Linux Advance Tutorial

Download

Download the datasets from https://zenodo.org/records/14258052 .

Fastq reads are in .gz format. Original genome files are in .fna (FASTA) format.

Reference Index Sidecar Files

.fna.amb: stores information about ambiguous bases (e.g., N) in the reference genome.
.fna.ann: stores information about sequence names, lengths, and other metadata.
.fna.bwt: contains the Burrows–Wheeler transformed sequence.
.fna.pac: contains packed sequence data.
.fna.sa: suffix array index, used to locate sequence positions.

Structure of FASTQ Files

FASTQ files are obtained after demultiplexing Illumina sequencing data. Providers usually deliver them as .fastq.gz for compression. Each record has four lines:

@SRR15369215.126490887  # sequence identifier
GGACCTTCTGTCATTTCACTCCTTCTGAAGTAAGGAGTGAAGTAAACACGAAGTAAACACGACAGGTTAGTCCTATTCCTTCAAGCAGGAGTACAGAAAAGAATGCAAATTCTGGGTTCTAGCCCAGCTTTTACTCCTATGGTTCTATTT  # sequence
+  # separator
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJFJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFAFJJJJJJJJJJJJJJJJFJJJJJJJJJFJJJJJJJJJFFFJJJJJJJJJJJ  # base call quality scores (ASCII)

Questions / Tasks

1) Count the number of sequencing reads in the FASTQ files

2) How many reads are shorter than 150 bp?

3) How many reads are longer than 150 bp?

4) How many reads are not equal to 150 bp?

5) How many reads are contaminated with Illumina adapters?

Adapter sequence: CTGTCTCTTATACACATCT