Sequencing Dataset Instructions

Download

Download the datasets from https://zenodo.org/records/14258052 .

Fastq reads are in .gz format. Original genome files are in .fna (FASTA) format.

Reference Index Sidecar Files

Structure of FASTQ Files

FASTQ files are obtained after demultiplexing Illumina sequencing data. Providers usually deliver them as .fastq.gz for compression. Each record has four lines:

@SRR15369215.126490887  # sequence identifier
GGACCTTCTGTCATTTCACTCCTTCTGAAGTAAGGAGTGAAGTAAACACGAAGTAAACACGACAGGTTAGTCCTATTCCTTCAAGCAGGAGTACAGAAAAGAATGCAAATTCTGGGTTCTAGCCCAGCTTTTACTCCTATGGTTCTATTT  # sequence
+  # separator
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJFJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFAFJJJJJJJJJJJJJJJJFJJJJJJJJJFJJJJJJJJJFFFJJJJJJJJJJJ  # base call quality scores (ASCII)

Questions / Tasks

1) Count the number of sequencing reads in the FASTQ files

2) How many reads are shorter than 150 bp?

3) How many reads are longer than 150 bp?

4) How many reads are not equal to 150 bp?

5) How many reads are contaminated with Illumina adapters?

Adapter sequence: CTGTCTCTTATACACATCT