🧬 Advanced linux tutorial

The dataset for this exercise is available at

/data/PwaniGenomics/fastq_data 
#create symbolic link to the data in your user directory 
ln -s /data/PwaniGenomics/fastq_data 

Note: FASTQ reads are in .gz format.

Genome File Descriptions (.fna Indexes)

Structure of a FASTQ File

FASTQ files (typically .fastq.gz) contain sequencing reads. Each read consists of four lines:


@SRR15369215.126490887       # Sequence identifier
GGACCTTCTGTCATTTCACT...      # Nucleotide sequence
+                             # Separator
AAFFFJJJJJJJJJJJJJJJJ...      # Base quality scores
  

Read Length Analysis

Count Total Reads for SRR836370_1_subset.fastq.gz and SRR836370_2_subset.fastq.gz:

Reads Shorter than 150bp for SRR836370_1_subset.fastq.gz and SRR836370_2_subset.fastq.gz:

Reads Longer than 150bp for SRR836370_1_subset.fastq.gz and SRR836370_2_subset.fastq.gz:

Reads Not Equal to 150bp for SRR836370_1_subset.fastq.gz and SRR836370_2_subset.fastq.gz:

Reads Contaminated with Illumina Adapter for SRR836370_1_subset.fastq.gz and SRR836370_2_subset.fastq.gz:

Adapter sequence: CTGTCTCTTATACACATCT

Extra note

The Datasets can be accessed later too

Datasets are available at: https://tinyurl.com/Popgen2025PUZenodo