Genetic Variant Filtering Guide

Welcome! This page breaks down the essential filters used for the refinement of genetic variant data before the downstream analysis. From removing errors to keeping only high-confidence SNPs, each step here helps you build more accurate dataset for things like population genetics, GWAS, and PCA. Whether you're new to variant filtering or just need an in-depth understanding of the step, this guide makes complex filters easy to understand and even easier to apply. The table shows the names we use to refer to the filters we apply.

Filter Name Description
passOnlyRetained only variants that passed all internal quality filters from the variant caller (Strelka in this case).
biallelicOnlyKept only biallelic variants (i.e., sites with exactly two alleles) for downstream compatibility and clarity.
rmvIndelsRemoved insertions and deletions (INDELs), keeping only SNPs (single nucleotide polymorphisms).
minMAF0Pt05Filtered for variants with a minor allele frequency (MAF) ≥ 0.05, ensuring informative polymorphisms and removing rare variants.
chr_E2Restricted to a specific chromosome/region of interest (e.g., chr_E2).
minDP3Required a minimum depth (DP) of 3 per genotype to avoid false positives due to low coverage.
minQ30Ensured that each site has a minimum site quality score of 30, reflecting high confidence in the variant.
minGQ30Filtered genotypes to retain only those with a Genotype Quality (GQ) ≥ 30, removing uncertain genotype calls.
hwe_0.05Removed variants deviating significantly from Hardy-Weinberg Equilibrium (p < 0.05).
imiss_0.6Filtered out individuals with >60% missing genotypes to maintain data quality.
miss_0.6Removed variants missing in >60% of individuals, ensuring sufficient representation across samples.
mid95percentileRetained the middle 95% of data based on depth distribution, trimming extreme outliers.
noZSBCustom filter

Minor Allele Frequency (MAF)

MAF is the frequency at which the less common allele occurs at a genetic locus in a population.

Example: If allele A appears in 90% of individuals and G in 10%, then MAF = 0.10.

Why apply minMAF0Pt05?

Depth Filter: minDP3

This filter ensures that each genotype (or site) is supported by at least 3 reads. Variants with DP < 3 are excluded.

Quality Filter: minQ30

This filter retains only variants with a site quality score ≥ 30, indicating high confidence in the variant call. QUAL 30 = 99.9% confidence the variant is true.

Genotype Quality Filter: minGQ30

The genotype quality (GQ) score represents the confidence in the individual genotype call at a variant site. GQ 30 = 99.9% confidence that the genotype is correct.

Hardy-Weinberg Equilibrium (HWE)

The filter hwe_0.05 removes variants that deviate significantly from HWE (p-value < 0.05).

Missingness per Sample: imiss_0.6

Removes samples with more than 60% missing genotype data.

Missingness per Site: miss_0.6

Removes variant sites where more than 60% of samples have missing data.

What is mid95percentile?

This filter retains only those variants where the depth per site falls within the middle 95% of the overall depth distribution.