Welcome! This page breaks down the essential filters used for the refinement of genetic variant data before the downstream analysis. From removing errors to keeping only high-confidence SNPs, each step here helps you build more accurate dataset for things like population genetics, GWAS, and PCA. Whether you're new to variant filtering or just need an in-depth understanding of the step, this guide makes complex filters easy to understand and even easier to apply. The table shows the names we use to refer to the filters we apply.
Filter Name | Description |
---|---|
passOnly | Retained only variants that passed all internal quality filters from the variant caller (Strelka in this case). |
biallelicOnly | Kept only biallelic variants (i.e., sites with exactly two alleles) for downstream compatibility and clarity. |
rmvIndels | Removed insertions and deletions (INDELs), keeping only SNPs (single nucleotide polymorphisms). |
minMAF0Pt05 | Filtered for variants with a minor allele frequency (MAF) ≥ 0.05, ensuring informative polymorphisms and removing rare variants. |
chr_E2 | Restricted to a specific chromosome/region of interest (e.g., chr_E2 ). |
minDP3 | Required a minimum depth (DP) of 3 per genotype to avoid false positives due to low coverage. |
minQ30 | Ensured that each site has a minimum site quality score of 30, reflecting high confidence in the variant. |
minGQ30 | Filtered genotypes to retain only those with a Genotype Quality (GQ) ≥ 30, removing uncertain genotype calls. |
hwe_0.05 | Removed variants deviating significantly from Hardy-Weinberg Equilibrium (p < 0.05). |
imiss_0.6 | Filtered out individuals with >60% missing genotypes to maintain data quality. |
miss_0.6 | Removed variants missing in >60% of individuals, ensuring sufficient representation across samples. |
mid95percentile | Retained the middle 95% of data based on depth distribution, trimming extreme outliers. |
noZSB | Custom filter |
MAF is the frequency at which the less common allele occurs at a genetic locus in a population.
Example: If allele A appears in 90% of individuals and G in 10%, then MAF = 0.10.
minMAF0Pt05
?minDP3
This filter ensures that each genotype (or site) is supported by at least 3 reads. Variants with DP < 3 are excluded.
minQ30
This filter retains only variants with a site quality score ≥ 30, indicating high confidence in the variant call. QUAL 30 = 99.9% confidence the variant is true.
minGQ30
The genotype quality (GQ) score represents the confidence in the individual genotype call at a variant site. GQ 30 = 99.9% confidence that the genotype is correct.
The filter hwe_0.05
removes variants that deviate significantly from HWE (p-value < 0.05).
imiss_0.6
Removes samples with more than 60% missing genotype data.
miss_0.6
Removes variant sites where more than 60% of samples have missing data.
mid95percentile
?This filter retains only those variants where the depth per site falls within the middle 95% of the overall depth distribution.