VCF Filtering and PCA Workflow

Create symbolic link to the variants file

ln -s /data/PwaniGenomics/variants_renamed.vcf.gz

Variants file can also be downloaded from here

1. Apply Base Quality, Genotype Quality, INDELS, "PASS" filters

Apply quality-based filters to retain high-confidence variants. Remove all variants with Q<30, GQ<30, remove all Indels and retain only the variants that are flagged as "PASS".

2. Remove non-neutral loci

Remove any variant that deviates from HWE with a chi-square P value of less than 0.01.

3. Remove alleles that appear less than 3 times

Extremely rare allele can sometimes be an artefact of sequencing. For example, an alelle that appears in the dataset 1 or 2 times can be because 1 individual in the dataset had a sequencing error. However, if an allele appears twice we are confident, it is represented in at least 2 individuals in the dataset.

4. Apply Individual Missingness Filter

Remove individuals with more than 98% missing data:

5. Record Variant Counts for Different site missingness Thresholds

Extract and save the number of variants remaining at each threshold (0.1 to 0.9):

6. Apply Site Missingness Filter

Filter out variants that are not commonly genotyped across individuals.

7. Plot variant_counts.txt Using ggplot2 in R

8. Apply Final Missingness Filter (Max Missing = 0.5)

This retains variants that are present in at least 50% of individuals.

9. Extract Depth Values Between 2.5th and 97.5th Percentiles

Use R or Python to compute the 0.025 and 0.975 quantiles and filter sites accordingly.

10. Use Site Mean Depth Filter

Calculate mean depth per variant:

11. Apply Depth Filters to Retain Only Mid 95% Variants

12. Perform PCA

Convert to PLINK format and run PCA: