Create symbolic link to the variants file
ln -s /data/PwaniGenomics/variants_renamed.vcf.gz
Variants file can also be downloaded from here
Apply quality-based filters to retain high-confidence variants. Remove all variants with Q<30, GQ<30, remove all Indels and retain only the variants that are flagged as "PASS".
Remove any variant that deviates from HWE with a chi-square P value of less than 0.01.
Extremely rare allele can sometimes be an artefact of sequencing. For example, an alelle that appears in the dataset 1 or 2 times can be because 1 individual in the dataset had a sequencing error. However, if an allele appears twice we are confident, it is represented in at least 2 individuals in the dataset.
Remove individuals with more than 98% missing data:
Extract and save the number of variants remaining at each threshold (0.1 to 0.9):
Filter out variants that are not commonly genotyped across individuals.
variant_counts.txt Using ggplot2 in RThis retains variants that are present in at least 50% of individuals.
Use R or Python to compute the 0.025 and 0.975 quantiles and filter sites accordingly.
Calculate mean depth per variant:
Convert to PLINK format and run PCA: