To filter variants with a low minor allele frequency (MAF) from a variant call format (VCF) file, we can use softwares such as bcftools, plink or R.
What is minor allele frequency?
The MAF refers to the frequency of the least frequent allele. This least frequent allele can be the reference allele or the alternate allele. A common threshold is 5% (0.05) but it depends on your cohort size.
1. bcftools
To remove variants with a low MAF using bcftools, we need to use the view command with the –min-af parameter:
bcftools view --min-af 0.05:minor -Oz input.vcf.gz > output.vcf.gz
It is equivalent to using the abbreviated parameter -q instead of –min-af:
bcftools view -q 0.05:minor input.vcf.gz > output.vcf
2. plink
To make plink (v1.9) filter variants with a low MAF, you need to use the –maf parameter:
plink --vcf input.vcf.gz --maf 0.05 --recode vcf-iid --out output.vcf
3. R
a) Import the dosages
If you don’t know how to open VCF files and extract dosages in R, I invite you to read my blog post about How to extract GT or DS fields from VCF files in R. The dosages variable is a matrix where the columns correspond to individuals and the rows correspond to variants.
b) Compute the MAF
To filter variants with a low MAF, we first need to compute it:
dosages[, "MAF"] = rowSums(round(dosages)) / (2 * (ncol(dosages)))
This command was slightly adapted from this topic.
c) Filter the variants
Then, we need to keep the variants with a MAF between 0.05 and 0.95:
dosages = dosages[dosages[, "MAF"] >= 0.05 & dosages[, "MAF"] <= 0.95,]
Conclusion
In conclusion, we can easily filter variants with a low minor allele frequency using bcftools, plink and R. What else would you like to do with VCF files?