This week, we will see how to make all variants heterozygous in a VCF file. The purpose of this step is to use this file for an unbiased allele-specific read mapping. In other words, we are going to create a custom VCF file containing both reference and alternative alleles for all variants.
Tool used
In this post, we will use the following version of bcftools:
bcftools --version
bcftools 1.11
Using htslib 1.11
Copyright (C) 2020 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Method
This is the command I use to make all variants heterozygous:
cat <(bcftools view -h ${VCF_file} | sed '/#CHROM/s/.*/#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE/') <(bcftools view -H ${VCF_file} | awk '{print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\tGT\t0\/1"}') | awk '{sub(/^chr/,"",$0)} 1' > ${VCF_file/.vcf.gz/_snps.vcf}
Let’s break it down together:
- bcftools view -h prints the header of the file
- sed ‘/#CHROM/s/.*/#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE/’ substitutes all column names by 10 other column names
- bcftools view -H prints the content of the file
- awk ‘{print $1″\t”$2″\t”$3″\t”$4″\t”$5″\t”$6″\t”$7″\t”$8″\tGT\t0\/1″}’) keeps the 8 first columns of the file, adds a ninth column with GT and a tenth column with 0/1
- awk ‘{sub(/^chr/,””,$0)} 1’ removes the “chr” string from the first colum to match my reference genome (facultative)
Result
Let’s print 20 lines from the 51th one:
tail -n +51 ${VCF_file} | head -n 20
Conclusion
In conclusion, we can make all variants heterozygous in a VCF file using the bcftools, awk and cat commands. Do you know a better way of doing that?