How to make all variants heterozygous in a VCF file

This week, we will see how to make all variants heterozygous in a VCF file. The purpose of this step is to use this file for an unbiased allele-specific read mapping. In other words, we are going to create a custom VCF file containing both reference and alternative alleles for all variants.

Tool used

In this post, we will use the following version of bcftools:

bcftools --version
bcftools 1.11
Using htslib 1.11
Copyright (C) 2020 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Method

This is the command I use to make all variants heterozygous:

cat <(bcftools view -h ${VCF_file} | sed '/#CHROM/s/.*/#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    SAMPLE/') <(bcftools view -H ${VCF_file} | awk '{print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\tGT\t0\/1"}') | awk '{sub(/^chr/,"",$0)} 1' > ${VCF_file/.vcf.gz/_snps.vcf}

Let’s break it down together:

  • bcftools view -h prints the header of the file
  • sed ‘/#CHROM/s/.*/#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE/’ substitutes all column names by 10 other column names
  • bcftools view -H prints the content of the file
  • awk ‘{print $1″\t”$2″\t”$3″\t”$4″\t”$5″\t”$6″\t”$7″\t”$8″\tGT\t0\/1″}’) keeps the 8 first columns of the file, adds a ninth column with GT and a tenth column with 0/1
  • awk ‘{sub(/^chr/,””,$0)} 1’ removes the “chr” string from the first colum to match my reference genome (facultative)

Result

Let’s print 20 lines from the 51th one:

tail -n +51 ${VCF_file} | head -n 20
Example of VCF file only containing heterozygous variants

Conclusion

In conclusion, we can make all variants heterozygous in a VCF file using the bcftools, awk and cat commands. Do you know a better way of doing that?

Related posts

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply