In this guide, we will see how to extract genotype (GT) or dosage (DS) fields from VCF files in R. We will make a speed comparison between base R and the vcfR package.
Before digging into this subject, I invite you to read my blog post about how to read VCF files in R if you are not familiar with this file format.
The VCF file that I will use today contains 423 individuals (columns) and 6,299,998 variants (rows). However, I will only import 181 individuals.
Count the header lines
First, we need to count the number of header lines that we will skip:
VCF = file("Example.vcf", "r")
skip = 0
line = readLines(VCF, 1)
while(!grepl("#CHROM", line)) {
skip = skip + 1
line = readLines(VCF, 1)
}
close(VCF)
Read the column names
Then, we need to import the column names:
columns = read.table("Example.vcf", comment.char = "", skip = skip, nrows = 1)
Extract GT or DS fields in base R
1) Read the VCF file
kept_columns = ifelse(columns %in% c("ID", individuals), "character", "NULL")
VCF = read.table("Example.vcf", comment.char = "", skip = skip,
colClasses = kept_columns, header = TRUE,
stringsAsFactors = FALSE, check.names = FALSE)
rownames(VCF) = VCF[, "ID"]
VCF = as.matrix(VCF[, individuals])
Time difference of 6.812936 mins
2) Extract the GT field
genotypes = matrix(sub(":.*", "", VCF),
nrow = nrow(VCF), ncol = ncol(VCF),
dimnames = list(row.names(VCF), colnames(VCF)))
Time difference of 8.305583 mins
3) Extract the DS field
dosages = matrix(as.numeric(sub(".*:", "", VCF)),
nrow = nrow(VCF), ncol = ncol(VCF),
dimnames = list(row.names(VCF), colnames(VCF)))
Time difference of 11.36085 mins
4) Limitations
This algorithm will only work if the format of your VCF file is “GT:DS” and it requires more memory than the vcfR package.
Extract GT or DS fields with the vcfR package
1) Import the VCF file
library(vcfR)
kept_columns = which(columns %in% c("ID", "FORMAT", individuals))
VCF = vcfR::read.vcfR("Example.vcf", cols = kept_columns, verbose = FALSE)
Time difference of 11.043 mins
2) Extract the GT field
genotypes = vcfR::extract.gt(VCF, return.alleles = FALSE, element = "GT",
as.numeric = FALSE)
Time difference of 7.927915 mins
3) Extract the DS field
dosages = vcfR::extract.gt(VCF, return.alleles = FALSE, element = "DS",
as.numeric = TRUE)
Time difference of 6.932623 mins
Conclusion
In conclusion, we can extract genotype (GT) or dosage (DS) fields from VCF files in base R or with the vcfR package. Which one do you prefer to use?