How to extract GT or DS fields from VCF files in R

In this guide, we will see how to extract genotype (GT) or dosage (DS) fields from VCF files in R. We will make a speed comparison between base R and the vcfR package.

Before digging into this subject, I invite you to read my blog post about how to read VCF files in R if you are not familiar with this file format.

The VCF file that I will use today contains 423 individuals (columns) and 6,299,998 variants (rows). However, I will only import 181 individuals.

Count the header lines

First, we need to count the number of header lines that we will skip:

VCF = file("Example.vcf", "r")
skip = 0
line = readLines(VCF, 1)

while(!grepl("#CHROM", line)) {
  skip = skip + 1
  line = readLines(VCF, 1)
}

close(VCF)

Read the column names

Then, we need to import the column names:

columns = read.table("Example.vcf", comment.char = "", skip = skip, nrows = 1)

Extract GT or DS fields in base R

1) Read the VCF file

kept_columns = ifelse(columns %in% c("ID", individuals), "character", "NULL")

VCF = read.table("Example.vcf", comment.char = "", skip = skip,
                 colClasses = kept_columns, header = TRUE,
                 stringsAsFactors = FALSE, check.names = FALSE)

rownames(VCF) = VCF[, "ID"]
VCF = as.matrix(VCF[, individuals])
Time difference of 6.812936 mins

2) Extract the GT field

genotypes = matrix(sub(":.*", "", VCF),
                   nrow = nrow(VCF), ncol = ncol(VCF),
                   dimnames = list(row.names(VCF), colnames(VCF)))
Time difference of 8.305583 mins

3) Extract the DS field

dosages = matrix(as.numeric(sub(".*:", "", VCF)),
                 nrow = nrow(VCF), ncol = ncol(VCF),
                 dimnames = list(row.names(VCF), colnames(VCF)))
Time difference of 11.36085 mins

4) Limitations

This algorithm will only work if the format of your VCF file is “GT:DS” and it requires more memory than the vcfR package.

Extract GT or DS fields with the vcfR package

1) Import the VCF file

library(vcfR)
kept_columns = which(columns %in% c("ID", "FORMAT", individuals))
VCF = vcfR::read.vcfR("Example.vcf", cols = kept_columns, verbose = FALSE)
Time difference of 11.043 mins

2) Extract the GT field

genotypes = vcfR::extract.gt(VCF, return.alleles = FALSE, element = "GT",
                             as.numeric = FALSE)
Time difference of 7.927915 mins

3) Extract the DS field

dosages = vcfR::extract.gt(VCF, return.alleles = FALSE, element = "DS",
                           as.numeric = TRUE)
Time difference of 6.932623 mins

Conclusion

In conclusion, we can extract genotype (GT) or dosage (DS) fields from VCF files in base R or with the vcfR package. Which one do you prefer to use?

Related posts

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply