How to import VCF files in R

In this blog post, we will see how to import VCF files in base R. This type of file is very common in genotyping analyses but requires a specific method to open it in R.

VCF format example

VCF files store information about genetic variants. They are typically composed of:

  • several lines starting with ## and containing the header
  • one line starting with #CHROM and containing the column names
  • hundreds of thousands lines containing sequence variations

For instance, here is a very small subset of a VCF file:

##fileformat=VCFv4.1
##source=gtc_to_vcf 1.2.0
##reference=GrCh37.fa
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=String,Description="GenCall score. For merged multi-locus entries, min(GenCall) score is reported.">
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
##contig=<ID=6,length=171115067>
##contig=<ID=7,length=159138663>
##contig=<ID=8,length=146364022>
##contig=<ID=9,length=141213431>
##contig=<ID=10,length=135534747>
##contig=<ID=11,length=135006516>
##contig=<ID=12,length=133851895>
##contig=<ID=13,length=115169878>
##contig=<ID=14,length=107349540>
##contig=<ID=15,length=102531392>
##contig=<ID=16,length=90354753>
##contig=<ID=17,length=81195210>
##contig=<ID=18,length=78077248>
##contig=<ID=19,length=59128983>
##contig=<ID=20,length=63025520>
##contig=<ID=21,length=48129895>
##contig=<ID=22,length=51304566>
##contig=<ID=X,length=155270560>
##contig=<ID=Y,length=59373566>
##contig=<ID=MT,length=16569>
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE1       SAMPLE2       SAMPLE3
1       82154   rs4477212       A       G       .       PASS    .       GT:GQ   0/0:0.462962    0/0:0.462882    0/0:0.462962
1       752721  rs3131972       A       G       .       PASS    .       GT:GQ   1/1:0.831747    0/1:0.831747    0/0:0.831747
1       768448  rs12562034      G       A       .       PASS    .       GT:GQ   0/0:0.478154    0/1:0.482656    0/0:0.482656
1       776546  rs12124819      A       G       .       PASS    .       GT:GQ   0/0:0.491415    0/0:0.491415    0/0:0.491415
1       798959  rs11240777      G       A       .       PASS    .       GT:GQ   0/0:0.770731    0/1:0.770731    1/1:0.770731
1       800007  rs6681049       T       C       .       PASS    .       GT:GQ   1/1:0.615153    0/1:0.615153    1/1:0.615153
1       838555  rs4970383       C       A       .       PASS    .       GT:GQ   1/1:0.807373    1/1:0.807373    0/0:0.807373
1       846808  rs4475691       C       T       .       PASS    .       GT:GQ   1/1:0.882001    0/0:0.882001    0/0:0.882001
1       854250  rs7537756       A       G       .       PASS    .       GT:GQ   1/1:0.812973    0/0:0.812973    0/0:0.812973
1       861808  rs13302982      A       G       .       PASS    .       GT:GQ   1/1:0.78072     0/0:0.78072     1/1:0.78072

This VCF file is composed of:

  • 30 header lines (but it can be more or less depending on your VCF file)
  • 1 line with column names
  • 10 lines with genotyping information (such as chromosome, position on the chromosome, reference SNP (rs) identity, reference allele, alternate allele, genotype and genotype quality) of 3 individuals (samples n°1, 2 and 3)

Read VCF files in base R

The codes below can be used to open VCF files in R:

VCF = file("Example.VCF", "r")
skip = 0
line = readLines(VCF, 1)

while(!grepl("#CHROM", line)) {
  skip = skip + 1
  line = readLines(VCF, 1)
}

close(VCF)

This algorithm is based on an issue about reading a file until a match is found and calls the following functions:

  • file: to open the connection to the VCF file
  • readLines: to read the VCF file line by line
  • grepl: to return TRUE if the line contains #CHROM and FALSE otherwise
  • close: to close the connection to the VCF file
VCF = read.table("Example.vcf", skip = skip, comment.char = "", header = TRUE,
                 stringsAsFactors = FALSE, check.names = FALSE)

The read.table built-in function uses the following arguments:

  • skip = skip: to ignore the header lines (30 in this example)
  • comment.char = “”: to not ignore the line starting with #CHROM
  • header = TRUE: to use the line starting with #CHROM as column names
  • stringsAsFactors = FALSE: to avoid strings becoming factors
  • check.names = FALSE: to avoid “#CHROM” becoming “X.CHROM” in column names

Here is the result in R:

Example of VCF file opened in base R

Conclusion

To conclude, we can import VCF files in base R using the read.table, file, readLines, grepl and close functions. What do you think of this method?

Related posts

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply