In this blog post, we will see how to import VCF files in base R. This type of file is very common in genotyping analyses but requires a specific method to open it in R.
VCF format example
VCF files store information about genetic variants. They are typically composed of:
- several lines starting with ## and containing the header
- one line starting with #CHROM and containing the column names
- hundreds of thousands lines containing sequence variations
For instance, here is a very small subset of a VCF file:
##fileformat=VCFv4.1
##source=gtc_to_vcf 1.2.0
##reference=GrCh37.fa
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=String,Description="GenCall score. For merged multi-locus entries, min(GenCall) score is reported.">
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
##contig=<ID=6,length=171115067>
##contig=<ID=7,length=159138663>
##contig=<ID=8,length=146364022>
##contig=<ID=9,length=141213431>
##contig=<ID=10,length=135534747>
##contig=<ID=11,length=135006516>
##contig=<ID=12,length=133851895>
##contig=<ID=13,length=115169878>
##contig=<ID=14,length=107349540>
##contig=<ID=15,length=102531392>
##contig=<ID=16,length=90354753>
##contig=<ID=17,length=81195210>
##contig=<ID=18,length=78077248>
##contig=<ID=19,length=59128983>
##contig=<ID=20,length=63025520>
##contig=<ID=21,length=48129895>
##contig=<ID=22,length=51304566>
##contig=<ID=X,length=155270560>
##contig=<ID=Y,length=59373566>
##contig=<ID=MT,length=16569>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1 SAMPLE2 SAMPLE3
1 82154 rs4477212 A G . PASS . GT:GQ 0/0:0.462962 0/0:0.462882 0/0:0.462962
1 752721 rs3131972 A G . PASS . GT:GQ 1/1:0.831747 0/1:0.831747 0/0:0.831747
1 768448 rs12562034 G A . PASS . GT:GQ 0/0:0.478154 0/1:0.482656 0/0:0.482656
1 776546 rs12124819 A G . PASS . GT:GQ 0/0:0.491415 0/0:0.491415 0/0:0.491415
1 798959 rs11240777 G A . PASS . GT:GQ 0/0:0.770731 0/1:0.770731 1/1:0.770731
1 800007 rs6681049 T C . PASS . GT:GQ 1/1:0.615153 0/1:0.615153 1/1:0.615153
1 838555 rs4970383 C A . PASS . GT:GQ 1/1:0.807373 1/1:0.807373 0/0:0.807373
1 846808 rs4475691 C T . PASS . GT:GQ 1/1:0.882001 0/0:0.882001 0/0:0.882001
1 854250 rs7537756 A G . PASS . GT:GQ 1/1:0.812973 0/0:0.812973 0/0:0.812973
1 861808 rs13302982 A G . PASS . GT:GQ 1/1:0.78072 0/0:0.78072 1/1:0.78072
This VCF file is composed of:
- 30 header lines (but it can be more or less depending on your VCF file)
- 1 line with column names
- 10 lines with genotyping information (such as chromosome, position on the chromosome, reference SNP (rs) identity, reference allele, alternate allele, genotype and genotype quality) of 3 individuals (samples n°1, 2 and 3)
Read VCF files in base R
The codes below can be used to open VCF files in R:
VCF = file("Example.VCF", "r")
skip = 0
line = readLines(VCF, 1)
while(!grepl("#CHROM", line)) {
skip = skip + 1
line = readLines(VCF, 1)
}
close(VCF)
This algorithm is based on an issue about reading a file until a match is found and calls the following functions:
- file: to open the connection to the VCF file
- readLines: to read the VCF file line by line
- grepl: to return TRUE if the line contains #CHROM and FALSE otherwise
- close: to close the connection to the VCF file
VCF = read.table("Example.vcf", skip = skip, comment.char = "", header = TRUE,
stringsAsFactors = FALSE, check.names = FALSE)
The read.table built-in function uses the following arguments:
- skip = skip: to ignore the header lines (30 in this example)
- comment.char = “”: to not ignore the line starting with #CHROM
- header = TRUE: to use the line starting with #CHROM as column names
- stringsAsFactors = FALSE: to avoid strings becoming factors
- check.names = FALSE: to avoid “#CHROM” becoming “X.CHROM” in column names
Here is the result in R:
Conclusion
To conclude, we can import VCF files in base R using the read.table, file, readLines, grepl and close functions. What do you think of this method?