To show you how to liftover genomic positions in R, we will use the association statistics from an article that reports risk loci from genome-wide association studies (GWAS) for one group of autoimmune diseases.
Load the libraries
In this tutorial, we will import the supplementary table 2 which is an excel file. We therefore need the readxl package. Moreover, we will use the rtracklayer package to liftover the genomic positions.
library(readxl)
library(rtracklayer)
Import the dataset
First, we can import the excel file and look at its content:
meta_analysis = data.frame(readxl::read_xlsx("41588_2017_BFng3760_MOESM276_ESM.xlsx", skip = 8), check.names = FALSE)
This file contains information on 241 GWAS risk loci such as the chromosome, the position and rsID of the top SNP, the position of the variants that are in linkage disequilibrium (LD) with the top SNP and the trait to which the top SNP is associated:
View(meta_analysis)
Subset the dataset
Second, we can select the columns that we will work on:
columns = c("chr", "pos", "rsID", "LD_left", "LD_right", "trait")
loci = meta_analysis[, c(1:5,9)]
names(loci) = columns
Here is the result before lifting over:
View(loci)
Chain file
Third, we need to download a chain file from the ensembl website.
Liftover the genomic positions
Fourth, we are going to liftover the genomic positions from GRCh37 to GRCh38 with this function:
lift_over = function(dataset, column){
fake_column = paste0("fake_", column)
excluded_columns = c(fake_column, "end", "width", "strand")
dataset[, fake_column] = dataset[, column] + 1
dataset_GR = GenomicRanges::makeGRangesFromDataFrame(dataset,
keep.extra.columns = T,
ignore.strand = F,
seqnames.field = "chr",
start.field = column,
end.field = fake_column)
seqlevelsStyle(dataset_GR) = "UCSC"
dataset = rtracklayer::liftOver(dataset_GR, chain_info)
dataset = as.data.frame(unlist(dataset), stringsAsFactors = FALSE)
dataset = dataset[, setdiff(names(dataset), excluded_columns)]
names(dataset)[names(dataset) == "seqnames"] = "chr"
names(dataset)[names(dataset) == "start"] = column
return(dataset)
}
We can apply it on the three columns that contain genomic positons (pos, LD_left and LD_right):
loci = lift_over(loci, "pos")
loci = lift_over(loci, "LD_left")
loci = lift_over(loci, "LD_right")
loci = loci[, columns]
And here is the result after lifting over:
View(loci)
Conclusion
In conclusion, the rtracklayer R package needs intervals to liftover genomic positions, so we need to create a fake column that we delete directly after the build conversion. Is this blog post useful to you?