How to liftover genomic positions in R

To show you how to liftover genomic positions in R, we will use the association statistics from an article that reports risk loci from genome-wide association studies (GWAS) for one group of autoimmune diseases.

Load the libraries

In this tutorial, we will import the supplementary table 2 which is an excel file. We therefore need the readxl package. Moreover, we will use the rtracklayer package to liftover the genomic positions.

library(readxl)
library(rtracklayer)

Import the dataset

First, we can import the excel file and look at its content:

meta_analysis = data.frame(readxl::read_xlsx("41588_2017_BFng3760_MOESM276_ESM.xlsx", skip = 8), check.names = FALSE)

This file contains information on 241 GWAS risk loci such as the chromosome, the position and rsID of the top SNP, the position of the variants that are in linkage disequilibrium (LD) with the top SNP and the trait to which the top SNP is associated:

View(meta_analysis)
Screenshot of the file containing information about GWAS risk loci (GRCh37)

Subset the dataset

Second, we can select the columns that we will work on:

columns = c("chr", "pos", "rsID", "LD_left", "LD_right", "trait")
loci = meta_analysis[, c(1:5,9)]
names(loci) = columns

Here is the result before lifting over:

View(loci)
Screenshot of the file containing information about GWAS risk loci after subsetting and column renaming (GRCh37)

Chain file

Third, we need to download a chain file from the ensembl website.

Liftover the genomic positions

Fourth, we are going to liftover the genomic positions from GRCh37 to GRCh38 with this function:

lift_over = function(dataset, column){
  fake_column = paste0("fake_", column)
  excluded_columns = c(fake_column, "end", "width", "strand")
  dataset[, fake_column] = dataset[, column] + 1
  dataset_GR = GenomicRanges::makeGRangesFromDataFrame(dataset,
                                                       keep.extra.columns = T,
                                                       ignore.strand = F,
                                                       seqnames.field = "chr",
                                                       start.field = column,
                                                       end.field = fake_column)
  seqlevelsStyle(dataset_GR) = "UCSC"
  dataset = rtracklayer::liftOver(dataset_GR, chain_info)
  dataset = as.data.frame(unlist(dataset), stringsAsFactors = FALSE)
  dataset = dataset[, setdiff(names(dataset), excluded_columns)]
  names(dataset)[names(dataset) == "seqnames"] = "chr"
  names(dataset)[names(dataset) == "start"] = column
  
  return(dataset)
}

We can apply it on the three columns that contain genomic positons (pos, LD_left and LD_right):

loci = lift_over(loci, "pos")
loci = lift_over(loci, "LD_left")
loci = lift_over(loci, "LD_right")
loci = loci[, columns]

And here is the result after lifting over:

View(loci)
Screenshot of the file containing information about GWAS risk loci after lifting over (GRCh38)

Conclusion

In conclusion, the rtracklayer R package needs intervals to liftover genomic positions, so we need to create a fake column that we delete directly after the build conversion. Is this blog post useful to you?

Related posts

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply