How to get the TSS of genes in R

Today, we will see how to get the transcription start site (TSS) of genes in R. Indeed, it is for example needed in expression quantitative trait loci (eQTL) analyses where we define a 2 Mb-window centered on the TSS.

What is a transcription start site (TSS)?

As the start position is always smaller than the end position regardless of the strand, the location of the TSS corresponds to the:

  • start position for genes that are on + the strand
  • end position for genes that are on – the strand

Two methods to find the TSS of genes in R

Method 1

On the one hand, we will access the Ensembl database with the biomaRt library and use the Ensembl version 110 which was released in July 2023. Then, we will define the TSS as the start position for genes on the + strand (strand = 1) and as the end position for genes on the – strand (strand = -1):

library(biomaRt)

ensembl = useEnsembl(dataset = "hsapiens_gene_ensembl",
                     biomart = "ensembl",
                     version = 110)

ensembl = getBM(mart = ensembl,
                attributes = c("ensembl_gene_id", "strand",
                               "start_position", "end_position"))

ensembl[, "transcription_start_site"] = ifelse(ensembl[, "strand"] == 1,
                                               ensembl[, "start_position"],
                                               ensembl[, "end_position"])

View(tail(ensembl))
Screenshot of the table reported by biomaRt with the method 1

Method 2

On the other hand, we will use the “transcription_start_site” column of the Ensembl database:

library(biomaRt)

ensembl = useEnsembl(dataset = "hsapiens_gene_ensembl",
                     biomart = "ensembl",
                     version = 110)

ensembl = getBM(mart = ensembl,
                attributes = c("ensembl_gene_id", "strand",
                               "start_position", "end_position",
                               "transcription_start_site"))

length(ensembl[, "ensembl_gene_id"]) #240694
length(unique(ensembl[, "ensembl_gene_id"])) #70116

View(tail(ensembl))
Screenshot of the table reported by biomaRt with the method 2 (before)

However, it outputs all gene transcripts because each transcript has its own TSS. For example, you can see that it reports several transcripts for the ENSG00000116786 gene. What if you only need one TSS per gene?

From Ensembl version 104, it is possible to find the canonical transcript of each gene using the “transcript_is_canonical” column:

library(biomaRt)

ensembl = useEnsembl(dataset = "hsapiens_gene_ensembl",
                     biomart = "ensembl",
                     version = 110)

ensembl = getBM(mart = ensembl,
                attributes = c("ensembl_gene_id", "strand",
                               "start_position", "end_position",
                               "transcription_start_site",
                               "transcript_is_canonical"))

ensembl = ensembl[which(ensembl[, "transcript_is_canonical"] == 1),]

length(ensembl[, "ensembl_gene_id"]) #70116
length(unique(ensembl[, "ensembl_gene_id"])) #70116

View(tail(ensembl))
Screenshot of the table reported by biomaRt with the method 2 (after)

Conclusion

In conclusion, we can get the TSS of genes in R using the biomaRt package. Which method do you use, especially for eQTL analyses?

Related posts

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply