How to perform a basic eQTL analysis in R

To perform a basic eQTL analysis in R, we will use the lm function from base R and extract the slope (also called beta, effect or estimate), the standard error, the t-score and the p-value of each linear regression line.

What is an eQTL

An eQTL is a genomic region associated with a decreased or an increased gene expression. There are two types of eQTLs:

  • if the variants are located on the same DNA molecule as the gene, we speak of cis eQTL
  • if the variants are not located on the same chromosome as the gene, we speak of trans eQTL

In this tutorial, we will perform a cis-eQTL mapping with an additive effect.

Data needed for an eQTL analysis

We need two types of data for the same set of individuals:

  • the read counts for each gene
  • the allele dosages for each variant in a given window around the transcription start site (TSS) of each gene

In this guide, we will consider only one gene but in reality, we test several thousand genes.

For both datasets, column names should correspond to individuals (in the same order) and row names should correspond to genes (phenotype file) or rsIDs (genotype file):

expression = expression[row.names(expression) == gene, individuals]
genotypes = genotypes[row.names(genotypes) %in% window_around_TSS, individuals]

All values for read counts and genomic dosages should be numeric.

eQTL mapping with the lm function

Several regression models can be used for an eQTL mapping but we will use a linear one today. Therefore, we need to ensure that our gene expression follows a normal distribution.

Here is how to perform the linear regression with the lm function and extract the vector of beta, standard error (se), t-values and p-values:

lm_coefficients = apply(genotypes, 1, function(genotype){
  summary(lm(expression ~ genotype))$coefficients[-1,]})

beta = lm_coefficients[1,]
se = lm_coefficients[2,]
t_value = lm_coefficients[3,]
p_value = lm_coefficients[4,]

Here is the implementation of the linear regression without the lm function (faster than lm):

ni = nrow(genotypes) #number of individuals
ng = ncol(genotypes) #number of genotypes
mean_expr = mean(expression)
mean_geno = apply(genotypes, 1, mean)

beta = apply(genotypes, 1, function(genotype){
  cov(genotype, expression) / var(genotype)})

alpha = mean_expr - beta * mean_geno

se = as.numeric(lapply(1:ng, function(genotype){
  sqrt((sum((beta[genotype] * genotypes[genotype,] + alpha[genotype] - expression)^2) / sum((genotypes[genotype,] - mean_geno[genotype])^2)) / (ni-2))}))

t_value = beta / se
p_value = 2 * pt(abs(t_value), ni - 2, lower.tail = FALSE)

The slope of each regression line corresponds to the covariance of each genotype with the expression divided by the variance of each genotype.

The alpha corresponds to the intercept with the y axis.

The standard error corresponds to the square root of the sum of the squared difference between the predicted (or fitted) expression and the real expression divided by the sum of the squared difference between the genotype and the mean genotype divided by the number of individuals minus two.

Remark

Pay attention that if the order of the alleles is reversed, beta will have an opposite sign. Therefore, if you want to compare summary statistics from different studies, you need to compare the order of the alleles and multiply beta by -1 if the order is reversed.

Conclusion

In conclusion, this post shows the principle of a basic eQTL analysis in R but in reality, we will use softwares such as QTLtools.

Related posts

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply