In this article, we will see how to perform a size factors normalization in base R by reimplementing the DESeq2 function.
What is a size factors normalization
Before digging into this subject, I invite you to read this very comprehensive guide which compares several methods for normalizing gene expression. Indeed, it gives explanations about the median of ratios method which is based on size factors.
Here is the input dataset:
The size factors normalization in DESeq2
The DESeq2 R package has a function to perform this normalization:
library(DESeq2)
size_factors = estimateSizeFactorsForMatrix(as.matrix(raw[, -1]))
test1 = raw
test1[, -1] = t(apply(test1[, -1], 1, function(x){x / size_factors}))
Here is the output dataset:
The size factors normalization in base R
We will split the procedure into five steps.
Step 0: ignore the genes that contain at least one zero
People often wonder how DESeq2 handles the genes that have at least one zero count because it would give a geometric mean equal to zero and therefore a division by zero.
If we look into the DESeq2 source code, we observe that the genes with 0s are ignored from step 1 to step 3. As in this issue, we need to remove the rows that contain at least one zero:
test2 = raw[!(apply(raw[, -1], 1, function(x) any(x == 0))),]
Step 1: compute the geometric mean per gene
First, we compute the geometric mean of each gene (row-wize):
geom_mean = apply(test2[, -1], 1, function(x){exp(mean(log(x)))})
Note: it is more efficient to take the exponential of the mean of the log of the values rather than taking the square root of the product of the values.
Step 2: divide the raw counts by the geometric mean
After that, we divide the raw gene counts by the geometric means (column-wize):
test2[, -1] = apply(test2[, -1], 2, function(x){x / geom_mean})
Step 3: compute the median of ratios per sample
Then, we compute the median of each sample (column-wize):
size_factors = apply(test2[, -1], 2, median)
Step 4: divide the raw counts by the median of ratios
Finally, we divide the raw gene counts by the size factors (row-wize):
test2 = raw
test2[, -1] = t(apply(test2[, -1], 1, function(x){x / size_factors}))
Here is the result:
Conclusion
In conclusion, we have reimplemented the DESeq2 function that performs a size factors normalization in R. What do you think of this post?