How to perform a size factors normalization in R

In this article, we will see how to perform a size factors normalization in base R by reimplementing the DESeq2 function.

What is a size factors normalization

Before digging into this subject, I invite you to read this very comprehensive guide which compares several methods for normalizing gene expression. Indeed, it gives explanations about the median of ratios method which is based on size factors.

Here is the input dataset:

Screenshort of the input dataset containing raw gene counts

The size factors normalization in DESeq2

The DESeq2 R package has a function to perform this normalization:

library(DESeq2)

size_factors = estimateSizeFactorsForMatrix(as.matrix(raw[, -1]))
test1 = raw
test1[, -1] = t(apply(test1[, -1], 1, function(x){x / size_factors}))

Here is the output dataset:

Screenshort of the input dataset containing normalized gene counts using the median of ratios method based on size factors

The size factors normalization in base R

We will split the procedure into five steps.

Step 0: ignore the genes that contain at least one zero

People often wonder how DESeq2 handles the genes that have at least one zero count because it would give a geometric mean equal to zero and therefore a division by zero.

If we look into the DESeq2 source code, we observe that the genes with 0s are ignored from step 1 to step 3. As in this issue, we need to remove the rows that contain at least one zero:

test2 = raw[!(apply(raw[, -1], 1, function(x) any(x == 0))),]

Step 1: compute the geometric mean per gene

First, we compute the geometric mean of each gene (row-wize):

geom_mean = apply(test2[, -1], 1, function(x){exp(mean(log(x)))})

Note: it is more efficient to take the exponential of the mean of the log of the values rather than taking the square root of the product of the values.

Step 2: divide the raw counts by the geometric mean

After that, we divide the raw gene counts by the geometric means (column-wize):

test2[, -1] = apply(test2[, -1], 2, function(x){x / geom_mean})

Step 3: compute the median of ratios per sample

Then, we compute the median of each sample (column-wize):

size_factors = apply(test2[, -1], 2, median)

Step 4: divide the raw counts by the median of ratios

Finally, we divide the raw gene counts by the size factors (row-wize):

test2 = raw
test2[, -1] = t(apply(test2[, -1], 1, function(x){x / size_factors}))

Here is the result:

Screenshort of the input dataset containing normalized gene counts using the median of ratios method based on size factors (reimplementation in base R)

Conclusion

In conclusion, we have reimplemented the DESeq2 function that performs a size factors normalization in R. What do you think of this post?

Related posts

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply