How to filter lowly expressed genes in R

Today, we will filter lowly expressed genes from RNA-seq datasets in R. This is an important step because genes must be expressed at a certain level to perform their biological function and because low expression is indistinguishable from technical noise.

Filtering genes with very low counts can be based on raw counts or on normalized counts (CPM, TPM, FPKM,…). In this tutorial, we will use raw counts and counts per million (CPM).

My dataset

The RNA-seq dataset that I will use today contains 387 samples and 61,541 genes (Ensembl release 105). Here is a subset:

Subset of an RNA-seq dataset in which we will filter lowly expressed genes in R

This subset contains:

  • some genes with only zero counts (no expression)
  • some genes with mostly zero counts (low expression)
  • some genes without zero counts (medium or high expression)

Example 1

The two algorithms below leave 49,802 genes in my dataset.

Remove genes if the sum across all samples is equal to 0

raw = raw[!rowSums(raw[, -1]) == 0,]

Keep genes if the sum across all samples is greater than 0

raw = raw[rowSums(raw[, -1]) > 0,]

Example 2

The four algorithms below leave 16,903 genes in my dataset.

Remove genes if ≥ 80% of samples have < 5 counts

raw = raw[!rowSums(raw[, -1] < 5) >= round(ncol(raw[, -1]) * 0.8),]

Keep genes if < 80% of samples have < 5 counts

raw = raw[rowSums(raw[, -1] < 5) < round(ncol(raw[, -1]) * 0.8),]

Remove genes if ≤ 20% of samples have ≥ 5 counts

raw = raw[!rowSums(raw[, -1] >= 5) <= round(ncol(raw[, -1]) * 0.2),]

Keep genes with if > 20% of samples have ≥ 5 counts

raw = raw[rowSums(raw[, -1] >= 5) > round(ncol(raw[, -1]) * 0.2),]

Example 3

From raw counts to CPM values

cpm = raw
cpm[, -1] = lapply(cpm[, -1], function(x) (x/sum(x)) * 1000000)

The two algorithms below leave 13,766 genes in my dataset.

Remove genes if their mean expression is < 1 CPM

cpm = cpm[!rowMeans(cpm[, -1]) < 1,]

Keep genes if their mean expression is ≥ 1 CPM

cpm = cpm[rowMeans(cpm[, -1]) >= 1,]

Conclusion

Unfortunately, there is no consensus on what is the best method to filter lowly expressed genes (in R for example) and it is really important to detail which one you used in Materials and Methods. You can choose to use the sum or the average as well as raw or normalized counts. What do you recommend?

Related posts

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply