STAR –quantMode GeneCounts vs HTSeq

In this post, we will compare STAR –quantMode GeneCounts vs HTSeq. Their goal is to perform gene quantification during or after read mapping, respectively. To achieve this, they need a GTF file containing the reference genome annotation.

Tools used

STAR --version
2.7.1a
htseq-count
Written by Simon Anders (sanders@fs.tum.de), European Molecular Biology
Laboratory (EMBL). (c) 2010. Released under the terms of the GNU General
Public License v3. Part of the 'HTSeq' framework, version 0.6.1p1.

STAR

STAR is a very efficient free software that enables to align reads to a FASTA file containing the reference genome of the species of interest. Interestingly, it has integrated the –quantMode parameter to count the number of reads per gene at the same type:

STAR \
--runMode alignReads \
--runThreadN ${SLURM_CPUS_PER_TASK} \
--genomeDir ${STARIndexDir} \
--readFilesIn ${reads1} ${reads2} \
--readFilesCommand gunzip -c \
--outFileNamePrefix STAR/${prefix}_ \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--quantMode GeneCounts \
--limitBAMsortRAM 30000000000

Result of the read mappping and gene counting:

head 10 STAR/${prefix}_ReadsPerGene.out.tab
Head of the gene count table produced by STAR --quantMode GeneCounts
tail 10 STAR/${prefix}_ReadsPerGene.out.tab
Tail of the gene count table produced by STAR --quantMode GeneCounts

HTSeq

HTSeq is another free software that quantifies the number of reads that map to genomic features (genes, exons, binding regions,…) in three different ways. The default mode (union) takes into account the reads that fully and partially overlap genes:

htseq-count -f bam -r pos -s no -a 0 STAR/${prefix}_Aligned.sortedByCoord.out.bam ${GTF} > HTSeq/${prefix}_ReadsPerGene.out.tab

I used the following parameters:

-f option for input file format:

  • SAM file: -f sam
  • BAM file: -f bam

-r option for sorting order (for paired-end reads only):

  • by read position: -r pos
  • by read name: -r name

-s option for strand specificity (depending on the library preparation method used):

  • unstranded: -s no (ex: SMART-seq v4 from Takara)
  • forward: -s yes (ex: QuantSeq 3′ mRNA FWD from Lexogen)
  • reverse: -s reverse (ex: TruSeq Stranded mRNA from Illumina)

-a option equal to 0 to not skip any alignment because of low quality

Result of the gene counting:

head 10 HTSeq/${prefix}_ReadsPerGene.out.tab
Head of the gene count table produced by HTSeq
tail 10 HTSeq/${prefix}_ReadsPerGene.out.tab

STAR vs HTSeq

Comparison of the structure of the output tables:

STARHTSeq
number of columns42
name of columnsgene_id
unstranded
read1
read2
gene_id
-s parameter
position of the four special counters belowheadtail
unmappedN_unmapped__not_aligned
multimappedN_multimapping__alignment_not_unique
uniquely mapped to no featureN_noFeature__no_feature
uniquely mapped to more than one featureN_ambiguous__ambiguous

Comparison of the content of the output tables:

diff <(sort STAR/${prefix}_ReadsPerGene.out.tab | awk '{print $1"\t"$2}') HTSeq/${prefix}_ReadsPerGene.out.tab

Only the part describing the four special counters differ between the two files. I obtained the same number of:

  • not unique / multimapped reads
  • uniquely mapped to no feature reads
  • uniquely mapped to more than one feature / ambiguous reads

However, there is a difference of 20 reads for the number of reads which are not mapped. There are 11,750,839 reads in the fastq file, which corresponds to the sum of the HTSeq output.

What happens if we don’t keep the unmapped reads

Moreover, if I redo the same analysis but with removing the –outSAMunmapped Within command from STAR, I obtain a warning for HTSeq: “Warning: Mate records missing for 20 records”.

Conclusion

In conclusion, I obtained the same gene counts with STAR and HTSeq. Do you also get the same results?

Related posts

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply