In this post, we will compare STAR –quantMode GeneCounts vs HTSeq. Their goal is to perform gene quantification during or after read mapping, respectively. To achieve this, they need a GTF file containing the reference genome annotation.
Tools used
STAR --version
2.7.1a
htseq-count
Written by Simon Anders (sanders@fs.tum.de), European Molecular Biology
Laboratory (EMBL). (c) 2010. Released under the terms of the GNU General
Public License v3. Part of the 'HTSeq' framework, version 0.6.1p1.
STAR
STAR is a very efficient free software that enables to align reads to a FASTA file containing the reference genome of the species of interest. Interestingly, it has integrated the –quantMode parameter to count the number of reads per gene at the same type:
STAR \
--runMode alignReads \
--runThreadN ${SLURM_CPUS_PER_TASK} \
--genomeDir ${STARIndexDir} \
--readFilesIn ${reads1} ${reads2} \
--readFilesCommand gunzip -c \
--outFileNamePrefix STAR/${prefix}_ \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--quantMode GeneCounts \
--limitBAMsortRAM 30000000000
Result of the read mappping and gene counting:
head 10 STAR/${prefix}_ReadsPerGene.out.tab
tail 10 STAR/${prefix}_ReadsPerGene.out.tab
HTSeq
HTSeq is another free software that quantifies the number of reads that map to genomic features (genes, exons, binding regions,…) in three different ways. The default mode (union) takes into account the reads that fully and partially overlap genes:
htseq-count -f bam -r pos -s no -a 0 STAR/${prefix}_Aligned.sortedByCoord.out.bam ${GTF} > HTSeq/${prefix}_ReadsPerGene.out.tab
I used the following parameters:
-f option for input file format:
- SAM file: -f sam
- BAM file: -f bam
-r option for sorting order (for paired-end reads only):
- by read position: -r pos
- by read name: -r name
-s option for strand specificity (depending on the library preparation method used):
- unstranded: -s no (ex: SMART-seq v4 from Takara)
- forward: -s yes (ex: QuantSeq 3′ mRNA FWD from Lexogen)
- reverse: -s reverse (ex: TruSeq Stranded mRNA from Illumina)
-a option equal to 0 to not skip any alignment because of low quality
Result of the gene counting:
head 10 HTSeq/${prefix}_ReadsPerGene.out.tab
tail 10 HTSeq/${prefix}_ReadsPerGene.out.tab
STAR vs HTSeq
Comparison of the structure of the output tables:
STAR | HTSeq | |
number of columns | 4 | 2 |
name of columns | gene_id unstranded read1 read2 | gene_id -s parameter |
position of the four special counters below | head | tail |
unmapped | N_unmapped | __not_aligned |
multimapped | N_multimapping | __alignment_not_unique |
uniquely mapped to no feature | N_noFeature | __no_feature |
uniquely mapped to more than one feature | N_ambiguous | __ambiguous |
Comparison of the content of the output tables:
diff <(sort STAR/${prefix}_ReadsPerGene.out.tab | awk '{print $1"\t"$2}') HTSeq/${prefix}_ReadsPerGene.out.tab
Only the part describing the four special counters differ between the two files. I obtained the same number of:
- not unique / multimapped reads
- uniquely mapped to no feature reads
- uniquely mapped to more than one feature / ambiguous reads
However, there is a difference of 20 reads for the number of reads which are not mapped. There are 11,750,839 reads in the fastq file, which corresponds to the sum of the HTSeq output.
What happens if we don’t keep the unmapped reads
Moreover, if I redo the same analysis but with removing the –outSAMunmapped Within command from STAR, I obtain a warning for HTSeq: “Warning: Mate records missing for 20 records”.
Conclusion
In conclusion, I obtained the same gene counts with STAR and HTSeq. Do you also get the same results?