Reference files for sequencing and genotyping data

In this post, we will see where to download some reference files that are required for the analysis of RNA sequencing and SNP genotyping data.

Genome versions

The current genome build can be accessed from two institutions under two different names:

  • the one from the National Center for Biotechnology Information (NCBI) is called GRCh38 and uses a 1-based coordinate system
  • the one from the University of California at Santa Cruz (UCSC) is called hg38 and uses a 0-based coordinate system

For my analyses, I chose to use the GRCh38.p13 genome version with the 105th Ensembl release for annotations. The correspondance between genomes assemblies and Ensembl releases can be fond on the GENCODE website (by clicking on the “show all releases” button).

FASTA files (for STAR and Picard LiftoverVcf)

Firstly, FASTA files contain biological sequences of DNA, RNA or proteins. For each sequence, we have:

  • one line beginning with the “>” symbol containing the sequence name / identifier
  • one or several line(s) containing the sequence

For your information, I have written an article about how to import FASTA files in R.

For my analyses, I downloaded the following FASTA file:

wget ftp://ftp.ensembl.org/pub/release-105/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa

Here is the beginning of the file:

Beginning of the GRCh38 version 105 FASTA file

Chromosome length file (for WASP)

Secondly, a chromosome information file contains the name and length of each chromosome in the assembly. I don’t know where to download this file for the GRCh38.p13 genome version, so I copied and pasted it from the NCBI website. It is required by WASP for example.

Here is an overview of its content:

Beginning of chromosome length file for the GRCh38.p13 genome version

Chain files (for Picard LiftoverVcf)

Thirdly, chain files are needed to lift over a VCF file from one genome version to another. For instance, if you plan to lift over from the GRCh37 genome version to the GRCh38, you can download the chain file below:

wget ftp://ftp.ensembl.org/pub/assembly_mapping/homo_sapiens/GRCh37_to_GRCh38.chain.gz

Chain files look like this:

GTF file (for STAR and HTSeq)

Finally, GTF files contain the annotation for each feature in the assembly such as on which chromosome it is located, on which strand, what its start and end positons,… I have downloaded this one:

wget ftp://ftp.ensembl.org/pub/release-105/gtf/homo_sapiens/Homo_sapiens.GRCh38.105.gtf.gz

Here are the 17 first rows of the file:

Beginning of the GRCh38 version 105 GTF file

Conclusion

To sum up, I have listed where to download some reference files for the analysis of RNA sequencing and SNP genotyping data. Please let me know if something is incorrect.

Related post

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply