Do you want to learn how to import FASTA files in base R? This particular type of bioinformatics file is difficult but not impossible to import using base R functions.
FASTA format example
FASTA files contain biological sequences of DNA, RNA or proteins. For each sequence, we have:
- one line beginning with the “>” symbol containing the sequence name / identifier
- one or several line(s) containing the sequence
For example, let’s open the FASTA file below which contains five sequences:
You can see that three sequences (seq_1, seq_3 and seq_4) are splitted over multiple lines. Each line contains 60 characters at maximum.
Read FASTA files in base R
The code below is adapted from an issue about collapsing multiple rows of strings into one row based on a condition. It uses the following base R functions:
- grepl: returns a logical vector with TRUE for the identifiers and FALSE for the sequences (ex: TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE)
- cumsum: returns an integer vector with an increment of 1 at every TRUE (ex: 1 1 1 1 1 2 2 3 3 3 4 4 4 5 5)
- sub: removes the “>” symbol from the sequence names
- paste: concatenates multiple rows into one
fasta = readLines("fasta.txt")
ids = grepl(">", fasta)
f = data.frame(
id = sub(">", "", fasta[ids]),
s = tapply(fasta[!ids], cumsum(ids)[!ids], function(x) {
paste(x, collapse = "")
}))
Here is the resulting data frame:
Conclusion
In conclusion, it is possible to open FASTA files using base R functions only. Is this algorithm helpful to you?