How to import FASTA files in R

Do you want to learn how to import FASTA files in base R? This particular type of bioinformatics file is difficult but not impossible to import using base R functions.

FASTA format example

FASTA files contain biological sequences of DNA, RNA or proteins. For each sequence, we have:

  • one line beginning with the “>” symbol containing the sequence name / identifier
  • one or several line(s) containing the sequence

For example, let’s open the FASTA file below which contains five sequences:

example of a FASTA file

You can see that three sequences (seq_1, seq_3 and seq_4) are splitted over multiple lines. Each line contains 60 characters at maximum.

Read FASTA files in base R

The code below is adapted from an issue about collapsing multiple rows of strings into one row based on a condition. It uses the following base R functions:

  • grepl: returns a logical vector with TRUE for the identifiers and FALSE for the sequences (ex: TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE)
  • cumsum: returns an integer vector with an increment of 1 at every TRUE (ex: 1 1 1 1 1 2 2 3 3 3 4 4 4 5 5)
  • sub: removes the “>” symbol from the sequence names
  • paste: concatenates multiple rows into one
fasta = readLines("fasta.txt")
ids = grepl(">", fasta)

f = data.frame(
  id = sub(">", "", fasta[ids]),
  s = tapply(fasta[!ids], cumsum(ids)[!ids], function(x) {
    paste(x, collapse = "")
  }))

Here is the resulting data frame:

example of a FASTA file imported in base R

Conclusion

In conclusion, it is possible to open FASTA files using base R functions only. Is this algorithm helpful to you?

Related posts

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply