read_fasta: Functions for reading FASTA files
In jedick/canprot: Chemical Analysis of Proteins

View source: R/read_fasta.R

read_fasta

R Documentation

Functions for reading FASTA files

Description

Read protein amino acid composition or sequences from a file and count numbers of amino acids in given sequences.

Usage

  read_fasta(file, iseq = NULL, type = "count", lines = NULL, 
    ihead = NULL, start = NULL, stop = NULL, molecule = "protein", id = NULL)
  count_aa(sequence, start = NULL, stop = NULL, molecule = "protein")
  sum_aa(AAcomp, abundance = 1, average = FALSE)

Arguments

`file`	character, path to FASTA file
`iseq`	numeric, which sequences to read from the file
`type`	character, type of return value (‘⁠count⁠’, ‘⁠sequence⁠’, ‘⁠lines⁠’, or ‘⁠headers⁠’)
`lines`	list of character, supply the lines here instead of reading them from file
`ihead`	numeric, which lines are headers
`start`	numeric, position in sequence to start counting
`stop`	numeric, position in sequence to stop counting
`molecule`	character, type of molecule (‘⁠protein⁠’, ‘⁠DNA⁠’, or ‘⁠RNA⁠’)
`id`	character, value to be used for `protein` in output table
`sequence`	character, one or more sequences
`AAcomp`	data frame, amino acid composition(s) of proteins
`abundance`	numeric, abundances of proteins
`average`	logical, return the weighted average of amino acid counts?

Details

read_fasta is used to retrieve entries from a FASTA file. Use iseq to select the sequences to read (the default is all sequences).

The function returns various data formats depending on the value of type:

‘⁠count⁠’: data frame of amino acid counts
‘⁠sequence⁠’: list of sequences
‘⁠lines⁠’: list of lines from the FASTA file (including headers)
‘⁠headers⁠’: list of header lines from the FASTA file

When type is ‘⁠count⁠’, the header lines of the file are parsed to obtain protein names that are put into the protein column in the result. Furthermore, if a UniProt FASTA header is detected (using the regular expression "\|......\|.*_"), the information there (accession, name, organism) is split into the protein, abbrv, and organism columns of the resulting data frame. this behavior (which may take a while for large files) can be suppressed by supplying protein names in id.

To speed up processing, if the line numbers of the header lines were previously determined, they can be supplied in ihead. Optionally, the lines of a previously read file may be supplied in lines (in this case no file is needed so file should be set to "").

count_aa is the underlying function that counts the numbers of each amino acid or nucleic-acid base in one or more sequences. The matching of letters is case-insensitive. A message is generated if any character in sequence, excluding spaces, is not one of the single-letter amino acid or nucleobase abbreviations. start and/or stop can be provided to process a fragment of the sequence. If only one of start or stop is present, the other defaults to 1 (start) or the length of the respective sequence (stop).

sum_aa sums the amino acid compositions in the input AAcomp data frame. It only applies to columns with the three-letter abbreviations of amino acids and to a column named chains (if present). The values in these columns are multiplied by the indicated abundance after recycling to the number of proteins. The values in these columns are then summed; if average is TRUE then the sum is divided by the number of proteins. Proteins with missing values (NA) of amino acid composition or abundance are omitted from the calculation. The output has one row and the same number of columns as the input; the value in the non-amino acid columns is taken from the first row of the input.

Value

count_aa returns a data frame with these columns (for proteins): Ala, Cys, Asp, Glu, Phe, Gly, His, Ile, Lys, Leu, Met, Asn, Pro, Gln, Arg, Ser, Thr, Val, Trp, Tyr. For ‘⁠DNA⁠’, the columns are changed to A, C, G, T, and for ‘⁠RNA⁠’, the columns are changed to A, C, G, U.

read_fasta returns a list of sequences (for type equal to ‘⁠sequence⁠’) or a list of lines (for type equal to ‘⁠lines⁠’ or ‘⁠headers⁠’). Otherwise, (for type equal to ‘⁠count⁠’) a data frame with these columns: protein, organism, ref, abbrv, chains, and the columns described above for count_aa.

sum_aa returns a one-row data frame.

Examples

## Reading a protein FASTA file
# The path to the file
file <- system.file("extdata/fasta/KHAB17.fasta", package = "canprot")
# Read the sequences, and print the first one
read_fasta(file, type = "seq")[[1]]
# Count the amino acids in the sequences
aa <- read_fasta(file)
# Calculate protein length (number of amino acids in each protein)
plength(aa)
# Sum the amino acid compositions
sum_aa(aa)

# Count amino acids in a sequence
count_aa("GGSGG")
# A message is issued for unrecognized characters
count_aa("AAAXXX")
# Count nucleobases in a sequence
bases <- count_aa("ACCGGGTTT", molecule = "DNA")

jedick/canprot documentation built on June 13, 2025, 10:13 p.m.