read_fasta: Functions for reading FASTA files

View source: R/read_fasta.R

read_fastaR Documentation

Functions for reading FASTA files

Description

Read protein amino acid composition or sequences from a file and count numbers of amino acids in given sequences.

Usage

  read_fasta(file, iseq = NULL, type = "count", lines = NULL, 
    ihead = NULL, start = NULL, stop = NULL, molecule = "protein", id = NULL)
  count_aa(sequence, start = NULL, stop = NULL, molecule = "protein")
  sum_aa(AAcomp, abundance = 1, average = FALSE)

Arguments

file

character, path to FASTA file

iseq

numeric, which sequences to read from the file

type

character, type of return value (‘⁠count⁠’, ‘⁠sequence⁠’, ‘⁠lines⁠’, or ‘⁠headers⁠’)

lines

list of character, supply the lines here instead of reading them from file

ihead

numeric, which lines are headers

start

numeric, position in sequence to start counting

stop

numeric, position in sequence to stop counting

molecule

character, type of molecule (‘⁠protein⁠’, ‘⁠DNA⁠’, or ‘⁠RNA⁠’)

id

character, value to be used for protein in output table

sequence

character, one or more sequences

AAcomp

data frame, amino acid composition(s) of proteins

abundance

numeric, abundances of proteins

average

logical, return the weighted average of amino acid counts?

Details

read_fasta is used to retrieve entries from a FASTA file. Use iseq to select the sequences to read (the default is all sequences).

The function returns various data formats depending on the value of type:

⁠count⁠

data frame of amino acid counts

⁠sequence⁠

list of sequences

⁠lines⁠

list of lines from the FASTA file (including headers)

⁠headers⁠

list of header lines from the FASTA file

When type is ‘⁠count⁠’, the header lines of the file are parsed to obtain protein names that are put into the protein column in the result. Furthermore, if a UniProt FASTA header is detected (using the regular expression "\|......\|.*_"), the information there (accession, name, organism) is split into the protein, abbrv, and organism columns of the resulting data frame. this behavior (which may take a while for large files) can be suppressed by supplying protein names in id.

To speed up processing, if the line numbers of the header lines were previously determined, they can be supplied in ihead. Optionally, the lines of a previously read file may be supplied in lines (in this case no file is needed so file should be set to "").

count_aa is the underlying function that counts the numbers of each amino acid or nucleic-acid base in one or more sequences. The matching of letters is case-insensitive. A message is generated if any character in sequence, excluding spaces, is not one of the single-letter amino acid or nucleobase abbreviations. start and/or stop can be provided to process a fragment of the sequence. If only one of start or stop is present, the other defaults to 1 (start) or the length of the respective sequence (stop).

sum_aa sums the amino acid compositions in the input AAcomp data frame. It only applies to columns with the three-letter abbreviations of amino acids and to a column named chains (if present). The values in these columns are multiplied by the indicated abundance after recycling to the number of proteins. The values in these columns are then summed; if average is TRUE then the sum is divided by the number of proteins. Proteins with missing values (NA) of amino acid composition or abundance are omitted from the calculation. The output has one row and the same number of columns as the input; the value in the non-amino acid columns is taken from the first row of the input.

Value

count_aa returns a data frame with these columns (for proteins): Ala, Cys, Asp, Glu, Phe, Gly, His, Ile, Lys, Leu, Met, Asn, Pro, Gln, Arg, Ser, Thr, Val, Trp, Tyr. For ‘⁠DNA⁠’, the columns are changed to A, C, G, T, and for ‘⁠RNA⁠’, the columns are changed to A, C, G, U.

read_fasta returns a list of sequences (for type equal to ‘⁠sequence⁠’) or a list of lines (for type equal to ‘⁠lines⁠’ or ‘⁠headers⁠’). Otherwise, (for type equal to ‘⁠count⁠’) a data frame with these columns: protein, organism, ref, abbrv, chains, and the columns described above for count_aa.

sum_aa returns a one-row data frame.

See Also

Pass the output of read_fasta to add.protein in the CHNOSZ package to set up thermodynamic calculations for proteins.

Examples

## Reading a protein FASTA file
# The path to the file
file <- system.file("extdata/fasta/KHAB17.fasta", package = "canprot")
# Read the sequences, and print the first one
read_fasta(file, type = "seq")[[1]]
# Count the amino acids in the sequences
aa <- read_fasta(file)
# Calculate protein length (number of amino acids in each protein)
plength(aa)
# Sum the amino acid compositions
sum_aa(aa)

# Count amino acids in a sequence
count_aa("GGSGG")
# A message is issued for unrecognized characters
count_aa("AAAXXX")
# Count nucleobases in a sequence
bases <- count_aa("ACCGGGTTT", molecule = "DNA")

jedick/canprot documentation built on April 2, 2024, 10:29 p.m.