filter_fasta: Filter fasta files by ingroup/outgroup status and taxonomy.

Description Usage Arguments Details Value Author(s) Examples

Description

Given a folder containing DNA sequences in multi-fasta format (i.e., each fasta file contains more than one sequence) and a dataframe including taxonomic data and ingroup/outgroup status, filter_fasta() outputs a list of those fasta files that pass one of two filters, or a combination of both. One filter excludes fasta files that do not contain greater than the minimum number of ingroup sequences. The other filter excludes fasta files that do not contain at least one sequence per ingroup taxon at the specified taxonomic rank.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
filter_fasta(
  seq_folder,
  taxonomy_data,
  filter_col = NULL,
  min_taxa = NULL,
  exclude_short = FALSE,
  sample_col = "sample",
  group_col = "group",
  ...
)

Arguments

seq_folder

Character vector of length one; the path to the folder containing the fasta files (ending in .fa or .fasta) to filter.

taxonomy_data

Dataframe matching sequences to ingroup/outgroup status and (optionally) higher-level taxonomic ranks for filtering. The columns must follow this format:

sample

Unique identifier for the source of the sequence, such as transcriptome IDs or species names. All sequences names must include such an identifier.

group

Either "in" or "out" (case-insensitive) depending if that sample is in the ingroup or outgroup.

(user-selected taxonomic rank)

The user can provide any taxonomic rank they wish to filter by. For example, alignments can be filtered by having at least one representative of each ingroup genus (family, order, etc.) in the dataset.

filter_col

Optional character; the name of the column to be used for filtering by taxonomic rank in taxonomy_data.

min_taxa

Minimum number of ingroup samples required to pass the filter.

exclude_short

Logical; should extremely short sequences be excluded from the alignment during filtering? If TRUE, the minimum length is set to be within 1 standard deviation of the mean sequence length for a given alignment.

sample_col

Optional character; user-provided column name for sample in taxonomy_data.

group_col

Optional character; user-provided column name for group taxonomy_data.

...

Other arguments. Not used by this function, but meant to be used by drake_plan for tracking during workflows.

Details

For example, if the dataset includes multiple ingroup genera each with multiple samples per genus, we may wish to filter alignments such that we only keep those with at least one sequence per ingroup genus. To do this, include a column called "genus" in taxonomy_data, and set filter_col = "genus".

Value

A named list of DNA sequences of class DNAbin that passed the filter. These are not modified in any way; they simply met the requirements of the filter.

Author(s)

Joel H Nitta, joelnitta@gmail.com

Examples

1
2
3
4
5
6
## Not run: filter_fasta(
  seq_folder = "some/folder/",
  taxonomy_data = onekp_data,
  filter_col = "genus",
  min_taxa = 2)
## End(Not run)

joelnitta/baitfindR documentation built on May 7, 2020, 6:21 p.m.