findSeq: Search for a query sequence against a list of assemblies

View source: R/func__findSeq.R

findSeqR Documentation

Search for a query sequence against a list of assemblies

Description

This function determines presence of a query sequence in a list of assembly graphs or FASTA files (readable to Bandage). It aims to answer the question that how frequent a query is found in a collection of assemblies. This function does not work on Window OS unless the Linux commandline "cut" is enabled.

Usage

findSeq(
  query = NULL,
  assemblies = NULL,
  bandage.path = "./bandage",
  blast.params = "-task megablast",
 
    bandage.params = "--ifilter 95 --evfilter 1e-3 --pathnodes 6 --minhitcov 0.98 --minpatlen 0.98 --maxpatlen 1.02",
  n.cores = -1,
  del.temp = TRUE
)

Arguments

query

Path to a FASTA file, which may contain multiple query sequences.

assemblies

A data frame, a character matrix or a CSV file whose first two columns provide strain names and paths to assembly files. These two columns may be named Strain and Assembly for instance. This argument can also be a path to a CSV file (with a header line for column names) for this data frame. For Bandage, a valid assembly file can be either a SPAdes FASTG file or a FASTA file. This function searches the query in every assembly file. Users may use a spreadsheet to create a CSV file for this data frame and import it into R.

bandage.path

Path to Bandage, without any backslash or forward slash terminating this parameter.

blast.params

Parameters passed directly to BLAST through the option "–blastp" of Bandage. Run "bandage –helpall" for details. Default: megablast.

bandage.params

Parameters passed directly to Bandage. Run "bandage –helpall" as well to see all valid parameters. These parameters controls how Bandage identifies a query.

n.cores

Number of computational cores that will be used in parallel for this function. It follows the same convention defined in the function findPhysLink. For simplicity, set it to zero to automatically detect and use all available cores; set it to -1 to leave one core out (recommended unless this function is executed through an SLURM job system).

del.temp

A logical parameter determing whether to keep all temporal files under the current working directory. Default: removing all of these files.

Value

A single data frame of identified query paths, one (the top hit) for each assembly. NA values are present if no query path is found at all in an assembly.

Author(s)

Yu Wan (wanyuac@126.com)

Examples

paths <- findSeq(query = "integrons.fna", assemblies = a, bandage.path = "apps/Bandage",
bandage.params = "--ifilter 95 --evfilter 1e-3 --pathnodes 6 --minhitcov 0.98 --minpatlen 0.98 --maxpatlen 1.02",
n.cores = 4, del.temp = FALSE)


wanyuac/GeneMates documentation built on Aug. 12, 2022, 7:37 a.m.