cmsearch: Search for Covariance Models (CM) in a set of sequences.

View source: R/inferrnal.R

cmsearchR Documentation

Search for Covariance Models (CM) in a set of sequences.

Description

This function calls "cmsearch" from Infernal. Infernal must be installed. The function is focused on retrieving the hits table and, optionally, producing an alignment.

Usage

cmsearch(
  cm,
  seq,
  glocal = TRUE,
  Z = NULL,
  output = NULL,
  alignment = NULL,
  acc = FALSE,
  noali = TRUE,
  notextw = FALSE,
  textw = NULL,
  verbose = FALSE,
  E = NULL,
  t = NULL,
  incE = NULL,
  incT = NULL,
  cut_ga = FALSE,
  cut_nc = FALSE,
  cut_tc = FALSE,
  filter_strategy = NULL,
  FZ = NULL,
  Fmid = NULL,
  notrunc = FALSE,
  anytrunc = FALSE,
  nonull3 = FALSE,
  mxsize = NULL,
  smxsize = NULL,
  cyk = FALSE,
  acyk = FALSE,
  wcx = NULL,
  toponly = TRUE,
  bottomonly = FALSE,
  tformat = NULL,
  cpu = NULL,
  stall = FALSE,
  mpi = FALSE,
  quiet = TRUE,
  table_function = c("fwf", "table", "lines")
)

Arguments

cm

(character of length 1) Path to the covariance model (.cm) file. The covariance model must include calibration data from running "cmcalibrate".

seq

(filename, character vector, Biostrings::XStringSet, or ShortRead::ShortRead Sequences to search with the CM. If a filename, the file can be of any type supported by Infernal.

glocal

(logical of length 1) Whether to run the search in glocal mode (global with respect to the model, local with respect to the sequence). When TRUE, the search is faster, but will fail to find matches with only partially homologous structure.

Z

(numeric scalar) Effective search space size in Mb for the purposes of calculating E-values.

output

(character filename) File to send the human-readable output to.

alignment

(character filename) A file to save the aligned hits to. If given, the alignment is saved in Stockholm format with annotations for secondary structure, posterior probablility, etc.

acc

(logical scalar) Use accessions instead of names in the main output.

noali

(logical scalar) Omit the alignment section from the main output.

notextw

(logical scalar) Unlimit the length of each line in the main output.

textw

(numeric scalar) Set the main output’s line length limit in characters per line. Default: 120.

verbose

(logical scalar)

E

(numeric scalar) Maximum E-value for reporting in per-target output. Default: 10.0

t

(numeric scalar) Maximum bit score for reporting in per-target reporting. Default: NULL

incE

(numeric scalar) Maximum E-value for hit inclusion. Default: 0.01

incT

(numeric scalar) Maximum bit score for hit inclusion. Default: NULL

cut_ga

(logical scalar) Use the GA (gathering) bit scores defined in the CM to set hit reporting and inclusion thresholds.

cut_nc

(logical scalar) Use the NC (noise cutoff) bit score thresholds defined in the CM to set hit reporting and inclusion thresholds.

cut_tc

(logical scalar) Use the TC (trusted cutoff) bit score thresholds defined in the CM to set hit reporting and inclusion thresholds.

filter_strategy

(character string) Filtering strategy for the acceleration pipeline. Options, from slowest (most sensitive) to fastest (least sensitive) are "max", "nohmm", "mid", "default", "rfam", "hmmonly".

FZ

(numeric scalar) Effective database size in Mb for determining filter thresholds.

Fmid

(numeric scalar) HMM filter thresholds for the "mid" filtering strategy. Default: 0.02

notrunc

(logical scalar) Turn off truncated hit detection.

anytrunc

(logical scalar) Allow truncated hits to begin and end at any position in a target sequence.

nonull3

(logical scalar) Turn off the null3 CM score corrections for biased composition.

mxsize

(numeric scalar) Maximum allowable CM DP matrix size in megabytes. Default: 128

smxsize

(numeric scalar) Maximum allowable CM search DP matrix size in megabytes. Default: 128

cyk

(logical scalar) Use the CYK algorithm, not Inside, to determine the final score of all hits.

acyk

(logical scalar) Use the CYK algorithm to align hits.

wcx

(numeric scalar) Expected maximum length of a hit, as a multiplier of consensus length of the model.

toponly

(logical scalar) Only search the top (Watson) strand of target sequences.

bottomonly

(logical scalar) Only search the bottom (Crick) strand of target sequences.

tformat

(character string) Format of the sequences in "seq". Options are "fasta", "embl", "genbank", "ddbj", "stockholm", "pfam", "a2m", "afa", "clustal", and "phylip". Default: autodetect

cpu

(integer scalar) The number of threads to use in the search.

stall

(logical scalar) For debugging the MPI master/worker version: pause after start, to enable the developer toattach debuggers to the running master and worker(s) processes

mpi

(logical scalar) Run in MPI master/worker mode, using mpirun.

quiet

(logical scalar) Suppress standard output of cmsearch, which can be long.

table_function

(character string) Either "fwf" (default), "table", or "lines"; whether to read the hits table using utils::read.fwf(), utils::read.table(), or readLines(). Neither of the two parsers works in all cases; utils::read.fwf() fails when more than one CM is present in the CM file, while utils::read.table() fails when any of the fields contain spaces.

Value

a base::data.frame with columns:

  • target_name (character) the name of the target sequence

  • taget_accession(character) the target's accession number

  • query_name(character) the name of the query CM

  • query_accession(character) the query CM's accession number

  • mdl(character) the model type ("cm" or "hmm")

  • mdl_from(integer) the start location of the hit in the model

  • mdl_to(integer) the end location of the hit in the model

  • seq_from(integer) the start location of the hit in the sequence

  • seq_to(integer) the end location of the hit in the sequence

  • strand(character) the strand the hit was found on ("+" or "-")

  • trunc)(character) whether the hit is truncated, and where ("no", "5'", "3'", "5'&3'", or "-" for hmm hits).

  • pass(integer) which algorithm pass the hit was found on.

  • gc(numeric) GC content of the hit

  • bias(numeric) biased composition correction. See the Infernal documentation.

  • score(numeric) bit-score of the hit, including the biased composition correction.

  • E_value(numeric) Expectation value for the hit.

inc

(character) "!" if the sequence meets the inclusion threshold, "?" if it only meets the reporting threshold.

Examples

    # search sequences from a fasta file for Rfam RF00002 (5.8S rRNA)
    cm <- cm_5_8S()
    sampfasta <- sample_rRNA_fasta()
    cmsearch(cm = cm, seq = sampfasta, cpu = 1)
    # also works if the fasta file has already been loaded
    samp <- Biostrings::readDNAStringSet(sampfasta)
    cmsearch(cm = cm, seq = samp, cpu = 1)

brendanf/inferrnal documentation built on Nov. 26, 2024, 4:24 p.m.