cdsearchr: Access NCBI's (batch) CD-SEARCH from R.
In vragh/seqvisr: Biological Sequence Visualization and Auxiliary Functions in R

cdsearchr

R Documentation

Access NCBI's (batch) CD-SEARCH from R.

Description

cdsearchr() provides an R interface for NCBI's CD-SEARCH sequence annotation tool. It takes the path to a FASTA file containing the query protein sequences as input and returns a data.frame containing the annotation results.

Usage

cdsearchr(queries = NA, db = c("cdd", "pfam", "smart", "tigrfam", "cog", "kog"),
smode = c("auto", "prec", "live"), useid1 = TRUE, compbasedadj = 1,
biascompfilter = TRUE, evalue = 0.01, tdata = c("hits", "aligns", "feats"),
alnfmt = NA, dmode = c("rep", "std", "full"), qdefl = TRUE, cddefl = FALSE,
maxhit = 500, check_max = 10, check_wait = 20)

Arguments

`queries`	(character string, mandatory) path to a FASTA file containing the query protein sequences.
`db`	(character string, optional) controls which databases CD-SEARCH should search the queries against. Please refer to "database selection" under the URL https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#BatchRPSBSearchMode for particulars on the databases. This parameter only has an effect if `smode` is set to "live" (see below). (Set to "cdd" by default.)
`smode`	(character string, optional) controls which search mode CD-SEARCH should use. "auto" will check the queries first against a set of precalculated results (by checking query identifiers; really only works if these are sequences in NCBI already), and if that fails, it performs a "live" search against the CD-SEARCH database. "prec" would return results only for queries that have a result in the precalculated database. "live" will search every query anew against its databases even if precalculated results exist for that query. (Set to "auto" by default.)
`useid1`	(binary, optional) controls whether queries should also be searched against archived sequence identifiers if the query's identifier (if it happens to be an NCBI identifier) does not match anything in the current Entrez Protein database records. (Set to TRUE by default.)
`compbasedadj`	(integer, optional) should CD-SEARCH use compositionally- corrected scoring? (0 - correction turned off; 1 - correction turned on.) (Set to 1 by default.)
`biascompfilter`	(binary, optional) should compositionally biased regions of the queries be filtered out? (Set to TRUE by default.)
`evalue`	(numeric, optional) expect value (statistical significance threshold) used for filtering and reporting annotation matches. (Set to 0.01 by default.)
`tdata`	(character string, optional) what type of target data should be returned: "hits" (domain hits), "aligns" (domain alignments), or "feats" (domain features). Changing from the default might break functionality as of the current version of `seqvisr`. (Set to "hits" by default.)
`alnfmt`	(character string, optional) data format to be used for downloading alignment data in the event `tmode` is set to "aligns". This will never be the case for cdsearchr, and this option exists only for the sake of completeness. (Set to `NA` by default.)
`dmode`	(character string, optional) which data mode must be used for the results. This dictates what set of domains are returned: the highest scoring hit for each region of the sequence ("rep"), the best hits from each database available in CD-SEARCH (so multiple hits per query region are possible; "std"), or all hits ("full"). (Set to "rep" by default.)
`qdefl`	(binary, optional) should query titles be included in the results? (Set to TRUE by default.)
`cddefl`	(binary, optional) should domain titles be included in the results? (Set to FALSE by default.)
`maxhit`	(integer, optional) maximum number of results per query that should be retrieved. Only matters if `smode` is set to "live".
`check_max`	(numeric, optional) how many times should cdsearchr() query for results before giving up? (Set to 10 attemps by default.)
`check_wait`	(numeric, optional) how long – in seconds – must cdsearchr() wait between successive requests to the CD-SEARCH API while querying for the results. (Set to 20 by default.)

Details

cdsearchr() is an R-based interface to the NCBI CD-SEARCH application. It uses httr internally to submit and retrieve data. Once the queries have been submitted, cdsearchr() will repeatedly query the CD-SEARCH server until it receives a response (success/failure) or the number of attempts exceeds check_max. Although check_wait has been set to 20 (seconds) it is recommended that the user adjusts this based on the size of the data set (set to a smaller value for smaller data sets and vice versa).

Note

cdsearchr() respects the CD-SEARCH API's 4000 queries submission limit. Therefore, users should pre-chunk FASTA files into files of 4000 sequences or fewer before passing them on cdsearchr.#'

Examples

## Not run: 
inpath <- system.file("extdata", "cdsearchr_testdata.fasta", package = "seqvisr", mustWork = TRUE)
cdsearchr(queries = inpath, check_wait = 2)

## End(Not run)

vragh/seqvisr documentation built on April 20, 2024, 10:06 a.m.