cdsearchr | R Documentation |
cdsearchr()
provides an R
interface for NCBI's CD-SEARCH sequence annotation tool.
It takes the path to a FASTA file containing the query protein sequences as input
and returns a data.frame
containing the annotation results.
cdsearchr(queries = NA, db = c("cdd", "pfam", "smart", "tigrfam", "cog", "kog"),
smode = c("auto", "prec", "live"), useid1 = TRUE, compbasedadj = 1,
biascompfilter = TRUE, evalue = 0.01, tdata = c("hits", "aligns", "feats"),
alnfmt = NA, dmode = c("rep", "std", "full"), qdefl = TRUE, cddefl = FALSE,
maxhit = 500, check_max = 10, check_wait = 20)
queries |
(character string, mandatory) path to a FASTA file containing the query protein sequences. |
db |
(character string, optional) controls which databases CD-SEARCH should
search the queries against. Please refer to "database selection" under the URL
https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#BatchRPSBSearchMode
for particulars on the databases. This parameter only has an effect if |
smode |
(character string, optional) controls which search mode CD-SEARCH should use. "auto" will check the queries first against a set of precalculated results (by checking query identifiers; really only works if these are sequences in NCBI already), and if that fails, it performs a "live" search against the CD-SEARCH database. "prec" would return results only for queries that have a result in the precalculated database. "live" will search every query anew against its databases even if precalculated results exist for that query. (Set to "auto" by default.) |
useid1 |
(binary, optional) controls whether queries should also be searched against archived sequence identifiers if the query's identifier (if it happens to be an NCBI identifier) does not match anything in the current Entrez Protein database records. (Set to TRUE by default.) |
compbasedadj |
(integer, optional) should CD-SEARCH use compositionally- corrected scoring? (0 - correction turned off; 1 - correction turned on.) (Set to 1 by default.) |
biascompfilter |
(binary, optional) should compositionally biased regions of the queries be filtered out? (Set to TRUE by default.) |
evalue |
(numeric, optional) expect value (statistical significance threshold) used for filtering and reporting annotation matches. (Set to 0.01 by default.) |
tdata |
(character string, optional) what type of target data should be
returned: "hits" (domain hits), "aligns" (domain alignments), or "feats" (domain
features). Changing from the default might break functionality as of the current
version of |
alnfmt |
(character string, optional) data format to be used for downloading
alignment data in the event |
dmode |
(character string, optional) which data mode must be used for the results. This dictates what set of domains are returned: the highest scoring hit for each region of the sequence ("rep"), the best hits from each database available in CD-SEARCH (so multiple hits per query region are possible; "std"), or all hits ("full"). (Set to "rep" by default.) |
qdefl |
(binary, optional) should query titles be included in the results? (Set to TRUE by default.) |
cddefl |
(binary, optional) should domain titles be included in the results? (Set to FALSE by default.) |
maxhit |
(integer, optional) maximum number of results per query that should be retrieved.
Only matters if |
check_max |
(numeric, optional) how many times should cdsearchr() query for results before giving up? (Set to 10 attemps by default.) |
check_wait |
(numeric, optional) how long – in seconds – must cdsearchr() wait between successive requests to the CD-SEARCH API while querying for the results. (Set to 20 by default.) |
cdsearchr() is an R
-based interface to the NCBI CD-SEARCH application. It uses httr
internally
to submit and retrieve data. Once the queries have been submitted, cdsearchr() will repeatedly
query the CD-SEARCH server until it receives a response (success/failure) or the number of attempts
exceeds check_max
. Although check_wait
has been set to 20 (seconds) it is recommended that the
user adjusts this based on the size of the data set (set to a smaller value for smaller data sets
and vice versa).
cdsearchr() respects the CD-SEARCH API's 4000 queries submission limit. Therefore, users should pre-chunk FASTA files into files of 4000 sequences or fewer before passing them on cdsearchr.#'
## Not run:
inpath <- system.file("extdata", "cdsearchr_testdata.fasta", package = "seqvisr", mustWork = TRUE)
cdsearchr(queries = inpath, check_wait = 2)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.