get_seq_info: Get NCBI sequence record

Description Usage Arguments Details Value Functions Author(s) Examples

Description

Retrieves information about sequences from NCBI records for given organism name or taxon identifier.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
get_seq_info(
  org.name,
  db,
  n.start = 1,
  n.stop = NULL,
  step = 500,
  return.dataframe = FALSE,
  check.result = FALSE,
  term = NULL,
  verbose = TRUE
)

get_seq_info_fix(
  info.list,
  web.history = NULL,
  org.name = NULL,
  db,
  n.start = 1,
  n.stop = NULL,
  step = 500,
  term = NULL,
  verbose = TRUE
)

info_listtodata(info.list, unlist = TRUE, verbose = TRUE)

Arguments

org.name

character; scientific name or taxon identifier (written as "txid0000") of the organism/taxon.

db

character; NCBI database for search. See entrez_dbs() for possible values.

n.start

integer; download starting value. Default is 1.

n.stop

integer; download finishing value. Default is NULL, which provides retrieval of all available GIs.

step

integer; download increment value. Maximum is 500.

return.dataframe

integer; whether to return information as structured data frame (another variant is list of lists).

check.result

logical; check if download was done correctly.

term

character; search query.

verbose

logical; show messages

info.list

list of previously downloaded records.

web.history

previously saved web_history object for use in calls to the NCBI. New web.history is created if none is provided.

unlist

logical; unlist result before transforming (only recommended if step > 1).

Details

This function sends the query to NCBI database and returns sequence records according to the query. By default the query is organism, so the function returns data of all sequences that are associated with the requested organism. For example, if org.name = "Homo sapiens" the function will download data for all records that answer the query "Homo sapiens[Organism]". For any other query use parameter term.

The function downloads records by piecemeal, by several pieces in one block. The size of the block is defined by parameter step. It is useful if by any reason the download was interrupted, so later it is possible to reload only the missing blocks without the need to reload the entire amount of data. By default, all available records are downloaded, but you may also choose start and finish points by specifying the parameters n.start and n.stop. The numeration starts with 1, not 0. At the end the resulting list of blocks (list of lists if step > 1) is unlisted into one data frame that contains information about record GI, UID, caption, source database, organism, strain etc. You may prevent this by setting return.dataframe = FALSE. Also, regardless of return.dataframe settings, the list of blocks is returned if the download was somehow compromised. Optionally, you can turn the resulting list into data frame later using the function info_listtodata(). Note that in this case, if parameter info.list was inherited from get_seq_info() function, the result must be unlisted first (use unlist = TRUE).

If download was corrupted you may use get_seq_info() function to reload the missing block. The corrupted list of blocks should be set in info.list parameter. You may also check and reload data when get_seq_infos() function is running by specifying check.result = TRUE.

In progress the functions turn off and on scientific notation.

Value

get_seq_info() returns data frame that contains most of sequence information from NCBI records. If return.dataframer = FALSE or there are missing data, list of lists is returned. List contains full information from NCBI records.

get_seq_info_fix() returns list of lists.

info_listtodata() returns data frame.

Functions

Author(s)

Elena N. Filatova

Examples

1
2
3
info.dataframe <- get_seq_info (org.name = "txid9606", db = "nucleotide", n.start = 1,
                               n.stop = 10, step = 5, return.dataframe = TRUE,
                               check.result = TRUE)

disprose documentation built on Jan. 6, 2022, 1:07 a.m.