update_annotation: update library annotation

Description Usage Arguments Details Value Functions File format Dependencies Processing time See Also

View source: R/fun-update_annotation.R

Description

Check all geneIDs in library annotation file against GeneBank and get up-to-date information.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
update_annotation(infile, outfile, verbose = FALSE, ...)

check_geneids(geneIDs, verbose, ...)

check_geneid_status(geneID, verbose, ...)

get_gene_fields(geneIDs)

get_gene_fields_batch(geneIDs, verbose, ...)

get_gene_type(geneID)

extract_field(x)

Arguments

infile

file containing the original annotation; must be compatible with fread; defaults to internally stored Dharmacon annotation from 16th May 2015 (plate numbers have been unified, originally each subset was numbered independently)

outfile

(optional) path to a file to save the updated annotation

verbose

logical whether or not to report progress,

...

ellipsis to facilitate control of internal functions' verbosity

Details

Since information in data bases can change it is prudent to refresh the library annotation from time to time. This function takes every geneID in an annotation file and checks its current status in GeneBank: whether is has been withdrawn or replaced (and if so, by what new geneID). It then queries Genebank again to retrieve the gene type (protein-coding, pseudogene, etc.). Once the geneIDs are updated, yet another query is sent to GeneBank to retrieve the current gene symbol, gene description, map location, chromosome number, and aliases. This is done in batches of up to 499 items at a time. (This seems like an odd limit but reutils forces saving results to a file at 500 or more records per query.)

Original geneIDs and gene symbols are kept in separate columns, original_geneid and original_gene_symbol.

Value

The function either invisibly returns a data frame or saves to a specified path and returns nothing.

Functions

File format

As of version 2.4 the function has undergone some generalization. It now serves not only the original Dharmacon file but also other text files. The input file may be tab- or comma delimited. It must contain the following information: plate number, well/position (e.g. A01), and geneID. All other columns are immaterial but will not be dropped.

Annotation files can be re-updated. In such a case the columns original_geneid and original_gene_symbol will remain as they are and the update will be run with original geneIDs rather than the updated ones.

Dependencies

GeneBank queries are handled with the package reutils. Random errors that occur on queries are handled with retry. Data is loaded (and saved) with package data.table. Data processing is done in base R. Several internal functions are called here, see Functions.

Processing time

check_geneid_status queries GeneBank one geneID at a time, which may swamp the server, hence a 0.5 second pause is introduced before every query.

See Also

reutils, retry, dots


olobiolo/siscreenr documentation built on Nov. 26, 2021, 3:08 p.m.