Description Usage Arguments Details Value Functions File format Dependencies Processing time See Also
View source: R/fun-update_annotation.R
Check all geneIDs in library annotation file against GeneBank and get up-to-date information.
1 2 3 4 5 6 7 8 9 10 11 12 13 | update_annotation(infile, outfile, verbose = FALSE, ...)
check_geneids(geneIDs, verbose, ...)
check_geneid_status(geneID, verbose, ...)
get_gene_fields(geneIDs)
get_gene_fields_batch(geneIDs, verbose, ...)
get_gene_type(geneID)
extract_field(x)
|
infile |
file containing the original annotation;
must be compatible with |
outfile |
(optional) path to a file to save the updated annotation |
verbose |
logical whether or not to report progress, |
... |
ellipsis to facilitate control of internal functions' verbosity |
Since information in data bases can change it is prudent
to refresh the library annotation from time to time.
This function takes every geneID in an annotation file and checks its current status
in GeneBank: whether is has been withdrawn or replaced (and if so, by what new geneID).
It then queries Genebank again to retrieve the gene type (protein-coding, pseudogene, etc.).
Once the geneIDs are updated, yet another query is sent to GeneBank
to retrieve the current gene symbol, gene description, map location,
chromosome number, and aliases. This is done in batches of up to 499 items at a time.
(This seems like an odd limit but reutils
forces
saving results to a file at 500 or more records per query.)
Original geneIDs and gene symbols are kept in separate columns,
original_geneid
and original_gene_symbol
.
The function either invisibly returns a data frame or saves to a specified path and returns nothing.
check_geneids
: runs check_geneid_status
for all geneIDs and returns a data.frame
;
pauses for 0.5 second before each request to avoid swamping the server
check_geneid_status
: checks a single geneID and retrieves its withdrawn and replaced status (TRUE/FALSE)
and a potential new geneID; then retrieves the gene type (protein-coding, pseudo, etc.);
there is a 1 second pause between queries; returns a character vector
get_gene_fields
: queries the gene data base and retrieves five fields:
gene symbol, gene description, map location, chromosome number, and aliases
(other geneIDs associated with the geneID);
get_gene_fields_batch
: runs get_gene_fields
in batches of 499 and less; this is necessary as
the results of efetch
are unworkable for larger sets
get_gene_type
: queries the gene data base and retrieves the gene type field
extract_field
: called by get_gene_type
to extract the gene type
with provisions in case the field does not exist or is NA
As of version 2.4 the function has undergone some generalization. It now serves not only the original Dharmacon file but also other text files. The input file may be tab- or comma delimited. It must contain the following information: plate number, well/position (e.g. A01), and geneID. All other columns are immaterial but will not be dropped.
Annotation files can be re-updated. In such a case the columns
original_geneid
and original_gene_symbol
will remain as they are
and the update will be run with original geneIDs rather than the updated ones.
GeneBank queries are handled with the package reutils
.
Random errors that occur on queries are handled with retry
.
Data is loaded (and saved) with package data.table
.
Data processing is done in base R.
Several internal functions are called here, see Functions
.
check_geneid_status
queries GeneBank one geneID at a time, which may swamp the server,
hence a 0.5 second pause is introduced before every query.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.