0.0.87.900
, due
to a bug introduced in jamba::cPaste()
that adversely affected
output for accessions that require multiple rounds of querying.freshenGenes()
new argument ignore.case
which calls
genejam::imget()
as a drop-in replacement for mget()
.
The process was improved by calling AnnotationDbi::keys()
instead of AnnotationDbi::ls()
, and this change is at
least an order of magnitude faster. In brief benchmarks,
using ignore.case=TRUE
adds roughly 0.1 seconds per
annotation in try_list
, but otherwise is the same
speed regardless the number of input entries.freshenGenes()
was updated so the input can contain
intermediate values, for example ENTREZID values.
In fact, now input can contain a mixture of gene symbols,
ENTREZID intermediate values, and it will fill in the
holes accordingly.
new argument intermediate
to define the colname that
contains the intermediate values, most commonly EG which
are Entrez gene ID.
intermediate
except when
handle_multiple="first_hit"
any existing value in
intermediate
is used with no further processing.
All other handle_multiple
will combine entries
into intermediate
.is_empty()
is a small helper function to determine which
entries in a vector are either NA
or ""
.These two new functions are convenience functions.
I often find myself wanting the gene symbol and
long gene name, so now freshenGenes2()
does that by default.
To add gene aliases, use freshenGenes3()
.
freshenGenes2()
is a simple extension to freshenGenes()
that
has "SYMBOL", "GENENAME"
in the output by default.freshenGenes3()
is a simple extension to freshenGenes()
that
has "SYMBOL", "GENENAME", "ALIAS"
in the output by default.get_anno_db()
logic to check for reciprocal annotation names
was updated to cover more scenarios. Specifically, "org.Hs.egUNIPROT2EG"
is properly recognized, it previously was not being recognized by
the reciprocal "org.Hs.egUNIPROT"
and therefore was being skipped.To prepare for a wider release, I decided to rename (!) some arguments, to have snake_case instead of camelCase for consistency. I heard myself complaining about my own package, "Why are some arguments camelCase and others are snake_case? Pick one!" I complain with a smile on my face, but still it's a fair point.
finalList
is now final
tryList
is now try_list
annLib
is now ann_lib
I suppose I should probably rename freshenGenes()
to freshen_genes()
.
get_anno_db()
new argument ignore.case
which will build an
environment where all keys are converted to lowercase. Ultimately,
this option incurs the lowest performance hit, since the keys
only need to be converted once, then the environment can be used
repeatedly with native mget()
functions.imget()
case-insensitive mget()
-- however once I tested it,
I realized this mechanism is fairly slow when using a fairly large
annotation object. Also, if querying the same data multiple times
using imget()
, there is no re-use and the cost is incurred each
operation -- very much not ideal. This function will likely be
retired soon.better_exists()
and better_get()
which are (not so humbly)
improved versions of base::exists()
and base::get()
respectively.
Their sole benefit is to recognize a package prefix in an object name,
so things like better_exists("base::get")
will return TRUE since
that object does exist; and subsequently better_get("base::get")
will
return that object. These functions are mostly useful when using
annotation package prefixes such as "KEGG.db::KEGGPATHID2NAME"
.
I needed a simple way to test if it exists.
better_exists()
also allows multiple input values.get_anno_db()
now calls better_exists()
and better_get()
which
allows using a package prefix with annotation names.get_anno_db()
argument "revmap_suffix"
now allows multiple
possible values, it cycles through each until it finds a match, otherwise
returns NULL
. Some annotations use suffix "2ENTREZID"
instead of
"2EG"
, and still others use "2NAME"
. I'm sure there will be others.freshenGenes()
new argument handle_multiple="best_each"
which
returns the best first try for each delimited entry in each input
row. For example c("APOE","APOA")
will match "APOE"
as an
authoritatice gene symbol, but "APOA"
is matched as an alias
to the new gene symbol "LPA"
. The output will be "APOA,LPA"
.
Note that output will contain unique entries delimited, but they
will not be sorted.The next version will have handle_multiple="best_each"
which
will find the best match for each entry in a set of delimited
gene symbols. Most useful for something like pathway enrichment
results, where the goal is to retain all possible genes, yet
each gene may require a different type of annotation to
find a match. See TODO.md
for details.
freshenGenes()
includes a new example showing how to recognize
Affymetrix probesets by using a custom search library.freshenGenes()
handles multiple annotation libraries,
mostly in the form of fully described annotation names,
such as "org.Hs.egSYMBOL"
and "hgu133plus2ENTREZID"
.freshenGenes()
new option empty_rule="na"
which will
replace empty entries with NA
. Other options empty_rule="blank"
replaces with ""
, and empty_rule="original"
replaces empty
entries in the first output column with the original entry in
the first input column.get_anno_db()
now returns`NULL
when an annotation is
not found, instead of throwing an error. This change allows
the calling function to skip missing annotation gracefully
without using tryCatch()
to catch the error.freshenGenes()
now properly ignores NA values without throwing
an exception. NA values are left as-is and returned as NA in the
final output.freshenGenes()
new argument protect_inline_sep
helps to
prevent splitting single values that may include the same sep
character, for example not splitting "H4 clustered histone 10, pseudogene"
into "H4 clustered histone 10"
and "pseudogene"
. Also, the
handling of finalList
uses sep
as the split
since that sep
is known to have been used in creating the intermediate values,
therefore it should be consistent in the final step. This subtle
change helps allow a more general split pattern in the first
step, such as "[, ]+"
which splits at comma and/or space,
without splitting at spaces in subsequent steps.lgsub()
is a simple wrapper around gsub()
except that it
operates on character vectors inside a list
. This function will
probably be moved into the jamba
package in the near future.
In fact, there will probably be a small family of list-friendly
functions: lgrep(), lvigrep(), lvgrep(), lstrsplit().get_anno_db()
was updated to handle reverse-map annotations,
for example requesting org.Hs.egALIAS
and deriving it from
org.Hs.egALIAS2EG
using AnnotationDbi::revmap()
.stringsAsFactors
bugs.
Least favorite default option in R.x
to character in freshenGenes()
.Note that genejam requires one Bioconductor annotation package,
usually org.Hs.eg.db
but can be any valid organism, such
as org.Mm.eg.db
for mouse, or org.Rn.eg.db
for rat.
mget()
by using jamba::rmNA()
.Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.