knitr::opts_chunk$set( collapse=TRUE, warning=FALSE, message=FALSE, fig.height=8, fig.width=8, comment="#>", fig.path="jambaVignette-" ); library(jamba);
Jamba is intended to contain JAM base functions, to be re-used during analysis, and by other R packages. Functions are broadly divided into categories.
plotSmoothScatter()
A common problem when visualizing extremely large datasets is how to
display thousands of datapoints while indicating the amount of overlap
of those points. The R graphics::smoothScatter()
function provides an
adequate drop-in replacement for most uses of plot()
, where several
aspects are enhanced. Specifically, parameters related to the
density function and related bandwidth arguments are enabled,
which otherwise is possible but tedious.
The arguments bandwidthN
and nbin
define the number of bandwidth steps, and the number of visual bins,
respectively. The bandwidth steps defines the level of detail
calculated for the 2-dimensional density map, higher values give
more detail, seen as more "grainy" -- ultimately whether you can see
individual points. The number of visual bins determines the
amount of detail used to render the density as an image -- effectlvely
how many density pixels. Higher values show more detail, but have
much larger memory requirement, and can be slow to render for
multi-panel plots.
The argument useRaster=TRUE
by default renders the plot as
a rasterized image, making it substantially faster and smaller
especially when rendered as a PDF, SVG, or any vector type
output.
An example showing plotSmoothScatter()
with argument doTest=TRUE
.
The test data shows a set of points not well-resolved with default
smoothScatter()
, but which is more clearly displayed with
plotSmoothScatter()
.
plotSmoothScatter(doTest=TRUE);
The function imageDefault()
is a slight enhancement to
graphics::image()
specifically to enforce square pixels
when useRaster=TRUE
. The default graphics::image()
assumes
each pixel will be rendered in a 1:1 aspect ratio, therefore
the 2-dimensional density map is distorted when the axis ranges
are not symmetric 1:1. Note that plotSmoothScatter()
calls
imageDefault()
.
imageByColors()
The imageByColors()
function is intended to take a matrix or data.frame
that already contains colors in each cell. It optionally displays cell
labels when supplied.
Cell labels are grouped to display one unique label per repeated label,
using the function breaksByVector()
to group labels.
This function is particularly useful to simplify labels in a large table of repeated values, for example in experiment design.
Here, we define a simple data.frame composed of colors, then use the data.frame to label itself:
a1 <- c("red","blue")[c(1,1,2)]; b1 <- c("yellow","orange")[c(1,2,2)]; c1 <- c("purple","orange")[c(1,2,2)]; d1 <- c("purple","green")[c(1,2,2)]; df1 <- data.frame(a=a1, b=b1, c=c1, d=d1); imageByColors(df1, cellnote=df1);
Labels can be independently rotated and resized, an arbitrary example is shown below:
imageByColors(df1, cellnote=df1, useRaster=TRUE, #adjBy="column", cexCellnote=list(c(1.5,1.5,1), c(1,1.5), c(1.6,1.2), c(1.6,1.5)), srtCellnote=list(c(90,0,0), c(0,45), c(0,0,0), c(0,90,0)));
There are several useful axis labeling functions.
For log-transformed data, minorLogTicksAxis()
is a
flexible function to help deal with different transforms,
such as log2(1 + x)
, which is common when analyzing
gene expression data, or any measurement data that
may contain zeros and no negative values.
Data transformed with log2(1 + x)
must be correctly inverse-transformed to generate the
proper numeric labels.
By default, the log labels are displayed in base 10, while the underlying data can be any numeric base. (The default assumes log2.)
Also note that zero values are represented as zero.
expr <- 2^seq(from=0, to=17, length.out=20) - 1; par("mfrow"=c(2,2)); layout(matrix(ncol=2, data=c(1,2,3,3))); bp1 <- barplot(expr, las=2, col="dodgerblue", main="measurement data in normal space"); axis(1, at=bp1[,1], labels=seq_along(bp1[,1]), las=2) par("xpd"=FALSE); bp1 <- barplot(expr[c(1,which(expr<=100)+1)], ylim=c(0,100), las=2, col="dodgerblue", main="zoomed-in y-axis"); axis(1, at=bp1[,1], labels=seq_along(bp1[,1]), las=2) log_expr <- log2(1 + expr); bp1 <- barplot(log_expr, las=2, col="dodgerblue", ylab="log2(1 + x) value", main="measurement data after log2(1+x) transform"); axis(1, at=bp1[,1], labels=seq_along(bp1[,1]), las=2) minorLogTicksAxis(side=4, logBase=2, offset=1, displayBase=10); mtext(side=4, line=2.95, las=0, "normal space measurement values")
Another common example is with log fold change (also called log ratio). The log fold change is preferred to represent fold changes because it allows the "up" and "down" values to be symmetric.
fr <- round(2^(0:14/2 - 3.5), digits=2); fc <- round(ifelse(fr < 1, -1/fr, fr), digits=2); fl <- paste0(format(fr, trim=TRUE), ifelse(fr < 1, paste0(" (", format(fc, trim=TRUE), "-fold)"), "")); par("mfrow"=c(1,2)); bp1 <- barplot(fr-1, ylim=c(-5, max(fr)+2), las=2, offset=1, col="dodgerblue", main="data expressed as a ratio"); k <- (fr < 1); text(x=bp1[k,1], y=fr[k], srt=90, adj=1.2, fl[k]); text(x=bp1[!k,1], y=fc[!k], srt=90, adj=-0.2, fl[!k]); box(); log_fc <- log2(fr); barplot(log_fc, ylim=c(-4, 4), yaxt="n", col="dodgerblue", main="log ratios with log2 y-axis"); box(); minorLogTicksAxis(side=2, logBase=2, offset=0, symmetricZero=TRUE, displayBase=10); mtext(side=2, line=2, las=0, "fold changes with log10 breaks"); minorLogTicksAxis(side=4, logBase=2, offset=0, symmetricZero=TRUE, displayBase=2); mtext(side=4, line=2, las=0, "fold changes with log2 breaks");
plotPolygonDensity()
A convenience function plotPolygonDensity()
is a light wrapper
around two functions: hist()
and density()
. However, it
makes two other options convenient:
log10(1+x)
or sqrt()
,par("mar"=c(8,4,4,2), "mfrow"=c(2, 2)); options("scipen"=7); plotPolygonDensity(10^(3+rnorm(2000)), breaks=50, cex.axis=1, main="normal-scaled x-axis"); plotPolygonDensity(10^(3+rnorm(2000)), log="x", breaks=50, main="log-scaled x-axis"); plotPolygonDensity((3+rnorm(2000))^2, cex.axis=1, breaks=50, main="normal-scaled x-axis"); plotPolygonDensity((3+rnorm(2000))^2, cex.axis=1, xScale="sqrt", breaks=50, main="sqrt-scaled x-axis"); drawLabels(preset="bottom", txt="sqrt-scaled x-axis", labelCex=1.5)
drawLabels()
drawLabels()
is aimed at base R graphics, and provides a quick
way to add a label to a plot. The argument preset
is used to
place the label relative to the sides and corners of the plot.
par("mfrow"=c(1,1)) plotPolygonDensity((3+rnorm(2000))^2, cex.axis=1, xScale="sqrt", breaks=50, main=""); drawLabels(preset="bottom", txt="sqrt-scaled x-axis", labelCex=1.5)
For me, color plays a big role in my daily work, both in how I use R, and the figures and visualizations I produce during data analysis.
Another Jam package colorjam
focuses on defining
categorical colors in an extensible manner.
getColorRamp()
getColorRamp()
is a workhorse of several other functions
and workflows. It tries to make convenient the job of
obtaining a color ramp (aka a color palette, or color
gradient). It interfaces with RColorBrewer
and viridisLite
for color palette names, and allows some useful extensions.
The suffix "_r"
is convenient for reversing the order of
colors, especially with RColorBrewer
palette "RdBu"
which provides red-white-blue colors. On heatmaps, for me
anyway, "red" is associated with "heat" and therefore the
high numeric value. Using "RdBu_r"
will provide
blue-white-red colors.
showColors()
is used to display a color ramp, by default
including the labels using imageByColors()
.
warpRamp()
is used with color gradients to compress or
expand the color gradient.
colorjam::closestRcolor()
is used to assign an R color
label to any R color in hex "#RRGGBB"
format.
colorjam::rainbowJam()
is used to define a categorical
color palette with n
colors, relatively evenly spaced
around a modified red-yellow-blue color wheel. (This color
wheel avoids the dominance of green and blue-green colors
that result from most "rainbow" color palettes.)
The function alternates the brightness value and
color saturation, so adjacent
colors are more visibly distinct.
showColors(list( Reds=getColorRamp("Reds"), RdBu=getColorRamp("RdBu"), RdBu_r=getColorRamp("RdBu_r"), `RdBu_r_warped x5`=warpRamp(getColorRamp("RdBu_r"), lens=5), `RdBu_r_warped x-5`=warpRamp(getColorRamp("RdBu_r"), lens=-5) )); showColors(vigrep("red", colors()))
mixedSort()
, mixedSortDF()
, mixedOrder()
The gtools
package implemented a function gtools::mixedsort()
which aimed to sort character vectors similar to GNU version sort
(gsort -V
), where numeric values within an alphanumeric
string would be sorted in proper numerical order.
The mixedSort()
is an optimization and extension, driven by the
desire to sort tens of thousands of gene symbols (and micro-RNAs, miRNA)
efficiently and based upon known order. Also, because miRNA nomenclature
includes a dash '-' character, numbers are not assumed to be negative
by default, but this behavior can be controlled.
For example "miR-2" should sort before "miR-11", and after "miR-1".
if (suppressPackageStartupMessages(require(gtools))) { x <- c("miR-12","miR-1","miR-122","miR-1b", "miR-1a", "miR-2"); df <- data.frame(sort=sort(x), gtools_mixedsort=gtools::mixedsort(x), mixedSort=mixedSort(x)); } else { x <- c("miR-12","miR-1","miR-122","miR-1b", "miR-1a", "miR-2"); df <- data.frame(sort=sort(x), mixedSort=mixedSort(x)); } df;
Note: it took roughly 12 seconds to sort 43,000 gene symbols using
gtools::mixedsort()
, and roughly 0.5 seconds using mixedSort()
.
The mixedOrder()
function is actually the core ordering algorithm,
called by mixedSort()
. The mixedOrder()
function does not fully behave
like order()
, in that order()
accepts a list of vectors as one type of input,
and applies the order logic to each list iteratively, in order break ties.
Instead the mmixedOrder()
function was implemented to serve that purpose.
For the most part, mixedSort()
or mixedSortDF()
should fulfill
most use cases.
mixedSortDF()
The mixedSortDF()
function is a convenient wrapper around mmixedOrder()
applied to data.frames. It sorts columns in a specified order, maintaining
ties in each column, and breaking ties as warranted in subsequent columns.
It has the added benefit of maintaining factor order, so a data.frame
that contains a column of factors with a specified order will maintain that
order during the sort process. This behavior is especially useful when
sorting a data.frame
with experiment design, where each column may contain
an ordered factor indicating the proper sample group ordering.
x <- c("miR-12","miR-1","miR-122","miR-1b", "miR-1a","miR-2"); g <- rep(c("Air", "Treatment", "Control"), 2); gf <- factor(g, levels=c("Control","Air", "Treatment")); df2 <- data.frame(groupfactor=gf, miRNA=x, stringsAsFactors=FALSE); df2sorted <- mixedSortDF(df2);
#knitr::kable(df2sorted);
df2sorted;
setPrompt()
setPrompt()
is a simple function that creates an R prompt
which includes the project name, R version, and current process ID.
When possible, the prompt uses color text.
It is especially helpful if you find yourself working on multiple active R projects at the same time. For me, the more visual reminders the better!
The process ID is particularly helpful if one R session becomes inactive, and if there are other active R sessions open. The process ID is the best way to identify the specific task so it can be killed.
setPrompt("jambaVignette"); # {jambaVignette}-R-3.6.0_10789>
writeOpenxlsx()
writeOpenxlsx()
is intended to make several capabilities of
the amazing openxlsx
package convenient.
If you ever found yourself saving data to Excel, then opening the file to apply numerous custom options -- this function is aimed at you! For Excel files with multiple worksheets, it quickly becomes tedious to apply all options consistently to each worksheet.
The capabilities are described below:
.xlsx
file.vigrep()
, provigrep()
, igrep()
, igrepHas()
I use these functions numerous times every day, in all
aspects of my analysis workflow. They're very simple
and convenient wrappers around base::grep()
:
vigrep()
- extends grep to use value=TRUE
and ignore.case=TRUE
provigrep()
- extends vigrep()
to use a vector of patterns,
and return values in the order they are matched. Great when you want
"all the controls first, then all the treated, then everything else".igrep()
- extends grep to use ignore.case=TRUE
, case-insensitive
matching.igrepHas()
- extends igrep()
to return TRUE
or FALSE
if any value in a vector matches the pattern.gsubOrdered()
gsubOrdered()
is an extension to gsub()
that preserves
factor order of the input data, by creating new ordered factor
levels using the same gsub()
replacement. If you have
experimental factors in a specific order, want to maintain
that order while making changes to the labels themselves,
this function is for you.
pasteByRow()
and pasteByRowOrdered()
pasteByRow()
is a lightweight by efficient method
for combining multiple columns into one character string.
After numerous benchmark tests, it appears the fastest
method for combining columns in a data.frame
with
10,000 or more rows of data, was to run a series of
paste()
operations. Every row-based method is far
worse, probably because the paste()
method is
vectorized and scales well.
However, this function also provides some convenient defaults, like skipping empty columns, and trimming leading and trailing blank entries.
I use this function frequently to make sample group
labels, combining values from a data.frame
with
multiple columns containing experimental factors.
pasteByRowOrdered()
is an extension of pasteByRow()
that maintains factor level order of each column,
and creates combined factor levels in that same
consistent order. If your data.frame
contains
experimental factors with a defined order, this
function will produce a factor where those
orders are also maintained.
a1 <- factor(c("mutant", "control")[c(1,1,2)], levels=c("control", "mutant")); b1 <- factor(c("vehicle", "treated")[c(2,1,1)], levels=c("vehicle", "treated")); d1 <- c("purple","green")[c(1,2,2)]; df2 <- data.frame(a=a1, b=b1, d=d1); df2; pasteByRow(df2); pasteByRowOrdered(df2); df3 <- data.frame(df2, pasteByRowOrdered=pasteByRowOrdered(df2)); mixedSortDF(df3, byCols="pasteByRowOrdered")
makeNames()
, nameVector()
, nameVectorN()
Again, I use these functions daily, to create unique names from a character vector, using a controlled vocabulary.
makeNames()
simply returns unique names, for
duplicated values it uses the suffix style _v1
, _v2
, _v3
.
The suffix can be controlled, whether to add a suffix to
singlet entries, what number to start with, etc.
nameVector()
uses a vector to name itself, returning
the same vector with names assigned by makeNames()
.
This function is helpful when used in an lapply()
because it enforces unique names which are returned
in the output list
from the lapply()
.
nameVectorN()
creates a named vector of the vector names.
I use it frequently in lapply()
functions to control
the names of the output list
.
x <- rep(head(letters, 4), c(2,4,1,5)); x; makeNames(x); nameVector(x); y <- nameVector(x); nameVectorN(y); lapply(nameVectorN(y), function(i){ i })
cPaste()
, cPasteSU()
, cPasteU()
cPaste()
is an efficient implementation intended
to concatenate values using a delimiter, similar to
paste(x, collapse=",")
, except in a list
context.
After numerous benchmark tests, there was no faster
implementation than the one from the Bioconductor
S4Vectors
package, which provides S4Vectors::unstrsplit()
.
cPasteU()
makes the values unique, inside each vector
of the list.
cPasteSU()
applies mixedSort()
after making the
values in each vector unique.
The mixedSort()
and unique()
functions are applied
in extended vector form, so each are applied only once
even when operating on a long list of vectors,
for example a list with >10,000 character vectors
should take less than 1 second.
These functions are very useful when operating on
a list of gene symbols. For example, a vector of
500,000 assay probe names may be converted to a list of
gene symbols, with some assay probe names associated
with multiple gene symbols. The function cPasteSU()
combines gene symbols with delimiter ","
, after
sorting and making values unique.
It is also useful with gene-pathway data, where biological pathways are associated with a long list of gene symbols.
set.seed(123); x <- lapply(seq_len(6), function(i){ paste0("Gene", sample(LETTERS, sample(c(1,1,2,5,9), 1), replace=TRUE)); }); cPaste(x); cPasteU(x); cPasteSU(x); data.frame(cPaste=cPaste(x), cPasteU=cPasteU(x), cPasteSU=cPasteSU(x))
jargs()
Jam extension to args()
.
If you ever used args()
and found it difficult to read the
jumble of function argument names, this function is for you!
Also, if you ever work with functions that have a lot of arguments, you can search arguments by pattern.
jargs(plotSmoothScatter) jargs(plotSmoothScatter, "^y")
sdim()
and ssdim()
These functions apply dim()
to a list, or
list of lists. In fact, they work with more than
just lists, they work with S4 object slotNames
,
multi-dimensional arrays,
and other objects which are list-like, for
example model fit objects are often stored
as a list.
For convenience, the output includes the class of each element in the list.
L <- list(LETTERS=LETTERS, letters=letters, lettersDF=data.frame(LETTERS, letters)); sdim(L); L2 <- list(List1=L, List2=L); sdim(L2); ssdim(L2);
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.