Kmisc introduces a grab-bag of utility functions that should be useful to various
kinds of useR
s. Some of the most useful functions in the package are demoed
in this vignette.
set.seed(123) library(data.table) library(Kmisc) library(lattice) library(grid) library(Rcpp) library(knitr) library(microbenchmark) dat <- data.frame( x=letters[1:4], y=1:4, z=LETTERS[1:4] ) opts_chunk$set( results="markup" )
without
: This function is used to remove columns from a list
/ data.frame
.
## let's remove columns 'x' and 'z' from dat. tryCatch( dat[ -c('x', 'z') ], error=function(e) print(e$message) ) ## oh :( dat[ !(names(dat) %in% c('x', 'z')) ] ## I always find that a bit awkward. Let's use Kmisc's without instead. without(dat, x, z)
extract
: Extract vectors from a data.frame or list. Although there is already
a good subsetting syntax for lists and vectors, I wanted a complementary
function for without
.
extract(dat, x, y)
re_without, re_extract
: Extract variables whose names don't match / do match
a regular expression pattern.
re_extract(dat, "[xy]") re_without(dat, "[xy]")
swap
: Replace elements in a vector.
tDat <- dat ## make a temporary copy of dat ## Replace some elements in tDat$y tDat$y <- swap( tDat$y, from=c(2, 4), to=c(20, 40) ) cbind( dat$y, tDat$y )
factor_to_char
, char_to_factor
: A set of functions that recurse through
a list / data.frame and set all elements that are characters to factors,
and vice versa.
bDat <- data.frame( x=rnorm(10), y=sample(letters,10), z=sample(letters,10) ) str( bDat ) str( factor_to_char(bDat) )
dapply
: The data.frame
version of the l/sapply
series of functions.
Why have this function when sapply
still does much the same? I always get
frustrated with the fact that either an array
or a list
is returned
by sapply, but never a data.frame
.
dat <- data.frame( x = rnorm(100), y = rnorm(100), z = rnorm(100) ) dapply( dat, summary )
kMerge
: Left joins, aka. merge( all.x=TRUE, ... )
without any mangling
of the order.
dat1 <- data.frame( id=5:1, x=c("a","a","b","b","b"), y=rnorm(5) ) dat2 <- data.frame( id=c(1, 2, 4), z=rnorm(3) ) ## default merge changes id order merge( dat1, dat2, by="id", all.x=TRUE ) ## even the sort parameter can't save you merge( dat1, dat2, by="id", all.x=TRUE, sort=TRUE ) # kMerge keeps it as is kMerge( dat1, dat2, by="id" )
in_interval
: A fast C implementation for determing which elements of a
vector x
lie within an interval [lo, hi)
.
x <- runif(10)*10; lo <- 5; hi <- 10 print( data.frame( x=x, between_5_and_10=in_interval(x, lo, hi) ) )
stack_list
: Use this to stack data.frames in a list. This can be useful if
e.g. you've run some kind of bootstrap procedure and have all your results
stored in as a list of data.frames -- even do.call( rbind, dfs )
can be slow.
The difference is even more prominent when used on very large lists.
This is partially deprecated by data.table::rbindlist
now, which has a much
faster C implementation.
dfs <- replicate(1E3, data.frame(x=rnorm(10), y=sample(letters,10), z=sample(LETTERS,10)), simplify=FALSE ) str( stack_list(dfs) ) system.time( stack_list(dfs) ) system.time( do.call(rbind, dfs) ) system.time( data.table::rbindlist(dfs) )
R is missing some nice builtin 'string' functions. I've introduced a few functions for common string operations.
str_rev
: Reverses a character vector; ie, a vector of strings.
str_rev2
is there if you need to reverse a potentially unicode string.
str_rev( c("ABC", "DEF", NA, paste(LETTERS, collapse="") ) ) str_rev2( c("はひふへほ", "abcdef") )
str_slice
: Slices a vector of strings at consecutive indices n
.
str_slice2
exists for potentially unicode strings.
str_slice( c("ABCDEF", "GHIJKL", "MNOP", "QR"), 2 ) str_slice2( "ハッピー", 2 )
str_sort
: sort a string.
str_sort("asnoighewgypfuiweb")
str_collapse
: Collapse a string using Rcpp
sugar; operates like
R's paste(..., collapse="")
, but works much faster.
str_collapse( c("ABC", "DEF", "GHI") )
Sometimes, you get really large data files that just aren't going to fit into RAM. You really wish you could split them up in a structured way, transform them in some way, and then put them back together. One might consider a more 'enterprise' edition of the split-apply-combine framework (Hadoop and friends), but one dirty alternative is to use C++ to munge through a text file and pull out things that we actually want.
split_file
: This function splits a delimited file into multiple files, according to
unique entries in a chosen column.
extract_rows_from_file
: From a delimited text file, extract only the rows for
which the entries in a particular column match some set of items that you
wish to keep.
Use these functions to generate C++ / Rcpp-backed functions for common R-style operations.
Rcpp_tapply_generator
: Generate fast tapply
style functions from C++
code through Rcpp. See the example.
dat <- data.frame( y=rnorm(100), x=sample(letters[1:5], 100, TRUE) ) tMean <- Rcpp_tapply_generator("return mean(x);") with( dat, tMean(y, x) ) with( dat, tapply(y, x, mean) ) microbenchmark( Kmisc=with( dat, tMean(y, x) ), R=with( dat, tapply(y, x, mean) ), times=5 )
Rcpp_apply_generator
: An apply function generator tailored to 2D matrices.
However, your function definition must return a scalar value.
aMean <- Rcpp_apply_generator("return mean(x);") mat <- matrix( rnorm(100), nrow=10 ) aMean(mat, 2) apply(mat, 2, mean) microbenchmark( Kmisc=aMean(mat, 2), R=apply(mat, 2, mean) )
tapply_
: This function operates like tapply
but works faster through a
faster factor generating function, as well as an optimized split. Note that
it is however restricted to the (common) case of your value and grouping
variables being column vectors.
library(microbenchmark) y <- rnorm(1000); x <- sample(letters[1:5], 1000, TRUE) tapply(y, x, mean) tapply_(y, x, mean) microbenchmark( times=10, tapply(y, x, mean), tapply_(y, x, mean), tMean(y, x) )
melt_
: This function operates like reshape2:::melt
, but works almost
entirely through C and hence is much faster.
dat <- data.frame( id=LETTERS[1:5], x1=rnorm(5), x2=rnorm(5), x3=rnorm(5) ) print(dat) melt_(dat, id.vars="id")
factor_
: A faster, simpler implementation of factor
through Rcpp. This might
be useful in some rare cases where speed is essential.
lets <- sample(letters, 1E6, TRUE) stopifnot( identical( factor_(lets), factor(lets) ) ) microbenchmark( times=5, factor_(lets), factor(lets) )
html
: Custom HTML in an R Markdown document.
html( table( class="table table-bordered table-striped table-condensed table-hover", ## bootstrap classes tr( td("Apples"), td("Bananas") ), tr( td("20"), td("30") ) ) )
anatomy
, anat
: Like str
, but much faster. It won't choke on very large data.frame
s.
df <- data.table(x=1, y=2) str(df) anatomy(df)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.