Kmisc introduces a grab-bag of utility functions that should be useful to various
useRs. Some of the most useful functions in the package are demoed
in this vignette.
set.seed(123) library(data.table) library(Kmisc) library(lattice) library(grid) library(Rcpp) library(knitr) library(microbenchmark) dat <- data.frame( x=letters[1:4], y=1:4, z=LETTERS[1:4] ) opts_chunk$set( results="markup" )
without: This function is used to remove columns from a
## let's remove columns 'x' and 'z' from dat. tryCatch( dat[ -c('x', 'z') ], error=function(e) print(e$message) ) ## oh :( dat[ !(names(dat) %in% c('x', 'z')) ] ## I always find that a bit awkward. Let's use Kmisc's without instead. without(dat, x, z)
extract: Extract vectors from a data.frame or list. Although there is already
a good subsetting syntax for lists and vectors, I wanted a complementary
extract(dat, x, y)
re_without, re_extract: Extract variables whose names don't match / do match
a regular expression pattern.
re_extract(dat, "[xy]") re_without(dat, "[xy]")
swap: Replace elements in a vector.
tDat <- dat ## make a temporary copy of dat ## Replace some elements in tDat$y tDat$y <- swap( tDat$y, from=c(2, 4), to=c(20, 40) ) cbind( dat$y, tDat$y )
char_to_factor: A set of functions that recurse through
a list / data.frame and set all elements that are characters to factors,
and vice versa.
bDat <- data.frame( x=rnorm(10), y=sample(letters,10), z=sample(letters,10) ) str( bDat ) str( factor_to_char(bDat) )
data.frame version of the
l/sapply series of functions.
Why have this function when
sapply still does much the same? I always get
frustrated with the fact that either an
array or a
list is returned
by sapply, but never a
dat <- data.frame( x = rnorm(100), y = rnorm(100), z = rnorm(100) ) dapply( dat, summary )
kMerge: Left joins, aka.
merge( all.x=TRUE, ... ) without any mangling
of the order.
dat1 <- data.frame( id=5:1, x=c("a","a","b","b","b"), y=rnorm(5) ) dat2 <- data.frame( id=c(1, 2, 4), z=rnorm(3) ) ## default merge changes id order merge( dat1, dat2, by="id", all.x=TRUE ) ## even the sort parameter can't save you merge( dat1, dat2, by="id", all.x=TRUE, sort=TRUE ) # kMerge keeps it as is kMerge( dat1, dat2, by="id" )
in_interval: A fast C implementation for determing which elements of a
x lie within an interval
x <- runif(10)*10; lo <- 5; hi <- 10 print( data.frame( x=x, between_5_and_10=in_interval(x, lo, hi) ) )
stack_list: Use this to stack data.frames in a list. This can be useful if
e.g. you've run some kind of bootstrap procedure and have all your results
stored in as a list of data.frames -- even
do.call( rbind, dfs ) can be slow.
The difference is even more prominent when used on very large lists.
This is partially deprecated by
data.table::rbindlist now, which has a much
faster C implementation.
dfs <- replicate(1E3, data.frame(x=rnorm(10), y=sample(letters,10), z=sample(LETTERS,10)), simplify=FALSE ) str( stack_list(dfs) ) system.time( stack_list(dfs) ) system.time( do.call(rbind, dfs) ) system.time( data.table::rbindlist(dfs) )
R is missing some nice builtin 'string' functions. I've introduced a few functions for common string operations.
str_rev: Reverses a character vector; ie, a vector of strings.
str_rev2 is there if you need to reverse a potentially unicode string.
str_rev( c("ABC", "DEF", NA, paste(LETTERS, collapse="") ) ) str_rev2( c("はひふへほ", "abcdef") )
str_slice: Slices a vector of strings at consecutive indices
str_slice2 exists for potentially unicode strings.
str_slice( c("ABCDEF", "GHIJKL", "MNOP", "QR"), 2 ) str_slice2( "ハッピー", 2 )
str_sort: sort a string.
str_collapse: Collapse a string using
Rcpp sugar; operates like
paste(..., collapse=""), but works much faster.
str_collapse( c("ABC", "DEF", "GHI") )
Sometimes, you get really large data files that just aren't going to fit into RAM. You really wish you could split them up in a structured way, transform them in some way, and then put them back together. One might consider a more 'enterprise' edition of the split-apply-combine framework (Hadoop and friends), but one dirty alternative is to use C++ to munge through a text file and pull out things that we actually want.
split_file: This function splits a delimited file into multiple files, according to
unique entries in a chosen column.
extract_rows_from_file: From a delimited text file, extract only the rows for
which the entries in a particular column match some set of items that you
wish to keep.
Use these functions to generate C++ / Rcpp-backed functions for common R-style operations.
Rcpp_tapply_generator: Generate fast
tapply style functions from C++
code through Rcpp. See the example.
dat <- data.frame( y=rnorm(100), x=sample(letters[1:5], 100, TRUE) ) tMean <- Rcpp_tapply_generator("return mean(x);") with( dat, tMean(y, x) ) with( dat, tapply(y, x, mean) ) microbenchmark( Kmisc=with( dat, tMean(y, x) ), R=with( dat, tapply(y, x, mean) ), times=5 )
Rcpp_apply_generator: An apply function generator tailored to 2D matrices.
However, your function definition must return a scalar value.
aMean <- Rcpp_apply_generator("return mean(x);") mat <- matrix( rnorm(100), nrow=10 ) aMean(mat, 2) apply(mat, 2, mean) microbenchmark( Kmisc=aMean(mat, 2), R=apply(mat, 2, mean) )
tapply_: This function operates like
tapply but works faster through a
faster factor generating function, as well as an optimized split. Note that
it is however restricted to the (common) case of your value and grouping
variables being column vectors.
library(microbenchmark) y <- rnorm(1000); x <- sample(letters[1:5], 1000, TRUE) tapply(y, x, mean) tapply_(y, x, mean) microbenchmark( times=10, tapply(y, x, mean), tapply_(y, x, mean), tMean(y, x) )
melt_: This function operates like
reshape2:::melt, but works almost
entirely through C and hence is much faster.
dat <- data.frame( id=LETTERS[1:5], x1=rnorm(5), x2=rnorm(5), x3=rnorm(5) ) print(dat) melt_(dat, id.vars="id")
factor_: A faster, simpler implementation of
factor through Rcpp. This might
be useful in some rare cases where speed is essential.
lets <- sample(letters, 1E6, TRUE) stopifnot( identical( factor_(lets), factor(lets) ) ) microbenchmark( times=5, factor_(lets), factor(lets) )
html: Custom HTML in an R Markdown document.
html( table( class="table table-bordered table-striped table-condensed table-hover", ## bootstrap classes tr( td("Apples"), td("Bananas") ), tr( td("20"), td("30") ) ) )
str, but much faster. It won't choke on very large
df <- data.table(x=1, y=2) str(df) anatomy(df)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.