Introduction to Kmisc

Kmisc introduces a grab-bag of utility functions that should be useful to various kinds of useRs. Some of the most useful functions in the package are demoed in this vignette.

set.seed(123)
library(data.table)
library(Kmisc)
library(lattice)
library(grid)
library(Rcpp)
library(knitr)
library(microbenchmark)
dat <- data.frame( x=letters[1:4], y=1:4, z=LETTERS[1:4] )
opts_chunk$set(
  results="markup"
)

without: This function is used to remove columns from a list / data.frame.

## let's remove columns 'x' and 'z' from dat.
tryCatch( dat[ -c('x', 'z') ], error=function(e) print(e$message) )
## oh :(
dat[ !(names(dat) %in% c('x', 'z')) ]
## I always find that a bit awkward. Let's use Kmisc's without instead.
without(dat, x, z)

extract: Extract vectors from a data.frame or list. Although there is already a good subsetting syntax for lists and vectors, I wanted a complementary function for without.

extract(dat, x, y)

re_without, re_extract: Extract variables whose names don't match / do match a regular expression pattern.

re_extract(dat, "[xy]")
re_without(dat, "[xy]")

swap: Replace elements in a vector.

tDat <- dat ## make a temporary copy of dat

## Replace some elements in tDat$y
tDat$y <- swap( tDat$y, from=c(2, 4), to=c(20, 40) )
cbind( dat$y, tDat$y )

factor_to_char, char_to_factor: A set of functions that recurse through a list / data.frame and set all elements that are characters to factors, and vice versa.

bDat <- data.frame( x=rnorm(10), y=sample(letters,10), z=sample(letters,10) )
str( bDat )
str( factor_to_char(bDat) )

dapply: The data.frame version of the l/sapply series of functions.

Why have this function when sapply still does much the same? I always get frustrated with the fact that either an array or a list is returned by sapply, but never a data.frame.

dat <- data.frame( x = rnorm(100), y = rnorm(100), z = rnorm(100) )
dapply( dat, summary )

kMerge: Left joins, aka. merge( all.x=TRUE, ... ) without any mangling of the order.

dat1 <- data.frame( id=5:1, x=c("a","a","b","b","b"), y=rnorm(5) )
dat2 <- data.frame( id=c(1, 2, 4), z=rnorm(3) )

## default merge changes id order
merge( dat1, dat2, by="id", all.x=TRUE )
## even the sort parameter can't save you
merge( dat1, dat2, by="id", all.x=TRUE, sort=TRUE )
# kMerge keeps it as is
kMerge( dat1, dat2, by="id" )

in_interval: A fast C implementation for determing which elements of a vector x lie within an interval [lo, hi).

x <- runif(10)*10; lo <- 5; hi <- 10
print( data.frame( x=x, between_5_and_10=in_interval(x, lo, hi) ) )

stack_list: Use this to stack data.frames in a list. This can be useful if e.g. you've run some kind of bootstrap procedure and have all your results stored in as a list of data.frames -- even do.call( rbind, dfs ) can be slow. The difference is even more prominent when used on very large lists.

This is partially deprecated by data.table::rbindlist now, which has a much faster C implementation.

dfs <- replicate(1E3, 
  data.frame(x=rnorm(10), y=sample(letters,10), z=sample(LETTERS,10)),
  simplify=FALSE
  )
str( stack_list(dfs) )
system.time( stack_list(dfs) )
system.time( do.call(rbind, dfs) )
system.time( data.table::rbindlist(dfs) )

Fast String Operations

R is missing some nice builtin 'string' functions. I've introduced a few functions for common string operations.

str_rev: Reverses a character vector; ie, a vector of strings. str_rev2 is there if you need to reverse a potentially unicode string.

str_rev( c("ABC", "DEF", NA, paste(LETTERS, collapse="") ) )
str_rev2( c("はひふへほ", "abcdef") )

str_slice: Slices a vector of strings at consecutive indices n. str_slice2 exists for potentially unicode strings.

str_slice( c("ABCDEF", "GHIJKL", "MNOP", "QR"), 2 )
str_slice2( "ハッピー", 2 )

str_sort: sort a string.

str_sort("asnoighewgypfuiweb")

str_collapse: Collapse a string using Rcpp sugar; operates like R's paste(..., collapse=""), but works much faster.

str_collapse( c("ABC", "DEF", "GHI") )

File I/O

Sometimes, you get really large data files that just aren't going to fit into RAM. You really wish you could split them up in a structured way, transform them in some way, and then put them back together. One might consider a more 'enterprise' edition of the split-apply-combine framework (Hadoop and friends), but one dirty alternative is to use C++ to munge through a text file and pull out things that we actually want.

split_file: This function splits a delimited file into multiple files, according to unique entries in a chosen column.

extract_rows_from_file: From a delimited text file, extract only the rows for which the entries in a particular column match some set of items that you wish to keep.

C++ Function Generators

Use these functions to generate C++ / Rcpp-backed functions for common R-style operations.

Rcpp_tapply_generator: Generate fast tapply style functions from C++ code through Rcpp. See the example.

dat <- data.frame( y=rnorm(100), x=sample(letters[1:5], 100, TRUE) )
tMean <- Rcpp_tapply_generator("return mean(x);")
with( dat, tMean(y, x) )
with( dat, tapply(y, x, mean) )
microbenchmark(
  Kmisc=with( dat, tMean(y, x) ),
  R=with( dat, tapply(y, x, mean) ),
  times=5
)

Rcpp_apply_generator: An apply function generator tailored to 2D matrices. However, your function definition must return a scalar value.

aMean <- Rcpp_apply_generator("return mean(x);")
mat <- matrix( rnorm(100), nrow=10 )
aMean(mat, 2)
apply(mat, 2, mean)
microbenchmark(
  Kmisc=aMean(mat, 2),
  R=apply(mat, 2, mean)
)

Faster Versions of Commonly Used R Functions

tapply_: This function operates like tapply but works faster through a faster factor generating function, as well as an optimized split. Note that it is however restricted to the (common) case of your value and grouping variables being column vectors.

library(microbenchmark)
y <- rnorm(1000); x <- sample(letters[1:5], 1000, TRUE)
tapply(y, x, mean)
tapply_(y, x, mean)
microbenchmark( times=10,
  tapply(y, x, mean),
  tapply_(y, x, mean),
  tMean(y, x)
)

melt_: This function operates like reshape2:::melt, but works almost entirely through C and hence is much faster.

dat <- data.frame(
  id=LETTERS[1:5],
  x1=rnorm(5),
  x2=rnorm(5),
  x3=rnorm(5)
)
print(dat)
melt_(dat, id.vars="id")

factor_: A faster, simpler implementation of factor through Rcpp. This might be useful in some rare cases where speed is essential.

lets <- sample(letters, 1E6, TRUE)
stopifnot( identical(
  factor_(lets),
  factor(lets)
) )
microbenchmark( times=5,
  factor_(lets),
  factor(lets)
)

html: Custom HTML in an R Markdown document.

html(
  table( class="table table-bordered table-striped table-condensed table-hover", ## bootstrap classes
    tr(
      td("Apples"),
      td("Bananas")
    ),
    tr(
      td("20"),
      td("30")
    )
  )
)

anatomy, anat: Like str, but much faster. It won't choke on very large data.frames.

df <- data.table(x=1, y=2)
str(df)
anatomy(df)


Try the Kmisc package in your browser

Any scripts or data that you put into this service are public.

Kmisc documentation built on May 29, 2017, 1:43 p.m.