It's good to get data into and out of R! More compatibility, more data, more better. But which is fastest?
Of course, speed might not be your only criterion. Different data formats have different capabilities and purposes:
serialize()
is your best best for getting R objects out of and
back into R the way you had them, but doesn't do much for you wneh
communicating with other systems.dump()
is your next best bet, while being a somewhat
human-readable text format.dims
and class
don't have equivalents
in Javascript. Different packages also take different approaches in
representing JSON objects in R, and vice versa.```{R setup, cache=FALSE, echo=FALSE, message=FALSE} library(knitr) opts_chunk$set(cache = TRUE, autodep = TRUE, message = FALSE, warning = FALSE) message("cache path is '", opts_chunk$get("cache.path"), "'")
All that said.... Which is fastest? Which is most efficient? Under which set of arbitrarily chosen tests? I'll give more details below, but first I'll show the results. In both graphs, lower is better. ```{R, fig_width=6, fig_height=6, fig.asp = 1, cache=FALSE} load_cache("timings","timing.plot")
On the horizontal axis is the normalized size of the object being transmitted. On the vertical is the time taken to transmit the test dataset. There are four scenarios corresponding to how the dataset is transmitted or stored. All tests are using the default options for each respective library.
This following graph shows the amount of storage used to encode the test dataset.
```{R, cache=FALSE} load_cache("datasize","datasize.plot")
# Benchmarking Data Export / Import This document is included as a static vignette, because running benchmark code would not be kind to package builders. I'll try picking a reference dataset and timing how long it takes to encode and decode it. I'll start with the dataset `nycflights13`, consisting of five data frames, since it's large enough to take a measurable time to process. The file `inst/benchmarking.R` contains code for test generation and timing. ```{R cache = FALSE, results = "hide"} source(system.file("benchmarking.R", package = "msgpack"))
I will test each encoder under the following scenarios:
convert
is for just converting an R object to an R raw or
character object and back, in memory.conn
also does an in-memory conversion but with R textConnection
or rawConnection
objects. (The rawConnection
objects seem to
work faster, even with text formats.)file
writes to a temp file and then reads it back.fifo
forks the R process and writes from one while reading
from the other over a UNIX named pipe.tcp
forks the R process and writes from one to the other using a
TCP socket.remote
connects to a R process on a remote machine, and sends
data from the local to the remote over a TCP socket.Each test will record the OS-reported CPU time, as well as elapsed clock time for reading, writing and total.
For fifo
, tcp
, and remote
tests, the reading and writing may
happen concurrently, so that total.elapsed < read.elapsed +
write.elapsed
. On the other hand, some packages cannot cope with
reading from asynchronous connections (more on this in the
details). In these cases I need to make sure all data is read into a
buffer before calling the decoder.
These packages turn out to vary in performance by a couple orders of
magnitude. So the test harness has to dynamically vary tne input size
so as to not take forever to run. It starts with a small subsample
of the dataset and increases its size until the encode and decode
takes 10 seconds (or the entire dataset is transmitted). (See function
testCurve
in inst/benchmarking.R
.).
On to the per-package benchmarks.
serialize
and unserialize
produce faithful replications of R
objects including R-specific structures like closures, environments,
and attributes. But generally only R code can read it. It is useful
for communicating between R processes.
For instance:
unserialize.inmem <- timeConvert(dataset, unserialize, function(data) serialize(data, NULL)) showTimings(unserialize.inmem, "R serialization (in memory)")
This is reasonably speedy, too.
However, I run into a problem if I try to transfer too large an object over
a fifo or socket and read it with unserialize().
```{R, error=TRUE} unserialize.bad <- timeFifoTransfer(dataset, unserialize, serialize, catch=FALSE)
This appears to be because `unserialize` doesn't operate concurrently, in the sense that it doesn't recover from finding the end of the line having only read part of a message. Meanwhile, on my machine `fifo()` and `socketConnection()` do not seem to block even if `blocking = TRUE` is set. They always wait for at least one byte, but may return fewer than requested. So `unserialize` does not work easily with socket connections. One workaround is to exhaustively read the connection before handing off the data to `unserialize`. I handle that off screen in `bufferBytes` in `inst/benchmarking.R`. ```{R} unserialize.socket <- timeSocketTransfer(dataset, bufferBytes(unserialize), serialize) showTimings(unserialize.socket, c("R serialization (over TCP, same host, blocking read)"))
But this strategy only works for transmitting one object per connection, and for one connection at a time.
Another way you can do it is to wrap R serialization with
msgpack::msgConnection
. This only adds a few bytes to each message,
and msgpack will handle assembling complete messages. This also allows
you to send several values per connection, or poll several connections
connections until one of them returns a decoded message.
unserialize.wrapped_socket <- timeSocketTransfer(dataset, unserialize, serialize, wrap = msgpack::msgConnection)
showTimings(unserialize.wrapped_socket, "R serialization over msgpack over TCP (same host)")
Surprisingly, this actually works faster. R
unserialize.wrapped_socket$total.elapsed <
unserialize.socket$total.elapsed || stop()), include=FALSE
. One
hypothesis that might account for this is that there are fewer write
syscalls if the object is serialized into memory before sending, and
fewer read syscalls if the object is prepended with its length before
reading.
Now I'll start collecting benchmarks systematically. Since the methods
we will explore have such a wide variance of performance, we will test
them with successively larger datasets, until they exceed a
timeout. The serialize.spec
data structure annotated below specifies
how to test R serialization and how to label the results. See the code
file inst/benchmarking.R
for more details.
```{R serialization}
serialize.spec <- combine_opts( # options of the same name concatenated list( method = list( # the factor "method" convert = list(method = timeConvert)), # has a level "convert" # that passes timeConvert # to the "method" # argument of the test # function encoder = list( # the factor "encoder" serialize = list( # has a level named "serialize" writer = serialize, # with these arguments specifying, reader, writer, to, from from = unserialize, to = function(data) serialize(data, NULL)))), # Combined with these common options (that include connection-based # test methods buffer_read_options(reader = unserialize, raw = TRUE) )
serialize.calls <- arg_df(serialize.spec)
```{R} serialize.timings <- run_tests(serialize.calls)
```{R include=FALSE, results="hide"} benchmarks <- data.frame()
```{R include=FALSE, results="hide"} benchmarks <- store(serialize.timings, benchmarks)
test.plot <- (ggplot( filter(benchmarks, encoder=="serialize")) + aes(x = size, y = total.elapsed, color = method) + geom_line() + scale_x_continuous(limits = c(0, 1), name = "Fraction of nycflights13 transmitted") + scale_y_continuous(limits = c(0, NA), name = "Elapsed time (s)") ) test.plot + labs(title="R serialization performance")
Here the horizontal axis is the size of the test dataset, and the vertical axis is the time taken to transmit and receive.
R's implementation of fifo connections seem to be consistently slower than other methods on the test system (OS X). The next slowest, in the case of R serialization, is the test case where data is transmitted over a gigabit link (if it matters, the reading computer is an Ubuntu box slower than the writing computer.)
The dput
and deparse
functions render R data objects to an ASCII
connection in R-like syntax. The idea is that the text output "looks
like" the code it takes to construct the object, to the extent that
the mechanism for reading objects back in is to eval
or source
the
text. (Hopefully one does not do this with untrusted data. A better
technique may be to evaluate the data in a limited environment that
just contains the needed constructors like structure
, list
and c
etc.)
## Annoyingly, the behavior of "dump" and "dput" depend on this global option. options(deparse.max.lines = NULL) #the output mimics the input dput(control=c(), list(1, "2", verb=quote(buckle), my=c("s", "h", "o", "e")))
(Note how dput
fails on transmitting language objects; if we try to
eval the above we will try to evaluate "buckle" instead of getting
just the name object. as.name("buckle")
.)
Performance-wise, dput
and source
should not be used for large
datasets, because they display an O(n^2) characteristic in terms of
the data size.
```{R dput} options(deparse.max.lines = NULL)
dput.timings <- run_tests(arg_df(c( all.common.options, list( encoder = list( dput = list( from = function(t) eval(parse(text=t)), to = deparse, reader = function(c) eval(parse(c)), writer = function(x, c) dput(x, file=c)))))))
```{R, include = FALSE, results = "hide"} benchmarks <- store(dput.timings, benchmarks)
(test.plot + labs(title="dput performance")) %+% filter(benchmarks, encoder == "dput")
It's interesting that deparse
(which is used for the "in-memory
conversion" test method labeled conn
) is much slower than the
connection-based dput
.
The jsonlite
package includes a fromJSON and toJSON implementation. It also
supports streaming reads and write, but only of records consisting of
one data frame per message. Data frames are sent row-wise.
Since the test dataset consists of several data frames, I will send one after the other over the lifetime of one connection.
```{R jsonlite} jsonlite_reader <- function(conn) { append <- msgpack:::catenator() jsonlite::stream_in(conn, verbose = FALSE, handler = function(x) append(list(x))) append(action="read") }
jsonlite_writer <- function(l, conn) { lapply(l, function(x) jsonlite::stream_out(x, verbose=FALSE, conn)) }
jsonlite.spec <- c( all.common.options, list( encoder = list( jsonlite = list( to = jsonlite::toJSON, from = jsonlite::fromJSON, reader = jsonlite_reader, writer = jsonlite_writer, raw = FALSE))))
jsonlite.timings <- run_tests(arg_df(jsonlite.spec))
```{R echo=FALSE, results="hide"} getElapsed <- function(d) filter(d, size == 1, method=="convert")$total.elapsed[[1]] getRemote <- function(d) filter(d, size == 1, method=="remote")$total.elapsed[[1]] `%digits%` <- function(x, y) format(x, digits=y)
jsonlite performs reasonably well, but is several times slower than serialization.
```{R, results = "hide"} benchmarks <- store(jsonlite.timings, benchmarks)
```{R} (test.plot + labs(title="jsonlite performance")) %+% filter(benchmarks, encoder == "jsonlite")
msgpack
is the package this vignette is written for.
```{R msgpack_remote} msgpack_remote.spec <- c( common.options, list(method = list(remote = list(method=timeRemoteTransfer)), encoder = list( msgpack = list( reader = msgpack::readMsg, writer = msgpack::writeMsg, wrap = msgpack::msgConnection))))
msgpackR_remote.timings <- run_tests(arg_df(msgpack_remote.spec)) (test.plot + labs(title="msgpack")) %+% msgpackR_remote.timings
```{R msgpack} msgpack.spec <- c( all.common.options, list( encoder = list( msgpack = list( wrap = msgpack::msgConnection, reader = msgpack::readMsg, writer = msgpack::writeMsg, to = msgpack::packMsg, from = msgpack::unpackMsg))))
msgpack.timings <- run_tests(arg_df(msgpack.spec))
```{R, results = "hide"} benchmarks <- store(msgpack.timings, benchmarks)
```{R} (test.plot + labs(title="msgpack performance")) %+% filter(benchmarks, encoder == "msgpack")
Implementing the streaming-mode callbacks has helped a lot, but there is still a quadratic characteristic going on here. Need to do some profiling of memory allocation.
Interestingly, msgpack is already faster than serialize for the remote use case (modulo some network glitches affecting one or two datapoints.)
There is an older pure-R implementation of msgpack on CRAN. One quirk is that
it doesn't accept NA
in R vectors.
```{R, error = TRUE} msgpackR::pack(c(1, 2, 3)) msgpackR::pack(c(1, 2, 3, NA))
As a workaround I'll substitute out all the NA values in the dataset. ```{R} dataset_mungenull <- map(dataset, map_dfc, function(col) ifelse(is.na(col), 9999, col))
```{R msgpackR} msgpackR.spec <- c( all.common.options %but% list( dataset = list(nycflights13 = list(data = dataset_mungenull)), note = list("no NAs" = list())), list( encoder = list( msgpackR = list( from = msgpackR::unpack, to = msgpackR::pack, reader = bufferBytes(msgpackR::unpack), writer = function(data, conn) writeBin(msgpackR::pack(data), conn)))))
```{R, results = "hide"} msgpackR.timings <- run_tests(arg_df(msgpackR.spec))
```{R, include=FALSE, results="hide"} benchmarks <- store(msgpackR.timings, benchmarks)
```{R} (test.plot + labs(title="msgpackR performance")) %+% filter(benchmarks, encoder == "msgpackR")
Performance-wise, it is quite slow, but a deeper concern is that if I send larger objects and inspect the results it sometimes looks garbled, and there are intermittent errors like:
```{R error=TRUE} str(subsample(dataset_mungenull, .000044) %>% (msgpackR::pack) %>% (msgpackR::unpack))
## rjson ```{R rjson} rjson.spec <- combine_opts( list( method = list( convert = list(method = timeConvert)), encoder = list( rjson = list( from = rjson::fromJSON, to = rjson::toJSON, writer = function(data, con) writeChar(rjson::toJSON(data), con)))), buffer_read_options(reader = function(con) rjson::fromJSON(file=con), raw = TRUE, buffer = bufferRawConn)) rjson.timings <- run_tests(arg_df(rjson.spec))
```{R include=FALSE, echo=FALSE} inmemory.factor <- (getElapsed(rjson.timings) / getElapsed(msgpack.timings)) remote.factor <- (getRemote(rjson.timings) / getRemote(msgpack.timings))
`rjson` is the fastest JSON implementation -- only `r inmemory.factor %digits% 2` times slower than `msgpack`, in memory, and `r remote.factor %digits% 2` times slower across the wire. It does not support streaming reads, so we must byte-buffer to read from connections. But that turns out to be quite fast as well. ```{R results="hide"} benchmarks <- store(rjson.timings, benchmarks)
(test.plot + labs(title="rjson performance")) %+% filter(benchmarks, encoder == "rjson")
I am getting the following intermittent error in RJSONIO (i.e. this document fails to render once in a while.)
Error in RJSONIO::readJSONStream(con) : failed to parse json at 10240
```{R RJSONIO} RJSONIO.spec <- c( all.common.options, list( encoder = list( RJSONIO = list( from = RJSONIO::fromJSON, to = RJSONIO::toJSON, writer = function(x, con) writeBin(RJSONIO::toJSON(x), con), reader = RJSONIO::readJSONStream, raw = TRUE))))
RJSONIO.timings <- run_tests(arg_df(RJSONIO.spec))
```{R, include = FALSE, results = "hide"} benchmarks <- store(RJSONIO.timings, benchmarks)
RJSONIO offers a function to do streaming reads from connection, but it has much overhead compared with in-memory conversion.
(test.plot + labs(title="RJSONIO performance")) %+% filter(benchmarks, encoder == "RJSONIO")
YAML is kind of "JSON, but more like Markdown." Allegedly easier to read but with a more complex grammar. It's popular for config files.
cat(yaml::as.yaml(list(compact=TRUE, schema = 0)))
Unfortunately, the YAML package produces a protection stack overflow when decoding too large a message.
```{R, error = TRUE} oops <- yaml::yaml.load(yaml::as.yaml(subsample(dataset, 0.15)))
```{R} yaml.timings <- run_tests(arg_df(c( all.common.options, list( note = list("Max 0.14" = list(max = 0.14)), encoder = list( yaml = list( reader = yaml::yaml.load_file, writer = function(data, conn) writeChar(yaml::as.yaml(data), conn), to = yaml::as.yaml, from = yaml::yaml.load, raw = TRUE))))))
```{R, results = "hide"} benchmarks <- store(yaml.timings, benchmarks)
```{R} (test.plot + labs(title="yaml performance")) %+% filter(benchmarks, encoder == "yaml")
We're going here aren't we. Let's see if we can send messages with CSV. To send several CSV tables over one connection, I'll prepend to each message a header saying how many rows to read following.
writeCsvs <- function(data, con) { for (nm in names(data)) { write.csv(as_tibble(list(name = nm, nrows = nrow(data[[nm]]))), con) write.csv(data[[nm]], con, row.names = FALSE) } } readCsvs <- function(data, con) { output <- list() tryCatch( repeat { header <- read.csv(con2, nrows=1, stringsAsFactors = FALSE) output[[header$name]] <- read.csv(con2, nrows=header$nrows, stringsAsFactors = FALSE) }, error = force) output }
I have just had a sinking feeling that a lot of people actually build web services that talk in CSV.
Unfortunately read.table
can't cope with non-blocking connections, so
I have to buffer the read into memory when reading from a fifo or socket.
```{R, results="hide"} csv.timings <- run_tests( arg_df(c( list(encoder = list(csv = list(writer = writeCsvs))), buffer_read_options(reader = readCsvs, raw = TRUE))))
```{R, results = "hide"} benchmarks <- store(csv.timings, benchmarks)
(test.plot + labs(title="R csv performance")) %+% filter(benchmarks, encoder == "csv")
Surprisingly fast at writing to files.
```{R results="hide", include=FALSE}
library(gnm) library(broom) library(stringr)
m <- gnm(data = benchmarks, total.elapsed ~ Mult(size, method, encoder))
getEncoder <- function(term) { match <- str_match(term, "\.(size|method|encoder)(\w*)") match <- match[,-1] dimnames(match) <- list(NULL, list("param", "value")) as_data_frame(match) }
coefs <- (m %>% tidy %>% bind_cols(., getEncoder(.$term)))
encoder_order <- (coefs %>% filter(param == "encoder") %>% arrange(desc(abs(estimate))) %>% .$value )
method_order <- (coefs %>% filter(param == "method") %>% arrange(desc(abs(estimate))) %>% .$value )
The most importent scenarios are conversion in memory, writing to file, writing to another process on the same host, and writing to remote host. ```{R timings, fig_width=6, fig_height=6, fig.asp = 1} library(directlabels) timing.plot <- ( benchmarks %>% subset(method %in% c("convert", "socket", "remote", "file")) %>% mutate(method = c(convert="Conversion in memory", socket="TCP (same host)", remote="TCP (over LAN)", file = "File I/O")[method]) %>% mutate(encoder = factor(encoder, levels=encoder_order)) %>% { (ggplot(.) + aes(y = total.elapsed, x = size, color = encoder, group = encoder) + facet_wrap(~method) + geom_point() + geom_line() + theme(aspect.ratio=1) + scale_x_continuous(name = "Fraction of nycflights13 transmitted") + scale_y_continuous(limits = c(0, NA), name = "Elapsed time (s)") + labs(title="Elapsed time to encode, write, read, decode, by package") + expand_limits(x=1.5, y=30) + geom_dl(aes(label=encoder), debug=FALSE, method = list( "last.points", rot=45, "bumpup", dl.trans(x=x+0.2, y=y+0.2))) + guides(color=FALSE) )}) timing.plot
```{R datasize}
needed <- (benchmarks %>% filter(!is.na(bytes)) %>% select(., encoder, method, size, bytes) %>% group_by(encoder) %>% filter(row_number() == 1) %>% arrange(desc(size), .by_group=TRUE) %>% slice(1) %>% select(encoder, method) )
model <- ( needed %>% inner_join(benchmarks) %>% select(bytes, encoder, size) %>% lm(formula = bytes ~ encoder * size))
predicted <- (needed %>% mutate(size = 1) %>% augment(model, newdata=.) %>% rename(bytes = .fitted) )
datasize.plot <- (predicted %>% mutate(encoder = factor(encoder, levels=encoder_order)) %>% ggplot %>% +aes(x = reorder(encoder, bytes), y=bytes/2^20, fill=encoder) %>% +geom_col() %>% +labs(y = "MiB", x = NULL, title = "Space used to encode nycflights13 (shorter is better)") %>% +geom_dl(aes(label=encoder, group=encoder), method=c(dl.trans(y=y+0.1), "top.bumpup")) %>% +guides(fill = FALSE) %>% +theme(axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank() ) ) datasize.plot
The difference in size between different JSON and msgpack implementations could use some investigation. It may down to whitespace, sending data row-wise vs. col-wise, differences in the mapping between R and data format types, or bugs. ```{R} save(benchmarks, file="../inst/benchmarks.RData")
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.