Which Data Format is Fastest Data Format?

It's good to get data into and out of R! More compatibility, more data, more better. But which is fastest?

Of course, speed might not be your only criterion. Different data formats have different capabilities and purposes:

```{R setup, cache=FALSE, echo=FALSE, message=FALSE} library(knitr) opts_chunk$set(cache = TRUE, autodep = TRUE, message = FALSE, warning = FALSE) message("cache path is '", opts_chunk$get("cache.path"), "'")

All that said.... Which is fastest? Which is most efficient? Under
which set of arbitrarily chosen tests? I'll give more details below,
but first I'll show the results. In both graphs, lower is better.

```{R, fig_width=6, fig_height=6, fig.asp = 1, cache=FALSE}

On the horizontal axis is the normalized size of the object being transmitted. On the vertical is the time taken to transmit the test dataset. There are four scenarios corresponding to how the dataset is transmitted or stored. All tests are using the default options for each respective library.

This following graph shows the amount of storage used to encode the test dataset.

```{R, cache=FALSE} load_cache("datasize","datasize.plot")

# Benchmarking Data Export / Import

This document is included as a static vignette, because running
benchmark code would not be kind to package builders.

I'll try picking a reference dataset and timing how long it takes to
encode and decode it. I'll start with the dataset `nycflights13`,
consisting of five data frames, since it's large enough to take a
measurable time to process.

The file `inst/benchmarking.R` contains code for test generation and

```{R cache = FALSE, results = "hide"}
source(system.file("benchmarking.R", package = "msgpack"))

I will test each encoder under the following scenarios:

Each test will record the OS-reported CPU time, as well as elapsed clock time for reading, writing and total.

For fifo, tcp, and remote tests, the reading and writing may happen concurrently, so that total.elapsed < read.elapsed + write.elapsed. On the other hand, some packages cannot cope with reading from asynchronous connections (more on this in the details). In these cases I need to make sure all data is read into a buffer before calling the decoder.

These packages turn out to vary in performance by a couple orders of magnitude. So the test harness has to dynamically vary tne input size so as to not take forever to run. It starts with a small subsample of the dataset and increases its size until the encode and decode takes 10 seconds (or the entire dataset is transmitted). (See function testCurve in inst/benchmarking.R.).

On to the per-package benchmarks.

R serialization

serialize and unserialize produce faithful replications of R objects including R-specific structures like closures, environments, and attributes. But generally only R code can read it. It is useful for communicating between R processes.

For instance:

unserialize.inmem <- timeConvert(dataset,
                                 function(data) serialize(data, NULL))
showTimings(unserialize.inmem, "R serialization (in memory)")

This is reasonably speedy, too.

However, I run into a problem if I try to transfer too large an object over a fifo or socket and read it with unserialize().

```{R, error=TRUE} unserialize.bad <- timeFifoTransfer(dataset, unserialize, serialize, catch=FALSE)

This appears to be because `unserialize` doesn't operate concurrently,
in the sense that it doesn't recover from finding the end of the line
having only read part of a message. Meanwhile, on my machine `fifo()`
and `socketConnection()` do not seem to block even if `blocking =
TRUE` is set. They always wait for at least one byte, but may return
fewer than requested. So `unserialize` does not work easily with socket

One workaround is to exhaustively read the connection before handing
off the data to `unserialize`. I handle that off screen in
`bufferBytes` in `inst/benchmarking.R`.

unserialize.socket <- timeSocketTransfer(dataset,

            c("R serialization (over TCP, same host, blocking read)"))

But this strategy only works for transmitting one object per connection, and for one connection at a time.

Another way you can do it is to wrap R serialization with msgpack::msgConnection. This only adds a few bytes to each message, and msgpack will handle assembling complete messages. This also allows you to send several values per connection, or poll several connections connections until one of them returns a decoded message.

unserialize.wrapped_socket <- timeSocketTransfer(dataset,
                                                 wrap = msgpack::msgConnection)
            "R serialization over msgpack over TCP (same host)")

Surprisingly, this actually works faster. R unserialize.wrapped_socket$total.elapsed < unserialize.socket$total.elapsed || stop()), include=FALSE. One hypothesis that might account for this is that there are fewer write syscalls if the object is serialized into memory before sending, and fewer read syscalls if the object is prepended with its length before reading.

Now I'll start collecting benchmarks systematically. Since the methods we will explore have such a wide variance of performance, we will test them with successively larger datasets, until they exceed a timeout. The serialize.spec data structure annotated below specifies how to test R serialization and how to label the results. See the code file inst/benchmarking.R for more details.

```{R serialization}

A spec is a named list of test factors.

A test factor is a named list of arguments that will be passed to the

test function at each factor level.

(Actually go look at the simpler test spec for msgpack before trying to

grok this)

serialize.spec <- combine_opts( # options of the same name concatenated list( method = list( # the factor "method" convert = list(method = timeConvert)), # has a level "convert" # that passes timeConvert # to the "method" # argument of the test # function encoder = list( # the factor "encoder" serialize = list( # has a level named "serialize" writer = serialize, # with these arguments specifying, reader, writer, to, from from = unserialize, to = function(data) serialize(data, NULL)))), # Combined with these common options (that include connection-based # test methods buffer_read_options(reader = unserialize, raw = TRUE) )

arg_df takes the above spec and takes its outer product, generating

a data frame with case labels and arguments used

serialize.calls <- arg_df(serialize.spec)

serialize.timings <- run_tests(serialize.calls)

```{R include=FALSE, results="hide"} benchmarks <- data.frame()

```{R include=FALSE, results="hide"}
benchmarks <- store(serialize.timings, benchmarks)
test.plot <- (ggplot(
  filter(benchmarks, encoder=="serialize"))
  + aes(x = size, y = total.elapsed, color = method)
  + geom_line()
  + scale_x_continuous(limits = c(0, 1), name = "Fraction of nycflights13 transmitted")
  + scale_y_continuous(limits = c(0, NA), name = "Elapsed time (s)")
test.plot + labs(title="R serialization performance")

Here the horizontal axis is the size of the test dataset, and the vertical axis is the time taken to transmit and receive.

R's implementation of fifo connections seem to be consistently slower than other methods on the test system (OS X). The next slowest, in the case of R serialization, is the test case where data is transmitted over a gigabit link (if it matters, the reading computer is an Ubuntu box slower than the writing computer.)


The dput and deparse functions render R data objects to an ASCII connection in R-like syntax. The idea is that the text output "looks like" the code it takes to construct the object, to the extent that the mechanism for reading objects back in is to eval or source the text. (Hopefully one does not do this with untrusted data. A better technique may be to evaluate the data in a limited environment that just contains the needed constructors like structure, list and c etc.)

## Annoyingly, the behavior of "dump" and "dput" depend on this global option.
options(deparse.max.lines = NULL)

#the output mimics the input
     list(1, "2", verb=quote(buckle), my=c("s", "h", "o", "e")))

(Note how dput fails on transmitting language objects; if we try to eval the above we will try to evaluate "buckle" instead of getting just the name object. as.name("buckle").)

Performance-wise, dput and source should not be used for large datasets, because they display an O(n^2) characteristic in terms of the data size.

```{R dput} options(deparse.max.lines = NULL)

dput.timings <- run_tests(arg_df(c( all.common.options, list( encoder = list( dput = list( from = function(t) eval(parse(text=t)), to = deparse, reader = function(c) eval(parse(c)), writer = function(x, c) dput(x, file=c)))))))

```{R, include = FALSE, results = "hide"}
benchmarks <- store(dput.timings, benchmarks)
(test.plot + labs(title="dput performance")) %+% filter(benchmarks, encoder == "dput")

It's interesting that deparse (which is used for the "in-memory conversion" test method labeled conn) is much slower than the connection-based dput.


The jsonlite package includes a fromJSON and toJSON implementation. It also supports streaming reads and write, but only of records consisting of one data frame per message. Data frames are sent row-wise.

Since the test dataset consists of several data frames, I will send one after the other over the lifetime of one connection.

```{R jsonlite} jsonlite_reader <- function(conn) { append <- msgpack:::catenator() jsonlite::stream_in(conn, verbose = FALSE, handler = function(x) append(list(x))) append(action="read") }

jsonlite_writer <- function(l, conn) { lapply(l, function(x) jsonlite::stream_out(x, verbose=FALSE, conn)) }

jsonlite.spec <- c( all.common.options, list( encoder = list( jsonlite = list( to = jsonlite::toJSON, from = jsonlite::fromJSON, reader = jsonlite_reader, writer = jsonlite_writer, raw = FALSE))))

jsonlite.timings <- run_tests(arg_df(jsonlite.spec))

```{R echo=FALSE, results="hide"}
getElapsed <- function(d) filter(d, size == 1, method=="convert")$total.elapsed[[1]]
getRemote <- function(d) filter(d, size == 1, method=="remote")$total.elapsed[[1]]
`%digits%` <- function(x, y) format(x, digits=y)

jsonlite performs reasonably well, but is several times slower than serialization.

```{R, results = "hide"} benchmarks <- store(jsonlite.timings, benchmarks)

(test.plot + labs(title="jsonlite performance")) %+% filter(benchmarks, encoder == "jsonlite")


msgpack is the package this vignette is written for.

```{R msgpack_remote} msgpack_remote.spec <- c( common.options, list(method = list(remote = list(method=timeRemoteTransfer)), encoder = list( msgpack = list( reader = msgpack::readMsg, writer = msgpack::writeMsg, wrap = msgpack::msgConnection))))

msgpackR_remote.timings <- run_tests(arg_df(msgpack_remote.spec)) (test.plot + labs(title="msgpack")) %+% msgpackR_remote.timings

```{R msgpack}
msgpack.spec <- c(
    encoder = list(
      msgpack = list(
        wrap = msgpack::msgConnection,
        reader = msgpack::readMsg,
        writer = msgpack::writeMsg,
        to = msgpack::packMsg,
        from = msgpack::unpackMsg))))
msgpack.timings <- run_tests(arg_df(msgpack.spec))

```{R, results = "hide"} benchmarks <- store(msgpack.timings, benchmarks)

(test.plot + labs(title="msgpack performance")) %+% filter(benchmarks, encoder == "msgpack")

Implementing the streaming-mode callbacks has helped a lot, but there is still a quadratic characteristic going on here. Need to do some profiling of memory allocation.

Interestingly, msgpack is already faster than serialize for the remote use case (modulo some network glitches affecting one or two datapoints.)


There is an older pure-R implementation of msgpack on CRAN. One quirk is that it doesn't accept NA in R vectors.

```{R, error = TRUE} msgpackR::pack(c(1, 2, 3)) msgpackR::pack(c(1, 2, 3, NA))

As a workaround I'll substitute out all the NA values in the dataset.

dataset_mungenull <- map(dataset, map_dfc,
                         function(col) ifelse(is.na(col), 9999, col))

```{R msgpackR} msgpackR.spec <- c( all.common.options %but% list( dataset = list(nycflights13 = list(data = dataset_mungenull)), note = list("no NAs" = list())), list( encoder = list( msgpackR = list( from = msgpackR::unpack, to = msgpackR::pack, reader = bufferBytes(msgpackR::unpack), writer = function(data, conn) writeBin(msgpackR::pack(data), conn)))))

```{R, results = "hide"}
msgpackR.timings <- run_tests(arg_df(msgpackR.spec))

```{R, include=FALSE, results="hide"} benchmarks <- store(msgpackR.timings, benchmarks)

(test.plot + labs(title="msgpackR performance")) %+% filter(benchmarks, encoder == "msgpackR")

Performance-wise, it is quite slow, but a deeper concern is that if I send larger objects and inspect the results it sometimes looks garbled, and there are intermittent errors like:

```{R error=TRUE} str(subsample(dataset_mungenull, .000044) %>% (msgpackR::pack) %>% (msgpackR::unpack))

## rjson

```{R rjson}
rjson.spec <- combine_opts(
    method = list(
      convert = list(method = timeConvert)),
    encoder = list(
      rjson = list(
        from = rjson::fromJSON,
        to = rjson::toJSON,
        writer = function(data, con) writeChar(rjson::toJSON(data), con)))),
  buffer_read_options(reader = function(con) rjson::fromJSON(file=con),
                      raw = TRUE,
                      buffer = bufferRawConn))
rjson.timings <- run_tests(arg_df(rjson.spec))

```{R include=FALSE, echo=FALSE} inmemory.factor <- (getElapsed(rjson.timings) / getElapsed(msgpack.timings)) remote.factor <- (getRemote(rjson.timings) / getRemote(msgpack.timings))

`rjson` is the fastest JSON implementation -- only `r inmemory.factor %digits% 2` 
times slower than `msgpack`, in memory, and `r remote.factor %digits% 2`
times slower across the wire. It does not
support streaming reads, so we must byte-buffer to read from
connections. But that turns out to be quite fast as well.

```{R results="hide"}
benchmarks <- store(rjson.timings, benchmarks)
(test.plot + labs(title="rjson performance")) %+% filter(benchmarks, encoder == "rjson")


I am getting the following intermittent error in RJSONIO (i.e. this document fails to render once in a while.)

Error in RJSONIO::readJSONStream(con) : failed to parse json at 10240

```{R RJSONIO} RJSONIO.spec <- c( all.common.options, list( encoder = list( RJSONIO = list( from = RJSONIO::fromJSON, to = RJSONIO::toJSON, writer = function(x, con) writeBin(RJSONIO::toJSON(x), con), reader = RJSONIO::readJSONStream, raw = TRUE))))

RJSONIO.timings <- run_tests(arg_df(RJSONIO.spec))

```{R, include = FALSE, results = "hide"}
benchmarks <- store(RJSONIO.timings, benchmarks)

RJSONIO offers a function to do streaming reads from connection, but it has much overhead compared with in-memory conversion.

(test.plot + labs(title="RJSONIO performance")) %+% filter(benchmarks, encoder == "RJSONIO")


YAML is kind of "JSON, but more like Markdown." Allegedly easier to read but with a more complex grammar. It's popular for config files.

cat(yaml::as.yaml(list(compact=TRUE, schema = 0)))

Unfortunately, the YAML package produces a protection stack overflow when decoding too large a message.

```{R, error = TRUE} oops <- yaml::yaml.load(yaml::as.yaml(subsample(dataset, 0.15)))

yaml.timings <- run_tests(arg_df(c(
    note = list("Max 0.14" = list(max = 0.14)),
    encoder = list(
      yaml = list(
        reader = yaml::yaml.load_file,
        writer = function(data, conn) writeChar(yaml::as.yaml(data), conn),
        to = yaml::as.yaml,
        from = yaml::yaml.load,
        raw = TRUE))))))

```{R, results = "hide"} benchmarks <- store(yaml.timings, benchmarks)

(test.plot + labs(title="yaml performance")) %+% filter(benchmarks, encoder == "yaml")


We're going here aren't we. Let's see if we can send messages with CSV. To send several CSV tables over one connection, I'll prepend to each message a header saying how many rows to read following.

writeCsvs <- function(data, con) {
  for (nm in names(data)) {
    write.csv(as_tibble(list(name = nm, nrows = nrow(data[[nm]]))), con)
    write.csv(data[[nm]], con, row.names = FALSE)

readCsvs <- function(data, con) {
  output <- list()
    repeat {
      header <- read.csv(con2, nrows=1, stringsAsFactors = FALSE)
      output[[header$name]] <- read.csv(con2, nrows=header$nrows, stringsAsFactors = FALSE)
    error = force)

I have just had a sinking feeling that a lot of people actually build web services that talk in CSV.

Unfortunately read.table can't cope with non-blocking connections, so I have to buffer the read into memory when reading from a fifo or socket.

```{R, results="hide"} csv.timings <- run_tests( arg_df(c( list(encoder = list(csv = list(writer = writeCsvs))), buffer_read_options(reader = readCsvs, raw = TRUE))))

```{R, results = "hide"}
benchmarks <- store(csv.timings, benchmarks)
(test.plot + labs(title="R csv performance")) %+% filter(benchmarks, encoder == "csv")

Surprisingly fast at writing to files.

Comparison of Encoder Speed

```{R results="hide", include=FALSE}

Derive some kind of ordering from fastest to slowest (for choosing our

color palette):

library(gnm) library(broom) library(stringr)

order the factors from high to low by ad-hoc regression model...

m <- gnm(data = benchmarks, total.elapsed ~ Mult(size, method, encoder))

getEncoder <- function(term) { match <- str_match(term, "\.(size|method|encoder)(\w*)") match <- match[,-1] dimnames(match) <- list(NULL, list("param", "value")) as_data_frame(match) }

coefs <- (m %>% tidy %>% bind_cols(., getEncoder(.$term)))

encoder_order <- (coefs %>% filter(param == "encoder") %>% arrange(desc(abs(estimate))) %>% .$value )

method_order <- (coefs %>% filter(param == "method") %>% arrange(desc(abs(estimate))) %>% .$value )

The most importent scenarios are conversion in memory, writing to
file, writing to another process on the same host, and writing to
remote host.

```{R timings, fig_width=6, fig_height=6, fig.asp = 1}
timing.plot <- ( benchmarks
  %>% subset(method %in% c("convert", "socket", "remote", "file"))
  %>% mutate(method = c(convert="Conversion in memory", socket="TCP (same host)",
                        remote="TCP (over LAN)", file = "File I/O")[method])
  %>% mutate(encoder = factor(encoder, levels=encoder_order))
  %>% { (ggplot(.)
    + aes(y = total.elapsed, x = size, color = encoder, group = encoder)
    + facet_wrap(~method)
    + geom_point()
    + geom_line()
    + theme(aspect.ratio=1)
    + scale_x_continuous(name = "Fraction of nycflights13 transmitted")
    + scale_y_continuous(limits = c(0, NA), name = "Elapsed time (s)")
    + labs(title="Elapsed time to encode, write, read, decode, by package")
    + expand_limits(x=1.5, y=30)
    + geom_dl(aes(label=encoder),
              method = list(
                dl.trans(x=x+0.2, y=y+0.2)))
    + guides(color=FALSE)

Comparison of Encoder Data Usage

```{R datasize}

We didn't have time to test each encoder with the full

dataset, so we'll extrapolate from the largest data set tested

needed <- (benchmarks %>% filter(!is.na(bytes)) %>% select(., encoder, method, size, bytes) %>% group_by(encoder) %>% filter(row_number() == 1) %>% arrange(desc(size), .by_group=TRUE) %>% slice(1) %>% select(encoder, method) )

model <- ( needed %>% inner_join(benchmarks) %>% select(bytes, encoder, size) %>% lm(formula = bytes ~ encoder * size))

predicted <- (needed %>% mutate(size = 1) %>% augment(model, newdata=.) %>% rename(bytes = .fitted) )

datasize.plot <- (predicted %>% mutate(encoder = factor(encoder, levels=encoder_order)) %>% ggplot %>% +aes(x = reorder(encoder, bytes), y=bytes/2^20, fill=encoder) %>% +geom_col() %>% +labs(y = "MiB", x = NULL, title = "Space used to encode nycflights13 (shorter is better)") %>% +geom_dl(aes(label=encoder, group=encoder), method=c(dl.trans(y=y+0.1), "top.bumpup")) %>% +guides(fill = FALSE) %>% +theme(axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank() ) ) datasize.plot

The difference in size between different JSON and msgpack
implementations could use some investigation. It may down to
whitespace, sending data row-wise vs. col-wise, differences in the
mapping between R and data format types, or bugs.

save(benchmarks, file="../inst/benchmarks.RData")

