pRDS is a R package that uses existing parallelized compression software (e.g., pigz) to read and write compressed RDS files, leading to a speedup given the multi-threading capabalities of modern computers. The package looks for the appropriate compression software in the host system's path and if found, offload the handling of compression and decompression tasks to the external program.
Writing RDS files is rather fast if the compression option is switched off. But with compression turned on saving large quantities of data quickly becomes impossible, as R's base implementation uses only a single thread. One way to solve this issue is to create the appropriate C bindings for the existiing compression libraries. But bindings are hard to maintain and subject to breaking when the upstream changes. Quite a few projects of this kind have already gone bust. CLI interfaces, though, rarely change, and OS package managers take care of maintaining them. pRDS smartly takes the dumb approach to fast compression / decompression: Relying on the external, well-maintained tools that we all have access to on our computers. This is particularly useful in HPC environments and analytic servers where we don't have access to all the libraries we want, but where most popular compression packages are pre-installed.
You can easily use devtools
to install pRDS:
if (!requireNamespace("devtools", quietly = TRUE)) install.packages("devtools")
devtools::install_github("retrography/pRDS")
You will probably also need to install one or more of the compression software packages that pRDS recognizes, so that pRDS can work its magic. Those include xz, lbzip2, pixz, pxz, pbzip2, zstd, 7-zip, and pigz. For that you will need to use your system's package manager (brew, APT, YUM, etc). Note that not all these packages are available on every Linux/Unix/macOS platform. That is why pRDS supports so many of them.
On Windows (to the best of my knowledge) you will need zstd for gzip as well as 7-zip for xz and bzip2 compression. You will also need to install RTools that includes cruicial utilities for detecting the mime type of the files:
choco install rtools zstandard 7zip
For pRDS to find your programs, they have to be in the system's path. zstd adds
its executables directly to the path. For 7zip you have to do it yourself by
adding the directoy (normally C:\Program Files\7-Zip
) to the PATH environment
variable. Otherwise you can add it to R's PATH environment:
Sys.setenv(PATH = paste("c:\\Program Files\\7-Zip", Sys.getenv("PATH"), sep=";"))
The same holds if you have pigz or any other compression software installed.
pRDS is designed to be a drop-in replacement for R's base RDS manipulation
functions. Its functions override readRDS
and writeRDS
to introduce new
versions that use system commands with parallel implementation. Call them the
same way you call the original functions from base
package. In addition to
the usual parameters you can also pass additional parameters to these functions
that are, in turn, channeled to the underlying cmpFile
function. For
instance you can control the number of cores to be used by setting the core
parameter, or change the compression level by passing a value to the
compression
parameter. You can also explicitly set the default compression
software for a given format using the setDefaultCmd
function (only the
supported software).
Note that for the package to know how many cores it can use for its task you
will have to set the mc.cores
option:
options(mc.cores = parallel::detectCores())
I have made this benchmark on my mac (i7/4 cores/8 threads) in a loosely controlled environment, so not in an entirely scientific manner. The test file consisted of about 17 million tweets (3GB CSV). All tests were conducted using the default compression levels (gzip = 6, bzip2 = 9, xz = 6). Four of the tools don't work as intended on my computer.
| Compressor | Format | Write (sec) | Read (sec) | Size (MB) | Note | | ---------- | ------------ | ----------- | ---------- | --------- | ----------------------------------------------------------- | | R native | Uncompressed | 99 | 42 | 4236 | | | R native | gzip | 296 | 62 | 1492 | | | pigz | gzip | 136 | 35 | 1497 | | | zstd | gzip | - | - | - | Not parallel | | R native | xz | 2626 | 105 | 772 | | | 7z | xz | 580 | 83 | 741 | | | xz | xz | 469 | 64 | 795 | | | pixz | xz | 443 | 63 | 805 | | | pxz | xz | - | - | - | Doesn't compile | | zstd | xz | - | - | - | Not parallel | | R native | bzip2 | 519 | 229 | 1054 | | | lbzip2 | bzip2 | 222 | 55 | 1053 | | | pbzip2 | bzip2 | - | - | - | Hangs | | 7z | bzip2 | 1060 | 82 | 704 | Compression parameter bad? Only decompresses its own files. |
Make sure to test the different tools on your platform and particular setup, as your tools may not have been compiled or configured properly for parallel operation.
pRDS is published under GNU General Public License, version 2 because it proudly steals some GPL code from R itself.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.