Introduction to `prt`
In prt: Tabular Data Backed by Partitioned 'fst' Files

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(prt)

The prt object introduced by this package is intended to represent tabular data stored as one or more fst files. This is in similar spirit as disk.frame, but is much less ambitious in scope and therefore much simpler in implementation. While the disk.frame package attempts to provide a dplyr compliant API and offers parallel computation via the future package, the intended use-case for prt objects is the situation where only a (small) subset of rows of the (large) tabular dataset are of interest for analysis at once. This subset can be specified using the base generic function subset() and the selected data is read into memory as a data.table object. Subsequent data operations and analysis is then preformed on this data.table representation. For this reason, partition-level parallelism is not in-scope for prt as fst already provides an efficient shared memory parallel implementation for decompression. Furthermore the much more complex multi-function non-standard evaluation API provided by dplyr was forgone in favor of the very simple one-function approach presented by the base R S3 generic function subset().

For the purpose of illustration of some prt features and particularities, we instantiate a dataset as data.table object and create a temporary directory which will contain the file-based data back ends.

tmp <- tempfile()
dir.create((tmp))

dat <- data.table::setDT(nycflights13::flights)
print(dat)

Creating a prt object consisting of 2 partitions can for example be done as

flights <- as_prt(dat, n_chunks = 2L, dir = tempfile(tmpdir = tmp))
print(flights)

This simply splits rows of dat into 2 equally sized groups, preserving the original row ordering and writes each group to its own fst file. Depending on the types of queries that are most frequently run against the data, this naive partitioning might not be optimal. While fst does provide random row access, row selection is only possible via index ranges. Consequently, for each partition all rows that fall into the range between the minimum and the maximum required index will be read into memory and superfluous rows are discarded. If for example the data were to be most frequently accessed by airline, the resulting data loads would be more efficient if the data was already sorted by carrier codes.

dat <- data.table::setorderv(dat, "carrier")
grp <- cumsum(table(dat$carrier)) / nrow(dat) < 0.5
dat <- split(dat, grp[dat$carrier])

by_carrier <- as_prt(dat, dir = tempfile(tmpdir = tmp))
by_carrier

The behavior of subsetting operations on prt objects is modeled after that of tibble objects. Columns can be extracted using [[, $ (with partial matching being disallowed), or by selecting a single column with [ and passing TRUE as drop argument.

str(flights[[1L]])
identical(flights[["year"]], flights$year)
identical(flights[["year"]], flights[, "year", drop = TRUE])
str(flights$yea)

If the object resulting from the subsetting operation is two-dimensional, it is returned as data.table object. Apart form this distinction, again the intent is to replicate tibble behavior. One way in which tibble and data.frame do not behave in the same way is in default coercion to lower dimensions. The default value for the drop argument of [.data.frame is FALSE if only one row is returned but changes to TRUE where the result is a single column, while it is always FALSE for tibbles. A difference in behavior between data.table and tibble (any by extension prt) is a missing j argument: in the tibble (and in the data.frame) implementation, the i argument is then interpreted as column specification, whereas for data.frames, i remains a row selection.

datasets::mtcars[, "mpg"]
flights[, "dep_time"]

jan_dt <- flights[flights$month == 1L, ]

jan_dt[1L]
flights[1L]

Deviation of prt subsetting behavior from that of tibble objects is most likely unintentional and bug reports are much appreciated as github issues.

The main feature of prt is the ability to load only a subset of a much larger tabular dataset and a useful function for selecting rows and columns of a table in a concise manner is the base R S3 generic function subset(). As such, a prt specific method is provided by this package. Using this functionality, above query for selecting all flights in January can be written as follows

identical(jan_dt, subset(flights, month == 1L))

To illustrate the importance of row-ordering consider the following small benchmark example: we subset on the carrier column, selecting only American Airlines flights. In one prt object, rows are ordered by carrier whereas in the other they are not, which will cause rows that are interleaved with those corresponding to AA flights to be read and discarded.

bench::mark(
  subset(flights, carrier == "AA"),
  subset(by_carrier, carrier == "AA")
)

A common problem with non-standard evaluation (NSE) is potential ambiguity. Symbols in expressions passed as subset and select arguments are first resolved in the context of the data, followed by the environment the expression was created in (the quosure environment). Expressions are evaluated using rlang::eval_tidy(), which makes possible the distinction between symbols referring to the data mask from those referring to the expression environment. This can either be achieved using the .data and .env pronouns or by forcing parts of the expression.

month <- 1L
subset(flights, month == month, 1L:7L)

identical(jan_dt, subset(flights, month == !!month))
identical(jan_dt, subset(flights, .env$month == .data$month))

While in the above example it is fairly clear what is happening and it should come as no surprise that the symbol month cannot simultaneously refer to a value in the calling environment and the name of a column in the data mask, a more subtle issue is considered in the following example. The environment which takes precedence for evaluating the select argument is a named list of column indices. This makes it possible for example to specify a range of columns as (and makes the behavior of subset() being applied to a prt object consistent with that of a data.frame).

subset(flights, select = year:day)

Now recall that symbols that cannot be resolved in this data environment will be looked up in the calling environment. Therefore the following effect, while potentially unintuitive, can easily be explained. Again, the .data and .env pronouns can be used to resolve potential issues.

sched_dep_time <- "dep_time"
colnames(subset(flights, select = sched_dep_time))

actual_dep_time <- "dep_time"
colnames(subset(flights, select = actual_dep_time))

colnames(subset(flights, select = .env$sched_dep_time))
colnames(subset(flights, select = .env$actual_dep_time))

colnames(subset(flights, select = .data$sched_dep_time))
colnames(subset(flights, select = .data$actual_dep_time))

By default, subset expressions have to be evaluated on the entire dataset at once in order to be consistent with base R subset() for data.frames. Often times this is inefficient and this behavior can be modified using the part_saft argument. Consider the following query which selects all rows where the arrival delay is larger than the mean arrival delay. Obviously an expression like this can yield different results depending on whether it is evaluated on individual partitions or over the entire data. Other queries such as the one above where we threshold on a fixed value, however can safely be evaluated on partitions individually.

is_true <- function(x) !is.na(x) & x
expr <- quote(is_true(arr_delay > mean(arr_delay, na.rm = TRUE)))
nrow(subset_quo(flights, expr, part_safe = FALSE))
nrow(subset_quo(flights, expr, part_safe = TRUE))

As an aside, in addition to subset(), which creates quosures from the expressions passed as subset and select, (using rlang::enquo()) the function subset_quo() which operates on already quoted expressions is exported as well. Thanks to the double curly brace forwarding operator introduced in rlang 0.4.0, this escape-hatch mechanism however is of lesser importance.

col_safe_subset <- function(x, expr, cols) {
  stopifnot(is_prt(x), is.character(cols))
  subset(x, {{ expr }}, .env$cols)
}

air_time <- c("dep_time", "arr_time")
col_safe_subset(flights, month == 1L, air_time)

In addition to subsetting, concise and informative printing is another area which effort ha been put into. Inspired by (and liberally borrowing code from) tibble, the print() method of fst objects adds the data.table approach of showing both the first and last n rows of the table in question. This functionality can be used by other classes used to represent tabular data, as the function trunc_dt() driving this is exported. All that is required are implementations of the base S3 generic functions dim(), head(), tail() and of course print().

new_tbl <- function(...) structure(list(...), class = "my_tbl")

dim.my_tbl <- function(x) {
  rows <- unique(lengths(x))
  stopifnot(length(rows) == 1L)
  c(rows, length(x))
}

head.my_tbl <- function(x, n = 6L, ...) {
  as.data.frame(lapply(x, `[`, seq_len(n)))
}

tail.my_tbl <- function(x, n = 6L, ...) {
  as.data.frame(lapply(x, `[`, seq(nrow(x) - n + 1L, nrow(x))))
}

print.my_tbl <- function(x, ..., n = NULL, width = NULL,
                         max_extra_cols = NULL) {

  out <- format_dt(x, n = n, width = width, max_extra_cols = max_extra_cols)
  out <- paste0(out, "\n")

  cat(out, sep = "")

  invisible(x)
}

if (base::getRversion() < "4.0.0") {

  .S3method <- function(generic, class, method) {

    if(missing(method)) {
      method <- paste(generic, class, sep = ".")
    }

    method <- match.fun(method)

    registerS3method(generic, class, method, envir = parent.frame())

    invisible(NULL)
  }
}

Map(.S3method, c("dim", "head", "tail", "print"), "my_tbl",
    list(dim.my_tbl, head.my_tbl, tail.my_tbl, print.my_tbl))

new_tbl(a = letters, b = 1:26)

Similarly, the function glimpse_dt() which can be used to implement a class-specific function for the tibble S3 generic tibble::glimpse(). In order to customize the text description of the object a class-specific function for the tibble S3 generic tibble::tbl_sum() can be provided.