In olascodgreat/samife: A Grammar of Data Manipulation

overview

This is a complete redesign of how we evaluate expression in dplyr. We no longer attempt to evaluate part of an expression. We now either:

recognize the entire expression, e.g. n() or mean(x) and use C++ code to evaluate it (this is what we call hybrid evaluation now, but I guess another term would be better.
if not, we use standard evaluation in a suitable environment

data mask

When used internally in the c++ code, a tibble become one of the 3 classes GroupedDataFrame, RowwiseDataFrame or NaturalDataFrame. Most internal code is templated by these classes, e.g. summarise is:

// [[Rcpp::export]]
SEXP summarise_impl(DataFrame df, QuosureList dots) {
  check_valid_colnames(df);
  if (is<RowwiseDataFrame>(df)) {
    return summarise_grouped<RowwiseDataFrame>(df, dots);
  } else if (is<GroupedDataFrame>(df)) {
    return summarise_grouped<GroupedDataFrame>(df, dots);
  } else {
    return summarise_grouped<NaturalDataFrame>(df, dots);
  }
}

The DataMask<SlicedTibble> template class is used by both hybrid and standard evaluation to extract the relevant information from the columns (original columns or columns that have just been made by mutate() or summarise())

standard evaluation

meta information about the groups

The functions n(), row_number() and group_indices() when called without arguments lack contextual information, i.e. the current group size and index, so they look for that information a the special environment

n <- function() {
  from_context("..group_size")
}

The DataMask class is responsible for updating the variables ..group_size and ..group_number

    // update the data context variables, these are used by n(), ...
    get_context_env()["..group_size"] = indices.size();
    get_context_env()["..group_number"] = indices.group() + 1;

all other functions can just be called with standard evaluation in the data mask

active and resolved bindings

When doing standard evaluation, we need to install a data mask that evaluates the symbols from the data to the relevant subset. The simple solution would be to update the data mask at each iteration with subsets for all the variables but that would be potentially expensive and a waste, as we might not need all of the variables at a given time, e.g. in this case:

iris %>% group_by(Species) %>% summarise(Sepal.Length = +mean(Sepal.Length))

We only need to materialize Sepal.Length, we don't need the other variables.

DataMask installs an active binding for each variable in one of (the top) the environment in the data mask ancestry, the active binding function is generated by this function so that it holds an index and a pointer to the data mask in its enclosure.

.make_active_binding_fun <- function(index, subsets){
  function() {
    materialize_binding(index, subsets)
  }
}

When hit, the active binding calls the materialize_binding function :

// [[Rcpp::export]]
SEXP materialize_binding(int idx, XPtr<DataMaskBase> mask) {
  return mask->materialize(idx);
}

The DataMask<>::materialize(idx) method returns the materialized subset, but also: - install the result in the bottom environment of the data mask, so that it mask the active binding. The point is to call the active binding only once. - remembers that the binding at position idx has been materialized, so that before evaluating the same expression in the next group, it is proactively materialized, because it is very likely that we need the same variables for all groups

When we move to the next expression to evaluate, DataMask forgets about the materialized bindings so that the active binding can be triggered again as needed.

use case of the DataMask class

before evaluating expressions, construct a DataMask from a tibble

DataMask<SlicedTibble> mask(tbl);

before evaluating a new expression, we need to rechain(parent_env) to prepare the data mask to evaluate expression with a given parent environment. This "forgets" about the materialized bindings.

mask.rechain(quosure.env());

before evaluating the expression ona new group, the indices are updated, this includes rematerializing the already materialized bindings

hybrid evaluation

Use of DataMask

Hybrid evaluation also uses the DataMask<> class, but it only needs to quickly retrieve the data for an entire column. This is what the maybe_get_subset_binding method does.

  // returns a pointer to the ColumnBinding if it exists
  // this is mostly used by the hybrid evaluation
  const ColumnBinding<SlicedTibble>* maybe_get_subset_binding(const SymbolString& symbol) const {
    int pos = symbol_map.find(symbol);
    if (pos >= 0) {
      return &column_bindings[pos];
    } else {
      return 0;
    }
  }

when the symbol map contains the binding, we get a ColumnBinding<SlicedTibble>*. These objects hold these fields:

  // is this a summary binding, i.e. does it come from summarise
  bool summary;

  // symbol of the binding
  SEXP symbol;

  // data. it is own either by the original data frame or by the
  // accumulator, so no need for additional protection here
  SEXP data;

hybrid evaluation only needs summary and data.

Expression

When attempting to evaluate an expression with the hybrid evaluator, we first construct an Expression object. This class has methods to quickly check if the expression can be managed, e.g.

    // sum( <column> ) and base::sum( <column> )
    if (expression.is_fun(s_sum, s_base, ns_base)) {
      Column x;
      if (expression.is_unnamed(0) && expression.is_column(0, x)) {
        return sum_(data, x, /* na.rm = */ false, op);
      } else {
        return R_UnboundValue;
      }
    }

This checks that the call matches sum(<column>) or base::sum(<column>) where <column> is a column from the data mask.

In that example, the Expression class checks that: - the first argument is not named - the first argument is a column from the data

Otherwise it means it is an expression that we can't handle, so we return R_UnboundValue which is the hybrid evaluation way to signal it gives up on handling the expression, and that it should be evaluated with standard evaluation.

Expression has the following methods:

inline bool is_fun(SEXP symbol, SEXP pkg, SEXP ns) : are we calling fun ? If so does fun curently resolve to the function we intend to (it might not if the function is masked, which allows to do trghings like this:)

> n <- function() 42
> summarise(iris, nn = n())
  nn
1 42

bool is_valid() const : is the expression valid. the Expressio, constructor rules out a few expressions that hjave no chance of being handled, such as pkg::fun() when pkg is none of dplyr, stats or base
SEXP value(int i) const : the expression at position i
bool is_named(int i, SEXP symbol) const : is the i'th argument named symbol
bool is_scalar_logical(int i, bool& test) const : is the i'th argument a scalar logical, we need this for handling e.g. na.rm = TRUE
bool is_scalar_int(int i, int& out) const is the i'th argument a scalar int, we need this for n = <int>
bool is_column(int i, Column& column) const is the i'th argument a column.

hybrid_do

The hybrid_do function uses methods from Expression to quickly assess if it can handle the expression and then calls the relevant function from dplyr::hybrid:: to create the result at once:

    if (expression.is_fun(s_sum, s_base, ns_base)) {
      // sum( <column> ) and base::sum( <column> )
      Column x;
      if (expression.is_unnamed(0) && expression.is_column(0, x)) {
        return sum_(data, x, /* na.rm = */ false, op);
      }
    } else if (expression.is_fun(s_mean, s_base, ns_base)) {
      // mean( <column> ) and base::mean( <column> )

      Column x;
      if (expression.is_unnamed(0) && expression.is_column(0, x)) {
        return mean_(data, x, false, op);
      }
    } else if ...

The functions in the C++ dplyr::hybrid:: namespace create objects whose classes hold: - the type of output they create - the information they need (e.g. the column, the value of na.rm, ...)

These classes all have these methods: - summarise() to return a result of the same size as the number of groups. This is used when op is a Summary. This returns R_UnboundValue to give up when the class can't do that, e.g. the classes behind lag - window() to return a result of the same size as the number of rows in the original data set.

The classes typically don't provide these methods directly, but rather inherit, via CRTP one of: - HybridVectorScalarResult, so that the class only has to provide a process method, for example the Count class:

template <typename SlicedTibble>
class Count : public HybridVectorScalarResult<INTSXP, SlicedTibble, Count<SlicedTibble> > {
public:
  typedef HybridVectorScalarResult<INTSXP, SlicedTibble, Count<SlicedTibble> > Parent ;

  Count(const SlicedTibble& data) : Parent(data) {}

  int process(const typename SlicedTibble::slicing_index& indices) const {
    return indices.size();
  }
} ;

HybridVectorScalarResult uses the result of process in both summarise() and window()

HybridVectorVectorResult expects a fill method, e.g. implementation of ntile(n=<int>) uses this class that derive from HybridVectorVectorResult.

template <typename SlicedTibble>
class Ntile1 : public HybridVectorVectorResult<INTSXP, SlicedTibble, Ntile1<SlicedTibble> > {
public:
  typedef HybridVectorVectorResult<INTSXP, SlicedTibble, Ntile1> Parent;

  Ntile1(const SlicedTibble& data, int ntiles_): Parent(data), ntiles(ntiles_) {}

  void fill(const typename SlicedTibble::slicing_index& indices, Rcpp::IntegerVector& out) const {
    int m = indices.size();
    for (int j = m - 1; j >= 0; j--) {
      out[ indices[j] ] = (int)floor((ntiles * j) / m) + 1;
    }
  }

private:
  int ntiles;
};