overview

This is a complete redesign of how we evaluate expression in dplyr. We no longer attempt to evaluate part of an expression. We now either:

data mask

When used internally in the c++ code, a tibble become one of the 3 classes GroupedDataFrame, RowwiseDataFrame or NaturalDataFrame. Most internal code is templated by these classes, e.g. summarise is:

// [[Rcpp::export]]
SEXP summarise_impl(DataFrame df, QuosureList dots) {
  check_valid_colnames(df);
  if (is<RowwiseDataFrame>(df)) {
    return summarise_grouped<RowwiseDataFrame>(df, dots);
  } else if (is<GroupedDataFrame>(df)) {
    return summarise_grouped<GroupedDataFrame>(df, dots);
  } else {
    return summarise_grouped<NaturalDataFrame>(df, dots);
  }
}

The DataMask<SlicedTibble> template class is used by both hybrid and standard evaluation to extract the relevant information from the columns (original columns or columns that have just been made by mutate() or summarise())

standard evaluation

meta information about the groups

The functions n(), row_number() and group_indices() when called without arguments lack contextual information, i.e. the current group size and index, so they look for that information a the special environment

n <- function() {
  from_context("..group_size")
}

The DataMask class is responsible for updating the variables ..group_size and ..group_number

    // update the data context variables, these are used by n(), ...
    get_context_env()["..group_size"] = indices.size();
    get_context_env()["..group_number"] = indices.group() + 1;

all other functions can just be called with standard evaluation in the data mask

active and resolved bindings

When doing standard evaluation, we need to install a data mask that evaluates the symbols from the data to the relevant subset. The simple solution would be to update the data mask at each iteration with subsets for all the variables but that would be potentially expensive and a waste, as we might not need all of the variables at a given time, e.g. in this case:

iris %>% group_by(Species) %>% summarise(Sepal.Length = +mean(Sepal.Length))

We only need to materialize Sepal.Length, we don't need the other variables.

DataMask installs an active binding for each variable in one of (the top) the environment in the data mask ancestry, the active binding function is generated by this function so that it holds an index and a pointer to the data mask in its enclosure.

.make_active_binding_fun <- function(index, subsets){
  function() {
    materialize_binding(index, subsets)
  }
}

When hit, the active binding calls the materialize_binding function :

// [[Rcpp::export]]
SEXP materialize_binding(int idx, XPtr<DataMaskBase> mask) {
  return mask->materialize(idx);
}

The DataMask<>::materialize(idx) method returns the materialized subset, but also: - install the result in the bottom environment of the data mask, so that it mask the active binding. The point is to call the active binding only once. - remembers that the binding at position idx has been materialized, so that before evaluating the same expression in the next group, it is proactively materialized, because it is very likely that we need the same variables for all groups

When we move to the next expression to evaluate, DataMask forgets about the materialized bindings so that the active binding can be triggered again as needed.

use case of the DataMask class

DataMask<SlicedTibble> mask(tbl);
mask.rechain(quosure.env());

hybrid evaluation

Use of DataMask

Hybrid evaluation also uses the DataMask<> class, but it only needs to quickly retrieve the data for an entire column. This is what the maybe_get_subset_binding method does.

  // returns a pointer to the ColumnBinding if it exists
  // this is mostly used by the hybrid evaluation
  const ColumnBinding<SlicedTibble>* maybe_get_subset_binding(const SymbolString& symbol) const {
    int pos = symbol_map.find(symbol);
    if (pos >= 0) {
      return &column_bindings[pos];
    } else {
      return 0;
    }
  }

when the symbol map contains the binding, we get a ColumnBinding<SlicedTibble>*. These objects hold these fields:

  // is this a summary binding, i.e. does it come from summarise
  bool summary;

  // symbol of the binding
  SEXP symbol;

  // data. it is own either by the original data frame or by the
  // accumulator, so no need for additional protection here
  SEXP data;

hybrid evaluation only needs summary and data.

Expression

When attempting to evaluate an expression with the hybrid evaluator, we first construct an Expression object. This class has methods to quickly check if the expression can be managed, e.g.

    // sum( <column> ) and base::sum( <column> )
    if (expression.is_fun(s_sum, s_base, ns_base)) {
      Column x;
      if (expression.is_unnamed(0) && expression.is_column(0, x)) {
        return sum_(data, x, /* na.rm = */ false, op);
      } else {
        return R_UnboundValue;
      }
    }

This checks that the call matches sum(<column>) or base::sum(<column>) where <column> is a column from the data mask.

In that example, the Expression class checks that: - the first argument is not named - the first argument is a column from the data

Otherwise it means it is an expression that we can't handle, so we return R_UnboundValue which is the hybrid evaluation way to signal it gives up on handling the expression, and that it should be evaluated with standard evaluation.

Expression has the following methods:

> n <- function() 42
> summarise(iris, nn = n())
  nn
1 42

hybrid_do

The hybrid_do function uses methods from Expression to quickly assess if it can handle the expression and then calls the relevant function from dplyr::hybrid:: to create the result at once:

    if (expression.is_fun(s_sum, s_base, ns_base)) {
      // sum( <column> ) and base::sum( <column> )
      Column x;
      if (expression.is_unnamed(0) && expression.is_column(0, x)) {
        return sum_(data, x, /* na.rm = */ false, op);
      }
    } else if (expression.is_fun(s_mean, s_base, ns_base)) {
      // mean( <column> ) and base::mean( <column> )

      Column x;
      if (expression.is_unnamed(0) && expression.is_column(0, x)) {
        return mean_(data, x, false, op);
      }
    } else if ...

The functions in the C++ dplyr::hybrid:: namespace create objects whose classes hold: - the type of output they create - the information they need (e.g. the column, the value of na.rm, ...)

These classes all have these methods: - summarise() to return a result of the same size as the number of groups. This is used when op is a Summary. This returns R_UnboundValue to give up when the class can't do that, e.g. the classes behind lag - window() to return a result of the same size as the number of rows in the original data set.

The classes typically don't provide these methods directly, but rather inherit, via CRTP one of: - HybridVectorScalarResult, so that the class only has to provide a process method, for example the Count class:

template <typename SlicedTibble>
class Count : public HybridVectorScalarResult<INTSXP, SlicedTibble, Count<SlicedTibble> > {
public:
  typedef HybridVectorScalarResult<INTSXP, SlicedTibble, Count<SlicedTibble> > Parent ;

  Count(const SlicedTibble& data) : Parent(data) {}

  int process(const typename SlicedTibble::slicing_index& indices) const {
    return indices.size();
  }
} ;

HybridVectorScalarResult uses the result of process in both summarise() and window()

template <typename SlicedTibble>
class Ntile1 : public HybridVectorVectorResult<INTSXP, SlicedTibble, Ntile1<SlicedTibble> > {
public:
  typedef HybridVectorVectorResult<INTSXP, SlicedTibble, Ntile1> Parent;

  Ntile1(const SlicedTibble& data, int ntiles_): Parent(data), ntiles(ntiles_) {}

  void fill(const typename SlicedTibble::slicing_index& indices, Rcpp::IntegerVector& out) const {
    int m = indices.size();
    for (int j = m - 1; j >= 0; j--) {
      out[ indices[j] ] = (int)floor((ntiles * j) / m) + 1;
    }
  }

private:
  int ntiles;
};

The result of fill is only used in window(). The summarise() method simpliy returns R_UnboundValue to give up.



olascodgreat/samife documentation built on May 13, 2019, 6:11 p.m.