title: "origami: A Generalized Framework for Cross-Validation in R" tags: - "R" - "statistics" - "machine learning" - "cross-validation" authors: - name: "Jeremy R. Coyle" orcid: 0000-0002-9874-6649 affiliation: 1 - name: "Nima S. Hejazi" orcid: 0000-0002-7127-2789 affiliation: 1 affiliations: - name: "Division of Biostatistics, University of California, Berkeley" index: 1 date: "19 June 2017" bibliography: paper.bib
Cross-validation is an essential tool for evaluating how any given data analytic procedure extends from a sample to the target population from which the sample is derived. It has seen widespread application in all facets of statistics, perhaps most notably statistical machine learning. When used for model selection, cross-validation has powerful optimality properties [@Vaart:2006bz; @vanderLaan:2007bz].
Cross-validation works by partitioning a sample into complementary subsets,
applying a particular data analytic (statistical) routine on a subset (the
"training" set), and evaluating the routine of choice on the complementary
subset (the "testing" set). This procedure is repeated across multiple
partitions of the data, and a variety of different partitioning schemes exist,
such as $V$-fold cross-validation and bootstrap cross-validation. origami
, a
package for the R language for statistical computing [@R], supports many of the
existing cross-validation schemes, providing a suite of tools that generalize
the application of cross-validation to arbitrary data analytic procedures.
The main function in the origami
R package is cross_validate
. To start off,
the user must define folds and a function that operates on each fold. Once
these are passed to cross_validate
, this function will map the fold-specific
function across the folds, combining the results in a reasonable way. Specific
details on each each step of this process are given below.
The folds
object passed to cross_validate
is a list of folds. Such lists can
be generated using the make_folds
function. Each fold consists of a list
with a training
index vector, a validation
index vector, and a fold_index
(its order in the list of folds). This function supports a variety of
cross-validation schemes including $V$-fold and bootstrap cross-validation, as
well as time series methods like "rolling window". See [@vanderLaan:2007bz]
for formal definitions of these schemes. make_folds
can balance across levels
of a variable (stratify_ids
), and it can also keep all observations from the
same independent unit together (cluster
). We invite interested users to
consult the documentation of the make_folds
function for further details.
The cv_fun
argument to cross_validate
is a function that will perform some
operation on each fold. The first argument to this function must be fold
,
which will receive an individual fold object to operate on. Additional arguments
can be passed to cv_fun
using the ...
argument to cross_validate
. Within
this function, the convenience functions training
, validation
and
fold_index
can return the various components of a fold object. If training
or validation
is passed an object, it will index into it in a sensible way.
For instance, if it is a vector, it will index the vector directly. If it is a
data.frame
or matrix
, it will index rows. This allows the user to easily
partition data into training and validation sets. This fold function must return
a named list of results containing whatever fold-specific outputs are generated.
cross-validate
After defining folds, cross_validate
can be used to map the cv_fun
across
the folds
using future_lapply
. This means that it can be easily parallelized
by specifying a parallelization scheme (i.e., a plan
). See the future
package for further details.
The application of cross_validate
generates a list of results. As described
above, each call to cv_fun
itself returns a list of results, with different
elements for each type of result we care about. The main loop generates a list
of these individual lists of results (a sort of "meta-list"). This "meta-list"
is then inverted such that there is one element per result type (this too is a
list of the results for each fold). By default, combine_results
is used to
combine these results type lists in a sensible manner. How results are combined
is determined automatically by examining the data types of the results from the
first fold. This can be modified by specifying a list of arguments to
.combine_control
. See the help for combine_results
for more details. In
most cases, the defaults should suffice.
\newpage
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.