lol.xval.eval: Embedding Cross Validation
In neurodata/lol: Linear Optimal Low-Rank Projection

Description Usage Arguments Value Details Author(s) Examples

A function for performing leave-one-out cross-validation for a given embedding model. This function produces fold-wise cross-validated misclassification rates for standard embedding techniques. Users can optionally specify custom embedding techniques with proper configuration of alg.* parameters and hyperparameters. Optional classifiers implementing the S3 predict function can be used for classification, with hyperparameters to classifiers for determining misclassification rate specified in classifier.* parameters and hyperparameters.

lol.xval.eval(
  X,
  Y,
  r,
  alg,
  sets = NULL,
  alg.dimname = "r",
  alg.opts = list(),
  alg.embedding = "A",
  classifier = lda,
  classifier.opts = list(),
  classifier.return = "class",
  k = "loo",
  rank.low = FALSE,
  ...
)

`X`	`[n, d]` the data with `n` samples in `d` dimensions.
`Y`	`[n]` the labels of the samples with `K` unique labels.
`r`	the number of embedding dimensions desired, where `r <= d`.
`alg`	the algorithm to use for embedding. Should be a function that accepts inputs `X`, `Y`, and has a parameter for `alg.dimname` if `alg` is supervised, or just `X` and `alg.dimname` if `alg` is unsupervised.This algorithm should return a list containing a matrix that embeds from d to r <= d dimensions.
`sets`	a user-defined cross-validation set. Defaults to `NULL`. `is.null(sets)` randomly partition the inputs `X` and `Y` into training and testing sets. `!is.null(sets)` use a user-defined partitioning of the inputs `X` and `Y` into training and testing sets. Should be in the format of the outputs from `lol.xval.split`. That is, a `list` with each element containing `X.train`, an `[n-k][d]` subset of data to test on, `Y.train`, an `[n-k]` subset of class labels for `X.train`; `X.test`, an `[n-k][d]` subset of data to test the model on, `Y.train`, an `[k]` subset of class labels for `X.test`.
`alg.dimname`	the name of the parameter accepted by `alg` for indicating the embedding dimensionality desired. Defaults to `r`.
`alg.opts`	the hyper-parameter options you want to pass into your algorithm, as a keyworded list. Defaults to `list()`, or no hyper-parameters.
`alg.embedding`	the attribute returned by `alg` containing the embedding matrix. Defaults to assuming that `alg` returns an embgedding matrix as `"A"`. `!is.nan(alg.embedding)` Assumes that `alg` will return a list containing an attribute, `alg.embedding`, a `[d, r]` matrix that embeds `[n, d]` data from `[d]` to `[r < d]` dimensions. `is.nan(alg.embedding)` Assumes that `alg` returns a `[d, r]` matrix that embeds `[n, d]` data from `[d]` to `[r < d]` dimensions.
`classifier`	the classifier to use for assessing performance. The classifier should accept `X`, a `[n, d]` array as the first input, and `Y`, a `[n]` array of labels, as the first 2 arguments. The class should implement a predict function, `predict.classifier`, that is compatible with the `stats::predict` `S3` method. Defaults to `MASS::lda`.
`classifier.opts`	any extraneous options to be passed to the classifier function, as a list. Defaults to an empty list.
`classifier.return`	if the return type is a list, `class` encodes the attribute containing the prediction labels from `stats::predict`. Defaults to the return type of `MASS::lda`, `class`. `!is.nan(classifier.return)` Assumes that `predict.classifier` will return a list containing an attribute, `classifier.return`, that encodes the predicted labels. `is.nan(classifier.return)` Assumes that `predict.classifer` returns a `[n]` vector/array containing the prediction labels for `[n, d]` inputs.
`k`	the cross-validated method to perform. Defaults to `'loo'`. If `sets` is provided, this option is ignored. See `lol.xval.split` for details. `'loo'` Leave-one-out cross validation `isinteger(k)` perform `k`-fold cross-validation with `k` as the number of folds.
`rank.low`	whether to force the training set to low-rank. Defaults to `FALSE`. If `sets` is provided, this option is ignored. See `lol.xval.split` for details. if `rank.low == FALSE`, uses default cross-validation method with standard `k`-fold validation. Training sets are `k-1` folds, and testing sets are `1` fold, where the fold held-out for testing is rotated to ensure no dependence of potential downstream inference in the cross-validated misclassification rates. if ]coderank.low == TRUE, users cross-validation method with `ntrain = min((k-1)/kn, d)` sample training sets, where `d` is the number of dimensions in `X`. This ensures that the training data is always low-rank, `ntrain < d + 1`. Note that the resulting training sets may have `ntrain < (k-1)/kn`, but the resulting testing sets will always be properly rotated `ntest = n/k` to ensure no dependencies in fold-wise testing.
`...`	trailing args.

Returns a list containing:

`lhat`	the mean cross-validated error.
`model`	The model returned by `alg` computed on all of the data.
`classifier`	The classifier trained on all of the embedded data.
`lhats`	the cross-validated error for each of the `k`-folds.

For more details see the help vignette: vignette("xval", package = "lolR")

For extending cross-validation techniques shown here to arbitrary embedding algorithms, see the vignette: vignette("extend_embedding", package = "lolR")

For extending cross-validation techniques shown here to arbitrary classification algorithms, see the vignette: vignette("extend_classification", package = "lolR")

Eric Bridgeford

# train model and analyze with loo validation using lda classifier
library(lolR)
data <- lol.sims.rtrunk(n=200, d=30)  # 200 examples of 30 dimensions
X <- data$X; Y <- data$Y
r=5  # embed into r=5 dimensions
# run cross-validation with the nearestCentroid method and
# leave-one-out cross-validation, which returns only
# prediction labels so we specify classifier.return as NaN
xval.fit <- lol.xval.eval(X, Y, r, lol.project.lol,
                          classifier=lol.classify.nearestCentroid,
                          classifier.return=NaN, k='loo')

# train model and analyze with 5-fold validation using lda classifier
data <- lol.sims.rtrunk(n=200, d=30)  # 200 examples of 30 dimensions
X <- data$X; Y <- data$Y
xval.fit <- lol.xval.eval(X, Y, r, lol.project.lol, k=5)

# pass in existing cross-validation sets
sets <- lol.xval.split(X, Y, k=2)
xval.fit <- lol.xval.eval(X, Y, r, lol.project.lol, sets=sets)