lol.xval.optimal_dimselect: Optimal Cross-Validated Number of Embedding Dimensions

Description Usage Arguments Value Details Author(s) Examples

View source: R/xval.R

Description

A function for performing leave-one-out cross-validation for a given embedding model, that allows users to determine the optimal number of embedding dimensions for their algorithm-of-choice. This function produces fold-wise cross-validated misclassification rates for standard embedding techniques across a specified selection of embedding dimensions. Optimal embedding dimension is selected as the dimension with the lowest average misclassification rate across all folds. Users can optionally specify custom embedding techniques with proper configuration of alg.* parameters and hyperparameters. Optional classifiers implementing the S3 predict function can be used for classification, with hyperparameters to classifiers for determining misclassification rate specified in classifier.*.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
lol.xval.optimal_dimselect(
  X,
  Y,
  rs,
  alg,
  sets = NULL,
  alg.dimname = "r",
  alg.opts = list(),
  alg.embedding = "A",
  alg.structured = TRUE,
  classifier = lda,
  classifier.opts = list(),
  classifier.return = "class",
  k = "loo",
  rank.low = FALSE,
  ...
)

Arguments

X

[n, d] the data with n samples in d dimensions.

Y

[n] the labels of the samples with K unique labels. Defaults to NaN.#' @param alg.opts any extraneous options to be passed to the classifier function, as a list. Defaults to an empty list. For example, this could be the embedding dimensionality to investigate.

rs

[r.n] the embedding dimensions to investigate over, where max(rs) <= d.

alg

the algorithm to use for embedding. Should be a function that accepts inputs X and Y and embedding dimension r if alg is supervised, or just X and embedding dimension r if alg is unsupervised.This algorithm should return a list containing a matrix that embeds from d to r < d dimensions.

sets

a user-defined cross-validation set. Defaults to NULL.

  • is.null(sets) randomly partition the inputs X and Y into training and testing sets.

  • !is.null(sets) use a user-defined partitioning of the inputs X and Y into training and testing sets. Should be in the format of the outputs from lol.xval.split. That is, a list with each element containing X.train, an [n-k][d] subset of data to test on, Y.train, an [n-k] subset of class labels for X.train; X.test, an [n-k][d] subset of data to test the model on, Y.train, an [k] subset of class labels for X.test.

alg.dimname

the name of the parameter accepted by alg for indicating the embedding dimensionality desired. Defaults to r.

alg.opts

the hyper-parameter options to pass to your algorithm as a keyworded list. Defaults to list(), or no hyper-parameters. This should not include the number of embedding dimensions, r, which are passed separately in the rs vector.

alg.embedding

the attribute returned by alg containing the embedding matrix. Defaults to assuming that alg returns an embgedding matrix as "A".

  • !is.nan(alg.embedding) Assumes that alg will return a list containing an attribute, alg.embedding, a [d, r] matrix that embeds [n, d] data from [d] to [r < d] dimensions.

  • is.nan(alg.embedding) Assumes that alg returns a [d, r] matrix that embeds [n, d] data from [d] to [r < d] dimensions.

alg.structured

a boolean to indicate whether the embedding matrix is structured. Provides performance increase by not having to compute the embedding matrix xv times if unnecessary. Defaults to TRUE.

  • TRUE assumes that if Ar: R^d -> R^r embeds from d to r dimensions and Aq: R^d -> R^q from d to q > r dimensions, that Aq[, 1:r] == Ar,

  • TRUE assumes that if Ar: R^d -> R^r embeds from d to r dimensions and Aq: R^d -> R^q from d to q > r dimensions, that Aq[, 1:r] != Ar,

classifier

the classifier to use for assessing performance. The classifier should accept X, a [n, d] array as the first input, and Y, a [n] array of labels, as the first 2 arguments. The class should implement a predict function, predict.classifier, that is compatible with the stats::predict S3 method. Defaults to MASS::lda.

classifier.opts

any extraneous options to be passed to the classifier function, as a list. Defaults to an empty list.

classifier.return

if the return type is a list, class encodes the attribute containing the prediction labels from stats::predict. Defaults to the return type of MASS::lda, class.

  • !is.nan(classifier.return) Assumes that predict.classifier will return a list containing an attribute, classifier.return, that encodes the predicted labels.

  • is.nan(classifier.return) Assumes that predict.classifer returns a [n] vector/array containing the prediction labels for [n, d] inputs.

k

the cross-validated method to perform. Defaults to 'loo'. If sets is provided, this option is ignored. See lol.xval.split for details.

  • 'loo' Leave-one-out cross validation

  • isinteger(k) perform k-fold cross-validation with k as the number of folds.

rank.low

whether to force the training set to low-rank. Defaults to FALSE. If sets is provided, this option is ignored. See lol.xval.split for details.

  • if rank.low == FALSE, uses default cross-validation method with standard k-fold validation. Training sets are k-1 folds, and testing sets are 1 fold, where the fold held-out for testing is rotated to ensure no dependence of potential downstream inference in the cross-validated misclassification rates.

  • if ]coderank.low == TRUE, users cross-validation method with ntrain = min((k-1)/k*n, d) sample training sets, where d is the number of dimensions in X. This ensures that the training data is always low-rank, ntrain < d + 1. Note that the resulting training sets may have ntrain < (k-1)/k*n, but the resulting testing sets will always be properly rotated ntest = n/k to ensure no dependencies in fold-wise testing.

...

trailing args.

Value

Returns a list containing:

folds.data

the results, as a data-frame, of the per-fold classification accuracy.

foldmeans.data

the results, as a data-frame, of the average classification accuracy for each r.

optimal.lhat

the classification error of the optimal r

.

optimal.r

the optimal number of embedding dimensions from rs

.

model

the model trained on all of the data at the optimal number of embedding dimensions.

classifier

the classifier trained on all of the data at the optimal number of embedding dimensions.

Details

For more details see the help vignette: vignette("xval", package = "lolR")

For extending cross-validation techniques shown here to arbitrary embedding algorithms, see the vignette: vignette("extend_embedding", package = "lolR")

For extending cross-validation techniques shown here to arbitrary classification algorithms, see the vignette: vignette("extend_classification", package = "lolR")

Author(s)

Eric Bridgeford

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# train model and analyze with loo validation using lda classifier
library(lolR)
data <- lol.sims.rtrunk(n=200, d=30)  # 200 examples of 30 dimensions
X <- data$X; Y <- data$Y
# run cross-validation with the nearestCentroid method and
# leave-one-out cross-validation, which returns only
# prediction labels so we specify classifier.return as NaN
xval.fit <- lol.xval.optimal_dimselect(X, Y, rs=c(5, 10, 15), lol.project.lol,
                          classifier=lol.classify.nearestCentroid,
                          classifier.return=NaN, k='loo')

# train model and analyze with 5-fold validation using lda classifier
data <- lol.sims.rtrunk(n=200, d=30)  # 200 examples of 30 dimensions
X <- data$X; Y <- data$Y
xval.fit <- lol.xval.optimal_dimselect(X, Y, rs=c(5, 10, 15), lol.project.lol, k=5)

# pass in existing cross-validation sets
sets <- lol.xval.split(X, Y, k=2)
xval.fit <- lol.xval.optimal_dimselect(X, Y, rs=c(5, 10, 15), lol.project.lol, sets=sets)

neurodata/lol documentation built on March 3, 2021, 1:46 a.m.