cv.multiview: Perform k-fold cross-validation for cooperative learning

View source: R/cv.multiview.R

cv.multiviewR Documentation

Perform k-fold cross-validation for cooperative learning

Description

Does k-fold cross-validation (CV) for multiview and produces a CV curve.

Usage

cv.multiview(
  x_list,
  y,
  family = gaussian(),
  rho = 0,
  weights = NULL,
  offset = NULL,
  lambda = NULL,
  type.measure = c("default", "mse", "deviance", "class", "auc", "mae", "C"),
  nfolds = 10,
  foldid = NULL,
  alignment = c("lambda", "fraction"),
  grouped = TRUE,
  keep = FALSE,
  trace.it = 0,
  ...
)

Arguments

x_list

a list of x matrices with same number of rows nobs

y

the quantitative response with length equal to nobs, the (same) number of rows in each x matrix

family

A description of the error distribution and link function to be used in the model. This is the result of a call to a family function. Default is stats::gaussian. (See stats::family for details on family functions.)

rho

the weight on the agreement penalty, default 0. rho=0 is a form of early fusion, and rho=1 is a form of late fusion. We recommend trying a few values of rho including 0, 0.1, 0.25, 0.5, and 1 first; sometimes rho larger than 1 can also be helpful.

weights

Observation weights; defaults to 1 per observation

offset

Offset vector (matrix) as in multiview

lambda

A user supplied lambda sequence, default NULL. Typical usage is to have the program compute its own lambda sequence. This sequence, in general, is different from that used in the glmnet::glmnet() call (named lambda). Note that this is done for the full model (master sequence), and separately for each fold. The fits are then aligned using the glmnet lambda sequence associated with the master sequence (see the alignment argument for additional details). Adapting lambda for each fold leads to better convergence. When lambda is supplied, the same sequence is used everywhere, but in some GLMs can lead to convergence issues.

type.measure

loss to use for cross-validation. Currently five options, not all available for all models. The default is type.measure="deviance", which uses squared-error for gaussian models (a.k.a type.measure="mse" there), deviance for logistic and poisson regression, and partial-likelihood for the Cox model. type.measure="class" applies to binomial and multinomial logistic regression only, and gives misclassification error. type.measure="auc" is for two-class logistic regression only, and gives area under the ROC curve. type.measure="mse" or type.measure="mae" (mean absolute error) can be used by all models except the "cox"; they measure the deviation from the fitted mean to the response. type.measure="C" is Harrel's concordance measure, only available for cox models.

nfolds

number of folds - default is 10. Although nfolds can be as large as the sample size (leave-one-out CV), it is not recommended for large datasets. Smallest value allowable is nfolds=3

foldid

an optional vector of values between 1 and nfold identifying what fold each observation is in. If supplied, nfold can be missing.

alignment

This is an experimental argument, designed to fix the problems users were having with CV, with possible values "lambda" (the default) else "fraction". With "lambda" the lambda values from the master fit (on all the data) are used to line up the predictions from each of the folds. In some cases this can give strange values, since the effective lambda values in each fold could be quite different. With "fraction" we line up the predictions in each fold according to the fraction of progress along the regularization. If in the call a lambda argument is also provided, alignment="fraction" is ignored (with a warning).

grouped

This is an experimental argument, with default TRUE, and can be ignored by most users. For all models except the "cox", this refers to computing nfolds separate statistics, and then using their mean and estimated standard error to describe the CV curve. If grouped=FALSE, an error matrix is built up at the observation level from the predictions from the nfold fits, and then summarized (does not apply to type.measure="auc"). For the "cox" family, grouped=TRUE obtains the CV partial likelihood for the Kth fold by subtraction; by subtracting the log partial likelihood evaluated on the full dataset from that evaluated on the on the (K-1)/K dataset. This makes more efficient use of risk sets. With grouped=FALSE the log partial likelihood is computed only on the Kth fold

keep

If keep=TRUE, a prevalidated array is returned containing fitted values for each observation and each value of lambda. This means these fits are computed with this observation and the rest of its fold omitted. The foldid vector is also returned. Default is keep=FALSE.

trace.it

If trace.it=1, then progress bars are displayed; useful for big models that take a long time to fit.

...

Other arguments that can be passed to multiview

Details

The current code can be slow for "large" data sets, e.g. when the number of features is larger than 1000. It can be helpful to see the progress of multiview as it runs; to do this, set trace.it = 1 in the call to multiview or cv.multiview. With this, multiview prints out its progress along the way. One can also pre-filter the features to a smaller set, using the exclude option, with a filter function.

If there are missing values in the feature matrices: we recommend that you center the columns of each feature matrix, and then fill in the missing values with 0.

For example,
x <- scale(x,TRUE,FALSE)
x[is.na(x)] <- 0
z <- scale(z,TRUE,FALSE)
z[is.na(z)] <- 0

Then run multiview in the usual way. It will exploit the assumed shared latent factors to make efficient use of the available data.

The function runs multiview nfolds+1 times; the first to get the lambda sequence, and then the remainder to compute the fit with each of the folds omitted. The error is accumulated, and the average error and standard deviation over the folds is computed. Note that cv.multiview does NOT search for values for rho. A specific value should be supplied, else rho=0 is assumed by default. If users would like to cross-validate rho as well, they should call cv.multiview with a pre-computed vector foldid, and then use this same fold vector in separate calls to cv.multiview with different values of rho.

Value

an object of class "cv.multiview" is returned, which is a list with the ingredients of the cross-validation fit.

lambda

the values of lambda used in the fits.

cvm

The mean cross-validated error - a vector of length length(lambda).

cvsd

estimate of standard error of cvm.

cvup

upper curve = cvm+cvsd.

cvlo

lower curve = cvm-cvsd.

nzero

number of non-zero coefficients at each lambda.

name

a text string indicating type of measure (for plotting purposes).

multiview.fit

a fitted multiview object for the full data.

lambda.min

value of lambda that gives minimum cvm.

lambda.1se

largest value of lambda such that error is within 1 standard error of the minimum.

fit.preval

if keep=TRUE, this is the array of prevalidated fits. Some entries can be NA, if that and subsequent values of lambda are not reached for that fold

foldid

if keep=TRUE, the fold assignments used

index

a one column matrix with the indices of lambda.min and lambda.1se in the sequence of coefficients, fits etc.

Examples

# Gaussian
# Generate data based on a factor model
set.seed(1)
x = matrix(rnorm(100*20), 100, 20)
z = matrix(rnorm(100*20), 100, 20)
U = matrix(rnorm(100*5), 100, 5)
for (m in seq(5)){
    u = rnorm(100)
    x[, m] = x[, m] + u
    z[, m] = z[, m] + u
    U[, m] = U[, m] + u}
x = scale(x, center = TRUE, scale = FALSE)
z = scale(z, center = TRUE, scale = FALSE)
beta_U = c(rep(0.1, 5))
y = U %*% beta_U + 0.1 * rnorm(100)
fit1 = cv.multiview(list(x=x,z=z), y, rho = 0.3)

# plot the cross-validation curve
plot(fit1)

# extract coefficients
coef(fit1, s="lambda.min")

# extract ordered coefficients
coef_ordered(fit1, s="lambda.min")

# make predictions
predict(fit1, newx = list(x[1:5, ],z[1:5,]), s = "lambda.min")

# Binomial

by = 1 * (y > median(y)) 
fit2 = cv.multiview(list(x=x,z=z), by, family = binomial(), rho = 0.9)
predict(fit2, newx = list(x[1:5, ],z[1:5,]), s = "lambda.min", type = "response")
plot(fit2)
coef(fit2, s="lambda.min")
coef_ordered(fit2, s="lambda.min")

# Poisson
py = matrix(rpois(100, exp(y))) 
fit3 = cv.multiview(list(x=x,z=z), py, family = poisson(), rho = 0.6)
predict(fit3, newx = list(x[1:5, ],z[1:5,]), s = "lambda.min", type = "response") 
plot(fit3)
coef(fit3, s="lambda.min")
coef_ordered(fit3, s="lambda.min")


multiview documentation built on April 3, 2023, 5:20 p.m.