cv.ses: Cross-Validation for SES and MMPC

View source: R/cv.ses.R

Cross-Validation for SES and MMPCR Documentation

Cross-Validation for SES and MMPC

Description

The function performs a k-fold cross-validation for identifying the best values for the SES and MMPC 'max_k' and 'threshold' hyper-parameters.

Usage

cv.ses(target, dataset, wei = NULL, kfolds = 10, folds = NULL, 
alphas = c(0.1, 0.05, 0.01), max_ks = c(3, 2), task = NULL, 
metric = NULL, metricbbc = NULL, modeler = NULL, ses_test = NULL, 
ncores = 1, B = 1)

cv.mmpc(target, dataset, wei = NULL, kfolds = 10, folds = NULL, 
alphas = c(0.1, 0.05, 0.01), max_ks = c(3, 2), task = NULL, 
metric = NULL, metricbbc = NULL, modeler = NULL, mmpc_test = NULL, 
ncores = 1, B = 1)

cv.waldses(target, dataset, wei = NULL, kfolds = 10, folds = NULL, 
alphas = c(0.1, 0.05, 0.01), max_ks = c(3, 2), task = NULL, 
metric = NULL, metricbbc = NULL, modeler = NULL, ses_test = NULL,
ncores = 1, B = 1)

cv.waldmmpc(target, dataset, wei = NULL, kfolds = 10, folds = NULL, 
alphas = c(0.1, 0.05, 0.01), max_ks = c(3, 2), task = NULL, 
metric = NULL, metricbbc = NULL, modeler = NULL, mmpc_test = NULL, 
ncores = 1, B = 1)

cv.permses(target, dataset, wei = NULL, kfolds = 10, folds = NULL, 
alphas = c(0.1, 0.05, 0.01), max_ks = c(3, 2), task = NULL, 
metric = NULL, metricbbc = NULL, modeler = NULL, ses_test = NULL, R = 999, 
ncores = 1, B = 1)

cv.permmmpc(target, dataset, wei = NULL, kfolds = 10, folds = NULL, 
alphas = c(0.1, 0.05, 0.01), max_ks = c(3, 2), task = NULL, 
metric = NULL, metricbbc = NULL, modeler = NULL, mmpc_test = NULL, R = 999, 
ncores = 1, B = 1)

Arguments

target

The target or class variable as in SES and MMPC. The difference is that it cannot accept a single numeric value, an integer indicating the column in the dataset.

dataset

The dataset object as in SES and MMPC.

wei

A vector of weights to be used for weighted regression. The default value is NULL.

kfolds

The number of the folds in the k-fold Cross Validation (integer).

folds

The folds of the data to use (a list generated by the function generateCVRuns TunePareto). If NULL the folds are created internally with the same function.

alphas

A vector of SES or MMPC thresholds hyper parameters used in CV.

max_ks

A vector of SES or MMPC max_ks parameters used in CV.

task

A character ("C", "R" or "S"). It can be "C" for classification (logistic, multinomial or ordinal regression), "R" for regression (robust and non robust linear regression, median regression, (zero inflated) poisson and negative binomial regression, beta regression), "S" for survival regresion (Cox, Weibull or exponential regression).

metric

A metric function provided by the user. If NULL the following functions will be used: auc.mxm, mse.mxm, ci.mxm for classification, regression and survival analysis tasks, respectively. See details for more. If you know what you have put it here to avoid the function choosing somehting else. Note that you put these words as they are, without "".

metricbbc

This is the same argument as "metric" with the difference that " " must be placed. If for example, metric = auc.mxm, here metricbbc = "auc.mxm". The same value must be given here. This argument is to be used with the function bbc which does bootstrap bias correction of the estimated performance (Tsamardinos, Greasidou and Borboudakis, 2018). This argument is valid if the last argument (B) is more than 1.

modeler

A modeling function provided by the user. If NULL the following functions will be used: glm.mxm, lm.mxm, coxph.mxm for classification, regression and survival analysis tasks, respectively. See details for more. If you know what you have put it here to avoid the function choosing somehting else. Note that you put these words as they are, without "".

ses_test

A function object that defines the conditional independence test used in the SES function (see also SES help page). If NULL, "testIndFisher", "testIndLogistic" and "censIndCR" are used for classification, regression and survival analysis tasks, respectively. If you know what you have put it here to avoid the function choosing somehting else. Not all tests can be included here. "testIndClogit", "testIndMVreg", "testIndIG", "testIndGamma", "testIndZIP" and "testIndTobit" are anot available at the moment.

mmpc_test

A function object that defines the conditional independence test used in the MMPC function (see also SES help page). If NULL, "testIndFisher", "testIndLogistic" and "censIndCR" are used for classification, regression and survival analysis tasks, respectively.

R

The number of permutations, set to 999 by default. There is a trick to avoind doing all permutations. As soon as the number of times the permuted test statistic is more than the observed test statistic is more than 50 (if threshold = 0.05 and R = 999), the p-value has exceeded the signifiance level (threshold value) and hence the predictor variable is not significant. There is no need to continue do the extra permutations, as a decision has already been made.

ncores

This argument is valid only if you have a multi-threaded machine.

B

How many bootstrap re-samples to draw. This argument is to be used with the function bbc which does bootstrap bias correction of the estimated performance (Tsamardinos, Greasidou and Borboudakis, 2018). If you have thousands of samples (observations) then this might not be necessary, as there is no optimistic bias to be corrected. What is the lower limit cannot be told beforehand however. SES and MMPC however were designed for the low sample cases, hence, bootstrap bias correction is perhaps a must thing to do.

Details

Input for metric functions: predictions: A vector of predictions to be tested. test_target: target variable actual values to be compared with the predictions.

The output of a metric function is a single numeric value. Higher values indicate better performance. Metric based on error measures should be modified accordingly (e.g., multiplying the error for -1)

The metric functions that are currently supported are:

  • auc.mxm: "area under the receiver operator characteristic curve" metric for binary logistic regression.

  • acc.mxm: accuracy for binary logistic regression.

  • fscore.mxm: F score for binary logistic regression.

  • prec.mxm: precision for binary logistic regression.

  • euclid_sens.spec.mxm: Euclidean norm of 1 - sensititivy and 1 - specificity for binary logistic regression.

  • spec.mxm: specificity for logistic regression.

  • sens.mxm: sensitivity for logistic regression.

  • acc_multinom.mxm: accuracy for multinomial logistic regression.

  • mse.mxm: mean squared error, for robust and non robust linear regression and median (quantile) regression (multiplied by -1).

  • pve.mxm: 1 - (mean squared error)/( (n - 1) * var(y_out) ), for non robust linear regression. It is basically the proportion of variance explained in the test set.

  • ci.mxm: 1 - concordance index as provided in the rcorr.cens function from the suvriva package. This is to be used with the Cox proportional hazards model only.

  • ciwr.mxm concordance index as provided in the rcorr.cens function from the survival package. This is to be used with the Weibull regression model only.

  • poisdev.mxm: Poisson regression deviance (multiplied by -1).

  • nbdev.mxm: Negative binomial regression deviance (multiplied by -1).

  • binomdev.mxm: Negative binomial regression deviance (multiplied by -1).

  • ord_mae.mxm: Ordinal regression mean absolute error (multiplied by -1).

  • mae.mxm: Mean absolute error (multiplied by -1).

  • mci.mxm: Matched concordance index (for conditonal logistic regression).

Usage: metric(predictions, test_target)

Input of modelling functions: train_target: target variable used in the training procedure. sign_data: training set. sign_test: test set.

Modelling functions provide a single vector of predictions obtained by applying the model fit on sign_data and train_target on the sign_test

The modelling functions that are currently supported are:

  • glm.mxm: fits a glm for a binomial family (classification task).

  • multinom.mxm: fits a multinomial regression model (classification task).

  • lm.mxm: fits a linear model (regression task).

  • coxph.mxm: fits a cox proportional hazards regression model (survival task).

  • weibreg.mxm: fits a Weibull regression model (survival task).

  • rq.mxm: fits a quantile (median) regression model (regression task).

  • lmrob.mxm: fits a robust linear model (regression task).

  • pois.mxm: fits a poisson regression model (regression task).

  • nb.mxm: fits a negative binomial regression model (regression task).

  • ordinal.mxm: fits an ordinal regression model (regression task).

  • beta.mxm: fits a beta regression model (regression task). The predicted values are transformed into R using the logit transformation. This is so that the "mse.mxm" metric function can be used. In addition, this way the performance can be compared with the regression scenario, where the logit is applied and then a regression model is employed.

  • clogit: fits a conditonalo logistic regression model.

Usage: modeler(train_target, sign_data, sign_test)

The procedure will be more automated in the future and more functions will be added. The multithreaded functions have been tested and no error has been detected. However, if you spot any suspicious results please let us know.

Value

A list including:

cv_results_all

A list with predictions, performances and signatures for each fold and each SES or MMPC configuration (e.g cv_results_all[[3]]$performances[1] indicates the performance of the 1st fold with the 3d configuration of SES or MMPC). In the case of the multi-threaded functions (cvses.par and cvmmpc.par) this is a list with a matrix. The rows correspond to the folds and the columns to the configurations (pairs of threshold and max_k).

best_performance

A numeric value that represents the best average performance.

best_configuration

A list that corresponds to the best configuration of SES or MMPC including id, threshold (named 'a') and max_k.

bbc_best_performance

The bootstrap bias corrected best performance if B was more than 1, othwerwise this is NULL.

runtime

The runtime of the cross-validation procedure.

Bear in mind that the values can be extracted with the $ symbol, i.e. this is an S3 class output.

Author(s)

R implementation and documentation: Giorgos Athineou <athineou@csd.uoc.gr> and Vincenzo Lagani <vlagani@csd.uoc.gr>

References

Ioannis Tsamardinos, Elissavet Greasidou and Giorgos Borboudakis (2018). Bootstrapping the out-of-sample predictions for efficient and accurate cross-validation. Machine Learning (To appear). https://link.springer.com/article/10.1007/s10994-018-5714-4

Harrell F. E., Lee K. L. and Mark D. B. (1996). Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in medicine, 15(4), 361-387.

Hanley J. A. and McNeil B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29-36.

Brentnall A. R., Cuzick J., Field J. and Duffy S. W. (2015). A concordance index for matched case-control studies with applications in cancer risk. Statistics in medicine, 34(3), 396-405.

Pedregosa F., Bach F. & Gramfort A. (2017). On the consistency of ordinal regression methods. The Journal of Machine Learning Research, 18(1), 1769-1803.

See Also

SES, CondIndTests, cv.gomp, bbc, testIndFisher, testIndLogistic, gSquare, censIndCR

Examples

set.seed(1234)

# simulate a dataset with continuous data
dataset <- matrix( rnorm(100 * 50), ncol = 50 )
# the target feature is the last column of the dataset as a vector
target <- dataset[, 50]
dataset <- dataset[, -50]

# get 50 percent of the dataset as a train set
train_set <- dataset[1:100, ]
train_target <- target[1:100]

# run a 10 fold CV for the regression task
best_model <- cv.ses(target = train_target, dataset = train_set, kfolds = 5, task = "R")

# get the results
best_model$best_configuration
best_model$best_performance

# summary elements of the process. Press tab after each $ to view all the elements and
# choose the one you are intresting in.
# best_model$cv_results_all[[...]]$...
#i.e.
# mse value for the 1st configuration of SES of the 5 fold
abs( best_model$cv_results_all[[ 1 ]]$performances[5] )

best_a <- best_model$best_configuration$a
best_max_k <- best_model$best_configuration$max_k

MXM documentation built on Aug. 25, 2022, 9:05 a.m.