The cvma
package provides a method for summarizing the strength of association between a set of variables and a multivariate outcome. In particular, cvma
uses an ensemble machine learning-based approach to detect and quantify the measure of association of nonlinear relationships and covariate interactions. In addition, it allows for testing the strong null hypothesis of no association between a set of variables and a multivariate outcome. Lastly, cvma
performs variable importance analysis, and therefore summarizes each groups' association with the outcome.
In the following sections, we examine the use of cvma
in a variety of simple examples. The package can be installed as follows:
if (!("cvma" %in% installed.packages())) { devtools::install_github("benkeser/cvma") }
Once the package is installed, we can load it using the following command:
suppressMessages(library(cvma))
The cvma
package provides flexible interface for computing cross-validation-based measures of maximal association. The method implemented uses covariate information to estimate combination of outcomes that optimizes a measure of predictive performance for given prediction algorithms. In essence, it estimates the combination of outcomes that is "easiest to predict" based on the available covariates.
In an outer layer of V-fold cross validation, training samples are used to train a prediction algorithm for each outcome. Multiple algorithms may be used and their predictions combined using the Super Learner methodology [@vdl2007super]. Note that Super Learner weights will be estimated based on V-2 fold cross-validation.
An inner layer of V-1 cross validation is used to determine a user-specified combination of outcomes that maximizes a user-specified prediction criteria. The outer layer validation sample is used to compute a user-specified cross-validated measure of performance of the prediction algorithm for predicting the combined outcome that was computed in the training sample. Several common choices for outcome combinations (convex combination of outcomes and single outcome that is most associated) and prediction criteria (nonparametric R^2, negative log-likelihood, and area under ROC curve) are implemented. Note that users may specify their own criteria as well. All implemented control options for cvma
can be seen using the list_control_options
function.
list_control_options()
The function returns the cross-validated summary measure for the maximally combined outcome and, if desired, the cross-validated summary measure for each outcome.
In order to illustrate cross-validated nonparametric R-squared for evaluating maximum association, we first simulate a simple dataset with continuous outcomes:
set.seed(1234) X <- data.frame(x1=runif(n=100,0,5), x2=runif(n=100,0,5)) Y1 <- rnorm(100, X$x1 + X$x2, 1) Y2 <- rnorm(100, X$x1 + X$x2, 3) Y <- data.frame(Y1 = Y1, Y2 = Y2)
Next, we run cvma
with 10 folds and 3 learners (generalized linear models, mean and neural networks). In order to specify more algorithms, see SuperLearner::listWrappers()
for implemented learners supported by the package SuperLearner
. We emphasize that \code{all_learner_fits} must be set to TRUE for future re-fitting.
fit <- cvma(Y = Y, X = X, V = 10, learners = c("SL.glm","SL.mean","SL.nnet"), return_control = list(outer_weight = TRUE, outer_sl = TRUE, inner_sl = FALSE, all_y = TRUE, all_learner_assoc = TRUE, all_learner_fits = TRUE)) fit
We can explore the results of cvma
a bit more.
fit$cv_assoc
returns risk for the entire procedure, including the cross-validation measure, confidence interval, p-value and influence curve.
fit$cv_assoc_all_y
returns cross-validated performance metrics for all the outcomes, with combined learners.
fit$cv_assoc_all_learners
returns for each outcome and learner cross-validated metric, confidence interval, associated p-value and influence curve.
fit$sl_fits
will return the super learner fit for each outcome on all the data, including risk for each learner and their separate fits.
fit$outer_weight
returns outcome weights for outer-most fold of cross-validation and the mean of the combined outcomes.
fit$outer_weight
returns outcome weights for inner-most fold of cross-validation, folds used for training and the mean of the combined outcomes.
We can generate a nice output using the summary
function. In particular, this function will summarize results of running cvma
by providing a user-friendly output for final weights, Super Learner risks and weights, cross-validation associated measures for each outcome and for each learner.
#Final weights: summary(fit, "weights") #Super Learner output: summary(fit, "superlearner") #Cross-validation measures for each outcome: summary(fit, "outcomes") #Cross-validation measures for each learner in the library: summary(fit, "learners")
We note that the majority of the computation time in cvma
is spent fitting learners. In order to allow the user to take advatange of many different options cvma
provides, \code{reweight_cvma} functon re-computes weights without re-fitting all of the learners. In addition to the standard input arguments to cvma
, we just need to add the previously obtained fit object:
re_fit <- reweight_cvma(fit, Y = Y, X = X, sl_control = list(ensemble_fn = "ensemble_linear", optim_risk_fn = "optim_risk_sl_se", weight_fn = "weight_sl_01", cv_risk_fn = "cv_risk_sl_r2", family = gaussian(), alpha = 0.05), y_weight_control = list(ensemble_fn = "ensemble_linear", weight_fn = "weight_y_01", optim_risk_fn = "optim_risk_y_r2", cv_risk_fn = "cv_risk_y_r2", alpha = 0.05)) re_fit
Instead of nonparametric $R^2$, we can use negative log-likelihood as prediction criteria instead. Similarly as before, we first simulate a simple dataset with binary outcomes:
set.seed(1234) X <- data.frame(x1=runif(n=100,0,5), x2=runif(n=100,0,5)) Y1 <- rbinom(100, 1, plogis(-2 + 0.1*X$x1 + 0.2*X$x2)) Y2 <- rbinom(100, 1, plogis(-2 + 0.1*X$x1)) Y <- data.frame(Y1 = Y1, Y2 = Y2)
We run cvma
with 10 folds and 3 learners as before, but this time specifying the cv_risk_fn
for both sl_control
and y_weight_control
to negative log-likelihood cross-validated measure. We also define loss by the negative log-likelihood, and change the outcome type in to binomial() for prediction.
fit <- cvma(Y = Y, X = X, V = 10, learners = c("SL.glm","SL.mean","SL.nnet"), sl_control = list(ensemble_fn = "ensemble_linear", optim_risk_fn = "optim_risk_sl_nloglik", weight_fn = "weight_sl_convex", cv_risk_fn = "cv_risk_sl_nloglik", family = binomial(), alpha = 0.05), y_weight_control = list(ensemble_fn = "ensemble_linear", weight_fn = "weight_y_01", optim_risk_fn = "optim_risk_y_nloglik", cv_risk_fn = "cv_risk_y_nloglik", alpha = 0.05)) fit
Finally, we can use the area area under ROC curve as prediction criteria. In order to accomplish this, we use auc appropriate cv_risk_fn
, optim_risk_fn
and weight_fn
as illustrated below:
fit <- cvma(Y = Y, X = X, V = 5, learners = c("SL.glm","SL.mean"), sl_control = list(ensemble_fn = "ensemble_linear", optim_risk_fn = "optim_risk_sl_nloglik", weight_fn = "weight_sl_convex", cv_risk_fn = "cv_risk_sl_auc", family = binomial(), alpha = 0.05), y_weight_control = list(ensemble_fn = "ensemble_linear", weight_fn = "weight_y_01", optim_risk_fn = "optim_risk_y_auc", cv_risk_fn = "cv_risk_y_auc", alpha = 0.05)) fit
Instead of assessing the overall summary of association between $X$ and $Y$, one might be interested in quantifying the relative importance of each component of $X$. The function compare_cvma
function computes the differences in the estimated association between $X$ and $Y$ when considering different sets of variables. In essence, the proposed method is similar to well known random forests, in that it measures the change in predictive performance with and without each predictor being considered [@breiman2001]. We emphasize that the method implemented in compare_cvma
yields straightforward, asymptotically justified inference, as opposed to many existing variable importance methods.
Once again, we simulate a simple data structure in order to illustrate how to use compare_cvma
function to assess variable importance:
set.seed(1234) X <- data.frame(x1=runif(n=100,0,5), x2=runif(n=100,0,5), x3=runif(n=100,0,5)) Y1 <- rnorm(100, X$x1 + X$x2, 1) Y2 <- rnorm(100, X$x1 + X$x2 + X$x3, 3) Y <- data.frame(Y1 = Y1, Y2 = Y2)
For computational simplicity of this example, we only look at low-dimensional $X$ and $Y$, and fit with simple learners and $V=10$. To assess the importance of variable $x3$ with $x3 \in X$, we repeat the procedure described in the previous section but restrict to variables not in this subset. Therefore, we obtain an estimate of the cross-validated performance measure for the composite prediction algorithm based on a subset of variables, as well as when all variables are used. First we fit data with all variables in $X$:
fit1 <- cvma(Y = Y, X = X, V = 10, learners = c("SL.glm","SL.mean"))
Then, we repeat the procedure restricting to variables not in the subset ${x3}$ of $X$:
fit2 <- cvma(Y = Y, X = X[,-3], V = 10, learners = c("SL.glm","SL.mean"))
We are now ready to assess the variable importance of $x3$ for association between $X$ and $Y$. The package cvma
provides 3 different options for the type of comparison to be made. With $R^2$ as the example metric, we can look at the differences in $R^2$ between the two fits using the following command:
fit<-compare_cvma(fit1, fit2, contrast = "diff") fit
The output of compare_cvma
includes the contrast measure, associated confidence interval for $1-\alpha$, and p-value. In addition, we can look at the ratio of $R^2$ instead, with symmetric CI.
fit<-compare_cvma(fit1, fit2, contrast = "ratio") fit
Finally, we can also assess variable importance using the ratio of $R^2$, with symmetric CI on the log-scale:
fit<-compare_cvma(fit1, fit2, contrast = "logratio") fit
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.