eNetXplorer: generates family of elastic net models for different alphas
In juliancandia/eNetXplorer: Quantitative Exploration of Elastic Net Families for Generalized Linear Models

eNetXplorer

R Documentation

generates family of elastic net models for different alphas

Description

Elastic net uses a mixing parameter alpha to tune the penalty term continuously from ridge (alpha=0) to lasso (alpha=1). eNetXplorer generates a family of elastic net models over different values of alpha for the quantitative exploration of the effects of shrinkage. For each alpha, the regularization parameter lambda is chosen by optimizing a quality (objective) function based on out-of-bag cross-validation predictions. Statistical significance of each model, as well as that of individual features within a model, is assigned by comparison to a set of null models generated by random permutations of the response. eNetXplorer fits linear (gaussian), logistic (binomial), multinomial, and Cox regression models.

Usage

eNetXplorer(x, y, family=c("gaussian","binomial","multinomial","cox"),
alpha=seq(0,1,by=0.2), nlambda=100, nlambda.ext=NULL, seed=NULL, scaled=TRUE, 
n_fold=5, n_run=100, n_perm_null=25, save_obj=FALSE, dest_dir=getwd(), 
dest_dir_create=TRUE, dest_dir_create_recur=FALSE, dest_obj="eNet.Robj",
save_lambda_QF_full=FALSE, QF.FUN=NULL, QF_label=NULL, 
QF_gaussian=c("cor.pearson","cor.spearman","cor.kendall","mse"),
binom_method=c("accuracy","precision","recall","Fscore","specificity","auc"),
multinom_method=c("avg accuracy","avg precision","avg recall","avg Fscore"), 
binom_pos=NULL, fscore_beta=NULL, fold_distrib_fail.max=100, 
cox_index=c("concordance","D-index"), logrank=FALSE, survAUC=FALSE, 
survAUC_time=NULL, ...)

Arguments

`x`	Input numerical matrix with instances as rows and features as columns. Instance and feature labels should be provided as row and column names, respectively. Can be in sparse matrix format (inherit from class `"sparseMatrix"` as in package `Matrix`). Cannot handle missing values.
`y`	Response variable. For `family="gaussian"`, numerical vector. For `family=` `"binomial"`, factor with two levels. For `family="multinomial"`, factor with two or more levels. For categorical families, if a vector is supplied, it will be coerced into a factor. For `family="cox"`, matrix with columns named "time" and "status", where the latter is a binary indicator of event (1) or right-censoring (0).
`family`	Response type: `"gaussian"` (numerical), `"binomial"` (2-level factor), `"multinomial"` (factor with >=2 levels) or `"cox"` (survival time and censoring status).
`alpha`	Sequence of values for the mixing parameter penalty term in the elastic net family. Default is `seq(0,1,by=0.2)`.
`nlambda`	Number of values for the regularization parameter `lambda`. Default is 100. Irrespective of `nlambda`, the range of `lambda` values is assigned by `glmnet`.
`nlambda.ext`	If set to a value larger than `nlambda`, this will be the number of values for `lambda` obtained by extending the range assigned by `glmnet` symmetrically while keeping the `lambda` density uniform in log scale. Default is `NULL`, which will not extend the range of `lambda` assigned by `glmnet`.
`seed`	Sets the pseudo-random number seed to enforce reproducibility. Default is `NULL`.
`scaled`	Z-score transformation of individual features across all instances. Default is `TRUE`.
`n_fold`	Number of cross-validation folds per run. `lambda` is chosen based on the maximization of a quality function on out-of-bag-instances averaged over all runs. Default is 5.
`n_run`	Number of runs (i.e. cross-validated model iterations); for each run, instances are randomly assigned to cross-validation folds. Default is 100.
`n_perm_null`	Number of random null-model permutations of the response per run. Default is 25.
`save_obj`	Logical to save the `eNetXplorer` object. Default is `FALSE`.
`dest_dir`	Destination directory. Default is the working directory.
`dest_dir_create`	Creates destination directory if it does not exist already. Default is `TRUE`.
`dest_dir_create_recur`	Creates destination directory recursively if it does not exist already. Default is `FALSE`.
`dest_obj`	Name for output `eNetXplorer` object.
`save_lambda_QF_full`	Full lambda vs QF information is included in the `eNetXplorer` object. Default is `FALSE`.
`QF.FUN`	User-defined quality (objective) function as maximization criterion to select `lambda` based on response vs out-of-bag predictions (see example below). If not set, family-specific default quality functions are used, as follows: for `family="gaussian"`, default is correlation; for `family="binomial"`, it is accuracy; for `family="multinomial"`, it is average accuracy; for `family="cox"`, it is the concordance index (default) or D-index (set by `cox_index`).
`QF_label`	Label for user-defined quality function, if `QF.FUN` is provided.
`QF_gaussian`	For `family="gaussian"`, this selects the default quality function as correlation (Pearson, Spearman and Kendall methods available) or mean squared error (`"mse"`). Default is `"cor.pearson"`.
`binom_method`	For `family="binomial"`, method to be used in the quality function. Default is `"accuracy"`.
`multinom_method`	For `family="multinomial"`, method to be used in the quality function. Default is `"avg accuracy"`.
`binom_pos`	For `family="binomial"` and quality function methods other than the default (`"accuracy"`), this is the class to be considered positive. Default is the first level of the response factor.
`fscore_beta`	For `family="binomial"` and quality function method `"Fscore"`, or for `family=` `"multinomial"` and quality function method `"avg Fscore"`, this is the beta factor to balance precision and recall. Default is 1.
`fold_distrib_fail.max`	For categorical models, maximum number of failed attempts per run to have all classes represented in each in-bag fold. If this number is exceeded, the execution is halted; try again with larger `n_fold`, by removing/reasigning classes of small size, and/or with larger `fold_distrib_fail.max`. Default is 100.
`cox_index`	For `family="cox"`, index method to be used in the default quality function. Default is `"concordance"`, alternative choice is `"D-index"`.
`logrank`	For `family="cox"`, logical to generate cross-validated log-rank test p-values of low- vs high-risk groups, defined by the median of out-of-bag predicted risk. Default is `FALSE`.
`survAUC`	For `family="cox"`, logical to calculate area-under-curve (AUC) from cross-validated time-dependent ROC curves based on out-of-bag predicted risk. Default is `FALSE`.
`survAUC_time`	For `family="cox"` (if `survAUC=T`), numerical vector with timepoints of interest; time must be in the same units as the response variable `y`.
`...`	Accepts parameters from `glmnet.control(...)` to allow changes of factory default parameters in `glmnet`. If not explicitly set, it will use factory defaults.

Details

For each alpha, a set of nlambda values is obtained using the full data; if provided, nlambda.ext allows to extend the range of lambda values symmetrically while keeping its density uniform in log scale. Using these values of lambda, elastic net cross-validation models are generated for n_run random assignments of instances among n_fold folds; the best lambda is determined by the maximization of a quality (objective) function that compares out-of-bag predictions against the response. A variety of quality functions are implemented for each response type, namely: for gaussian models, correlation (Pearson, Spearman and Kendall methods available) and mean squared error; for binomial models, accuracy, precision, recall, F-score, specificity, and area-under-curve; for multinomial models, average accuracy, precision, recall, and F-score; for Cox regression models, concordance and D-index (Schroeder et al). Some of these choices require additional parameters: binomial measures that are not invariant under class permutation (see Sokolova & Lapalme) require to specify which class is to be considered positive; F-score requires to specify the value of the beta factor to balance precision and recall (F-score equals precision for beta=0 and tends to recall in the large beta limit). Besides these built-in options, user-defined quality functions can be provided via QF.FUN. For each run, using the same assignment of instances into folds, n_perm_null null models are generated by shuffling the response. By using the quality function to compare the out-of-bag performance of the model to that of the null models, an empirical significance p-value is assigned to the model. Similar procedures allow to obtain p-values for individual features based on absolute coefficient magnitude and on the frequency of non-zero coefficients. A family of elastic net models is thus generated for multiple values of alpha spanning the range from ridge (alpha=0) to lasso (alpha=1). This function returns an eNetXplorer object on which summary, plotting and export functions in this package can be applied for further analysis. For details about the underlying elastic net models (Friedman et al; Zhou & Hastie), refer to the glmnet package and references therein. For more details about eNetXplorer, see Candia & Tsang and the package vignette.

For Cox regression models, setting logrank=T generates cross-validated log-rank test p-values of low- vs high-risk groups, which are defined by the median of out-of-bag predicted risk (Simon et al). Moreover, setting survAUC=T and providing a numerical vector survAUC_time with timepoints of interest generates the AUC from cross-validated time-dependent ROC curves based on out-of-bag predicted risk (Simon et al) using the timeROC package (Blanche et al).

Value

An object with S3 class "eNetXplorer".

`predictor`	Predictor matrix used for regression (in sparse matrix format).
`response`	Response variable used for regression.
`family`	Input parameter.
`alpha`	Input parameter.
`nlambda`	Input parameter.
`nlambda.ext`	Input parameter.
`seed`	Input parameter.
`scaled`	Input parameter.
`n_fold`	Input parameter.
`n_run`	Input parameter.
`n_perm_null`	Input parameter.
`QF_label`	Input parameter.
`QF_gaussian`	Input parameter.
`binom_method`	Input parameter.
`multinom_method`	Input parameter.
`binom_pos`	Input parameter.
`fscore_beta`	Input parameter.
`fold_distrib_fail.max`	Input parameter.
`cox_index`	Input parameter.
`logrank`	Input parameter.
`survAUC`	Input parameter.
`survAUC_time`	Input parameter.
`survAUC_method`	Input parameter.
`survAUC_lambda`	Input parameter.
`survAUC_span`	Input parameter.
`instance`	Instance labels.
`feature`	Feature labels.
`glmnet_params`	`glmnet` parameters used for regression.
`best_lambda`	`lambda` values chosen by cross-validation.
`model_QF_est`	Quality function values obtained by cross-validation.
`QF_model_vs_null_pval`	P-value from model vs null comparison to assess statistical significance.
`lambda_values`	List of `lambda` values used for each `alpha`.
`lambda_QF_est`	List of quality function values obtained for each `alpha`.
`predicted_values`	List of out-of-bag predicted values for each `alpha`; rows are instances and columns are median/mad predictions (for linear and Cox regression) or class predictions (for binomial and multinomial regression).
`feature_coef_wmean`	Mean of feature coefficients (over runs) weighted by non-zero frequency (over folds) in sparse matrix format, with features as rows and `alpha` values as columns. For multinomial regression, it is a list of matrices (one matrix for each class).
`feature_coef_wsd`	Standard deviation of feature coefficients (over runs) weighted by non-zero frequency (over folds) in sparse matrix format, with features as rows and `alpha` values as columns. For multinomial regression, it is a list of matrices (one matrix for each class).
`feature_freq_mean`	Mean of non-zero frequency in sparse matrix format, with features as rows and `alpha` values as columns. For multinomial regression, it is a list of matrices (one matrix for each class).
`feature_freq_sd`	Standard deviation of non-zero frequency in sparse matrix format, with features as rows and `alpha` values as columns. For multinomial regression, it is a list of matrices (one matrix for each class).
`null_feature_coef_wmean`	Analogous to `feature_coef_wmean` for null model permutations.
`null_feature_coef_wsd`	Analogous to `feature_coef_wsd` for null model permutations.
`null_feature_freq_mean`	Analogous to `feature_freq_mean` for null model permutations.
`null_feature_freq_sd`	Analogous to `feature_freq_sd` for null model permutations.
`feature_coef_model_vs_null_pval`	P-value from model vs null comparison to assess statistical significance of mean non-zero feature coefficients in sparse matrix format, with features as rows and `alpha` values as columns. For multinomial regression, it is a list of matrices (one matrix for each class).
`feature_freq_model_vs_null_pval`	P-value from model vs null comparison to assess statistical significance of mean non-zero feature frequencies in sparse matrix format, with features as rows and `alpha` values as columns. For multinomial regression, it is a list of matrices (one matrix for each class).
`logrank_pval`	For Cox regression (if `logrank=T`), cross-validated log-rank test p-value of low- vs high-risk groups, defined by the median of out-of-bag predicted risk.
`AUC_mean`	For Cox regression (if `survAUC=T`), mean AUC from cross-validated time-dependent ROC curves based on out-of-bag predicted risk, with timepoints (given by `survAUC_time`) as rows and `alpha` values as columns.
`AUC_sd`	For Cox regression (if `survAUC=T`), standard deviation of AUC.
`AUC_perc025`	For Cox regression (if `survAUC=T`), 2.5th percentile of AUC.
`AUC_perc500`	For Cox regression (if `survAUC=T`), 50th percentile (median) of AUC.
`AUC_perc975`	For Cox regression (if `survAUC=T`), 97.5th percentile of AUC.
`AUC_pval`	For Cox regression (if `survAUC=T`), p-value of AUC from model vs null comparison to assess statistical significance.

Author(s)

Julian Candia and John S. Tsang
Maintainer: Julian Candia julian.candia@nih.gov

References

Blanche P, Dartigues J-F and Jacqmin-Gadda H. Estimating and comparing time-dependent areas under receiver operating characteristic curves for censored event times with competing risks, Statistics in Medicine (2013) 32:5381-5397.

Candia J and Tsang JS. eNetXplorer: an R package for the quantitative exploration of elastic net families for generalized linear models, BMC Bioinformatics (2019) 20:189.

Friedman J, Hastie T and Tibshirani R. Regularization paths for generalized linear models via coordinate descent, Journal of Statistical Software (2010) 33:1-22.

Schroeder MS, Culhane AC, Quackenbush J, Haibe-Kains B. survcomp: an R/Bioconductor package for performance assessment and comparison of survival models, Bioinformatics (2011) 27:3206-8.

Simon RM, Subramanian J, Li M-C and Menezes S. Using cross-validation to evaluate predictive accuracy of survival risk classifiers based on high-dimensional data, Briefings in Bioinformatics (2011) 12:203-14.

Sokolova M and Lapalme G. A systematic analysis of performance measures for classification tasks, Information Processing and Management (2009) 45, 427-437.

Zou H and Hastie T. Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society Series B (2005) 67:301-20.

Examples



# Linear models (synthetic dataset comprised of 20 features and 75 instances):
data(QuickStartEx)
fit = eNetXplorer(x=QuickStartEx$predictor, y=QuickStartEx$response,
family="gaussian", n_run=20, n_perm_null=10, seed=111)


# Example showing an explicit (i.e. custom) implementation of mean squared error as QF
# Note: mean squared error as QF is a built-in option using \code{QF_gaussian="mse"}
data(QuickStartEx)
customQF = function(predicted,response){
     -mean((predicted-response)**2)
}
fit = eNetXplorer(x=QuickStartEx$predictor, y=QuickStartEx$response,
family="gaussian", n_run=20, n_perm_null=10, seed=111, QF.FUN=customQF, QF_label="MSE")


# Linear models to predict numerical day-70 H1N1 serum titers based on 
# day-7 cell population frequencies:
data(H1N1_Flow)
fit = eNetXplorer(x=H1N1_Flow$predictor_day7, y=H1N1_Flow$response_numer[rownames(
H1N1_Flow$predictor_day7)], family="gaussian", n_run=25, n_perm_null=15, seed=111)


# Binomial models to predict acute myeloid (AML) vs acute lymphoblastic (ALL) 
# leukemias: 
data(Leukemia_miR)
fit = eNetXplorer(x=Leuk_miR_filt$predictor, y=Leuk_miR_filt$response_binomial, 
family="binomial", n_run=25, n_perm_null=15, seed=111)


# Multinomial models to predict acute myeloid (AML), acute B-cell lymphoblastic 
# (B-ALL) and acute T-cell lymphoblastic (T-ALL) leukemias:
data(Leukemia_miR)
fit = eNetXplorer(x=Leuk_miR_filt$predictor, y=Leuk_miR_filt$response_multinomial,
family="multinomial", n_run=25, n_perm_null=15, seed=111)


# Binomial models to predict B-ALL vs T-ALL:
data(Leukemia_miR)
fit = eNetXplorer(x=Leuk_miR_filt$predictor[Leuk_miR_filt$response_multinomial!="AML",],
y=Leuk_miR_filt$response_multinomial[Leuk_miR_filt$response_multinomial!="AML"], 
family="binomial", n_run=25, n_perm_null=15, seed=111)


# Cox regression models to predict survival based on 7-gene signature:
data(breastCancerSurv)
fit = eNetXplorer(x=breastCancerSurv$predictor, y=breastCancerSurv$response, family="cox", 
n_run=25, n_perm_null=15, seed=111)

juliancandia/eNetXplorer documentation built on May 11, 2023, 1:59 a.m.