eNetXplorer: generates family of elastic net models for different alphas

Description Usage Arguments Details Value Author(s) See Also Examples

View source: R/eNetXplorer.R

Description

Elastic net uses a mixing parameter alpha to tune the penalty term continuously from ridge (alpha=0) to lasso (alpha=1). eNetXplorer generates a family of elastic net models over different values of alpha for the quantitative exploration of the effects of shrinkage. For each alpha, the regularization parameter lambda is chosen by optimizing a quality function based on out-of-bag cross-validation predictions. Statistical significance of each model, as well as that of individual features within a model, is assigned by comparison to a set of null models generated by random permutations of the response. eNetXplorer fits linear (gaussian), logistic (binomial) and multinomial models.

Usage

1
2
3
4
eNetXplorer(x, y, family=c("gaussian","binomial","multinomial"), 
alpha=seq(0,1,by=0.2), nlambda=100, nlambda.ext=NULL, seed=NULL, scaled=T, 
n_fold=5, n_run=100, n_perm_null=25, QF.FUN=NULL, QF_label=NULL, 
cor_method=c("pearson","kendall","spearman"), fold_distrib_fail.max=100, ...)

Arguments

x

Input numerical matrix with instances as rows and features as columns. Instance and feature labels should be provided as row and column names, respectively. Can be in sparse matrix format (inherit from class "sparseMatrix" as in package Matrix). Cannot handle missing values.

y

Response variable. For family="gaussian", numerical vector. For family= "binomial", factor with two levels. For family="multinomial", factor with two or more levels. For categorical families, if a vector is supplied, it will be coerced into a factor.

family

Response type: "gaussian" (numerical), "binomial" (2-level factor), or
"multinomial" (factor with >=2 levels).

alpha

Sequence of values for the mixing parameter penalty term in the elastic net family. Default is seq(0,1,by=0.2).

nlambda

Number of values for the regularization parameter lambda. Default is 100. Irrespective of nlambda, the range of lambda values is assigned by glmnet.

nlambda.ext

If set to a value larger than nlambda, this will be the number of values for lambda obtained by extending the range assigned by glmnet symmetrically while keeping the lambda density uniform in log scale. Default is NULL, which will not extend the range of lambda assigned by glmnet.

seed

Sets the pseudo-random number seed to enforce reproducibility. Default is NULL.

scaled

Z-score transformation of individual features across all instances. Default is TRUE.

n_fold

Number of cross-validation folds per run. lambda is chosen based on the maximization of a quality function on out-of-bag-instances averaged over all runs. Default is 5.

n_run

Number of runs; for each run, instances are randomly assigned to cross-validation folds. Default is 100.

n_perm_null

Number of random null-model permutations of the response per run. Default is 25.

QF.FUN

User-defined quality function as maximization criterion to select lambda based on response vs out-of-bag predicted instances. For family="gaussian", default is correlation; for family="binomial", it is accuracy; for family="multinomial", it is average accuracy.

QF_label

Label for user-defined quality function, if QF.FUN is provided.

cor_method

For family="gaussian", correlation method to be used in the default quality function cor.test. Default is "pearson".

fold_distrib_fail.max

For categorical models, maximum number of failed attempts per run to have all classes represented in each in-bag fold. If this number is exceeded, the execution is halted; try again with larger n_fold, by removing/reasigning classes of small size, and/or with larger fold_distrib_fail.max. Default is 100.

...

Accepts parameters from glmnet.control(...) to allow changes of factory default parameters in glmnet. If not explicitly set, it will use factory defaults.

Details

For each alpha, a set of nlambda values is obtained using the full data; if provided, nlambda.ext allows to extend the range of lambda values symmetrically while keeping its density uniform in log scale. Using these values of lambda, elastic net cross-validation models are generated for n_run random assignments of instances among n_fold folds; the best lambda is determined by the maximization of a quality function that compares out-of-bag predictions against the response. User-defined quality functions can be provided via QF.FUN, otherwise sensible defaults are used (e.g. correlation for gaussian models). For each run, using the same assignment of instances into folds, n_perm_null null models are generated by shuffling the response. By using the quality function to compare the out-of-bag performance of the model to that of the null models, an empirical significance p-value is assigned to the model. Similar procedures allow to obtain p-values for individual features based on absolute coefficient magnitude and on the frequency of non-zero coefficients. A family of elastic net models is thus generated for multiple values of alpha spanning the range from ridge (alpha=0) to lasso (alpha=1). This function returns an eNetXplorer object on which summary, plotting and export functions in this package can be applied for further analysis. For details about the underlying elastic net models, please refer to the glmnet package and references therein.

Value

An object with S3 class "eNetXplorer".

predictor

Predictor matrix used for regression (in sparse matrix format).

response

Response variable used for regression.

family

Input parameter.

alpha

Input parameter.

nlambda

Input parameter.

nlambda.ext

Input parameter.

seed

Input parameter.

scaled

Input parameter.

n_fold

Input parameter.

n_run

Input parameter.

n_perm_null

Input parameter.

QF_label

Input parameter.

cor_method

Input parameter.

fold_distrib_fail.max

Input parameter.

instance

Instance labels.

feature

Feature labels.

glmnet_params

glmnet parameters used for regression.

best_lambda

lambda values chosen by cross-validation.

model_QF_est

Quality function values obtained by cross-validation.

QF_model_vs_null_pval

P-value from model vs null comparison to assess statistical significance.

lambda_values

List of lambda values used for each alpha.

lambda_QF_est

List of quality function values obtained for each alpha.

predicted_values

List of out-of-bag predicted values for each alpha; rows are instances and columns are median/mad predictions (for linear regression) or class predictions (for binomial and multinomial regression).

feature_coef_wmean

Mean of feature coefficients (over runs) weighted by non-zero frequency (over folds) in sparse matrix format, with features as rows and alpha values as columns. For multinomial regression, it is a list of matrices (one matrix for each class).

feature_coef_wsd

Standard deviation of feature coefficients (over runs) weighted by non-zero frequency (over folds) in sparse matrix format, with features as rows and alpha values as columns. For multinomial regression, it is a list of matrices (one matrix for each class).

feature_freq_mean

Mean of non-zero frequency in sparse matrix format, with features as rows and alpha values as columns. For multinomial regression, it is a list of matrices (one matrix for each class).

feature_freq_sd

Standard deviation of non-zero frequency in sparse matrix format, with features as rows and alpha values as columns. For multinomial regression, it is a list of matrices (one matrix for each class).

null_feature_coef_wmean

Analogous to feature_coef_wmean for null model permutations.

null_feature_coef_wsd

Analogous to feature_coef_wsd for null model permutations.

null_feature_freq_mean

Analogous to feature_freq_mean for null model permutations.

null_feature_freq_sd

Analogous to feature_freq_sd for null model permutations.

feature_coef_model_vs_null_pval

P-value from model vs null comparison to assess statistical significance of mean non-zero feature coefficients in sparse matrix format, with features as rows and alpha values as columns. For multinomial regression, it is a list of matrices (one matrix for each class).

feature_freq_model_vs_null_pval

P-value from model vs null comparison to assess statistical significance of mean non-zero feature frequencies in sparse matrix format, with features as rows and alpha values as columns. For multinomial regression, it is a list of matrices (one matrix for each class).

Author(s)

Julian Candia and John S. Tsang
Maintainer: Julian Candia [email protected]

See Also

summary, plot, summaryPDF, export

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
## Not run: 
# Linear models (synthetic dataset comprised of 20 features and 75 instances):
data(QuickStartEx)
fit = eNetXplorer(x=QuickStartEx$predictor, y=QuickStartEx$response,
family="gaussian", n_run=20, n_perm_null=10, seed=111)

# Linear models to predict numerical day-70 H1N1 serum titers based on 
day-7 cell population frequencies:
data(H1N1_Flow)
fit = eNetXplorer(x=H1N1_Flow$predictor_day7, y=H1N1_Flow$response_numer[rownames(
H1N1_Flow$predictor_day7)], family="gaussian", n_run=25, n_perm_null=15, seed=111)

# Binomial models to predict acute myeloid (AML) vs acute lymphoblastic (ALL) 
#leukemias: 
data(Leukemia_miR)
fit = eNetXplorer(x=Leukemia_miR$predictor, y=Leukemia_miR$response_binomial, 
family="binomial", n_run=25, n_perm_null=15, seed=111)

# Multinomial models to predict acute myeloid (AML), acute B-cell lymphoblastic 
# (B-ALL) and acute T-cell lymphoblastic (T-ALL) leukemias:
data(Leukemia_miR)
fit = eNetXplorer(x=Leukemia_miR$predictor, y=Leukemia_miR$response_multinomial,
family="multinomial", n_run=25, n_perm_null=15, seed=111)

# Binomial models to predict B-ALL vs T-ALL:
data(Leukemia_miR)
fit = eNetXplorer(x=Leukemia_miR$predictor[Leukemia_miR$response_multinomial!="AML",],
y=Leukemia_miR$response_multinomial[Leukemia_miR$response_multinomial!="AML"], 
family="binomial", n_run=25, n_perm_null=15, seed=111)

## End(Not run)

juliancandia/eNetXplorer documentation built on April 22, 2018, 9:20 p.m.