d2wlasso: d2wlasso package

Description Usage Arguments Value References Examples

View source: R/main_functions.R

Description

This package provides functions to perform variable selection with weighted lasso for both linear regression and the Cox proportional hazards regression. The weights are chosen to direct the variable selection procedure so that covariates that are highly associated with the response are likely to be selected and covariates that weakly associated with the response are less likely to be selected. Association between the response and the covariates is based on results from simpler linear/Cox regressions between the response and each covariate, and include, for example, q-values, partial correlation coefficients, and t-statistics of regression coefficients.

Performs variable selection with covariates multiplied by weights that direct which variables are likely to be associated with the response.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
d2wlasso(
  x,
  z,
  y,
  cox.delta = NULL,
  factor.z = TRUE,
  regression.type = c("linear", "cox")[1],
  weight.type = c("one", "corr.estimate", "corr.pvalue", "corr.bh.pvalue", "corr.tstat",
    "corr.qvalue", "parcor.estimate", "parcor.pvalue", "parcor.bh.pvalue", "parcor.tstat",
    "parcor.qvalue", "exfrequency.random.partition.aic",
    "exfrequency.random.partition.bic", "exfrequency.kmeans.partition.aic",
    "exfrequency.kmeans.partition.bic", "exfrequency.kquartiles.partition.aic",
    "exfrequency.kquartiles.partition.bic", "exfrequency.ksorted.partition.aic",
    "exfrequency.ksorted.partition.bic")[1],
  weight_fn = function(x) {     x },
  ttest.pvalue = TRUE,
  q_opt_tuning_method = c("bootstrap", "smoother")[2],
  qval.alpha = 0.15,
  alpha.bh = 0.05,
  robust = TRUE,
  show.plots = FALSE,
  pi0.known = FALSE,
  pi0.val = 0.9,
  penalty.choice = c("cv.mse", "cv.penalized.loss", "penalized.loss",
    "deviance.criterion")[3],
  est.MSE = c("est.var", "step")[1],
  cv.folds = 10,
  mult.cv.folds = 0,
  penalized.loss.delta = 2,
  nboot = 100,
  k.split = 4,
  step.direction = "backward"
)

Arguments

x

(n by m) matrix of main covariates where m is the number of covariates and n is the sample size.

z

(n by 1) matrix of additional fixed covariate affecting response variable. This covariate should always be selected. Can be NULL.

y

(n by 1) a matrix corresponding to the response variable. If regression.type is "cox", y contains the observed event times.

cox.delta

(n by 1) a matrix that denotes censoring when regression.type is "cox" (1 denotes survival event is observed, 0 denotes the survival event is censored). Can be NULL.

factor.z

logical. If TRUE, the fixed variable z is a factor variable.

regression.type

a character indicator that is either "linear" for linear regression or "cox" for Cox proportional hazards regression. Default is "linear".

weight.type

Character value denoting which weights to be used for the weighted lasso, where each covariate in x is multiplied by a scalar weight. Options include

  • one:The scalar weight is one.

  • corr.estimate:The scalar weight for covariate x_j is the Pearson correlation between x_j and y.

  • corr.pvalue:The scalar weight for covariate x_j is the p-value of the coefficient of x_j in the regression of y on x_j

  • corr.bh.pvalue:The scalar weight for covariate x_j is the Benjanmini-Hocbherg adjusted p-value from corr.pvalue.

  • corr.qvalue:The scalar weight for covariate x_j is the q-value transform of the p-value from corr.pvalue.

  • corr.tstat:The scalar weight for covariate x_j is the t-statistic associated with testing the significance of x_j in the regression of y on x_j.

  • parcor.estimate:The scalar weight for covariate x_j is the partial correlation between x_j and y after adjustment for z.

  • parcor.pvalue:The scalar weight for covariate x_j is the p-value of the coefficient of x_j in the regression of y on z and x_j

  • parcor.bh.pvalue:The scalar weight for covariate x_j is the Benjanmini-Hocbherg adjusted p-value from parcor.pvalue.

  • parcor.qvalue:The scalar weight for covariate x_j is the q-value transform of the p-value from parcor.pvalue.

  • parcor.tstat:The scalar weight for covariate x_j is the t-statistic associated with testing the significance of x_j in the regression of y on z and x_j.

  • exfrequency.random.partition.aic:The scalar weight for covariate x_j is an exclusion frequency. The exclusion frequency is obtained as follows: we first partition the covariates into k.split random groups, and we apply a stepwise linear/Cox regression of the response on each partition set of covariate. The final model is selected using an AIC criterion, and we track if x_j is excluded from the final model. We repeat this procedure nboot times and the exclusion frequency is the average number of times x_j is excluded.

  • exfrequency.random.partition.bic:The scalar weight for covariate x_j is computed as in exfrequency.random.partition.aic, except that the final model within each stepwise regression is selected using a BIC criterion.

  • exfrequency.kmeans.partition.aic:The scalar weight for covariate x_j is an exclusion frequency. The exclusion frequency is obtained as follows: we apply ridge regression of the response on all covariates and obtain ridge regression coefficients for each covariate. We then partitioned the covariates into k.split groups using a K-means criterion on the ridge regression coefficients, and we applied a stepwise linear/Cox regression of the response on each partition set of covariate. The final model is selected using an AIC criterion, and we track if x_j is excluded from the final model. We repeat this procedure nboot times and the exclusion frequency is the average number of times x_j is excluded.

  • exfrequency.kmeans.partition.bic:The scalar weight for covariate x_j is computed as in exfrequency.kmeans.partition.aic, except that the final model within each stepwise regression is selected using a BIC criterion.

  • exfrequency.kquartile.partition.aic:The scalar weight for covariate x_j is an exclusion frequency. The exclusion frequency is obtained as follows: we apply ridge regression of the response on all covariates and obtain ridge regression coefficients for each covariate. We then partitioned the covariates into k.split groups using k-quantiles of the ridge regression coefficients, and we applied a stepwise linear/Cox regression of the response on each partition set of covariate. The final model is selected using an AIC criterion, and we track if x_j is excluded from the final model. We repeat this procedure nboot times and the exclusion frequency is the average number of times x_j is excluded.

  • exfrequency.kquartiles.partition.bic:The scalar weight for covariate x_j is computed as in exfrequency.kquartiles.partition.aic, except that the final model within each stepwise regression is selected using a BIC criterion.

  • exfrequency.ksorted.partition.aic:The scalar weight for covariate x_j is an exclusion frequency. The exclusion frequency is obtained as follows: we apply ridge regression of the response on all covariates and obtain ridge regression coefficients for each covariate. We then partitioned the covariates into k.split groups by first ordering the ridge regression coefficients in descending order and splitting them into k.split groups. We then applied a stepwise linear/Cox regression of the response on each partition set of covariate. The final model is selected using an AIC criterion, and we track if x_j is excluded from the final model. We repeat this procedure nboot times and the exclusion frequency is the average number of times x_j is excluded.

  • exfrequency.ksorted.partition.bic:The scalar weight for covariate x_j is computed as in exfrequency.ksorted.partition.aic, except that the final model within each stepwise regression is selected using a BIC criterion.

weight_fn

A user-defined function to be applied to the weights for the weighted lasso. Default is an identify function.

ttest.pvalue

logical indicator used when weight.type is "corr.pvalue","corr.bh.pvalue", "corr.qvalue", "parcor.pvalue","parcor.bh.pvalue","parcor.qvalue". If TRUE, p-value for each covariate is computed from univariate linear/cox regression of the response on each covariate. If FALSE, the p-value is computed from correlation coefficients between the response and each covariate. Default is FALSE.

q_opt_tuning_method

character indicator used when weight.type is "corr.qvalue" or "parcor.qvalue". Options are "bootstrap" or "smoother" to specify how the optimal tuning parameter is obtained when computing q-values from Storey and Tibshirani (2003). Default is "smoother" (smoothing spline).

qval.alpha

scalar value used when weight.type is "corr.qvalue" or "parcor.qvalue". The choice of qval.alpha indicates the cut-off for q-values used to obtain the result threshold.selection The result threshold.selection contains all covariates for which their q-value is less than qval.alpha.

alpha.bh

scalar value used when weight.type is "corr.pvalue","corr.bh.pvalue", "parcor.pvalue", "parcor.bh.pvalue". The choice of alpha.bh indicates the cut-off for p-values used to obtain the result in threshold.selection. The result threshold.selection contains all covariates for which their p-value is less than alpha.bh.

robust

logical indicator used when weight.type is "corr.qvalue" or "parcor.qvalue". If TRUE, q-values computed as in Storey and Tibshirani (2003) are robust for small p-values.

show.plots

logical indicator. When weight.type is "corr.qvalue" or "parcor.qvalue", show.plots refers to figures associated with q-value computations as proposed in Storey and Tibshirani (2003). If show.plots is TRUE, we display the density histogram of original p-values, density histogram of the q-values, scatter plot of \hatπ versus λ in the computation of q-values, and scatter plot of significant tests versus q-value cut-off. When penalty.choice is "penalized.loss", show.plots refers to plots associated with the penalized loss criterion. If TRUE, a plot of the penalized loss criterion versus steps in the LARS algorithm of Efron et al (2004) is displayed. Default of show.plots is FALSE.

pi0.known

logical indicator used when weight.type is "corr.qvalue" or "parcor.qvalue". If TRUE, when computing q-values, the estimate of the true proportion of the null hypothesis is set to the value of pi0.val given by the user. If FALSE, the estimate of the true proportion of the null hypothesis is computed by bootstrap or smoothing spline as proposed in Storey and Tibshirani (2003). Default is FALSE.

pi0.val

scalar used when weight.type is "corr.qvalue" or "parcor.qvalue". A user supplied estimate of the true proportion of the null hypothesis. Used only when pi0.known is TRUE. Default is 0.9.

penalty.choice

character that indicates the variable selection criterion. Options are "cv.mse" for the K-fold cross-validated mean squared prediction error, "penalized.loss" for the penalized loss criterion which requires specification of the penalization parameter penalized.loss.delta, "cv.penalized.loss" for the K-fold cross-validated criterion to determine delta in the penalized loss criterion, and "deviance.criterion" for optimizing the Cox proportional hazards deviance (only available when regression.type is "cox".) Defalt is "penalized.loss".

est.MSE

character that indicates how the mean squared error is estimated in the penalized loss criterion when penalty.choice is "penalized.loss" or "cv.penalized.loss". Options are "est.var" which means the MSE is sd(y) * sqrt(n/(n-1)) where n is the sample size, and "step" which means we use the MSE from forward stepwise regression with AIC as the selection criterion. Default is "est.var".

cv.folds

scalar denoting the number of folds for cross-validation when penalty.choice is "cv.mse" or "cv.penalized.loss". Default is 10.

mult.cv.folds

scalar denoting the number of times we repeat the cross-validation procedures of penalty.choice being "cv.mse" or "cv.penalized.loss". Default is 0.

penalized.loss.delta

scalar to indicate the choice of the penalization parameter delta in the penalized loss criterion when penalty.choice is "penalized.loss".

nboot

scalar denoting the number of bootstrap samples obtained for exclusion frequency weights when weight.type is "exfrequency.random.partition.aic", "exfrequency.random.partition.bic", "exfrequency.kmeans.partition.aic", "exfrequency.kmeans.partition.bic","exfrequency.kquartiles.partition.aic", "exfrequency.kquartiles.partition.bic","exfrequency.ksorted.partition.aic","exfrequency.ksorted.partition.bic". Default is 100.

k.split

scalar that indicates the number of partitions used to compute the exclusion frequency weights when weight.type is "exfrequency.random.partition.aic", "exfrequency.random.partition.bic", "exfrequency.kmeans.partition.aic", "exfrequency.kmeans.partition.bic","exfrequency.kquartiles.partition.aic", "exfrequency.kquartiles.partition.bic","exfrequency.ksorted.partition.aic","exfrequency.ksorted.partition.bic". Default is 4.

step.direction

character that indicates the direction of stepwise regression used to compute the exclusion frequency weights when weight.type is "exfrequency.random.partition.aic", "exfrequency.random.partition.bic", "exfrequency.kmeans.partition.aic", "exfrequency.kmeans.partition.bic","exfrequency.kquartiles.partition.aic", "exfrequency.kquartiles.partition.bic","exfrequency.ksorted.partition.aic","exfrequency.ksorted.partition.bic". One of "both", "forward" or "backward". Default is "backward".

Value

References

Garcia, T.P. and M¨uller, S. (2016). Cox regression with exclusion frequency-based weights to identify neuroimaging markers relevant to Huntington’s disease onset. Annals of Applied Statistics, 10, 2130-2156.

Garcia, T.P. and M¨uller, S. (2014). Influence of measures of significance-based weights in the weighted Lasso. Journal of the Indian Society of Agricultural Statistics (Invited paper), 68, 131-144.

Garcia, T.P., Mueller, S., Carroll, R.J., Dunn, T.N., Thomas, A.P., Adams, S.H., Pillai, S.D., and Walzem, R.L. (2013). Structured variable selection with q-values. Biostatistics, DOI:10.1093/biostatistics/kxt012.

Efron, B., Hastie, T., Johnstone, I. AND Tibshirani, R. (2004). Least angle regression. Annals of Statistics 32, 407–499.

Garcia, T.P. and M¨uller, S. (2016). Cox regression with exclusion frequency-based weights to identify neuroimaging markers relevant to Huntington’s disease onset. Annals of Applied Statistics, 10, 2130-2156.

Garcia, T.P. and M¨uller, S. (2014). Influence of measures of significance-based weights in the weighted Lasso. Journal of the Indian Society of Agricultural Statistics (Invited paper), 68, 131-144.

Garcia, T.P., Mueller, S., Carroll, R.J., Dunn, T.N., Thomas, A.P., Adams, S.H., Pillai, S.D., and Walzem, R.L. (2013). Structured variable selection with q-values. Biostatistics, DOI:10.1093/biostatistics/kxt012.

Storey, J. D. and Tibshirani, R. (2003). Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences 100, 9440-9445.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
x = matrix(rnorm(100*5, 0, 1),100,5)
z = matrix(rbinom(100, 1, 0.5),100,1)
y = matrix(z[,1] + 2*x[,1] - 2*x[,2] + rnorm(100, 0, 1), 100)

dwl0 = d2wlasso(x,z,y)
dwl1 = d2wlasso(x,z=NULL,y,weight.type="corr.pvalue")
dwl2 = d2wlasso(x,z,y,weight.type="parcor.qvalue")
dwl3 = d2wlasso(x,z,y,weight.type="parcor.bh.pvalue")
dwl4 = d2wlasso(x,z,y,weight.type="parcor.qvalue",mult.cv.folds=100)
dwl5 = d2wlasso(x,z,y,weight.type="exfrequency.random.partition.aic")
dwl6 = d2wlasso(x,z,y,weight.type="exfrequency.kmeans.partition.aic")
dwl7 = d2wlasso(x,z,y,weight.type="exfrequency.kquartiles.partition.aic")
dwl8 = d2wlasso(x,z,y,weight.type="exfrequency.ksorted.partition.aic")

## Cox model
x = matrix(rnorm(100*5, 0, 1),100,5)
z = matrix(rbinom(100, 1, 0.5),100,1)
y = matrix(exp(z[,1] + 2*x[,1] - 2*x[,2] + rnorm(100, 0, 2)), 100)
cox.delta = matrix(1,nrow=length(y),ncol=1)
dwl0.cox = d2wlasso(x,z,y,cox.delta,regression.type="cox",penalty.choice="cv.mse")
dwl1.cox = d2wlasso(x,z=NULL,y,cox.delta,
  regression.type="cox",weight.type="corr.pvalue",penalty.choice="cv.mse")
dwl2.cox = d2wlasso(x,z,y,cox.delta,
  regression.type="cox",weight.type="parcor.qvalue",penalty.choice="cv.mse")
dwl3.cox = d2wlasso(x,z,y,cox.delta,
  regression.type="cox",weight.type="parcor.bh.pvalue",penalty.choice="cv.mse")
dwl4.cox = d2wlasso(x,z,y,cox.delta,
  regression.type="cox",weight.type="parcor.qvalue",
  mult.cv.folds=100,penalty.choice="cv.mse")
dwl5.cox = d2wlasso(x,z,y,cox.delta,regression.type="cox",
  weight.type="exfrequency.random.partition.aic",penalty.choice="cv.mse")
dwl6.cox = d2wlasso(x,z,y,cox.delta,regression.type="cox",
  weight.type="exfrequency.kmeans.partition.aic",penalty.choice="cv.mse")
dwl7.cox = d2wlasso(x,z,y,cox.delta,regression.type="cox",
  weight.type="exfrequency.kquartiles.partition.aic",penalty.choice="cv.mse")
dwl8.cox = d2wlasso(x,z,y,cox.delta,regression.type="cox",
  weight.type="exfrequency.ksorted.partition.aic",penalty.choice="cv.mse")

rakheon/d2wlasso documentation built on Feb. 26, 2020, 10:39 p.m.