This is supplementary material to the paper "A Scalable Empirical Bayes Approach to Variable Selection in Generalized Linear Models" (Bar, Booth & Wells, 2019). This vignette provides code and sample output for some simulated data and case studies.
SEMMS is implemented in R and is available as a package from github, https://github.com/haimbar/SEMMS
SEMMS requires R version 3.6.0 and up. It requires five packages to be pre-installed: Rcpp, RcppArmadillo, MASS, car, and, as of version 0.2.0 it requires the edgefinder package, available from https://github.com/haimbar/edgefinder . It can be installed by using
devtools::install_github("haimbar/edgefinder")
To install SEMMS,
devtools::install_github("haimbar/SEMMS")
To create files which SEMMS can use, please read the documentation of the function readInputFile.
The air-pollution data set analyzed in this case study was first introduced in (Breiman, L, and J Friedman, 1985), who illustrated the ACE procedure. It consists of daily measurements of ozone concentration levels in the Los Angeles basin, collected over 330 days in 1976. There are eight meteorological explanatory variables, labeled x1,...,x8:
We refer to a more recent analysis in (Lee, Nelder, and Pawitan, 2006, subsection 2.4.4), which also uses x9, the day of the year (doy). Selecting a first-order linear regression model can be done easily by checking all 29=512 possible models, but this strategy is not feasible when we wish to include second, or third order terms (with 254 and 2219 possible models, respectively.) We consider models with first and second order terms, with a total of 54 putative predictors. We standardize these predictors using the scale function in R. We use SEMMS to fit a log-linear model in which it is assumed that log(Y) is normally distributed. The following code shows how to run the “greedy” version of SEMMS (rnd=F in the code below), and produce some plots (see Figure 1 and Figure 2 below).
library(SEMMS) # The ozone data, adding second order terms fn <- system.file("extdata", "ozone.txt", package = "SEMMS", mustWork = TRUE) dataYXZ <- readInputFile(fn, ycol=2, skip=19, Zcols=3:11, addIntercept = TRUE, logTransform = 2, twoWay = TRUE) nn <- 20 # initial guess for the number of non-nulls distribution <- "gaussian" rnd <- F fittedSEMMS <- fitSEMMS(dataYXZ, mincor=0.75, nn=nn, minchange=1, initWithEdgeFinder=F, distribution="N",verbose=T,rnd=rnd) foundSEMMS <- sort(union(which(fittedSEMMS$gam.out$lockedOut != 0), fittedSEMMS$gam.out$nn)) fittedGLM <- runLinearModel(dataYXZ,fittedSEMMS$gam.out$nn, "N") print(summary(fittedGLM$mod)) plotMDS(dataYXZ, fittedSEMMS, fittedGLM, ttl="Ozone Data") plotFit(fittedGLM)
{ width=50% }
{ width=80% }
The model with the variables depicted in Figure 1 fits well, as can be seen in Figure 2, and it explains 83.3% of the variability in the data. The adjusted R2 is 82.5, and the AIC is 187 (Aikake, 1974), compared with 75% and 303, respectively, in (Lee et al., 2006)
Notes about the code:
fittedSEMMS <- fitSEMMS(dataYXZ, mincor=0.75, nnset=c(9,45,54), minchange=1, distribution="N",verbose=T,rnd=rnd)
One may also choose the initial set of variables using a different method, such as the lasso. The iterative nature of our Generalized Alternating Minimization (GAM) algorithm, it is guaranteed to do at least as well as the method used to initialize it because it yields a non-increasing Kullback-Leibler divergence. For example, using the ncvreg package, with the MCP penalty, we can do:
cv <- cv.ncvreg(dataYXZ$Z, dataYXZ$Y,family="gaussian", penalty="MCP") fit <- cv$fit beta <- fit$beta[,cv$min] nnset <- which(beta!=0)[-1] - 1 fittedSEMMS <- fitSEMMS(dataYXZ, mincor=0.75, nnset=nnset, minchange=1, distribution="N",verbose=T,rnd=rnd)
This yields 16 variables in the nnset variable.
Finally, we show the results when SEMMS with the edgefinder option set to TRUE:
fittedSEMMS <- fitSEMMS(dataYXZ, mincor=0.8, nn=15, minchange = 1, distribution="N",verbose=T,rnd=F) fittedGLM <- runLinearModel(dataYXZ,fittedSEMMS$gam.out$nn, "N") plotMDS(dataYXZ, fittedSEMMS, fittedGLM, ttl="Ozone data")
{ width=50% }
We can obtain an ANOVA table for the final model, as follows.
regressionTable <- summary(fittedGLM$mod)$coefficients zvars <- grep("Z\\d{4}",rownames(regressionTable),perl = T) rownames(regressionTable)[zvars] <- dataYXZ$originalZnames[fittedSEMMS$gam.out$nn] print(xtable(regressionTable), type="html")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 2.21 | 0.02 | 114.88 | 0.00 |
vh | 0.26 | 0.03 | 8.84 | 0.00 |
ibh | -0.22 | 0.03 | -8.25 | 0.00 |
vis | -0.09 | 0.02 | -4.09 | 0.00 |
doy | -0.08 | 0.02 | -3.65 | 0.00 |
humid*dpg | -0.19 | 0.02 | -7.61 | 0.00 |
ibh*dpg | -0.16 | 0.02 | -7.18 | 0.00 |
doy*doy | -0.28 | 0.02 | -12.52 | 0.00 |
The log(ozone) levels are a quadratic function of the day of the year. The maximum of this second degree polynomial with respect to doy corresponds approximately to June 20, very close to the spring solstice. This captures the seasonal effect, as it is well known that ozone levels are associated with the duration of daylight.
The NKI70 data is available from the “penalized” package in R (Goeman, 2010). In (Bar, Booth, and Wells, 2019) we discuss how we used this data set to analyze which genes are associated with either death or recurrence of metastasis. Here, we show how to invoke SEMMS to perform the analysis, using the Poisson response, per (Whitehead, 1980). The selected model is shown graphically in Figure 4.
fn <- system.file("extdata", "NKI70_t1.RData", package = "SEMMS", mustWork = TRUE) dataYXZ <- readInputFile(file=fn, ycol=1, Zcols=2:73) fittedSEMMS <- fitSEMMS(dataYXZ, mincor=0.8, nn=6, minchange = 1, distribution="P", verbose=T, rnd=F) # Fit the linear model using the selected predictors fittedGLM <- runLinearModel(dataYXZ,fittedSEMMS$gam.out$nn, "P") plotMDS(dataYXZ, fittedSEMMS, fittedGLM, ttl="NKI70") summary(fittedGLM$mod)
{ width=50% }
The estimated coefficients for the selected model are as follows:
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -2.81 | 0.40 | -7.04 | 0.00 |
log.t_i1. | -0.37 | 0.09 | -4.17 | 0.00 |
SCUBE2 | -0.35 | 0.27 | -1.27 | 0.20 |
KNTC2 | -0.79 | 0.33 | -2.41 | 0.02 |
ZNF533 | -0.56 | 0.36 | -1.56 | 0.12 |
IGFBP5.1 | 0.36 | 0.16 | 2.20 | 0.03 |
PRC1 | 1.09 | 0.38 | 2.91 | 0.00 |
SEMMS allows the user to 'lock-in' variables. That is, to choose which variables must be included in the model. For example, if we want to include log.t_i1. and Age in the model, we invoke SEMMS by using the Xcols and Zcols parameters, like this:
dataYXZ <- readInputFile(file=fn, ycol=1, Xcols=2:3, Zcols=4:73) fittedSEMMS <- fitSEMMS(dataYXZ, mincor=0.8, nn=6, minchange = 1, distribution="P", verbose=T, rnd=F) fittedGLM <- runLinearModel(dataYXZ,fittedSEMMS$gam.out$nn, "P") plotMDS(dataYXZ, fittedSEMMS, fittedGLM, ttl="NKI70") summary(fittedGLM$mod)
The plotMDS function does not show locked-in predictors, but they are estimated by runLinearModel.
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -2.77 | 0.40 | -7.00 | 0.00 |
log.t_i1. | -0.38 | 0.09 | -4.28 | 0.00 |
Age | -0.18 | 0.19 | -0.95 | 0.34 |
ZNF533 | -0.72 | 0.39 | -1.85 | 0.06 |
COL4A2 | 0.45 | 0.20 | 2.29 | 0.02 |
IGFBP5.1 | 0.36 | 0.17 | 2.17 | 0.03 |
PRC1 | 0.55 | 0.28 | 1.96 | 0.05 |
The following example consists of 1,000 predictors, of which 20 are significant and are highly correlated. The model was denoted by N5 in (Bar, Booth, and Wells, 2019). where the error terms are i.i.d. from a standard normal distribution, and Z1,...,Z20 are drawn from a multivariate normal distribution with an autoregressive (AR1) structure, with $\rho=0.95$. In this simulation, N=100 and in our simulations we show that the median number of true positives obtained by SEMMS is 20, and the median number of false positives is 0. The code for fitting the mixture model to this data via SEMMS is given here. Note that the input file may be a text file (e.g., comma or tab separated) or it can be an RData file which was created previously with the readInputFile function.
fn <- system.file("extdata", "AR1SIM.RData", package = "SEMMS", mustWork = TRUE) dataYXZ <- readInputFile(fn, ycol=1, Zcols=2:1001) fittedSEMMS <- fitSEMMS(dataYXZ, mincor=0.8, nn=15, minchange = 1, distribution="N",verbose=T,rnd=F) # Fit the linear model using the selected predictors fittedGLM <- runLinearModel(dataYXZ,fittedSEMMS$gam.out$nn, "N") plotMDS(dataYXZ, fittedSEMMS, fittedGLM, ttl="AR1 simulation")
In this case we used edgefinder to detect highly correlated pairs, and we find all the true predictors, and no false positives. In this example, if we do not use edgefinder, and rely on the mincor=0.8 seting, SEMMS finds one false positive (v211) and 20 true positives. Figure 5 shows a graphical representation of the selected model. The red diamonds represent the variables which were selected by SEMMS, and the gold circles correspond to variables which are highly correlated with a selected predictor. This diagram, obtained from the plotMDS function in the SEMMS package clearly shows the AR(1) structure among the significant predictors.
{ width=50% }
The data set SimBin provided with the package contains a simulated dataset according to simulation B2 in (Bar, Booth, and Wells, 2019), with 1000 predictors of which 10 are associated with a binary response. The true predictors consist of two set of variables (Z1-Z5 and Z101-Z105), each one having an autoregressive structure, AR(1), with rho=0.95.
fn <- system.file("extdata", "SimBin.RData", package = "SEMMS", mustWork = TRUE) dataYXZ <- readInputFile(fn, ycol=1, Zcols=2:1001) fittedSEMMS <- fitSEMMS(dataYXZ, mincor=0.7, nn=5, minchange = 1, distribution="B", verbose=T, rnd=F) fittedGLM <- runLinearModel(dataYXZ,fittedSEMMS$gam.out$nn, "B") plotMDS(dataYXZ, fittedSEMMS, fittedGLM, ttl="Simulated Binomial (AR1)")
{ width=50% }
Akaike, Hirotugu. 1974. "A new look at the statistical model identification." IEEE Transactions on Automatic Control 19 (6): 716-723.
Bar, Haim, James G. Booth, and Martin T. Wells (2019), Journal of Computational and Graphical Statistics, DOI: 10.1080/10618600.2019.1706542
Breiman, L, and J Friedman. 1985. "Estimating Optimal Transformations for Multiple Regression and Correlation." Technometrics 80: 580-598.
Goeman, J J. 2010. "L1 penalized estimation in the Cox proportional hazards model." no. 1. 14. Hastie, T, and R Tibshirani. 1990. Generalized Additive Models. New York, NY, USA: Chapman and Hall.
Lee, Y, J A Nelder, and Y Pawitan. 2006. Generalized Linear Models with Random Effects. London, UK: Chapman & Hall/CRC.
Whitehead, J. 1980. "Fitting Cox’s Regression Model to Survival Data Using GLIM." Applied Statistics 29: 268-275.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.