In garthtarr/mplot: Graphical Model Stability and Variable Selection Procedures

Background on model selection and the mplot package.

The methods provided by the mplot package rely heavily on various bootstrap techniques to give an indication of the stability of selecting a given model or variable and even though not done here, could be implemented with resampling methods other than the bootstrap, for example cross validation. The m in mplot stands for model selection/building and we anticipate that in future more graphs and methods will be added to the package to further aid better and more stable building of regression models. The intention is to encourage researchers to engage more closely with the model selection process, allowing them to pair their experience and domain specific knowledge with comprehensive summaries of the relative importance of various statistical models.

Two major challenges in model building are the vast number of models to choose from and the myriad of ways to do so. Standard approaches include stepwise variable selection techniques and more recently the lasso. A common issue with these and other methods is their instability, that is, the tendency for small changes in the data to lead to the selection of different models.

An early and significant contribution to the use of bootstrap model selection is @Shao:1996 who showed that carefully selecting $m$ in an $m$-out-of-$n$ bootstrap drives the theoretical properties of the model selector. @Mueller:2005 and @Mueller:2009 modified and generalised Shao's $m$-out-of-$n$ bootstrap model selection method to robust settings, first in linear regression and then in generalised linear models. The bootstrap is also used in regression models that are not yet covered by the mplot package, such as mixed models [e.g. @Shang:2008] or partially linear models [e.g. @Mueller:2009b] as well as for the selection of tuning parameters in regularisation methods [e.g. @Park:2014].

Assume that we have $n$ independent observations $\mathbf{y} = (y_{1},\ldots,y_{n})^{\top}$ and an $n\times p$ full rank design matrix $\mathbf{X}$ whose columns are indexed by $1,\ldots,p$. Let $\alpha$ denote any subset of $p_{\alpha}$ distinct elements from ${1,\ldots,p}$. Let $\mathbf{X}{\alpha}$ be the corresponding $n\times p{\alpha}$ design matrix and $\mathbf{x}{\alpha i}^{\top}$ denote the $i$th row of $\mathbf{X}{\alpha}$.

The mplot package focuses specifically on linear and generalised linear models (GLM). In the context of GLMs, a model $\alpha$ for the relationship between the response $\mathbf{y}$ and the design matrix $\mathbf{X}{\alpha}$ is specified by \begin{align} \mathbb{E}(\mathbf{y}) = h(\mathbf{X}{\alpha}^{\top}\boldsymbol{\beta}{\alpha}), \text{ and }\operatorname{var}(\mathbf{y}) = \sigma^{2}v(h(\mathbf{X}{\alpha}^{\top}\boldsymbol{\beta}{\alpha})), \end{align} where $\boldsymbol{\beta}{\alpha}$ is an unknown $p_{\alpha}$-vector of regression parameters and $\sigma$ is an unknown scale parameter. Here $\mathbb{E}(\cdot)$ and $\operatorname{var}(\cdot)$ denote the expected value and variance of a random variable, $h$ is the inverse of the usual link function and both $h$ and $v$ are assumed known. When $h$ is the identity and $v(\cdot)=1$, we recover the standard linear model.

The purpose of model selection is to choose one or more models $\alpha$ from a set of candidate models, which may be the set of all models $\mathcal{A}$ or a reduced model set (obtained, for example, using any initial screening method). Many model selection procedures assess model fit using the generalised information criterion, \begin{equation} \textrm{GIC}(\alpha,\lambda) = \hat{Q}(\alpha) + \lambda p_{\alpha}. \label{GIC} \end{equation} The $\hat{Q}(\alpha)$ component is a measure of description loss or lack of fit, a function that describes how well a model fits the data, for example, the residual sum of squares or $-2~\times~\text{log-likelihood}$. The number of independent regression model parameters, $p_{\alpha}$, is a measure of model complexity. The penalty multiplier, $\lambda$, determines the properties of the model selection criterion [@Mueller:2010; @Murray:2013]. Special cases, when $\hat{Q}(\alpha)=-2\times\text{log-likelihood}(\alpha)$, include the AIC with $\lambda=2$, BIC with $\lambda=\log(n)$ and more generally the generalised information criterion (GIC) with $\lambda\in\mathbb{R}$ [@Konishi:1996].

The mplot package currently implements variable inclusion plots, model stability plots and a model selection procedure inspired by the adaptive fence of @Jiang:2008. Variable inclusion plots were introduced independently by @Mueller:2010 and @Meinshausen:2010. The idea is that the best model is selected over a range of values of the penalty multiplier $\lambda$ and the results are visualised on a plot which shows how often each variable is included in the best model. These types of plots have previously been referred to as stability paths, model selection curves and most recently variable inclusion plots (VIPs) in @Murray:2013. An alternative to penalising for the number of variables in a model is to assess the fit of models within each model size. This is the approach taken in our model stability plots where searches are performed over a number of bootstrap replications and the best models for each size are tallied. The rationale is that if there exists a correct model of a particular model size it will be selected overwhelmingly more often than other models of the same size. Finally, the adaptive fence was introduced by @Jiang:2009 to select mixed models. This is the first time code has been made available to implement the adaptive fence and the first time the adaptive fence has been applied to linear and generalised linear models.

For all methods, we provide publication quality classical plot methods using base R graphics as well as interactive plots using the googleVis package [@Gesmann:2011]. We also add further utility to these plot methods by packaging the results in a shiny web interface which facilitates a high degree of interactivity [@Chang:2015a; @Chang:2015b].

In the rejoinder to their least angle regression paper, @Tibshirani:2004 comment,

"In actual practice, or at least in good actual practice, there is a cycle of activity between the investigator, the statistician and the computer ... The statistician examines the output critically, as did several of our commentators, discussing the results with the investigator, who may at this point suggest adding or removing explanatory variables, and so on, and so on."

We hope the suite of methods available in the mplot package adds valuable information to this cycle of activity between researchers and statisticians. In particular, providing statisticians and researchers alike with a deeper understanding of the relative importance of different models and the variables contained therein.

In the artificial example, we demonstrated a situation where giving the researcher more information in a graphical presentation can lead to choosing the correct model when standard stepwise procedures would have failed.

The diabetes data set suggested the existence of a number of different dominant models at various model sizes which could then be investigated further, for example, statistically using cross validation to determine predictive ability, or in discussion with researchers to see which makes the most practical sense. In contrast, there are no clear models suggested for the birth weight example. The adaptive fence has no peaks, nor is there a clearly dominant model in the model stability plot even though all but one variable are more informative than the added redundant variable in the variable inclusion plot.

While the core of the mplot package is built around exhaustive searches, this becomes computationally infeasible as the number of variables grows. We have implemented similar visualisations to model stability plots and variable inclusion plots for glmnet which brings the concept of model stability to much larger model sizes, though it will no longer be based around exhaustive searches.

The graphs provided by the mplot package are a major contribution. A large amount of information is generated by the various methods and the best way to interpret that information is through effective visualisations. For example, as was be shown in the diabetes example, the path a variable takes through the variable inclusion plot is often more important than the average inclusion probability over the range of penalty values considered. It can also be instructive to observe when there are no peaks in the adaptive fence plot as this indicates that the variability of the log-likelihood is limited and no single model stands apart from the others. Such a relatively flat likelihood over various models would also be seen in the model stability plot where there was no dominant model over the range of model sizes considered.

Although interpretation of the model selection plots provided here is something of an art, this is not something to shy away from. We accept and train young statisticians to interpret qq-plots and residual plots. There is a wealth of information in our plots, particularly the interactive versions enhanced with the shiny interface, that can better inform a researchers' model selection choice.