Theoretical Framework

Bayesian Model Averaging (BMA)

The objective of variable selection in linear regression is to identify a set of candidate predictors that produce the "best" model. Here, "best" model may mean the model that best predicts unseen cases, or that which is best explained by the data. Given an independent variable $Y$ and a set of $k$ candidate predictors $X_1, X_2, ..., X_k$, the "best" model is more concretely described as follows: $$Y = \beta_0 + \displaystyle\sum_{j=1}^{p}\beta_jX_j + \epsilon = X\beta + \epsilon$$ where:
$X_1, X_2, ..., X_p$ is a subset of $X_1, X_2, ..., X_k$
$X$ is a $n \times (p + 1)$ matrix containing the observed data on $p$ predictors
$\epsilon$ ~ $N(0,\sigma^2I)$
$\beta$ are the $(p + 1)$ individual unknown parameters

However, model selection exercises that lead to a single "best" model ignore model uncertainty [@Hodges1987; @Draper1995; @Raftery1997], and leads to underestimation of uncertainty when making inferences about quantities of interest [@Raftery1997].

Bayesian model averaging, employed when a variety of statistically reasonable models are extant, addresses model uncertainty and leads to optimal predictive ability [@Raftery1994] by average over all possible combinations of predictors when making inferences about quantities of interest. The final estimates are computed as a weighted average of the parameter estimates from each of the models.

The standard BMA solution, first introduced by Leamer in 1978, defines a set of all possible models as $M = ({M_1,..., M_k})$. If $\Delta$ is the quantity of interest, such as a movie prediction or tomorrow's stock price, then the posterior distribution of $\Delta$ given the data $D$ is: $$Pr(\Delta|D) = \displaystyle\sum_{k=1}^KPr(\Delta|M_k, D)Pr(M_k|D)$$ This is an average of the posterior distributions under each model weighted by the corresponding posterior model probabilities [@Raftery1997]. The posterior model probability of $M_k$ is then computed as the ratio of its marginal likelihood to the sum of the marginal likelihoods over the entire model space and is given by[@Amini]:

$$Pr(M_k|D) = \frac{Pr(D|M_k)Pr(M_k)}{\displaystyle\sum_{i=1}^{K}Pr(D|M_i)Pr(M_i)}$$ where:
$$Pr(D|M_k) = \int p(D|\theta_k, M_k)P(\theta_k|M_k)d\theta_k$$ is the marginal likelihood of model $M_k$, $\theta_k$ is the vector of parameters of model $M_k$, $Pr(\theta_k|M_k)$ is the prior density of $\theta_k$ under model $M_k$, $Pr(D|\theta_k, M_k)$ is the likelihood of the data given the $M_k$ and $\theta_k$, and $Pr(M_k)$ is the prior probability that $M_k$ is the true model.

At this stage, the posterior inclusion probability of each candidate predictor $\beta_p$, $Pr(\beta_p\neq0|D)$, is obtained by summing the posterior model probabilities over all models that include $\beta_p$. Referring back to the linear regression model; the posterior means and standard deviations of coefficient vectors $\beta$, are defined as: $$E[\hat{\beta}|D] = \displaystyle\sum_{j=1}^{2^k}\hat{\beta}Pr(M_j)|D),$$ $$V[\hat{\beta}|D] = \displaystyle\sum_{j=1}^{2^k}Var[{\beta|D,M_j] +\hat\beta^2})Pr(M_j|D) - E[\beta|D]^2.$$ Averaging over all models in this way provides better predictive results, as measured by the logarithmic scoring rule, than any single model $M_j$ $(j 1,...,K)$ [@Raftery1997].

Prior Distributions of Parameters

To implement BMA, one must specify prior distributions over all parameters in all models, as well as prior probabilities of the models themselves. If prior information about the parameters and the models is available, it should be used. However, if the amount of prior information is small relative to the effort required to specify it, as is often the case, default or so-called "non-informative" or "reference" priors may be used for such analysis. The selection of default priors may affect the integrated likelihood, the key factor in computing posterior model weights, and so the prior density should be wide enough and reasonably flat over the region of the parameter space where the likelihood is large, but not so spread out as to decrease the prior at the posterior mode. This decreases the integrated likelihood and may unnecessarily penalize larger models [@Raftery1994].

Zellner's g-prior

Zellner [@zellner1986] proposed a multivariate normal-gamma conjugate prior, the so-called 'g-prior', which is given by: $$\beta | \sigma^2 \sim N(\beta_0, \sigma^2g(X'X)^{-1}).$$ Typically, $\beta_0 = 0$, and the Gaussian posterior for $\beta$ is given by: $$B|\sigma^2, X, Y \sim N(\Sigma^{-1}a,\Sigma^{-1}),$$ where $$\Sigma = \frac{1}{g}(\sigma^2(X'X)^{-1})^{-1} + \frac{1}{\sigma^2}X'X,$$ and $$a = \frac{1}{g}(\sigma^2(X'X)^{-1})^{-1}\beta_0 + \frac{1}{g}X'Y.$$ Since $$\hat{\beta} = (X'X)^{-1}X'Y,$$ the posterior distribution of $\beta$ is $$\beta|Y,\sigma^2,X \sim N(\frac{1}{g+1}({\beta_0 + g\hat{\beta}}),\frac{g\sigma^2}{g+1}(X'X)^{-1}).$$ Hence, the posterior mean is: $$\mathbb{E}[\beta|Y,\sigma^2,X] = \Sigma^{-1}a = \frac{1}{1+g}\beta_0+\frac{g}{1+g}(X'X)^{-1}X'Y = \frac{1}{1+g}\beta_0+\frac{g}{1+g}\hat{\beta}.$$ By reducing prior elicitation to two components, the prior mean of $\beta_0$, taken to be 0, and the scalar g, Zellner's informative g-prior determines how much the prior distribution of $\beta$ contributes to the posterior. Since it provides a closed-form for the marginal likelihood, Zellner's g-prior is popular in variable selection.

However, one is still left with choosing g that doesn't suffer from paradoxes, such as Lindley's paradox [@Lindley1957] and the information paradox [@Liang2008a]. Nine priors r kfigr::figr(label = "priors", prefix = TRUE, link = TRUE, type="Table") supported in the literature, each with designated data dependent values for g, were explored in this experiment.

r kfigr::figr(label = "priors", prefix = TRUE, link = TRUE, type="Table"): Parameter priors

priors <- openxlsx::read.xlsx(xlsxFile = "../vignettes/figs_tables/priors.xlsx")
knitr::kable(priors,digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = T, position = "center")

The Bayesian Information Criterion (BIC) prior is a reference prior based upon an approximation of the log marginal likelihood of parameter value $\theta$ using the Bayesian information criterion.
$$log Pr(\theta|M_k) \approx c - 1/2BIC_k$$ where:
$$BIC_k = -2log(\hat{L}) + plog(n).$$
where $\hat{L}$ is the maximized value of the likelihood function of the model with respect to the parameter $\theta$, $p$ is the number of parameters and $n$ is the number of observations. The value of g associated with BIC is obtained by setting $log(n)$ to $(1+\frac{1}{g})log(1+g)$, This prior is typically flat where the likelihood is large and contains the same amount of information that would be contained in a typical single observation [@Wasserman1996].

The Akaike Information Criterion (AIC) prior is the same as the BIC prior above, except AIC is used to approximate the likelihood of the data given a model $M_k$. The AIC of the model $M_k$ is: $$AIC_k = 2p - 2log(\hat{L})$$

The Empirical Bayes Global Prior(EP-G) uses an EM algorithm to find a common or global estimate of g, averaged over all models [@Liang2008a].

The Empirical Bayes Local Prior(EP-L) uses the MLE of g from the marginal likelihood of each model [@Liang2008a].

The Zellner's g-prior is a multivariate normal-gamma conjugate prior where $\beta \sigma^2\sim N(\beta_0, g\sigma^2S^{-1}_{XX})$, $g = n$, and the scaled variance and covariances are obtained from the ordinary least squares (OLS).

Hyper-g is a class of g-priors based upon continuous proper hyperprior f(g), giving a mixture of g-priors where the prior on g/(1+g) is a Beta(1, alpha/2) [@Clyde2017]. Hyper-g-n is a special case of the Hyper-g prior where u = g/n and u ~ Beta(1, alpha/2), to provide consistency when the null model is true [@Clyde2017]. Hyper-g Laplace is the same as the Hyper-g prior, but uses Laplace approximation to integrate over the prior on g.

Lastly, the Zellner-Siow prior places a gamma prior on n/g with G(shape = 1/2, scale = 1/2).

Model Priors

For this experiment, the uniform distribution that assigns an equal prior probability to all models was used such that $Pr(M_k) = 1 / K$ for each $k$ [@Raftery1988].




DataScienceSalon/Bayesian-Regression documentation built on May 29, 2019, 12:06 a.m.