Note

The blmr package was created as part of a DataScience course at Aarhus University and is NOT intended for use. Expecially correct handling of prior means and data precision is not guaranteed.

Introduction

This vignette describes various functionalities of the blmr package. The blmr package allows for fitting of Bayesian Linear Models to data. The fit is stored in an S3 object of class blm. The package includes many generic functions available for Linear Model lm objects (coef, resid, ...). The plotting function is designed to visualize the blm model together with the data. At the end I also implemented a few functions for analysis of blm objects.

Core Functions Math

I don't provide example data but rather demonstrate functionality using random data. The basic linear regression model is described by the formula: $$y = w_0 + w_1*x + \epsilon$$

However, this can easily be generalized to fit an arbitraty number of additive terms: $$ \textbf{y} = \textbf{w} \textbf{$\Phi$}_\textbf{x} + \boldsymbol{\epsilon} $$

where:
$\mathbf{\Phi_{x}}$ is the design matrix.
$\textbf{w}$ is the vector of coefficients, ie the weights on the design matrix columns.
$\boldsymbol{\epsilon}$ is the vector of errors on the observables.
$\textbf{y}$ vector of observed values.

Frequentist solution for fitting linear models is to minimize the sum of squared residuals (ie the sum of squared distance between observed and fitted y values). There is a very elegant solution for this problem using matrix algebra (the so-called Moore-Penrose pseudoinverse): $$\textbf{w} = (\textbf{$\Phi$}\textbf{x}^T\textbf{$\Phi$}\textbf{x})^{-1}\textbf{$\Phi$}_\textbf{x}^T\textbf{y} $$

For a bayesian approach one needs to consider not the best fit, but a likelihood distribution of coefficients that updates a $prior$ distribution of the weights. For simplicity, both prior and likelihood are considered normal distributed.

Prior: $$\mathbf{w_{0}} = N(\boldsymbol{\mu_{0}}, \sigma^2 \mathbf{\Lambda_{0}^{-1}})$$

Posterior: $$\mathbf{w_{x,y}} = N(\boldsymbol{\mu_{x,y}}, \sigma^2 \mathbf{\Lambda_{x,y}^{-1}})$$

Posterior Distribution of Coefficients

Thomas provided the following equations for updating (slightly changed variable names for consistency): $$\mathbf{\Lambda_{x,y}} = \beta \textbf{$\Phi$}\textbf{x}^T \textbf{$\Phi$}\textbf{x} + \mathbf{\Lambda_{0}}$$ $$\boldsymbol{\mu_{x,y}} = \beta \mathbf{\Lambda_{x,y}^{-1}} \textbf{$\Phi$}_\textbf{x}^T \textbf{y})$$

Wikipedia (https://en.wikipedia.org/wiki/Bayesian_linear_regression) suggests the following equation: $$\mathbf{\Lambda_{x,y}}=\textbf{$\Phi$}\textbf{x}^T\textbf{$\Phi$}\textbf{x} + \mathbf{\Lambda_{0}}$$

$$\boldsymbol{\mu_{x,y}} = \mathbf{\Lambda_{x,y}^{-1}} (\boldsymbol{\mu_{0}}\mathbf{\Lambda_{0}}+ \textbf{$\Phi$}_\textbf{x}^T\textbf{y})$$

This is somewhat odd. The solution form Wikipedia apperently does not consider error precision in the calculations. That is, the precision of the posterior coefficients would not be influenced by the precision of the data. By contrast, equations of the posterior means by Thomas does not consider the prior means when computing the posterior MAP values, ie this solution is possibly only correct when $\boldsymbol{\mu_{0}}=0$.

I found this complete formulas from free available course material (http://gandalf.psych.umn.edu/users/schrater/schrater_lab/courses/PattRecog09/BayesRegress.pdf) from Paul Schrater (University of Minnesota). Similar to Thomas, with a term including the mean of the prior added to the equation for the posterior mean.

$$\mathbf{\Lambda_{x,y}}=\beta \textbf{$\Phi$}\textbf{x}^T\textbf{$\Phi$}\textbf{x} + \mathbf{\Lambda_{0}}$$

$$\boldsymbol{\mu_{x,y}} = \mathbf{\Lambda_{x,y}^{-1}} (\boldsymbol{\mu_{0}}\mathbf{\Lambda_{0}}+ \beta \textbf{$\Phi$}_\textbf{x}^T\textbf{y})$$

The solution used in the blm function of the blmr package implements the equations as given by Paul Schrater.

Distribution of Fitted/Predicted Values

The distribution of predicted/fitted values for y is from Thomas (identical to the one provided by Schrater): $$p(y_{pred}|model) = N(y|\boldsymbol{\mu_{x,y}}, \frac{1}{\beta} + \textbf{$\Phi$}\textbf{x}^T \mathbf{\Lambda{x,y}^{-1}} \textbf{$\Phi$}_\textbf{x})$$

Total Probability of a Model Fit

I used a formula from Wikipedia (https://en.wikipedia.org/wiki/Bayesian_linear_regression). This equation is based on the fit and response, and the parameters for the prior and posterior inverse gamma distributions a and b.

$$ p(\textbf{y}|model) = \frac{1}{(2*pi)^{\frac{n}{2}}} * \sqrt{\frac{det(\textbf{$\Phi$}\textbf{0}^T)}{det(\textbf{$\Phi$}\textbf{x,y}^T)}} * \frac{b_{0}^{a_{0}}}{b_{x,y}^{a_{x,y}}} * \frac{\Gamma a_{x,y}}{\Gamma a_{0}}$$

where: $a_{0}$: prior value for the inverse gamma distribution parameter a (default: $a_{0} = 1$).
$b_{0}$: prior value for the inverse gamma distribution parameter b (default: $b_{0} = 1$).
$a_{x,y} = a_{0} + \frac{n}{2}$: posterior value for a.
$b_{x,y} = b_{0} + \frac{1}{2}(\textbf{y}^T \textbf{y} + \boldsymbol{\mu_{0}^T \Lambda_{0} \mu_{0}} + \boldsymbol{ \mu_{x,y}^T \Lambda_{x,y} \mu_{x,y}})$: posterior value for b.

The above equation leads to very small probabilities, for many models not computable in R. However, it is possible to compute the log-likelihood, and this is computed by default. $$ ln(p(\textbf{y}|model)) = 1 - \frac{nln(2pi) + ln(det(\textbf{$\Phi$}\textbf{0}^T)) - ln(det(\textbf{$\Phi$}\textbf{x,y}^T))}{2} + a_{0}ln(b_{0}) - a_{x,y}ln(b_{x,y}) + ln(\Gamma a_{x,y}) - ln(\Gamma a_{0})$$

To install and load the package:

#devtools::install_github('manschmi/blmr')
library(blmr)
library(ggplot2)

Fitting a Bayesian Model

To simlulate a simple linear regression problem I use the model data:

set.seed(0)
w0 <- 1
w1 <- 2
x <- seq(-100,100,10)
b <- 0.001
e <- rnorm(length(x), mean=0, sd=sqrt(1/b) )
y <- w0 + w1*x + e

The blm fit using default settings is straightforward.

blm_mod <- blm(y~x)
blm_mod

With default setting the result is very similar to a lm fit.

lm_mod <- lm(y~x)
lm_mod

We can easily visualize this (more on the plotting function later):

plot(blm_mod, legend_parm=list(cex=.5))

Priors

The normal blm fit is extremely close to the lm fit. This is mostly because an uniformative prior is used. Here the prior would have means 0 for both intercept and slope and variance 1. But what about if you want to specify other priors. The prior (supported by this package) need to be multivariate normal distributions and for the package also contains a S3 class for multivariate normal distribution mvnd. Its funcitonality is very simple and easies demonstrated by example:

custom_prior <- mvnd(means = c(1,1), covar = matrix(c(2,.2,.2,2),ncol=2))
custom_prior

blm(y~x, prior=custom_prior)

A typical use case would be to use the posterior distribution of an existing blm object as prior for a new model. An this is easy as an blm object can be used as prior.

blm(y~x, prior=blm_mod)

Precision

When precision of the error $\beta$ is not provided it is estimated from the data using the deviance of the observed y values from the lm (ie MLE) fit using the following equation: $$\textit{1. MLE coefficients: } \textbf{w} = (\textbf{$\Phi$}\textbf{x}^T\textbf{$\Phi$}\textbf{x})^{-1}\textbf{$\Phi$}\textbf{x}^T\textbf{y}$$ $$\textit{2. MLE residuals: } \textbf{RSS} = \textbf{y} - \textbf{w}\textbf{$\Phi$}\textbf{x}$$. $$\textit{3. precision: } \beta = \left({\frac{\sum{\mathbf{RSS^2}}}{n-p}}\right)^{-1}$$

Pretty straightforward, simply do the MLE fit, get the variance of the residuals using degrees of freedom of the regression model.

As alternative it is also possible to specify beta as argument to the blm function.

set.seed(0)
w0 <- 1
w1 <- 2
x <- seq(-100,100,10)
b <- 0.001
y <- w0 + w1*x + rnorm(length(x), mean=0, sd=sqrt(1/b) )
mod_with_b <- blm(y~x, beta=b)
mod_with_b
precision(mod_with_b)

Compared to test without providing precision (ie where it is estimated from the data):

mod_wout_b <- blm(y~x)
mod_wout_b
precision(mod_wout_b)
````

The MAP and covariance values are very close although not exactly the same.
Lets check it out on a plot:
```r
plot(blm(y~x), legend_parm=list(cex=.5))
abline(blm(y~x, beta = b), col='green', lwd=3)
abline(lm(y~x), col='orange', lwd=3)

The blm object

The blm object contains the following slots. The most important of these can be accessed via generic or blm-specific functions (in paranthesis).
call: the matched call ( no fun )
formula: the formula used ( formula() )
df.residual: the degrees of freedom of the model ( df.residual() )
frame: the model frame used ( model.frame() )
matrix: the model matrix used ( model.matrix() )
beta: the precision of the data ( precision() )
prior: the prior distribution used ( no fun )
posterior: the posterior distribution ( coef(), coef( , var=TRUE) )

The coef() (or coefficients() ) returns by default only the MAP estimate of the coefficients.

coef(blm_mod)

To get the covariance of the posterior distribution set argument covar = TRUE:

coef(blm_mod, covar=TRUE)

Extracting residuals, fitted values, ...

Again, this is done using generic functions. So far those are implemented:

resid, residual
deviance
fitted
predict
confint

fitted and residuals

For fitted and residuals it is possible to retrieve the variance of the esimate (same as the variance for the fitted values they are calculated from) together with the MAP estimate using parameter var = TRUE.

fitted(blm_mod)

fitted(blm_mod, var=TRUE)

resid(blm_mod)

resid(blm_mod, var=TRUE)

predict

The implementation of predict is different to provide consistency with predict.lm and uses the arguments se.fit and interval. se.fit values are computed as $\sqrt{\sigma^2}$ of the fitted values. Confidence interval are the quantiles from the predicted distribution.

predict(blm_mod)

predict(blm_mod, se.fit=TRUE)

predict(blm_mod, se.fit=TRUE, interval='confidence')

This is inconsistent with fitted and resid. However, importantly, this allows native adding of a blm fit to ggplot2 objects just like lm fit (see below under Plotting)

confint

Bayesian statistics does not use the term 'confidence interval'. However, the 95% quantile of the distribution of parameters is pretty obvious choice to be consistent with generic model functions. So this is what is provided here.

confint(blm_mod)

update

The update function used for the blm package was designed for versatility. It takes a blm object as input, together with a set of parameters (a new formula, prior and/or data) to update to and returns a new model with the updated fit. By default, the posterior of the input model will be used as prior for the updated model. Lets update our blm_mod for example:

x2 <- rnorm(50) 
y2 <- rnorm(50, w1 * x2 + w0, 1/b)
new_mod <- update(blm_mod, data=data.frame(x=x2, y=y2)) 
new_mod

As can be seen the posterior of the input model blm_mod is used as prior for the updated model in this case. If we want the updated model to use the same prior, this needs to be specified:

update(blm_mod, prior=blm_mod$prior, data=data.frame(x=x2, y=y2)) 

We can also update the formula. Typically this involves dropping factors.

update(blm_mod, y~x+0, prior=blm_mod$prior, data=data.frame(x=x2, y=y2)) 

As you can see there is a warning that factors are dropped.

And this can be done using R formula update semantics.

update(blm_mod, ~.+0, prior=blm_mod$prior, data=data.frame(x=x2, y=y2)) 

Note: one cannot (in the current implementation) update to more complex models.

Plotting

The function plot.blm produces a single plot of the data points, together with the blm MAP estimate, the 95% quantile and the lm fits. Note the legend_parm passes named arguments to legend. The cex=.5 is to make prettier plots in the vignette.

plot(blm_mod, legend_parm=list(cex=.5))

There is a little catch here. The blm object does not store the raw data used to create the model. In a scenario where the x values are not part of the model matrix and frame we need to provide them as arguments to the function:

w0 <- .2
w1 <- 3
x <- seq(-10,10,1)
b <- 1.3
y <- w0 + w1*cos(x) + rnorm(length(x), mean=0, sd=sqrt(1/b) )

model <- blm(y ~ cos(x), prior = NULL, beta = b, data = data.frame(x=x, y=y))
#plot(model) fails due to 'lack' of x in the 'model' object

plot(model, explanatory='x', legend_parm=list(cex=.5))

Plotting using abline

One can also add the blm fit lines using abline function as for lm fits. Note: this only works for straight lines. It simply extracts the first 2 coefficients from the coef() function.

plot(y~x)
abline(blm(y~x))

Plotting with ggplot2

One can also add the blm fit lines using stat_smooth for ggplot2 plots. Note: this only works for straight lines. This simply builds the model using default settings and calls the fitted function with se.fit = TRUE. The standard error area is derived from the distribution of the fitted values.

d <- data.frame(x=x, y=y)

ggplot(d, aes(x=x, y=y)) +
  geom_point() +
  stat_smooth(method='blm', fill='blue', colour='blue') +
  stat_smooth(method='lm', fill='red', colour='red')

Of course, one can pass additional arguments to blm from within stat_smooth:

ggplot(d, aes(x=x, y=y)) +
  geom_point() +
  stat_smooth(method='blm', formula=y~cos(x), beta=1)

Diagnsotic Plots

There is also support for some diagnostic plots, akin to plot.lm:

diagnostic_plots(model)

To visualize the distribution of the coefficients I also implemented a kernel density plot for mvnd objects:

kernel_density(model$prior, xlim=c(-4,4), ylim=c(-4,4), main = 'prior')
kernel_density(model$posterior, xlim=c(-4,4), ylim=c(-4,4), main = 'posterior')

Complex Models

Additional terms can be used as already seen above.

A model with a cosine term.

w0 <- .2
w1 <- 3
w2 <- 10
x <- seq(-100,100,1)
b <- 1.3
y <- w0 + w1*x + w2*cos(x) + rnorm(length(x), mean=0, sd=sqrt(1/b) )

mod <- blm(y ~ x + cos(x), prior = NULL, beta = b, data = data.frame(x=x, y=y))
summary(mod)

A model with a polynomial term.

w0 <- .2
w1 <- 3
w2 <- 10
x <- seq(-100,100,1)
b <- 0.00003
y <- w0 + w1*x + w2*x^2 + rnorm(length(x), mean=0, sd=sqrt(1/b) )

mod <- blm(y ~ x + I(x^2), prior = NULL, beta = b, data = data.frame(x=x, y=y))
summary(mod)
plot(mod, pch=19, xlim=c(-15,15), ylim=c(-200,2000), legend_parm=list(cex=.5))

or also:

mod <- blm(y ~ poly(x,2, raw=TRUE), prior = NULL, beta = b, data = data.frame(x=x, y=y))
summary(mod)
plot(mod, explanatory='x',pch=19, xlim=c(-15,15), ylim=c(-200,2000), legend_parm=list(cex=.5))
##Model Analysis and Comparison

The parts below or of preliminary experimental nature and there are absolutely no guarantees for correct functioning.


####Bayes Information Criterion (BIC)

Note, this does nothing special on a blm compared to an lm object. It simply computes:
$$BIC = log\left(\frac{\sum{RSS}}{n}\right) * k * log(n)$$
Where: n is the number of data points, k is the number of parameters and RSS are the squared residuals of the fit.


####Bayes Factor
Bayes factor applies a likelihood test comparing the total probability of 2 models. The equation for the likelihood for a *blm* model fit is given in the introduction. 

```r
 set.seed(1) 
 x <- seq(-10,10,.1) 
 b <- 0.3

 w0 <- 0.2 ; w1 <- 3 ; w2 <- 10

 y <- rnorm(201, mean = w0 + w1 * x + w2 *sin(x), sd = sqrt(1/b)) 
 mod1 <- blm(y ~ x + sin(x))
 bic(mod1)
plot(mod1, xlim=c(-10,10), legend_parm=list(cex=.5)) 

compate this fit to another mod removing the sinus term, clearly less well fitting

mod2 <- blm(y ~ x)

bic(mod2)

The BIC for the second, much less well-fitting model is much higher and thus the BIC indicates a better fit for mod1. See ?bic for more information on how to interpret the values.

Bayes factor comparing the 2 models

bayes_factor(mod1, mod2)

Very strong support. See ?bayes.factor for more information on how to interpret the value.

Anova F-test comparing the 2 models

anova(mod1, mod2)

Strong support for mod1 over mod2.

Compare less separated models

b <- 0.003
y <- rnorm(201, mean = w0 + w1 * x + w2 *sin(x), sd = sqrt(1/b)) 
mod1 <- blm(y ~ x + sin(x))
bic(mod1)

plot(mod1, xlim=c(-10,10), legend_parm=list(cex=.5))
mod2 <- blm(y ~ x)
bic(mod2)

... bic indicates still some positive support for complex mod1, but not very strong.

Bayes factor comparing the 2 models

bayes_factor(mod1, mod2)

No significant support for mod1 over mod2.

Anova F-test comparing the 2 models

anova(mod1, mod2)

Still strong support for mod1 over mod2.

Some Examples that Illustrate the Behaviour of Bayesian Linear Models

Variance of Fitted Values increases with distance to the data

w0 <- 0.3 ; w1 <- 1.1 ; b <- 1.3
x <- rnorm(100)
y <- rnorm(100, w1 * x + w0, 1/b)
mod <- blm(y~x, beta=b, data=data.frame(x=x, y=y))
plot(mod, caption='data range', legend_parm=list(cex=.5))
plot(mod, xlim=c(-100,100), ylim=c(-100,100), caption='extended range', legend_parm=list(cex=.5))

updating the model with itself improves the fit

As criteria I measure the Mahalnobis distance between the coefficients of a model to the posterior distribution of another one.

mod2 <- update(mod)
mod3 <- update(mod2)
mod4 <- update(mod3)

mahal(mod4$posterior, coef(mod4))
mahal(mod4$posterior, coef(mod3))
mahal(mod4$posterior, coef(mod2))
mahal(mod4$posterior, coef(mod))

This should also be reflected in decreasing deviance.

deviance(mod4)
deviance(mod3)
deviance(mod2)
deviance(mod)


manschmi/blmr documentation built on May 21, 2019, 11:25 a.m.