knitr::opts_chunk$set(fig.pos="H", fig.width = 7, fig.height = 4, echo = TRUE, fig.align="center", warning = FALSE, message = FALSE, eval=TRUE)
The blm
package contains a simple Gibbs and Metropolis-Hastings sampler to run a Bayesian Linear Regression model. It is written partly in R and partly in Julia. The bridge between these two languages is handled by the JuliaCall library. Julia is a high-performing and high-level language. In the case of an MCMC sampler, we repeatedly sample from conditional posterior distributions, which amplifies the need for speed and efficiency.
In this document, I only focus on the mechanics of the functions and not the interpretation of results.
To set up the blm package, we load in in R
library(blm)
This also loads the magrittr package, which is helpful because we can chain commands using the forward-operating pipe command (%>%
).
Next, we load the Julia environment
blm_setup()
This can take a couple of seconds.
The data we will use is included in the blm
library, and contains compensation data on $336$ directors in $52$ companies and $3$ sectors.
library(table1) data("directors") dir2 <- directors # Add labels label(dir2$Compensation) <- "Compensation ('000 GBR)" label(dir2$Age) <- "Age (years)" label(dir2$Male) <- "Gender (male)" label(dir2$Sector) <- "Sector the company belongs to" table1(Compensation ~ Age + Male + Sector, data=dir2)
knitr::include_graphics("sumstats.png", dpi=100)
From the summary statistics, we gather that the outcome variable is skewed, which is why we will use $\log{(\text{compensation})}$ rather than the untransformed variable. We also center the Age predictor and Gender predictor^[Even though Male is a categorical variable, centering it reduces autocorrelation.]
library(dplyr) directors <- directors %>% mutate(Compensation = log(Compensation), Age = Age - mean(Age), Male = as.numeric(Male), Male = Male - mean(Male))
For comparison purposes, we run a linear regression (lm) model using Maximum Likelihood
# Linear model for comparison ffit <- lm("Compensation ~ Age + Male", data=directors) summary(ffit)
To fit a blm model, we call the blm()
function, which is similar to the lm()
function.
bfit <- blm("Compensation ~ Age + Male", data=directors)
At this point, we can call print()
and summary()
on the data to check the model status.
print(bfit)
This gives us information about the data, the sampling options and the priors, which are uninformative at this point.
summary(bfit)
summary()
gives us information about the model fit, MAP estimates and so on. Given that we have not sampled the posterior, however, these statistics are not yet supplied.
We can update sampling settings as follows
bfit <- bfit %>% # Update sampling settings set_sampling_options(., chains = 3, iterations = 15000, thinning = 2, burn = 2000)
Priors may be made informative as follows
bfit <- bfit %>% # Add priors (we're on a log scale) set_prior("b1", mu=.01, sd=.2) %>% # 1 % increase / every year ==> 20% spread set_prior("b2", mu=.05, sd=.03) print(bfit)
To add informative hypotheses to the model, we can use set_hypothesis()
bfit <- bfit %>% # H1: Males earn about the same as females set_hypothesis("b0 + b2 < b0") %>% # H2: Directors only earn more as they get older set_hypothesis("b1 > 0") print(bfit)
If we do not specify initial values, then blm
will draw these from the informative or uninformative priors. Given that we are on a log scale, it would be good to set these close to $0$.
bfit <- bfit %>% # Set initial values # If we draw from the priors the starting values will be too large # This is an issue for MH because it takes many more iterations to converge set_initial_values(chain_1 = list("b" = c(7, -5, 0), "sigma" = 1), chain_2 = list("b" = c(-5, 0, 7), "sigma" = 2))
If you want to change the sampler (i.e. use Metropolis-Hastings instead of a Gibbs sampler), then you can use set_sampler()
. However, this is not recommended as you will only lose efficiency. The MH algorithm was implemented for didactical reasons.
bfit <- bfit %>% # Change the sampler of Age to metropolis hastings (random walk) # Lambda parameter controls the variance of the (normal) proposal distribution set_sampler("b1", type="MH", lambda=0.01)
If you want to compute the intercept-only model next to the model you are interested in, you can add the line compute_null_model()
:
bfit <- bfit %>% # This has the same parameters as set_sampling_options() compute_null_model()
You can always use get_parameter_names()
to obtain a mapping of variables to parameters:
bfit %>% get_parameter_names()
If we are happy with these settings, we can sample from the posterior distribution. This process takes the longest during the first time you call the sample_posterior()
function. This happens because Julia's Just-In-Time (JIT) compiler compiles the code when it is called the first time. In subsequent evaluations of the code, it then uses the already compiled code. To carry out the blm
sampling plan, you execute:
bfit <- bfit %>% # Sample the posterior sample_posterior()
This has several effects:
bfit <- bfit %>% # Remove posterior samples delete_posterior()
bfit <- bfit %>% # Draw more samples update_posterior(iterations=10000)
To assess convergence, we can check the convergence plots:
bfit %>% plot("history")
plot(bfit, "density")
plot(bfit, "autocorrelation")
If these look fine, you can also view the Gelman-Rubin statistic:
bfit %>% evaluate_convergence_diagnostics()
The burn-in diagnostic are regression coefficients of the samples on the index, which will deviate if there is a trend in the data (unless they are non-linear and cancel each other out).
If you compute the intercept-only model, you will also want to check whether that model has converged:
plot(bfit, "nullmodel")
To evaluate the effect of autocorrelation on the effective sample size of the MCMC samples, you can execute:
bfit %>% evaluate_effective_sample_size()
If you use the MH sampler, you can also check the number of accepted samples like so:
bfit %>% evaluate_accepted_draws()
Most results are returned when you sample the model. You can view them by executing:
summary(bfit)
A model that contains all optional elements has the following objects embedded in it:
names(bfit)
You can use contains()
to see whether you model contains a specific element^[contains()
is also an exported function for other packages, so this may clash with e.g. dplyr]:
# Use blm::contains() in case of clashes blm::contains(bfit, "rsq")
If you want to get a summary of the individual elements, you can execute e.g.
# i.e. r-squared bfit %>% get_value("rsq") %>% summary()
You can also view the posterior for the $R^2$ values:
bfit %>% get_value("rsq") %>% plot()
blm
contains three posterior predictive checks. The first checks the residuals for skewness, the second for heteroskedasticity and the third for independence.
bfit <- bfit %>% # Use a fraction of the total posterior # (to reduce the size of the output and because it is heavy on this pdf file) evaluate_ppc(p=.15)
The PPC results are the only results not automatically included in the summary output and the only function in the evaluate_
family that the user must call himself.
bfit %>% get_value("ppc") %>% summary()
We can also plot these results:
bfit %>% get_value("ppc") %>% plot("independence") %>% plot() bfit %>% get_value("ppc") %>% plot("normality") %>% plot() bfit %>% get_value("ppc") %>% plot("heteroskedasticity") %>% plot()
The blm
library contains a ggplot2 theme called theme_blm()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.