In JasperHG90/blm: Bayesian linear model

knitr::opts_chunk$set(fig.pos="H", fig.width = 7, fig.height = 4, echo = TRUE, 
                      fig.align="center", warning = FALSE, message = FALSE, eval=TRUE)

The blm package contains a simple Gibbs and Metropolis-Hastings sampler to run a Bayesian Linear Regression model. It is written partly in R and partly in Julia. The bridge between these two languages is handled by the JuliaCall library. Julia is a high-performing and high-level language. In the case of an MCMC sampler, we repeatedly sample from conditional posterior distributions, which amplifies the need for speed and efficiency.

In this document, I only focus on the mechanics of the functions and not the interpretation of results.

Setting up blm

To set up the blm package, we load in in R

library(blm)

This also loads the magrittr package, which is helpful because we can chain commands using the forward-operating pipe command (%>%).

Next, we load the Julia environment

blm_setup()

This can take a couple of seconds.

The data

The data we will use is included in the blm library, and contains compensation data on $336$ directors in $52$ companies and $3$ sectors.

library(table1)
data("directors")
dir2 <- directors
# Add labels
label(dir2$Compensation) <- "Compensation ('000 GBR)"
label(dir2$Age) <- "Age (years)"
label(dir2$Male) <- "Gender (male)"
label(dir2$Sector) <- "Sector the company belongs to"
table1(Compensation ~ Age + Male + Sector, data=dir2)

knitr::include_graphics("sumstats.png", dpi=100)

From the summary statistics, we gather that the outcome variable is skewed, which is why we will use $\log{(\text{compensation})}$ rather than the untransformed variable. We also center the Age predictor and Gender predictor^[Even though Male is a categorical variable, centering it reduces autocorrelation.]

library(dplyr)
directors <- directors %>%
  mutate(Compensation = log(Compensation), 
         Age = Age - mean(Age),
         Male = as.numeric(Male),
         Male = Male - mean(Male))

For comparison purposes, we run a linear regression (lm) model using Maximum Likelihood

# Linear model for comparison
ffit <- lm("Compensation ~ Age + Male", data=directors)
summary(ffit)

Running a blm model

To fit a blm model, we call the blm() function, which is similar to the lm() function.

bfit <- blm("Compensation ~ Age + Male", 
            data=directors)

At this point, we can call print() and summary() on the data to check the model status.

print(bfit)

This gives us information about the data, the sampling options and the priors, which are uninformative at this point.

summary(bfit)

summary() gives us information about the model fit, MAP estimates and so on. Given that we have not sampled the posterior, however, these statistics are not yet supplied.

We can update sampling settings as follows

bfit <- bfit %>%
  # Update sampling settings
  set_sampling_options(., chains = 3, iterations = 15000, thinning = 2, burn = 2000)

Priors may be made informative as follows

bfit <- bfit %>%
  # Add priors (we're on a log scale)
  set_prior("b1", mu=.01, sd=.2) %>% # 1 % increase / every year ==> 20% spread
  set_prior("b2", mu=.05, sd=.03)
print(bfit)

To add informative hypotheses to the model, we can use set_hypothesis()

bfit <- bfit %>%
  # H1: Males earn about the same as females
  set_hypothesis("b0 + b2 < b0") %>%
  # H2: Directors only earn more as they get older
  set_hypothesis("b1 > 0")
print(bfit)

If we do not specify initial values, then blm will draw these from the informative or uninformative priors. Given that we are on a log scale, it would be good to set these close to $0$.

bfit <- bfit %>%
  # Set initial values
  # If we draw from the priors the starting values will be too large
  # This is an issue for MH because it takes many more iterations to converge
  set_initial_values(chain_1 = list("b" = c(7, -5, 0), "sigma" = 1),
                     chain_2 = list("b" = c(-5, 0, 7), "sigma" = 2))

If you want to change the sampler (i.e. use Metropolis-Hastings instead of a Gibbs sampler), then you can use set_sampler(). However, this is not recommended as you will only lose efficiency. The MH algorithm was implemented for didactical reasons.

bfit <- bfit %>%
  # Change the sampler of Age to metropolis hastings (random walk)
  # Lambda parameter controls the variance of the (normal) proposal distribution
  set_sampler("b1", type="MH", lambda=0.01)

If you want to compute the intercept-only model next to the model you are interested in, you can add the line compute_null_model():

bfit <- bfit %>%
  # This has the same parameters as set_sampling_options()
  compute_null_model()

You can always use get_parameter_names() to obtain a mapping of variables to parameters:

bfit %>%
  get_parameter_names()

Sampling from the posterior

If we are happy with these settings, we can sample from the posterior distribution. This process takes the longest during the first time you call the sample_posterior() function. This happens because Julia's Just-In-Time (JIT) compiler compiles the code when it is called the first time. In subsequent evaluations of the code, it then uses the already compiled code. To carry out the blm sampling plan, you execute:

bfit <- bfit %>%
  # Sample the posterior
  sample_posterior()

This has several effects:

Your sampling plan is now locked, and you can only change the burn-in value. You can delete the posterior as follows:

bfit <- bfit %>%
  # Remove posterior samples
  delete_posterior()

The model DIC, hypotheses, intercept-only model, model Bayes Factor and R-squared values are calculated automatically.
If you want to draw more samples under the sampling plan, you can do so as follows:

bfit <- bfit %>%
  # Draw more samples
  update_posterior(iterations=10000)

Assessing convergence

To assess convergence, we can check the convergence plots:

bfit %>%
  plot("history")

plot(bfit, "density")

plot(bfit, "autocorrelation")

If these look fine, you can also view the Gelman-Rubin statistic:

bfit %>%
  evaluate_convergence_diagnostics()

The burn-in diagnostic are regression coefficients of the samples on the index, which will deviate if there is a trend in the data (unless they are non-linear and cancel each other out).

If you compute the intercept-only model, you will also want to check whether that model has converged:

plot(bfit, "nullmodel")

To evaluate the effect of autocorrelation on the effective sample size of the MCMC samples, you can execute:

bfit %>%
  evaluate_effective_sample_size()

If you use the MH sampler, you can also check the number of accepted samples like so:

bfit %>%
  evaluate_accepted_draws()

Evaluating results

Most results are returned when you sample the model. You can view them by executing:

summary(bfit)

A model that contains all optional elements has the following objects embedded in it:

names(bfit)

You can use contains() to see whether you model contains a specific element^[contains() is also an exported function for other packages, so this may clash with e.g. dplyr]:

# Use blm::contains() in case of clashes
blm::contains(bfit, "rsq")

If you want to get a summary of the individual elements, you can execute e.g.

# i.e. r-squared
bfit %>%
  get_value("rsq") %>%
  summary()

You can also view the posterior for the $R^2$ values:

bfit %>%
  get_value("rsq") %>%
  plot()

Posterior predictive checks

blm contains three posterior predictive checks. The first checks the residuals for skewness, the second for heteroskedasticity and the third for independence.

bfit <- bfit %>%
  # Use a fraction of the total posterior 
  # (to reduce the size of the output and because it is heavy on this pdf file)
  evaluate_ppc(p=.15)

The PPC results are the only results not automatically included in the summary output and the only function in the evaluate_ family that the user must call himself.

bfit %>%
  get_value("ppc") %>%
  summary()

We can also plot these results:

bfit %>%
  get_value("ppc") %>%
  plot("independence") %>%
  plot()

bfit %>%
  get_value("ppc") %>%
  plot("normality") %>%
  plot()

bfit %>%
  get_value("ppc") %>%
  plot("heteroskedasticity") %>%
  plot()