The ability to pass arbitrary random effects matrices to lme4
makes lmerMultiMember
useful for more than just vanilla multiple membership models. Any model that can be specified as a combination of "conventional" fixed effects and an arbitrary random effects matrix can be specified using lmerMultiMember
. (Although for certain model specifications the underlying lme4
model fitting may not converge.)
One class of models that can be fit using lmerMultiMember
are Bradley-Terry models, used to predict the probability of a given individual winning in a paired comparison with another individual, even when information is incomplete (i.e. not every individual has directly encountered every other individual). One fairly prominent application is to predict rankings of chess players, even if not every player in the list has played every other player.
The Bradley-Terry model can be written as a Generalized Linear Mixed-effects Model (GLMM) as follows: $$\operatorname{logit}(P(i > j)) = \log\left(\frac{P(i > j)}{1 - P(i > j)}\right) = \log\left(\frac{P(i > j)}{P(j > i)}\right) = \beta_i - \beta_j$$ If you're not that comfortable with the math, don't worry! This formulation of the Bradley-Terry model is just a logistic regression with a random effects matrix that has the usual positive indicators for one individual in each comparison, but negative indicators for the other individual. In this formulation we model the probability the the individual with the positive indicator "wins" the comparison. There is often a natural way to assign individuals to sides of the comparison, for instance by always giving the home team the positive indicator (in this case, if we model an intercept, that intercept represents the home field advantage) or, in chess, to always assign the player with the white pieces the positive indicator.
lmerMultiMember::bradleyterry_from_vectors()
is a helper function that creates the indicator matrix from a vector of home teams (i.e. individuals that will be assigned positive indicators) and a vector of visiting teams (i.e. individuals that will be assigned negative indicators). If this function is used to generate the indicator matrix, we will be modeling the probability of the home team winning.
To demonstrate how to specify and fit a Bradley-Terry model in lmerMultiMember
, in this vignette we will rank National Football League teams by their performance in the 2021 season. Of course we know who "won" the season, because the Los Angeles Rams won the Super Bowl at the end of the playoffs. However, with a Bradley-Terry model, we can predict win probabilities even for teams who did not play a single game against each other during the season. For example, I might be wondering how my local team, the Green Bay Packers, would have fared against my spouse's home team, the Los Angeles Chargers (formerly of San Diego), even though these teams are in different conferences and did not play against each other in 2021 season. As we'll demonstrate in this vignette, the coefficients from the Bradley-Terry model can give you a predicted outcome for this hypothetical game.
NFL scores for the 2021 season are included with lmerMultiMember
as a dataset, so all we need to do to load the data is load the package itself. Then, we can take a quick look at the data.
# load packages library(lmerMultiMember) # for dataset and modeling library(kableExtra) # for displaying tabular data library(dplyr) # for data manipulation library(sjPlot) # for plotting model coefficients library(ggplot2) # for general plotting stuff # display the NFL 2021 season data nfl_scores_2021 |> kable() |> kable_styling(bootstrap_options = c("hover", "condensed"), font_size = 16) |> scroll_box(height = "250px")
From this dataset we can generate a Bradley-Terry indicator matrix using lmerMultiMember::bradleyterry_from_vectors()
.
# generate indicator matrix from home team and visiting team vectors W <- bradleyterry_from_vectors(nfl_scores_2021$home_team, nfl_scores_2021$visiting_team) # check dimensions of indicator matrix (should be teams x games) # i.e. 32 x 285 paste(c("rows/teams:", ", columns/games:"), dim(W)) |> cat()
With the indicator matrix generated, all we need to do is specify a logistic glmer()
model to predict wins by the home team, and insert our indicator model using a dummy variable.
# convert winners to binary variable for more transparent interpretation nfl_scores_2021$home_win <- recode(nfl_scores_2021$winner, "visiting" = 0, "home" = 1) # fit model m <- glmer(home_win ~ (1 | indicators), family = binomial, memberships = list(indicators = W), data = nfl_scores_2021) # show model summary summary(m)
With the model fit, we can start interpreting the coefficients. The intercept of the model represents the home field advantage (i.e. the probability of a home team winning, regardless of that team's strength). The strength of individual teams is captured in the random intercepts.
# plot random effects plot_model(m, type = "re", sort.est = TRUE, grid = FALSE, line.size = 1.0, dot.size = 3.0) + coord_cartesian() + theme_bw() + ggtitle("Odds ratio of team winning against League-average team") + theme(axis.ticks = element_blank(), axis.text.x = element_text(size = 12, angle = 60, hjust = 1.0), axis.text.y = element_text(size = 12), plot.title = element_text(size = 14), rect = element_rect(fill = "transparent")) + scale_y_continuous() + geom_hline(yintercept = 1.0, color = "coral2", size = 1.0) + scale_color_manual(values = c("coral3", "coral1"))
While the model predicts on the scale of logits (log-odds), sjPlot
automatically transforms the coefficients to odds ratios by exponentiating them. This is a fairly natural way of interpreting coefficients from a model like this, so let's take a look at the best and worst teams.
The Los Angeles Rams, the team with the highest random intercept, has an odds ratio of almost 2, meaning they are favored 2:1 to win against the hypothetical average team in the League. The Rams were the eventual Super Bowl winners in 2021, so this seems like a sensible model prediction, at least superficially.
The Jacksonville Jaguars, the team with the lowest random intercept, has an odds ratio of about .5, meaning their odds of beating the hypothetical League average team are 1:2, which is consistent with their League-worst 3-14 win/loss record.
So far, so good; the model correctly predicted the League's best and worst teams in the 2021 season. So what about our hypothetical matchup, the LA Chargers playing the Green Bay Packers?
It's easiest to understand how to make this prediction if we compute it by hand. We start by looking up the random intercepts for the Packers (GB) and Chargers (LAC).
# extract random intercepts for Green Bay and the LA Chargers ranef(m)$indicators[c("GB", "LAC"), , drop = FALSE] |> round(2)
By exponentiating the difference between these intercepts, we can compute the odds ratio in favor of Green Bay winning this hypothetical matchup.
$$OR(GB > LAC) = \exp(\beta_{GB} - \beta_{LAC}) = \exp(.50 - .07) = 1.54$$
The odds of the 2021 Packers winning against the 2021 Chargers, disregarding any potential home field advantage, are 1.54:1.
To double-check the predictions we got from lmerMultiMember
, we can fit the same model structure in Stan, the probabilistic programming language for Bayesian models. Replicating this model in a Bayesian framework is convenient, because we can reproduce the mild shrinkage that lme4
applies to its random effects. (We can approximate the implicit uniform prior on home field advantage in the lmerMultiMember
by putting a very wide prior on that coefficient in the Stan model.)
We first specify the Stan model's structure:
```{stan, output.var = "model_stan", eval = FALSE}
data {
int
Then, we pass data to the model and sample from it using `rstan`. Conveniently, we can use the helper function `lmerMultiMember::bradleyterry_from_vectors()` to generate an indicator matrix for Stan, as well: ```r # use multiple cores for sampling options(mc.cores = parallel::detectCores()) # prepare data for passing to Stan sampler stan_data <- list( home_win = nfl_scores_2021$home_win, N = nrow(nfl_scores_2021), M = length(unique(nfl_scores_2021$home_team)), W = t(as.matrix(bradleyterry_from_vectors(nfl_scores_2021$home_team, nfl_scores_2021$visiting_team))) ) # sample from Stan model stan_fit <- rstan::sampling(model_stan, data = stan_data, warmup = 2000, iter = 8000) # create tidy data frame from posterior samples stan_post <- broom.mixed::tidy(stan_fit, pars = "team_OR", conf.int = TRUE) # store fitted Stan model save(stan_post, file = "bt_model_stan_fit.Rds")
Finally, we can plot the HDIs from the posterior samples in the same manner that we plotted the lmerMultiMember
effects:
# load fitted Stan model (this is to save resources when building vignettes) load("bt_model_stan_fit.Rds") # sort teams from low to high and color-code stan_post$term <- sort(unique(nfl_scores_2021$home_team)) stan_post$color <- cut(stan_post$estimate, breaks = c(0, 1, Inf), labels = c("negative", "positive")) # plot 95% HDIs stan_post[order(stan_post$estimate), ] |> ggplot(aes(y = estimate, x = factor(term, level = term), color = color)) + geom_point(size = 3) + geom_linerange(aes(ymin = conf.low, ymax = conf.high, x = factor(term, level = term), color = color), size = 1) + theme_bw() + ggtitle("Odds ratio of team winning against League-average team") + theme(axis.ticks = element_blank(), axis.text.x = element_text(size = 12, angle = 60, hjust = 1.0), axis.text.y = element_text(size = 12), plot.title = element_text(size = 14), legend.position = "none", rect = element_rect(fill = "transparent")) + xlab("") + ylab("") + geom_hline(yintercept = 1.0, color = "coral2", size = 1.0) + scale_color_manual(values = c("coral3", "coral1"))
the rstan
and lmerMultiMember
estimates of team strength seem to match, although the uncertainty intervals are a little bit wider here. (This could potentially be fixed by setting stricter priors.)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.