Off-label usage of lmerMultiMember

The ability to pass arbitrary random effects matrices to lme4 makes lmerMultiMember useful for more than just vanilla multiple membership models. Any model that can be specified as a combination of "conventional" fixed effects and an arbitrary random effects matrix can be specified using lmerMultiMember. (Although for certain model specifications the underlying lme4 model fitting may not converge.)

One class of models that can be fit using lmerMultiMember are Bradley-Terry models, used to predict the probability of a given individual winning in a paired comparison with another individual, even when information is incomplete (i.e. not every individual has directly encountered every other individual). One fairly prominent application is to predict rankings of chess players, even if not every player in the list has played every other player.

The Bradley-Terry model can be written as a Generalized Linear Mixed-effects Model (GLMM) as follows: $$\operatorname{logit}(P(i > j)) = \log\left(\frac{P(i > j)}{1 - P(i > j)}\right) = \log\left(\frac{P(i > j)}{P(j > i)}\right) = \beta_i - \beta_j$$ If you're not that comfortable with the math, don't worry! This formulation of the Bradley-Terry model is just a logistic regression with a random effects matrix that has the usual positive indicators for one individual in each comparison, but negative indicators for the other individual. In this formulation we model the probability the the individual with the positive indicator "wins" the comparison. There is often a natural way to assign individuals to sides of the comparison, for instance by always giving the home team the positive indicator (in this case, if we model an intercept, that intercept represents the home field advantage) or, in chess, to always assign the player with the white pieces the positive indicator.

lmerMultiMember::bradleyterry_from_vectors() is a helper function that creates the indicator matrix from a vector of home teams (i.e. individuals that will be assigned positive indicators) and a vector of visiting teams (i.e. individuals that will be assigned negative indicators). If this function is used to generate the indicator matrix, we will be modeling the probability of the home team winning.

Predicting NFL team rankings for the 2021 season

To demonstrate how to specify and fit a Bradley-Terry model in lmerMultiMember, in this vignette we will rank National Football League teams by their performance in the 2021 season. Of course we know who "won" the season, because the Los Angeles Rams won the Super Bowl at the end of the playoffs. However, with a Bradley-Terry model, we can predict win probabilities even for teams who did not play a single game against each other during the season. For example, I might be wondering how my local team, the Green Bay Packers, would have fared against my spouse's home team, the Los Angeles Chargers (formerly of San Diego), even though these teams are in different conferences and did not play against each other in 2021 season. As we'll demonstrate in this vignette, the coefficients from the Bradley-Terry model can give you a predicted outcome for this hypothetical game.

NFL scores for the 2021 season are included with lmerMultiMember as a dataset, so all we need to do to load the data is load the package itself. Then, we can take a quick look at the data.

# load packages
library(lmerMultiMember)  # for dataset and modeling
library(kableExtra)  # for displaying tabular data
library(dplyr)  # for data manipulation
library(sjPlot)  # for plotting model coefficients
library(ggplot2)  # for general plotting stuff

# display the NFL 2021 season data
nfl_scores_2021 |>
  kable() |>
  kable_styling(bootstrap_options = c("hover", "condensed"), font_size = 16) |>
  scroll_box(height = "250px")

Generating a Bradley-Terry indicator matrix and fitting a model

From this dataset we can generate a Bradley-Terry indicator matrix using lmerMultiMember::bradleyterry_from_vectors().

# generate indicator matrix from home team and visiting team vectors
W <- bradleyterry_from_vectors(nfl_scores_2021$home_team,
                               nfl_scores_2021$visiting_team)

# check dimensions of indicator matrix (should be teams x games)
# i.e. 32 x 285
paste(c("rows/teams:", ", columns/games:"), dim(W)) |> cat()

With the indicator matrix generated, all we need to do is specify a logistic glmer() model to predict wins by the home team, and insert our indicator model using a dummy variable.

# convert winners to binary variable for more transparent interpretation
nfl_scores_2021$home_win <- recode(nfl_scores_2021$winner,
                                   "visiting" = 0, "home" = 1)

# fit model
m <- glmer(home_win ~ (1 | indicators),
           family = binomial,
           memberships = list(indicators = W),
           data = nfl_scores_2021)

# show model summary
summary(m)

Sanity check on random intercepts (team strength)

With the model fit, we can start interpreting the coefficients. The intercept of the model represents the home field advantage (i.e. the probability of a home team winning, regardless of that team's strength). The strength of individual teams is captured in the random intercepts.

# plot random effects
plot_model(m, type = "re", sort.est = TRUE, grid = FALSE,
           line.size = 1.0, dot.size = 3.0) +
  coord_cartesian() +
  theme_bw() +
  ggtitle("Odds ratio of team winning against League-average team") +
  theme(axis.ticks = element_blank(),
        axis.text.x = element_text(size = 12, angle = 60, hjust = 1.0),
        axis.text.y = element_text(size = 12),
        plot.title = element_text(size = 14),
        rect = element_rect(fill = "transparent")) +
  scale_y_continuous() +
  geom_hline(yintercept = 1.0, color = "coral2", size = 1.0) +
  scale_color_manual(values = c("coral3", "coral1"))

While the model predicts on the scale of logits (log-odds), sjPlot automatically transforms the coefficients to odds ratios by exponentiating them. This is a fairly natural way of interpreting coefficients from a model like this, so let's take a look at the best and worst teams.

The Los Angeles Rams, the team with the highest random intercept, has an odds ratio of almost 2, meaning they are favored 2:1 to win against the hypothetical average team in the League. The Rams were the eventual Super Bowl winners in 2021, so this seems like a sensible model prediction, at least superficially.

The Jacksonville Jaguars, the team with the lowest random intercept, has an odds ratio of about .5, meaning their odds of beating the hypothetical League average team are 1:2, which is consistent with their League-worst 3-14 win/loss record.

Predicting the winner of an unplayed game

So far, so good; the model correctly predicted the League's best and worst teams in the 2021 season. So what about our hypothetical matchup, the LA Chargers playing the Green Bay Packers?

It's easiest to understand how to make this prediction if we compute it by hand. We start by looking up the random intercepts for the Packers (GB) and Chargers (LAC).

# extract random intercepts for Green Bay and the LA Chargers
ranef(m)$indicators[c("GB", "LAC"), , drop = FALSE] |> round(2)

By exponentiating the difference between these intercepts, we can compute the odds ratio in favor of Green Bay winning this hypothetical matchup.

$$OR(GB > LAC) = \exp(\beta_{GB} - \beta_{LAC}) = \exp(.50 - .07) = 1.54$$

The odds of the 2021 Packers winning against the 2021 Chargers, disregarding any potential home field advantage, are 1.54:1.

Replicating the Bradley-Terry model in Stan

To double-check the predictions we got from lmerMultiMember, we can fit the same model structure in Stan, the probabilistic programming language for Bayesian models. Replicating this model in a Bayesian framework is convenient, because we can reproduce the mild shrinkage that lme4 applies to its random effects. (We can approximate the implicit uniform prior on home field advantage in the lmerMultiMember by putting a very wide prior on that coefficient in the Stan model.)

We first specify the Stan model's structure:

```{stan, output.var = "model_stan", eval = FALSE} data { int N; // number of games int M; // number of teams matrix[N, M] W; // indicator matrix int home_win[N]; // outcome } parameters { real home_adv; // home field advantage vector[M] team_logOR; // team strength (log odds ratio) real sigma; // sd of team strengths } transformed parameters { vector[M] team_OR; // odds ratio per team team_OR = exp(team_logOR); // exponentiate team log OR } model { home_adv ~ normal(0, 100); // wide normal prior on home field advantage team_logOR ~ normal(0, sigma); // normal priors on team strength home_win ~ bernoulli_logit(home_adv + W * team_logOR); // likelihood }

Then, we pass data to the model and sample from it using `rstan`. Conveniently, we can use the helper function `lmerMultiMember::bradleyterry_from_vectors()` to generate an indicator matrix for Stan, as well:

```r

# use multiple cores for sampling
options(mc.cores = parallel::detectCores())

# prepare data for passing to Stan sampler
stan_data <- list(
  home_win = nfl_scores_2021$home_win,
  N = nrow(nfl_scores_2021),
  M = length(unique(nfl_scores_2021$home_team)),
  W = t(as.matrix(bradleyterry_from_vectors(nfl_scores_2021$home_team,
                                            nfl_scores_2021$visiting_team)))
)

# sample from Stan model
stan_fit <- rstan::sampling(model_stan, data = stan_data,
                               warmup = 2000, iter = 8000)

# create tidy data frame from posterior samples
stan_post <- broom.mixed::tidy(stan_fit, pars = "team_OR", conf.int = TRUE)

# store fitted Stan model
save(stan_post, file = "bt_model_stan_fit.Rds")

Finally, we can plot the HDIs from the posterior samples in the same manner that we plotted the lmerMultiMember effects:

# load fitted Stan model (this is to save resources when building vignettes)
load("bt_model_stan_fit.Rds")

# sort teams from low to high and color-code
stan_post$term <- sort(unique(nfl_scores_2021$home_team))
stan_post$color <- cut(stan_post$estimate, breaks = c(0, 1, Inf),
                  labels = c("negative", "positive"))

# plot 95% HDIs
stan_post[order(stan_post$estimate), ] |>
  ggplot(aes(y = estimate, x = factor(term, level = term), color = color)) +
  geom_point(size = 3) +
  geom_linerange(aes(ymin = conf.low, ymax = conf.high,
                     x = factor(term, level = term), color = color), size = 1) +
  theme_bw() +
  ggtitle("Odds ratio of team winning against League-average team") +
  theme(axis.ticks = element_blank(),
        axis.text.x = element_text(size = 12, angle = 60, hjust = 1.0),
        axis.text.y = element_text(size = 12),
        plot.title = element_text(size = 14),
        legend.position = "none",
        rect = element_rect(fill = "transparent")) +
  xlab("") + ylab("") +
  geom_hline(yintercept = 1.0, color = "coral2", size = 1.0) +
  scale_color_manual(values = c("coral3", "coral1"))

the rstan and lmerMultiMember estimates of team strength seem to match, although the uncertainty intervals are a little bit wider here. (This could potentially be fixed by setting stricter priors.)



jvparidon/lmerMultiMember documentation built on Oct. 22, 2023, 3:42 p.m.