In gnattino/polymatching: A Matching Algorithm for Designs with Multiple Groups

knitr::opts_chunk$set(echo = TRUE, 
                      message = FALSE, 
                      warning = FALSE)

The algorithm: a brief overview

The polymatching algorithm is organized in two main stages. First, a starting matched sample is generated to initialize the algorithm. Second, an iterative procedure is applied to reduce the distance within matched sets, thus improving the quality of the solution.

For the initialization of the first matched sample, this package implements a sequential strategy. The units of the first two groups are optimally matched. Then, the units from the third group are optimally matched to the existing pairs. The units from the fourth group are then optimally matched to the existing triplets, and so on. Note that, at each step, the existing two-group optimal matching procedure can be applied because each problem can be addressed by a two-group matching algorithm (matching units to fixed pairs, units to fixed triplets and so on). The order of the considered groups can be specified by the user or, by default, is based on their size, sequentially matching groups from the smallest to the largest. Furthermore, the package enables user to provide any given matched set as starting point, so that the algorithm can explore any improvement in terms of total distance.

The second stage of the algorithm is the iterative procedure. For example, with $K$ groups, the algorithm starts by relaxing the linkage with the first group and the units of this group are rematched to the fixed $(K-1)$-uples. The procedure is repeated for each of the $K$ groups and the attained total distance is recorded. If any newly generated matched set improves the total distance, the matched sample attaining the smallest total distance is selected and the iterations continue. Again, the linkage to each group is sequentially relaxed and possible reductions in the total distance are explored. The algorithm stops when no improvement in the total distance is found.

A toy dataset

To showcase the functionalities of the polymatching package, let's start by generating a toy dataset with 3 continuous variables ($X_1$, $X_2$ and $X_3$), one binary variable ($Z$) and one categorical variable ($W$), with four levels. The units are divided in 4 groups, labeled as A, B, C and D, with different size ($n_A = 10$, $n_B = 100$, $n_C = n_D = 500$). Let $G$ be the name of the variable identifying the assigned group.

#Set seed for reproducibility of results
set.seed(1234)

#Group sizes
n_A <- 50
n_B <- 750
n_C <- n_D <- 1000

#Generate variables separately for each group
dat_A <- data.frame(X_1 = rnorm(n_A, mean = 0,  sd = 1), 
                    X_2 = rnorm(n_A, mean = 5,  sd = 1), 
                    X_3 = rnorm(n_A, mean = -5,  sd = 1), 
                    Z = factor(rbinom(n_A, size = 1, prob = .5)),
                    W = factor(apply(rmultinom(n_A, size = 1, prob = c(.25, .25, .25, .25)) == 1, 
                                     MARGIN = 2, 
                                     FUN = which)),
                    G = "A",
                    stringsAsFactors = FALSE)

dat_B <- data.frame(X_1 = rnorm(n_B, mean = 0,  sd = 1), 
                   X_2 = rnorm(n_B, mean = 5,  sd = 1), 
                   X_3 = rnorm(n_B, mean = -5,  sd = 1), 
                   Z = factor(rbinom(n_B, size = 1, prob = .5)),
                   W = factor(apply(rmultinom(n_A, size = 1, prob = c(.25, .25, .25, .25)) == 1, 
                                     MARGIN = 2, 
                                     FUN = which)),
                   G = "B",
                   stringsAsFactors = FALSE)

dat_C <- data.frame(X_1 = rnorm(n_C, mean = -0.5,  sd = 1), 
                   X_2 = rnorm(n_C, mean = 4.5,  sd = 1), 
                   X_3 = rnorm(n_C, mean = -4.5,  sd = 1), 
                   Z = factor(rbinom(n_C, size = 1, prob = .2)),
                   W = factor(apply(rmultinom(n_A, size = 1, prob = c(.1, .2, .3, .4)) == 1, 
                                     MARGIN = 2, 
                                     FUN = which)),
                   G = "C",
                   stringsAsFactors = FALSE)

dat_D <- data.frame(X_1 = rnorm(n_D, mean = 0.5,  sd = 1), 
                   X_2 = rnorm(n_D, mean = 5.5,  sd = 1), 
                   X_3 = rnorm(n_D, mean = -5.5,  sd = 1), 
                   Z = factor(rbinom(n_D, size = 1, prob = .8)),
                   W = factor(apply(rmultinom(n_A, size = 1, prob = c(.4, .3, .2, .1)) == 1, 
                                     MARGIN = 2, 
                                     FUN = which)),
                   G = "D",
                   stringsAsFactors = FALSE)

dat <- rbind(dat_A, dat_B, dat_C, dat_D)

The following plot shows the distribution of the continuous variables.

library(ggplot2)
library(tidyr)

dat %>%
  pivot_longer(cols = c("X_1", "X_2", "X_3"), 
               names_to = "variable", 
               values_to = "value") %>%
  ggplot() + 
  geom_density(aes(x = value, group = G, color = G, after_stat(count))) +
  facet_grid(cols = vars(variable), scales = "free") +
  labs(x = element_blank(), y = element_blank(), color = "Group")

The distribution of the binary and categorical variables is represented below. The two figures show a good overlap in all of the variables.

dat %>%
  pivot_longer(cols = c("Z", "W"), 
               names_to = "variable", 
               values_to = "value") %>%
  ggplot() + 
  geom_bar(aes(x = G, fill = value)) +
  facet_grid(cols = vars(variable), scales = "free") +
  labs(x = "Group", y = "N", fill = "Value")

Getting started with `polymatch`

The function polymatch can be used to generate matched sets with units that are similar with respect to the desired variables. In the simplest matched design, matched sets are made by one unit for each group. In the simulated dataset, this means creating quadruplets with one subject from each of the four groups. This is the default design of polymatch. The following code can be used to generate matched sets that are similar with respect to the three continuous variables $X_1$, $X_2$ and $X_3$.

library(polymatching)
result_match_1 <- polymatch(G ~ X_1 + X_2 + X_3, 
                        data = dat,
                        distance = "euclidean")

The output shows the total distance attained at each step of the iterative process. To identify the generated sets, the IDs of the matched sets can be added to the dataset as a new variable.

dat$match_id_1 <- result_match_1$match_id

We can check that each group is made of four units, one from each group. The number of matched sets is 10, as this is the size of the smallest group.

head(table(dat$match_id_1, 
      dat$G, 
      dnn = c("Match ID", "Group")))

Unbalanced matched sets

As the groups B, C and D are much larger than A, we can also match multiple subjects from B, C and D to each subject in A. This can be done with the parameter vectorK of the function.

result_match_2 <- polymatch(G ~ X_1 + X_2 + X_3, 
                        data = dat,
                        distance = "euclidean",
                        vectorK = c("A" = 1, 
                                    "B" = 2,
                                    "C" = 3,
                                    "D" = 3))
dat$match_id_2 <- result_match_2$match_id

Now, each unit from A is matched to 2 units from B and to 3 units from C and D.

head(table(dat$match_id_2, 
      dat$G, 
      dnn = c("Match ID", "Group")))

Exact matching constraints

It is possible to force the algorithm to match only the units that share the value of one variable (or multiple variables). This can be done with the exactMatch parameter. With the following code, only units with the same value in $Z$ and $W$ can be matched.

result_match_3 <- polymatch(G ~ X_1 + X_2 + X_3, 
                        data = dat,
                        distance = "euclidean",
                        exactMatch = ~ Z + W)
dat$match_id_3 <- result_match_3$match_id

Propensity score matching

A popular strategy to balance multiple covariates is to match on the propensity score. In the two-group setting, it is defined as the probability of receiving the treatment or exposure of interest and is commonly estimated through logistic regression. With multiple treatment groups, it is defined as the vector of probabilities of receiving each treatment and is generally estimated through a multinomial logistic model.

library(VGAM)
psModel <- vglm(G ~ X_1 + X_2 + X_3 + Z + W,
                family = multinomial, 
                data = dat)
summary(psModel)

With 4 groups, the model estimates 4 probabilities and 3 logits ($log(P(G='A')/P(G='D'))$, $log(P(G='B')/P(G='D'))$, $log(P(G='C')/P(G='D'))$). Units are generally matched on the logits of the propensity score.

logitPS <- predict(psModel, type = "link")
dat$logit_AvsD <- logitPS[,1]
dat$logit_BvsD <- logitPS[,2]
dat$logit_CvsD <- logitPS[,3]

result_match_4 <- polymatch(G ~ logit_AvsD + logit_BvsD + logit_CvsD, 
                            data = dat,
                            distance = "euclidean")
dat$match_id_4 <- result_match_4$match_id