knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
The polymatching algorithm is organized in two main stages. First, a starting matched sample is generated to initialize the algorithm. Second, an iterative procedure is applied to reduce the distance within matched sets, thus improving the quality of the solution.
For the initialization of the first matched sample, this package implements a sequential strategy. The units of the first two groups are optimally matched. Then, the units from the third group are optimally matched to the existing pairs. The units from the fourth group are then optimally matched to the existing triplets, and so on. Note that, at each step, the existing two-group optimal matching procedure can be applied because each problem can be addressed by a two-group matching algorithm (matching units to fixed pairs, units to fixed triplets and so on). The order of the considered groups can be specified by the user or, by default, is based on their size, sequentially matching groups from the smallest to the largest. Furthermore, the package enables user to provide any given matched set as starting point, so that the algorithm can explore any improvement in terms of total distance.
The second stage of the algorithm is the iterative procedure. For example, with $K$ groups, the algorithm starts by relaxing the linkage with the first group and the units of this group are rematched to the fixed $(K-1)$-uples. The procedure is repeated for each of the $K$ groups and the attained total distance is recorded. If any newly generated matched set improves the total distance, the matched sample attaining the smallest total distance is selected and the iterations continue. Again, the linkage to each group is sequentially relaxed and possible reductions in the total distance are explored. The algorithm stops when no improvement in the total distance is found.
To showcase the functionalities of the polymatching
package, let's start by generating a toy dataset with 3 continuous variables ($X_1$, $X_2$ and $X_3$), one binary variable ($Z$) and one categorical variable ($W$), with four levels. The units are divided in 4 groups, labeled as A, B, C and D, with different size ($n_A = 10$, $n_B = 100$, $n_C = n_D = 500$). Let $G$ be the name of the variable identifying the assigned group.
#Set seed for reproducibility of results set.seed(1234) #Group sizes n_A <- 50 n_B <- 750 n_C <- n_D <- 1000 #Generate variables separately for each group dat_A <- data.frame(X_1 = rnorm(n_A, mean = 0, sd = 1), X_2 = rnorm(n_A, mean = 5, sd = 1), X_3 = rnorm(n_A, mean = -5, sd = 1), Z = factor(rbinom(n_A, size = 1, prob = .5)), W = factor(apply(rmultinom(n_A, size = 1, prob = c(.25, .25, .25, .25)) == 1, MARGIN = 2, FUN = which)), G = "A", stringsAsFactors = FALSE) dat_B <- data.frame(X_1 = rnorm(n_B, mean = 0, sd = 1), X_2 = rnorm(n_B, mean = 5, sd = 1), X_3 = rnorm(n_B, mean = -5, sd = 1), Z = factor(rbinom(n_B, size = 1, prob = .5)), W = factor(apply(rmultinom(n_A, size = 1, prob = c(.25, .25, .25, .25)) == 1, MARGIN = 2, FUN = which)), G = "B", stringsAsFactors = FALSE) dat_C <- data.frame(X_1 = rnorm(n_C, mean = -0.5, sd = 1), X_2 = rnorm(n_C, mean = 4.5, sd = 1), X_3 = rnorm(n_C, mean = -4.5, sd = 1), Z = factor(rbinom(n_C, size = 1, prob = .2)), W = factor(apply(rmultinom(n_A, size = 1, prob = c(.1, .2, .3, .4)) == 1, MARGIN = 2, FUN = which)), G = "C", stringsAsFactors = FALSE) dat_D <- data.frame(X_1 = rnorm(n_D, mean = 0.5, sd = 1), X_2 = rnorm(n_D, mean = 5.5, sd = 1), X_3 = rnorm(n_D, mean = -5.5, sd = 1), Z = factor(rbinom(n_D, size = 1, prob = .8)), W = factor(apply(rmultinom(n_A, size = 1, prob = c(.4, .3, .2, .1)) == 1, MARGIN = 2, FUN = which)), G = "D", stringsAsFactors = FALSE) dat <- rbind(dat_A, dat_B, dat_C, dat_D)
The following plot shows the distribution of the continuous variables.
library(ggplot2) library(tidyr) dat %>% pivot_longer(cols = c("X_1", "X_2", "X_3"), names_to = "variable", values_to = "value") %>% ggplot() + geom_density(aes(x = value, group = G, color = G, after_stat(count))) + facet_grid(cols = vars(variable), scales = "free") + labs(x = element_blank(), y = element_blank(), color = "Group")
The distribution of the binary and categorical variables is represented below. The two figures show a good overlap in all of the variables.
dat %>% pivot_longer(cols = c("Z", "W"), names_to = "variable", values_to = "value") %>% ggplot() + geom_bar(aes(x = G, fill = value)) + facet_grid(cols = vars(variable), scales = "free") + labs(x = "Group", y = "N", fill = "Value")
polymatch
The function polymatch
can be used to generate matched sets with units that are similar with respect to the desired variables. In the simplest matched design, matched sets are made by one unit for each group. In the simulated dataset, this means creating quadruplets with one subject from each of the four groups. This is the default design of polymatch
. The following code can be used to generate matched sets that are similar with respect to the three continuous variables $X_1$, $X_2$ and $X_3$.
library(polymatching) result_match_1 <- polymatch(G ~ X_1 + X_2 + X_3, data = dat, distance = "euclidean")
The output shows the total distance attained at each step of the iterative process. To identify the generated sets, the IDs of the matched sets can be added to the dataset as a new variable.
dat$match_id_1 <- result_match_1$match_id
We can check that each group is made of four units, one from each group. The number of matched sets is 10, as this is the size of the smallest group.
head(table(dat$match_id_1, dat$G, dnn = c("Match ID", "Group")))
As the groups B, C and D are much larger than A, we can also match multiple subjects from B, C and D to each subject in A. This can be done with the parameter vectorK
of the function.
result_match_2 <- polymatch(G ~ X_1 + X_2 + X_3, data = dat, distance = "euclidean", vectorK = c("A" = 1, "B" = 2, "C" = 3, "D" = 3)) dat$match_id_2 <- result_match_2$match_id
Now, each unit from A is matched to 2 units from B and to 3 units from C and D.
head(table(dat$match_id_2, dat$G, dnn = c("Match ID", "Group")))
It is possible to force the algorithm to match only the units that share the value of one variable (or multiple variables). This can be done with the exactMatch
parameter. With the following code, only units with the same value in $Z$ and $W$ can be matched.
result_match_3 <- polymatch(G ~ X_1 + X_2 + X_3, data = dat, distance = "euclidean", exactMatch = ~ Z + W) dat$match_id_3 <- result_match_3$match_id
A popular strategy to balance multiple covariates is to match on the propensity score. In the two-group setting, it is defined as the probability of receiving the treatment or exposure of interest and is commonly estimated through logistic regression. With multiple treatment groups, it is defined as the vector of probabilities of receiving each treatment and is generally estimated through a multinomial logistic model.
library(VGAM) psModel <- vglm(G ~ X_1 + X_2 + X_3 + Z + W, family = multinomial, data = dat) summary(psModel)
With 4 groups, the model estimates 4 probabilities and 3 logits ($log(P(G='A')/P(G='D'))$, $log(P(G='B')/P(G='D'))$, $log(P(G='C')/P(G='D'))$). Units are generally matched on the logits of the propensity score.
logitPS <- predict(psModel, type = "link") dat$logit_AvsD <- logitPS[,1] dat$logit_BvsD <- logitPS[,2] dat$logit_CvsD <- logitPS[,3] result_match_4 <- polymatch(G ~ logit_AvsD + logit_BvsD + logit_CvsD, data = dat, distance = "euclidean") dat$match_id_4 <- result_match_4$match_id
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.