Matching Patients in the `cancer` Dataset with `vecmatch`
In vecmatch: Generalized Propensity Score Estimation and Matching for Multiple Groups

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.asp = 0.8,
  echo = TRUE
)

Practical Example of the Vector Matching Workflow

In this chapter, we will describe an exemplary data analysis of the cancer dataset from the vecmatch package, matching multiple cohorts based on selected covariates. The vecmatch package follows a strict workflow based on the vector matching algorithm defined by @lopez2017estimation, and it is advisable to follow it. The whole process consists of five steps to ensure the best possible matching quality using the vector matching algorithm.

Step 1: Data Exploration and Initial Imbalance Assessment

The vecmatch package provides two graphical functions for assessing initial dataset imbalance: raincloud() and mosaic(). The raincloud() function is designed for continuous variables, while mosaic() is for categorical variables. Both functions allow grouping by up to two categorical variables using the group and facet aesthetics and compute various statistical summaries, including significance tests and effect size coefficients.

In this example analysis, we focus on two predictor variables from the cancer dataset: the discrete sex and the continuous age. To evaluate differences in age across status groups, we can use the raincloud() function as follows:

library(vecmatch)
library(ggplot2)

raincloud(cancer,
  age,
  status,
  significance = "t_test",
  sig_label_color = TRUE,
  sig_label_size = 3,
  limits = c(10, 120)
) +
  scale_y_continuous(breaks = seq(10, 100, 10))

This plot presents the conditional distribution of age across different status groups. The standardized mean differences (SMD) are represented by the brackets on the right side of the jittered point plot. The results of the Student's t-tests are displayed alongside the boxplots on the right, providing insights into the statistical significance of the differences in age between the groups.

Similarly, we can evaluate the differences in the sex variable using the mosaic() function. The code snippet below demonstrates this:

mosaic(cancer,
  status,
  sex,
  group_counts = TRUE,
  significance = TRUE
)

The plot shows the conditional distribution of sex across different status groups. The counts for all groups are displayed inside the corresponding boxes, representing each subgroup within the mosaic plot. Partial significance is visually coded through color filling aesthetics. The labels provide the results of the Chi-squared test of independence, summarizing the statistical assessment of the relationship between the sex and status variables.

Based on the plots, we can observe some imbalance in the age and sex variables. However, in both cases, the differences between the status groups appear to be rather moderate. All Student's t-tests, adjusted for multiple comparisons, resulted in statistically insignificant differences in mean age values between the groups. Additionally, 2 out of the 6 calculated standardized mean differences (SMDs) fell below the 0.10 threshold, indicating relatively small differences.

After a closer examination of the mean age values, we can notice that they form two separate clusters. The first cluster includes controls and patients with adenomas, who tend to be younger, while the second cluster consists of patients with benign and malignant colorectal cancer, who are generally older. The primary objective of the matching process will then be to align these two clusters, ensuring they have a common mean age value.

There were no significant differences in the sex distributions across the status groups, as the overall Chi-square test of independence was insignificant. All partial significance tests were statistically insignificant, indicating small values of standardized Pearson's residuals.

Given the relatively small differences between the status groups, the vector matching algorithm is expected to produce satisfactory results.

Step 2: Estimation of Generalized Propensity Scores

The next step in the vector matching algorithm is the calculation of generalized propensity scores (GPS) for the treatment variable. These scores represent the treatment assignment probabilities and are based on user-defined covariates. The relationship between the treatment variable and predictors can be easily defined using R-specific formula notation, allowing for both additive and interaction effects.

The vecmatch package provides a straightforward method for calculating the GPS-matrix with different methods using the estimate_gps() function. However, custom approaches can also be applied within the vector-matching workflow. The GPS-matrix used in subsequent analyses must have the exact same structure as the output of estimate_gps, containing the treatment variable and the treatment assignment probabilities for each level of the treatment variable. Importantly, the probabilities for each row in the GPS-matrix must sum to 1. The resulting data.frame has to be of class gps.

In the following example, we will use a simple formula with age and sex as predictors, allowing for an interaction effect. The method used to estimate the generalized propensity scores (GPS) will be a multinomial logistic regression, which can handle multiple levels of the treatment variable. The code for estimating the GPS using the multinomial logistic regression model is as follows:

formula_cancer <- status ~ age * sex
gps_matrix <- estimate_gps(formula_cancer,
  cancer,
  method = "multinom",
  reference = "control"
)
head(gps_matrix, 7)

Step 3: Calculating Common Support Region Borders

To remove observations that are not eligible for further matching, the borders of the common support region (CSR) must be calculated. The csregion() function facilitates this process by taking the gps_matrix object as its sole argument. It computes the CSR borders and automatically filters the input gps_matrix to include only the observations that fall within the calculated CSR borders.

The function returns an object of class csr_matrix, which contains the filtered data along with several additional attributes. These attributes enable users to manually filter the initial gps_matrix or access a summary of the CSR border calculation as a data.frame object. The csregion() function can be used as follows:

csr_matrix <- csregion(gps_matrix)

We can then compare the dimensions of the original gps_matrix and the filtered csr_matrix.

dim(gps_matrix)
dim(csr_matrix)

Clearly, 24 observations have been filtered out of the original matrix.

Next, as recommended by @lopez2017estimation, the GPS can be recalculated using only the data within the CSR. However, in our experience, this additional step does not result in a significant improvement in matching quality, especially if the number of observations that fall outside of the CSR is relatively small in comparison to the whole dataset size. Therefore, we decided to bypass this step and proceed with matching directly on the filtered dataset.

Step 4: k-Means Clustering and Matching {#sec-examatch}

The most crucial step of the vector matching algorithm is the k-means clustering and subsequent matching within the resulting clusters. This approach ensures that only observations with similar GPS vectors are matched, thereby justifying the name vector matching. The core components of this step are the stats::kmeans() function [@rlang2024], which performs clustering, and the Matching::Matchby() or optmatch::fullmatch() functions [@sekhon2011multivariate, @hansen2006optimal], which match observations within each k-means cluster. These functionalities are combined into a single function named match_gps().

The match_gps() function takes the csr_matrix as input and returns a matched dataset. The matching process can be customized using various arguments, and it is advisable to experiment with different parameter combinations to optimize both matching quality and the number of matched samples (see appendix \ref{app:practical}). For the replicability of results, it is also crucial to set the random number generator seed before running the clustering and matching procedures. This ensures that the clustering assignments and matches are consistent across different runs of the analysis, allowing others to reproduce the results exactly.

The match_gps() function can be used as follows:

set.seed(164373)
matched_cancer <- match_gps(csr_matrix,
  caliper = 0.21,
  kmeans_cluster = 2,
  reference = "control",
  replace = TRUE,
  method = "fullopt",
  order = "desc"
)

Step 5: Post-Matching Quality Assessment {#quality}

The vecmatch package provides a dedicated function, balqual(), to assess the quality of post-matching results. This function compares various metrics and statistical summaries between the pre- and post-matching datasets, using a user-defined formula to evaluate balance. Its usage is straightforward and requires only the matched dataset (produced by match_gps()) and the formula used for the calculation of the GPS:

balqual(matched_cancer,
  formula_cancer,
  type = "smd",
  statistic = "max",
  round = 4
)

After assessing the quality with balqual(), we can combine both matched and unmatched datasets into a single data frame to visualize age and sex using raincloud() and mosaic():

matched_cancer$dataset <- "matched"
unmatched_cancer <- cancer
unmatched_cancer$dataset <- "unmatched"
data_full <- rbind(matched_cancer, unmatched_cancer)

raincloud(data_full,
  age,
  status,
  dataset,
  significance = "t_test",
  sig_label_color = TRUE,
  sig_label_size = 3,
  limits = c(10, 120)
) +
  scale_y_continuous(breaks = seq(10, 100, 10))

mosaic(data_full,
  status,
  sex,
  dataset,
  group_counts = TRUE,
  significance = TRUE
)

References

Any scripts or data that you put into this service are public.

vecmatch documentation built on June 8, 2025, 9:36 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

vecmatch
Generalized Propensity Score Estimation and Matching for Multiple Groups

Matching Patients in the `cancer` Dataset with `vecmatch`
In vecmatch: Generalized Propensity Score Estimation and Matching for Multiple Groups

Practical Example of the Vector Matching Workflow

Step 1: Data Exploration and Initial Imbalance Assessment

Step 2: Estimation of Generalized Propensity Scores

Step 3: Calculating Common Support Region Borders

Step 4: k-Means Clustering and Matching {#sec-examatch}

Step 5: Post-Matching Quality Assessment {#quality}

References

Try the vecmatch package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

vecmatch Generalized Propensity Score Estimation and Matching for Multiple Groups

Matching Patients in the `cancer` Dataset with `vecmatch` In vecmatch: Generalized Propensity Score Estimation and Matching for Multiple Groups

Practical Example of the Vector Matching Workflow

Step 1: Data Exploration and Initial Imbalance Assessment

Step 2: Estimation of Generalized Propensity Scores

Step 3: Calculating Common Support Region Borders

Step 4: k-Means Clustering and Matching {#sec-examatch}

Step 5: Post-Matching Quality Assessment {#quality}

References

Try the vecmatch package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

vecmatch
Generalized Propensity Score Estimation and Matching for Multiple Groups

Matching Patients in the `cancer` Dataset with `vecmatch`
In vecmatch: Generalized Propensity Score Estimation and Matching for Multiple Groups