In kamclean/finalpsm: Propensity-score matching with tidyverse and finalfit

knitr::opts_chunk$set(echo = TRUE, message=F, warning=F)

library(dplyr); library(tibble); library(finalpsm); library(patchwork)

Propensity-score Matching (PSM) with the finalpsm package

Create the matching dataset

Firstly we should ensure all variables of interest are in the correct formats for subsequent propensity-score matching. Variables should either be factors or numerical as appropriate.

We will be using the survival::colon dataset as the basis of our example.

data <- tibble::as_tibble(survival::colon) %>%

  dplyr::filter(etype==2) %>% # Outcome of interest is death
  dplyr::filter(rx!="Obs") %>%  # rx will be our binary treatment variable
  dplyr::select(-etype,-study, -status) %>% # Remove superfluous variables

  # Convert into numeric and factor variables
  dplyr::mutate_at(vars(obstruct, perfor, adhere, node4), function(x){factor(x, levels=c(0,1), labels = c("No", "Yes"))}) %>%
  dplyr::mutate(rx = factor(rx),
                mort365 = cut(time, breaks = c(-Inf, 365, Inf), labels = c("Yes", "No")),
                sex = factor(sex, levels=c(0,1), labels = c("Female", "Male")),
                differ = factor(differ, levels = c(1,2,3), labels = c("Well", "Moderate", "Poor")),
                extent = factor(extent, levels = c(1,2,3, 4), labels = c("Submucosa", "Muscle", "Serosa", "Contiguous Structures")),
                surg = factor(surg, levels = c(0,1), labels = c("Short", "Long"))) %>%

  # Logical value for outcome (for survival analysis)
  dplyr::mutate(status = mort365=="Yes")

knitr::kable(head(data, 10))

Propensity-score matching with matchit

With the data pre-processed, we can use the finalpsm matchit() function to create the PSM dataset. This is essentially a wrapper function to the MatchIt matchit() function - one of the most widely used packages for PSM within R. However, the finalpsm matchit() addresses what are felt to be several key issues:

Issue: The MatchIt::matchit() does not easily fit into a tidyverse workflow.
Solution: Inspired by the tidyverse and finalfit, the matchit() function facilitates a modulator approach to building formulae.

Problem: The MatchIt::matchit() will not run if there is missing data present in the strata or explanatory variables.
Solution: The matchit() function performs pairwise deletion only on those the strata or explanatory variables

Problem: The MatchIt::matchit() will not run if the strata variable is not a numeric binary variable (0 or 1).
Solution: The matchit() function handles the strata variable internally to allow either a factor or numeric variable to be used. The lowest value or first level will be designed the control (0), and the highest value or second level will be designated the treatment (1).

There are 4 types of variables that can be specified in the finalpsm::match function:

strata - The binary treatment variable
explanatory - All variables that are desired to be used to generate the propensity-score.
id - An optional variable that allows any number of columns that provide a method of identifying the patient to be retained (e.g. the record or hospital id for each patient). This is not used in the propensity-score matching process.
dependent- An optional variable that allows any number of columns that contain the desired outcome(s) of the patient that will be used within the subsequent modeling. This is not used in the propensity-score matching process.

All other inputs to the MatchIt::matchit() are accepted (see MatchIt documentation) with the default method of matching being full matching.

output <- data %>%
  finalpsm::match(strata = "rx",
                    explanatory = c("age","sex", "obstruct", "differ", "surg"),
                    id = "id",
                    dependent = c("mort365", "time"),
                    method = "full")

The outputs from the finalpsm::match() function include:

1. object: This can be used as like any MatchIt::matchit() output and so facilitates its use within existing scripts (although this function is designed with the intention to be used with subsequently described functions).

output$object

1. data: This is the final matched dataset. However, this is not just the output from MatchIt::match.data() applied to the MatchIt::matchit() object, but also all specified variables under the dependent or id parameters. Unspecified variables are removed.
Note: If the keep_all argument is set to TRUE, this retains all columns in the original dataset (and so removes the need for specifying dependent or id variables).

head(output$data, 10) %>% knitr::kable()

There are 3 additional columns appended to the dataset following propensity-score matching:

distance: This is the propensity-score generated for that record. This is the probability of that patient being in the treatment group (strata), given the covariates supplied (explanatory).
weights: This is the weighting provided to each patient within the matching procedure (particularly relevant in full matching). As stated by the MatchIt authors (Ho et al.): "If one chooses options that allow matching with replacement, or any solution that has different numbers of controls (or treateds) within each subclass or strata (such as full matching), then the parametric analysis following matching must accomodate these procedures". As such, the weights column is used within subsequent modeling to ensure that the dataset remains balanced on the key variables.
subclass: This is which subclass (aka matched cases and controls) the patient has been divided into based on their propensity score. Within each of the subclasses the propensity score is, at least approximately, constant.

Assessment of Balance

So we have gone to the effort of matching the sample to allow inference. However, before we start using the matched dataset generated, we should ensure that the PSM process has achieved its goal of creating a balanced sample on our observed variables (as that is a key determinant of the validity of any conclusions based on this data).

This can be achieved in 2 ways: both visual and quantitative methods of assessing balance.

Balance Visualisation

There are several ways to visualise balance after PSM, which broadly fall into 2 categories: overall assessment and individual covariate assessment (explanatory).

The balance_plot() function provides the capability to easily generate several ggplots to allow quick assessment.

All plots only require the object produced by matchit.
All plots produced are relatively plain, but can be aesthetically modified further via ggplot2 functions.

1. Overall balance between treatment groups

This can be visualised as a density or jitter plot to allow comparison of the overall distribution of propensity-scores between the treatment groups.

finalpsm::balance_plot(output, type = "density") / finalpsm::balance_plot(output, type = "jitter")

Well balanced groups should follow approximately the same pattern (as in this case).

2. Covariate balance between treatment groups

This can be visualised as a geom_point plot with a line of best fit to allow comparison of the distribution of propensity-scores between treatment groups for the different covariates (explanatory variables).

# https://sejdemyr.github.io/r-tutorials/statistics/tutorial8.html
covariate <- finalpsm::balance_plot(output, type = "covariate")

covariate$factor / covariate$numeric

Well balanced groups should follow approximately the same pattern (as in this case).

This can also be visualized as a Love plot graphically displaying covariate balance before and after adjusting.

An absolute SMD of 0 is complete balance. This plot requires a decision regarding the threshold of what is considered "balanced" within the dataset (e.g. the accepted standardised mean difference (SMD) between the treatment and control groups for each variable). The default is an absolute SMD of 0.2, below which the variables are considered well balanced.
The arrow starts at the unmatched absolute SMD for that variable, and points to the absolute SMD following propensity score matching.

# https://sejdemyr.github.io/r-tutorials/statistics/tutorial8.html
finalpsm::balance_plot(output, type = "love", threshold = 0.2)

From the Love plot here, we can see that while some variables have become less balanced as a result of the propensity score matching process, these remain overall within our stated threshold of "good" balance (0.2). The unbalanced variable (sex) has seen a substantial improvement in balance as a result of the PSM process.

Balance Table

However, people usually like some objective numbers thrown in (even if it's sometimes about as arbitrary as squinting at a plot).

There are several packages out there that include some measure of formal quantification of the balance in a PSM sample ("balance tables"), MatchIt included. However, these tend to be poorly formatted for readability, comparison and publication.

The cobalt package is far more sophisticated regarding assessment of balance than finalpsm, however does not work well within the intended tidyverse / finalfit workflow to produce tables formatted for publication.

finalpsm::balance_table(output, threshold = 0.2) %>% knitr::kable()

As you can see with the unm_balance and mat_balance columns, the sample is already relatively well balanced in the "unmatched" sample (unm_) with in imbalance in the sex variable (more females in the treated group). However, in the propensity-score matched sample, a much improved balance has been achieved across all variables (all now below the a priori absolute standardised mean difference of 0.2).

Note: There is recognise concerns regarding the use of significant tests to determine differences in treatment groups as a method of assessing balance with propensity score matching Imai et al. (2008). As such, this is not automatically provided, but can be included if desired if p=TRUE argument is added.

kamclean/finalpsm documentation built on Oct. 3, 2023, 3:52 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com