finalpsm: Propensity-score matching with tidyverse and finalfit

Propensity-score Matching (PSM) with the finalpsm package

Firstly we should ensure all variables of interest are in the correct formats for subsequent propensity-score matching. Variables should either be factors or numerical as appropriate.

We will be using the survival::colon dataset as the basis of our example.

data <- tibble::as_tibble(survival::colon) %>%

  dplyr::filter(etype==2) %>% # Outcome of interest is death
  dplyr::filter(rx!="Obs") %>%  # rx will be our binary treatment variable
  dplyr::select(-etype,-study, -status) %>% # Remove superfluous variables

  # Convert into numeric and factor variables
  dplyr::mutate_at(vars(obstruct, perfor, adhere, node4), function(x){factor(x, levels=c(0,1), labels = c("No", "Yes"))}) %>%
  dplyr::mutate(rx = factor(rx),
                mort365 = cut(time, breaks = c(-Inf, 365, Inf), labels = c("Yes", "No")),
                sex = factor(sex, levels=c(0,1), labels = c("Female", "Male")),
                differ = factor(differ, levels = c(1,2,3), labels = c("Well", "Moderate", "Poor")),
                extent = factor(extent, levels = c(1,2,3, 4), labels = c("Submucosa", "Muscle", "Serosa", "Contiguous Structures")),
                surg = factor(surg, levels = c(0,1), labels = c("Short", "Long"))) %>%

  # Logical value for outcome (for survival analysis)
  dplyr::mutate(status = mort365=="Yes")

knitr::kable(head(data, 10))

id rx sex age obstruct perfor adhere nodes differ extent surg node4 time mort365 status 1 Lev+5FU Male 43 No No No 5 Moderate Serosa Short Yes 1521 No FALSE 2 Lev+5FU Male 63 No No No 1 Moderate Serosa Short No 3087 No FALSE 4 Lev+5FU Female 66 Yes No No 6 Moderate Serosa Long Yes 293 Yes TRUE 6 Lev+5FU Female 57 No No No 9 Moderate Serosa Short Yes 1767 No FALSE 7 Lev Male 77 No No No 5 Moderate Serosa Long Yes 420 No FALSE 9 Lev Male 46 No No Yes 2 Moderate Serosa Short No 3173 No FALSE 10 Lev+5FU Female 68 No No No 1 Moderate Serosa Long No 3308 No FALSE 11 Lev Female 47 No No Yes 1 Moderate Serosa Short No 2908 No FALSE 12 Lev+5FU Male 52 No No No 2 Poor Serosa Long No 3309 No FALSE 14 Lev Male 68 Yes No No 3 Moderate Serosa Short No 2910 No FALSE

With the data pre-processed, we can use the finalpsm matchit() function to create the PSM dataset. This is essentially a wrapper function to the MatchIt matchit() function - one of the most widely used packages for PSM within R. However, the finalpsm matchit() addresses what are felt to be several key issues:

Issue: The MatchIt::matchit() does not easily fit into a tidyverse workflow.
Solution: Inspired by the tidyverse and finalfit, the matchit() function facilitates a modulator approach to building formulae.
Problem: The MatchIt::matchit() will not run if there is missing data present in the strata or explanatory variables.
Solution: The matchit() function performs pairwise deletion only on those the strata or explanatory variables
Problem: The MatchIt::matchit() will not run if the strata variable is not a numeric binary variable (0 or 1).
Solution: The matchit() function handles the strata variable internally to allow either a factor or numeric variable to be used. The lowest value or first level will be designed the control (0), and the highest value or second level will be designated the treatment (1).

There are 4 types of variables that can be specified in the finalpsm::match function:

strata - The binary treatment variable
explanatory - All variables that are desired to be used to generate the propensity-score.
id - An optional variable that allows any number of columns that provide a method of identifying the patient to be retained (e.g. the record or hospital id for each patient). This is not used in the propensity-score matching process.
dependent- An optional variable that allows any number of columns that contain the desired outcome(s) of the patient that will be used within the subsequent modeling. This is not used in the propensity-score matching process.

All other inputs to the MatchIt::matchit() are accepted (see MatchIt documentation) with the default method of matching being full matching.

The outputs from the finalpsm::match() function include:

1. object: This can be used as like any MatchIt::matchit() output and so facilitates its use within existing scripts (although this function is designed with the intention to be used with subsequently described functions).

output$object

## 
## Call: 
## MatchIt::matchit(formula = rx_01 ~ age + sex + obstruct + differ + 
##     surg, data = data_match, method = method)
## 
## Sample sizes:
##           Control Treated
## All           300     298
## Matched       300     298
## Discarded       0       0

1. data: This is the final matched dataset. However, this is not just the output from MatchIt::match.data() applied to the MatchIt::matchit() object, but also all specified variables under the dependent or id parameters. Unspecified variables are removed.
Note: If the keep_all argument is set to TRUE, this retains all columns in the original dataset (and so removes the need for specifying dependent or id variables).

head(output$data, 10) %>% knitr::kable()

rowid id rx rx_01 age sex obstruct differ surg distance weights subclass match mort365 time 1 1 Lev+5FU 1 43 Male No Moderate Short 0.4639307 1.0000000 1 Matched No 1521 2 2 Lev+5FU 1 63 Male No Moderate Short 0.4477382 1.0000000 198 Matched No 3087 3 4 Lev+5FU 1 66 Female Yes Moderate Long 0.5182585 1.0000000 11 Matched Yes 293 4 6 Lev+5FU 1 57 Female No Moderate Short 0.5605443 1.0000000 63 Matched No 1767 5 7 Lev 0 77 Male No Moderate Long 0.4334800 0.3355705 227 Matched No 420 6 9 Lev 0 46 Male No Moderate Short 0.4614961 1.0067114 76 Matched No 3173 7 10 Lev+5FU 1 68 Female No Moderate Long 0.5486730 1.0000000 150 Matched No 3308 8 11 Lev 0 47 Female No Moderate Short 0.5685687 1.0067114 204 Matched No 2908 9 12 Lev+5FU 1 52 Male No Poor Long 0.5115592 1.0000000 163 Matched No 3309 10 14 Lev 0 68 Male Yes Moderate Short 0.4121929 1.0067114 139 Matched No 2910

There are 3 additional columns appended to the dataset following propensity-score matching:

distance: This is the propensity-score generated for that record. This is the probability of that patient being in the treatment group (strata), given the covariates supplied (explanatory).
weights: This is the weighting provided to each patient within the matching procedure (particularly relevant in full matching). As stated by the MatchIt authors (Ho et al.): “If one chooses options that allow matching with replacement, or any solution that has different numbers of controls (or treateds) within each subclass or strata (such as full matching), then the parametric analysis following matching must accomodate these procedures”. As such, the weights column is used within subsequent modeling to ensure that the dataset remains balanced on the key variables.
subclass: This is which subclass (aka block) the patient has been divided into based on their propensity score. Within each of the subclasses the propensity score is, at least approximately, constant. This is only present with full or sub-classification propensity score matching.

So we have gone to the effort of matching the sample to allow inference. However, before we start using the matched dataset generated, we should ensure that the PSM process has achieved its goal of creating a balanced sample on our observed variables (as that is a key determinant of the validity of any conclusions based on this data).

This can be achieved in 2 ways: both visual and quantitative methods of assessing balance.

There are several ways to visualise balance after PSM, which broadly fall into 2 categories: overall assessment and individual covariate assessment (explanatory).

The balance_plot() function provides the capability to easily generate several ggplots to allow quick assessment.

All plots only require the object produced by matchit.
All plots produced are relatively plain, but can be aesthetically modified further via ggplot2 functions.

1. Overall balance between treatment groups

This can be visualised as a density or jitter plot to allow comparison of the overall distribution of propensity-scores between the treatment groups.

finalpsm::balance_plot(output, type = "density") / finalpsm::balance_plot(output, type = "jitter")

Well balanced groups should follow approximately the same pattern (as in this case).

2. Covariate balance between treatment groups

This can be visualised as a geom_point plot with a line of best fit to allow comparison of the distribution of propensity-scores between treatment groups for the different covariates (explanatory variables).

# https://sejdemyr.github.io/r-tutorials/statistics/tutorial8.html
covariate <- finalpsm::balance_plot(output, type = "covariate")

covariate$factor / covariate$numeric

Well balanced groups should follow approximately the same pattern (as in this case).

This can also be visualized as a Love plot graphically displaying covariate balance before and after adjusting.

An absolute SMD of 0 is complete balance. This plot requires a decision regarding the threshold of what is considered “balanced” within the dataset (e.g. the accepted standardised mean difference (SMD) between the treatment and control groups for each variable). The default is an absolute SMD of 0.2, below which the variables are considered well balanced.
The arrow starts at the unmatched absolute SMD for that variable, and points to the absolute SMD following propensity score matching.

# https://sejdemyr.github.io/r-tutorials/statistics/tutorial8.html
finalpsm::balance_plot(output, type = "love", threshold = 0.2)

From the Love plot here, we can see that while some variables have become less balanced as a result of the propensity score matching process, these remain overall within our stated threshold of “good” balance (0.2). The unbalanced variable (sex) has seen a substantial improvement in balance as a result of the PSM process.

Balance Table

However, people usually like some objective numbers thrown in (even if it’s sometimes about as arbitrary as squinting at a plot).

There are several packages out there that include some measure of formal quantification of the balance in a PSM sample (“balance tables”), MatchIt included. However, these tend to be poorly formatted for readability, comparison and publication.

The cobalt package is far more sophisticated regarding assessment of balance than finalpsm, however does not work well within the intended tidyverse / finalfit workflow to produce tables formatted for publication.

finalpsm::balance_table(output, threshold = 0.2) %>% knitr::kable()

label level unm_con unm_trt unm_smd unm_balance mat_con mat_trt mat_smd mat_balance age (SD) 60.2 (11.7) 59.7 (12.3) 0.037 Yes 60.2 (12.3) 59.7 (12.3) 0.011 Yes sex Female 131 (43.7) 161 (54.0) 0.208 No 157 (52.3) 161 (54.0) 0.034 Yes Male 169 (56.3) 137 (46.0) 143 (47.7) 137 (46.0) obstruct No 241 (80.3) 244 (81.9) 0.039 Yes 257 (85.7) 244 (81.9) 0.103 Yes Yes 59 (19.7) 54 (18.1) 43 (14.3) 54 (18.1) differ Well 37 (12.3) 29 (9.7) 0.116 Yes 26 (8.7) 29 (9.7) 0.038 Yes Moderate 219 (73.0) 215 (72.1) 218 (72.7) 215 (72.1) Poor 44 (14.7) 54 (18.1) 56 (18.7) 54 (18.1) surg Short 222 (74.0) 222 (74.5) 0.011 Yes 234 (78.0) 222 (74.5) 0.082 Yes Long 78 (26.0) 76 (25.5) 66 (22.0) 76 (25.5)

As you can see with the unm_balance and mat_balance columns, the sample is already relatively well balanced in the “unmatched” sample (unm_) with in imbalance in the sex variable (more females in the treated group). However, in the propensity-score matched sample, a much improved balance has been achieved across all variables (all now below the a priori absolute standardised mean difference of 0.2).

Note: There is recognise concerns regarding the use of significant tests to determine differences in treatment groups as a method of assessing balance with propensity score matching Imai et al. (2008). As such, this is not automatically provided, but can be included if desired if p=TRUE argument is added.

kamclean/finalpsm documentation built on Oct. 3, 2023, 3:52 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com