vignette/vignette_1_match.md

Propensity-score Matching (PSM) with the finalpsm package

Create the matching dataset

Firstly we should ensure all variables of interest are in the correct formats for subsequent propensity-score matching. Variables should either be factors or numerical as appropriate.

We will be using the survival::colon dataset as the basis of our example.

data <- tibble::as_tibble(survival::colon) %>%

  dplyr::filter(etype==2) %>% # Outcome of interest is death
  dplyr::filter(rx!="Obs") %>%  # rx will be our binary treatment variable
  dplyr::select(-etype,-study, -status) %>% # Remove superfluous variables

  # Convert into numeric and factor variables
  dplyr::mutate_at(vars(obstruct, perfor, adhere, node4), function(x){factor(x, levels=c(0,1), labels = c("No", "Yes"))}) %>%
  dplyr::mutate(rx = factor(rx),
                mort365 = cut(time, breaks = c(-Inf, 365, Inf), labels = c("Yes", "No")),
                sex = factor(sex, levels=c(0,1), labels = c("Female", "Male")),
                differ = factor(differ, levels = c(1,2,3), labels = c("Well", "Moderate", "Poor")),
                extent = factor(extent, levels = c(1,2,3, 4), labels = c("Submucosa", "Muscle", "Serosa", "Contiguous Structures")),
                surg = factor(surg, levels = c(0,1), labels = c("Short", "Long"))) %>%

  # Logical value for outcome (for survival analysis)
  dplyr::mutate(status = mort365=="Yes")

knitr::kable(head(data, 10))
id rx sex age obstruct perfor adhere nodes differ extent surg node4 time mort365 status 1 Lev+5FU Male 43 No No No 5 Moderate Serosa Short Yes 1521 No FALSE 2 Lev+5FU Male 63 No No No 1 Moderate Serosa Short No 3087 No FALSE 4 Lev+5FU Female 66 Yes No No 6 Moderate Serosa Long Yes 293 Yes TRUE 6 Lev+5FU Female 57 No No No 9 Moderate Serosa Short Yes 1767 No FALSE 7 Lev Male 77 No No No 5 Moderate Serosa Long Yes 420 No FALSE 9 Lev Male 46 No No Yes 2 Moderate Serosa Short No 3173 No FALSE 10 Lev+5FU Female 68 No No No 1 Moderate Serosa Long No 3308 No FALSE 11 Lev Female 47 No No Yes 1 Moderate Serosa Short No 2908 No FALSE 12 Lev+5FU Male 52 No No No 2 Poor Serosa Long No 3309 No FALSE 14 Lev Male 68 Yes No No 3 Moderate Serosa Short No 2910 No FALSE

Propensity-score matching with matchit

With the data pre-processed, we can use the finalpsm matchit() function to create the PSM dataset. This is essentially a wrapper function to the MatchIt matchit() function - one of the most widely used packages for PSM within R. However, the finalpsm matchit() addresses what are felt to be several key issues:

There are 4 types of variables that can be specified in the finalpsm::match function:

All other inputs to the MatchIt::matchit() are accepted (see MatchIt documentation) with the default method of matching being full matching.

The outputs from the finalpsm::match() function include:

output$object

## 
## Call: 
## MatchIt::matchit(formula = rx_01 ~ age + sex + obstruct + differ + 
##     surg, data = data_match, method = method)
## 
## Sample sizes:
##           Control Treated
## All           300     298
## Matched       300     298
## Discarded       0       0
head(output$data, 10) %>% knitr::kable()
rowid id rx rx_01 age sex obstruct differ surg distance weights subclass match mort365 time 1 1 Lev+5FU 1 43 Male No Moderate Short 0.4639307 1.0000000 1 Matched No 1521 2 2 Lev+5FU 1 63 Male No Moderate Short 0.4477382 1.0000000 198 Matched No 3087 3 4 Lev+5FU 1 66 Female Yes Moderate Long 0.5182585 1.0000000 11 Matched Yes 293 4 6 Lev+5FU 1 57 Female No Moderate Short 0.5605443 1.0000000 63 Matched No 1767 5 7 Lev 0 77 Male No Moderate Long 0.4334800 0.3355705 227 Matched No 420 6 9 Lev 0 46 Male No Moderate Short 0.4614961 1.0067114 76 Matched No 3173 7 10 Lev+5FU 1 68 Female No Moderate Long 0.5486730 1.0000000 150 Matched No 3308 8 11 Lev 0 47 Female No Moderate Short 0.5685687 1.0067114 204 Matched No 2908 9 12 Lev+5FU 1 52 Male No Poor Long 0.5115592 1.0000000 163 Matched No 3309 10 14 Lev 0 68 Male Yes Moderate Short 0.4121929 1.0067114 139 Matched No 2910

There are 3 additional columns appended to the dataset following propensity-score matching:

Assessment of Balance

So we have gone to the effort of matching the sample to allow inference. However, before we start using the matched dataset generated, we should ensure that the PSM process has achieved its goal of creating a balanced sample on our observed variables (as that is a key determinant of the validity of any conclusions based on this data).

This can be achieved in 2 ways: both visual and quantitative methods of assessing balance.

Balance Visualisation

There are several ways to visualise balance after PSM, which broadly fall into 2 categories: overall assessment and individual covariate assessment (explanatory).

The balance_plot() function provides the capability to easily generate several ggplots to allow quick assessment.

1. Overall balance between treatment groups

This can be visualised as a density or jitter plot to allow comparison of the overall distribution of propensity-scores between the treatment groups.

finalpsm::balance_plot(output, type = "density") / finalpsm::balance_plot(output, type = "jitter")

Well balanced groups should follow approximately the same pattern (as in this case).

2. Covariate balance between treatment groups

This can be visualised as a geom_point plot with a line of best fit to allow comparison of the distribution of propensity-scores between treatment groups for the different covariates (explanatory variables).

# https://sejdemyr.github.io/r-tutorials/statistics/tutorial8.html
covariate <- finalpsm::balance_plot(output, type = "covariate")

covariate$factor / covariate$numeric

Well balanced groups should follow approximately the same pattern (as in this case).

This can also be visualized as a Love plot graphically displaying covariate balance before and after adjusting.

# https://sejdemyr.github.io/r-tutorials/statistics/tutorial8.html
finalpsm::balance_plot(output, type = "love", threshold = 0.2)

From the Love plot here, we can see that while some variables have become less balanced as a result of the propensity score matching process, these remain overall within our stated threshold of “good” balance (0.2). The unbalanced variable (sex) has seen a substantial improvement in balance as a result of the PSM process.

Balance Table

However, people usually like some objective numbers thrown in (even if it’s sometimes about as arbitrary as squinting at a plot).

There are several packages out there that include some measure of formal quantification of the balance in a PSM sample (“balance tables”), MatchIt included. However, these tend to be poorly formatted for readability, comparison and publication.

The cobalt package is far more sophisticated regarding assessment of balance than finalpsm, however does not work well within the intended tidyverse / finalfit workflow to produce tables formatted for publication.

finalpsm::balance_table(output, threshold = 0.2) %>% knitr::kable()
label level unm_con unm_trt unm_smd unm_balance mat_con mat_trt mat_smd mat_balance age (SD) 60.2 (11.7) 59.7 (12.3) 0.037 Yes 60.2 (12.3) 59.7 (12.3) 0.011 Yes sex Female 131 (43.7) 161 (54.0) 0.208 No 157 (52.3) 161 (54.0) 0.034 Yes Male 169 (56.3) 137 (46.0) 143 (47.7) 137 (46.0) obstruct No 241 (80.3) 244 (81.9) 0.039 Yes 257 (85.7) 244 (81.9) 0.103 Yes Yes 59 (19.7) 54 (18.1) 43 (14.3) 54 (18.1) differ Well 37 (12.3) 29 (9.7) 0.116 Yes 26 (8.7) 29 (9.7) 0.038 Yes Moderate 219 (73.0) 215 (72.1) 218 (72.7) 215 (72.1) Poor 44 (14.7) 54 (18.1) 56 (18.7) 54 (18.1) surg Short 222 (74.0) 222 (74.5) 0.011 Yes 234 (78.0) 222 (74.5) 0.082 Yes Long 78 (26.0) 76 (25.5) 66 (22.0) 76 (25.5)

As you can see with the unm_balance and mat_balance columns, the sample is already relatively well balanced in the “unmatched” sample (unm_) with in imbalance in the sex variable (more females in the treated group). However, in the propensity-score matched sample, a much improved balance has been achieved across all variables (all now below the a priori absolute standardised mean difference of 0.2).



kamclean/finalpsm documentation built on Oct. 3, 2023, 3:52 a.m.