Firstly we should ensure all variables of interest are in the correct formats for subsequent propensity-score matching. Variables should either be factors or numerical as appropriate.
We will be using the survival::colon
dataset as the basis of our
example.
data <- tibble::as_tibble(survival::colon) %>%
dplyr::filter(etype==2) %>% # Outcome of interest is death
dplyr::filter(rx!="Obs") %>% # rx will be our binary treatment variable
dplyr::select(-etype,-study, -status) %>% # Remove superfluous variables
# Convert into numeric and factor variables
dplyr::mutate_at(vars(obstruct, perfor, adhere, node4), function(x){factor(x, levels=c(0,1), labels = c("No", "Yes"))}) %>%
dplyr::mutate(rx = factor(rx),
mort365 = cut(time, breaks = c(-Inf, 365, Inf), labels = c("Yes", "No")),
sex = factor(sex, levels=c(0,1), labels = c("Female", "Male")),
differ = factor(differ, levels = c(1,2,3), labels = c("Well", "Moderate", "Poor")),
extent = factor(extent, levels = c(1,2,3, 4), labels = c("Submucosa", "Muscle", "Serosa", "Contiguous Structures")),
surg = factor(surg, levels = c(0,1), labels = c("Short", "Long"))) %>%
# Logical value for outcome (for survival analysis)
dplyr::mutate(status = mort365=="Yes")
knitr::kable(head(data, 10))
id
rx
sex
age
obstruct
perfor
adhere
nodes
differ
extent
surg
node4
time
mort365
status
1
Lev+5FU
Male
43
No
No
No
5
Moderate
Serosa
Short
Yes
1521
No
FALSE
2
Lev+5FU
Male
63
No
No
No
1
Moderate
Serosa
Short
No
3087
No
FALSE
4
Lev+5FU
Female
66
Yes
No
No
6
Moderate
Serosa
Long
Yes
293
Yes
TRUE
6
Lev+5FU
Female
57
No
No
No
9
Moderate
Serosa
Short
Yes
1767
No
FALSE
7
Lev
Male
77
No
No
No
5
Moderate
Serosa
Long
Yes
420
No
FALSE
9
Lev
Male
46
No
No
Yes
2
Moderate
Serosa
Short
No
3173
No
FALSE
10
Lev+5FU
Female
68
No
No
No
1
Moderate
Serosa
Long
No
3308
No
FALSE
11
Lev
Female
47
No
No
Yes
1
Moderate
Serosa
Short
No
2908
No
FALSE
12
Lev+5FU
Male
52
No
No
No
2
Poor
Serosa
Long
No
3309
No
FALSE
14
Lev
Male
68
Yes
No
No
3
Moderate
Serosa
Short
No
2910
No
FALSE
With the data pre-processed, we can use the finalpsm matchit()
function to create the PSM dataset. This is essentially a wrapper
function to the MatchIt matchit()
function - one of the most widely
used packages for PSM within R. However, the finalpsm matchit()
addresses what are felt to be several key issues:
Issue: The MatchIt::matchit()
does not easily fit into a tidyverse
workflow.
Solution: Inspired by the tidyverse and finalfit, the matchit()
function facilitates a modulator approach to building formulae.
Problem: The MatchIt::matchit()
will not run if there is missing
data present in the strata or explanatory variables.
Solution: The matchit()
function performs pairwise deletion only
on those the strata or explanatory variables
Problem: The MatchIt::matchit()
will not run if the strata
variable is not a numeric binary variable (0 or 1).
Solution: The matchit()
function handles the strata
variable
internally to allow either a factor or numeric variable to be used.
The lowest value or first level will be designed the control (0),
and the highest value or second level will be designated the
treatment (1).
There are 4 types of variables that can be specified in the
finalpsm::match
function:
strata
- The binary treatment variable
explanatory
- All variables that are desired to be used to
generate the propensity-score.
id
- An optional variable that allows any number of columns that
provide a method of identifying the patient to be retained (e.g. the
record or hospital id for each patient). This is not used in the
propensity-score matching process.
dependent
- An optional variable that allows any number of columns
that contain the desired outcome(s) of the patient that will be used
within the subsequent modeling. This is not used in the
propensity-score matching process.
All other inputs to the MatchIt::matchit()
are accepted (see MatchIt
documentation)
with the default method of matching being full matching.
The outputs from the finalpsm::match()
function include:
object
: This can be used as like any MatchIt::matchit()
output and so facilitates its use within existing scripts
(although this function is designed with the intention to be
used with subsequently described functions).output$object
##
## Call:
## MatchIt::matchit(formula = rx_01 ~ age + sex + obstruct + differ +
## surg, data = data_match, method = method)
##
## Sample sizes:
## Control Treated
## All 300 298
## Matched 300 298
## Discarded 0 0
data
: This is the final matched dataset. However, this is not
just the output from MatchIt::match.data()
applied to the
MatchIt::matchit()
object, but also all specified variables
under the dependent
or id
parameters. Unspecified variables
are removed.Note: If the keep_all
argument is set to TRUE, this retains
all columns in the original dataset (and so removes the need for
specifying dependent or id variables).
head(output$data, 10) %>% knitr::kable()
rowid
id
rx
rx_01
age
sex
obstruct
differ
surg
distance
weights
subclass
match
mort365
time
1
1
Lev+5FU
1
43
Male
No
Moderate
Short
0.4639307
1.0000000
1
Matched
No
1521
2
2
Lev+5FU
1
63
Male
No
Moderate
Short
0.4477382
1.0000000
198
Matched
No
3087
3
4
Lev+5FU
1
66
Female
Yes
Moderate
Long
0.5182585
1.0000000
11
Matched
Yes
293
4
6
Lev+5FU
1
57
Female
No
Moderate
Short
0.5605443
1.0000000
63
Matched
No
1767
5
7
Lev
0
77
Male
No
Moderate
Long
0.4334800
0.3355705
227
Matched
No
420
6
9
Lev
0
46
Male
No
Moderate
Short
0.4614961
1.0067114
76
Matched
No
3173
7
10
Lev+5FU
1
68
Female
No
Moderate
Long
0.5486730
1.0000000
150
Matched
No
3308
8
11
Lev
0
47
Female
No
Moderate
Short
0.5685687
1.0067114
204
Matched
No
2908
9
12
Lev+5FU
1
52
Male
No
Poor
Long
0.5115592
1.0000000
163
Matched
No
3309
10
14
Lev
0
68
Male
Yes
Moderate
Short
0.4121929
1.0067114
139
Matched
No
2910
There are 3 additional columns appended to the dataset following propensity-score matching:
distance
: This is the propensity-score generated for that record.
This is the probability of that patient being in the treatment group
(strata
), given the covariates supplied (explanatory
).
weights
: This is the weighting provided to each patient within the
matching procedure (particularly relevant in full matching). As
stated by the MatchIt authors (Ho et
al.):
“If one chooses options that allow matching with replacement, or
any solution that has different numbers of controls (or treateds)
within each subclass or strata (such as full matching), then the
parametric analysis following matching must accomodate these
procedures”. As such, the weights
column is used within
subsequent modeling to ensure that the dataset remains balanced on
the key variables.
subclass
: This is which subclass (aka block) the patient has been
divided into based on their propensity score. Within each of the
subclasses the propensity score is, at least approximately,
constant. This is only present with full or sub-classification
propensity score matching.
So we have gone to the effort of matching the sample to allow inference. However, before we start using the matched dataset generated, we should ensure that the PSM process has achieved its goal of creating a balanced sample on our observed variables (as that is a key determinant of the validity of any conclusions based on this data).
This can be achieved in 2 ways: both visual and quantitative methods of assessing balance.
There are several ways to visualise balance after PSM, which broadly
fall into 2 categories: overall assessment and individual covariate
assessment (explanatory
).
The balance_plot()
function provides the capability to easily generate
several ggplots to allow quick assessment.
All plots only require the object produced by matchit
.
All plots produced are relatively plain, but can be aesthetically modified further via ggplot2 functions.
This can be visualised as a density or jitter plot to allow comparison of the overall distribution of propensity-scores between the treatment groups.
finalpsm::balance_plot(output, type = "density") / finalpsm::balance_plot(output, type = "jitter")
Well balanced groups should follow approximately the same pattern (as in this case).
This can be visualised as a geom_point plot with a line of best fit to
allow comparison of the distribution of propensity-scores between
treatment groups for the different covariates (explanatory
variables).
# https://sejdemyr.github.io/r-tutorials/statistics/tutorial8.html
covariate <- finalpsm::balance_plot(output, type = "covariate")
covariate$factor / covariate$numeric
Well balanced groups should follow approximately the same pattern (as in this case).
This can also be visualized as a Love plot graphically displaying covariate balance before and after adjusting.
An absolute SMD of 0 is complete balance. This plot requires a decision regarding the threshold of what is considered “balanced” within the dataset (e.g. the accepted standardised mean difference (SMD) between the treatment and control groups for each variable). The default is an absolute SMD of 0.2, below which the variables are considered well balanced.
The arrow starts at the unmatched absolute SMD for that variable, and points to the absolute SMD following propensity score matching.
# https://sejdemyr.github.io/r-tutorials/statistics/tutorial8.html
finalpsm::balance_plot(output, type = "love", threshold = 0.2)
From the Love plot here, we can see that while some variables have become less balanced as a result of the propensity score matching process, these remain overall within our stated threshold of “good” balance (0.2). The unbalanced variable (sex) has seen a substantial improvement in balance as a result of the PSM process.
However, people usually like some objective numbers thrown in (even if it’s sometimes about as arbitrary as squinting at a plot).
There are several packages out there that include some measure of formal
quantification of the balance in a PSM sample (“balance tables”),
MatchIt
included. However, these tend to be poorly formatted for
readability, comparison and publication.
The cobalt
package
is far more sophisticated regarding assessment of balance than
finalpsm
, however does not work well within the intended tidyverse
/
finalfit
workflow to produce tables formatted for publication.
finalpsm::balance_table(output, threshold = 0.2) %>% knitr::kable()
label
level
unm_con
unm_trt
unm_smd
unm_balance
mat_con
mat_trt
mat_smd
mat_balance
age
(SD)
60.2 (11.7)
59.7 (12.3)
0.037
Yes
60.2 (12.3)
59.7 (12.3)
0.011
Yes
sex
Female
131 (43.7)
161 (54.0)
0.208
No
157 (52.3)
161 (54.0)
0.034
Yes
Male
169 (56.3)
137 (46.0)
143 (47.7)
137 (46.0)
obstruct
No
241 (80.3)
244 (81.9)
0.039
Yes
257 (85.7)
244 (81.9)
0.103
Yes
Yes
59 (19.7)
54 (18.1)
43 (14.3)
54 (18.1)
differ
Well
37 (12.3)
29 (9.7)
0.116
Yes
26 (8.7)
29 (9.7)
0.038
Yes
Moderate
219 (73.0)
215 (72.1)
218 (72.7)
215 (72.1)
Poor
44 (14.7)
54 (18.1)
56 (18.7)
54 (18.1)
surg
Short
222 (74.0)
222 (74.5)
0.011
Yes
234 (78.0)
222 (74.5)
0.082
Yes
Long
78 (26.0)
76 (25.5)
66 (22.0)
76 (25.5)
As you can see with the unm_balance
and mat_balance
columns, the
sample is already relatively well balanced in the “unmatched” sample
(unm_
) with in imbalance in the sex variable (more females in the
treated group). However, in the propensity-score matched sample, a much
improved balance has been achieved across all variables (all now below
the a priori absolute standardised mean difference of 0.2).
p=TRUE
argument is added.Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.