knitr::opts_chunk$set(collapse = TRUE, prompt=TRUE)
All the software used in this worksheet is freely available. The R statistical package is installed for you in the lab, but you may download and install R for Windows, Mac, and Linux systems from: http://www.r-project.org.
The following document walks through a common propensity score
matching work-flow in R. Example R code will appear with a >
indicating the command prompt. You may type this code yourself ---
each line is a command to R. Output will follow prefaced by
##
. (In R, #
represents a comment; any command preceded by any
number of #
's will not be executed.) For example:
2 + 2
(Note that when entering the code yourself, do not include the >
in
your command. Also, for longer lines of code in this document, the
text may wrap onto a second line, with the second line preceded by a
+
sign. When entering the code yourself, you do not have to wrap the
lines, and do not include the +
.)
R stores data in named variables using the arrow operator:
my.variable <- 2 + 2 my.variable * 3
R add-on packages are available to install directly from R:
install.packages("optmatch") install.packages("RItools")
These commands will ask you to select a CRAN server. Any server will
do. You may also be asked whether you'd like to set up a "personal
library to install packages into"; if so, answer yes. (The default
personal library location that R will suggest should be OK.) You'll
only need to run these commands the first time you want to use
optmatch
or RItools
on a particular computer, or when you install
a new version of R.
You'll also need to enable the mlm
function. It is in an
optmatchExperimental branch, which as the name implies, exists for
testing out new code that will eventually migrate into the main
optmatch
branch. The following two commands will load in a working copy of mlm
:
install.packages("devtools") devtools::install_github(repo = "markmfredrickson/optmatchExperimental")
Attach extension packages that we'll be using for matching and associated diagnostics:
library(optmatch) library(RItools) devtools::load_all()
library(optmatch) library(RItools) library(optmatchExperimental)
You'll do this each time you start a new R session and want to run matching commands.
To load the nuclear plants data, enter
data(nuclearplants)
data(nuclearplants)
To see the first six rows:
head(nuclearplants)
For more on the variables here, enter
help("nuclearplants")
You can directly access a variable within this data frame as follows. (Try typing in the commands to see what they do.)
nuclearplants$pt table(nuclearplants$pt) with(nuclearplants, table(pt))
The variable you will have just viewed and tabulated, pt
, is a dummy
for whether the plant was built with "partial turnkey guarantees."
These plants were not comparable to the others in terms of
construction costs. Let's exclude them for the time being, for
simplicity. To do this we'll create a data table (in R jargon, a
"data frame") of just those observations for which pt
is 0:
nuke.nopt <- subset(nuclearplants, pt == 0)
To inspect its first six or last six entries, do
head(nuke.nopt) tail(nuke.nopt)
To view this as presenting a matching problem, we'll think of plants
built on the site of a previously existing plant (pr == 1
) as the
treatment group and plants on new sites (pr == 0
) as comparisons.
To check the number of treated and control plants:
table(nuke.nopt$pr)
To get the pair match minimizing the mean paired distance on cap
,
among all collections of 7 non-overlapping pairs, do
pairmatch(pr ~ cap, data = nuke.nopt)
For a more readable report of who gets matched to whom, type
print(pairmatch(pr ~ cap, data = nuke.nopt), grouped = TRUE)
For matching on both date
and cap
, you'd type pairmatch(pr ~ cap
+ date, ...)
instead of pairmatch(pr ~ cap, ...)
. We'll talk later
about how this combines discrepancies on the two variables. For now,
note the form of the output this command generates: a variable of the
same length as the variables making up nuke.nopt
, assigning a
distinct name to each matched set. To fix your intuition, you might
try connecting up below the units that pairmatch()
has placed in the
same matched sets.
library(pander) a <- with(nuke.nopt, data.frame( Plant=row.names(nuke.nopt), Date=round(date-65, 1), Capacity=round(x=(cap-400),digits=-1))[as.logical(pr),]) b <- with(nuke.nopt, data.frame( Plant=row.names(nuke.nopt), Date=round(date-65, 1), Capacity=round(x=(cap-400),digits=-1))[!as.logical(pr),]) c <- cbind(data.frame(rbind(as.matrix(a), matrix(nrow=nrow(b)-nrow(a), ncol=3))), b) rownames(c) <- NULL pandoc.table(c, style="multiline", missing="", caption='New-site (left columns) versus existing-site (right columns) plants. "date" is `date-65`; "capacity" is `cap-400`.')
For basic summary information about this match, try
summary(pairmatch(pr ~ cap, data = nuke.nopt))
If you've already typed in the pairmatch(...)
part, you can use the
up-arrow, Home and End keys to avoid having to re-type. Alternatively,
to assign the name "pm
" to the matching result, do
pm <- pairmatch(pr ~ cap, data = nuke.nopt)
Now, you can just type print(pm, grouped = TRUE)
or summary(pm)
.
The following would give a basic matched analysis of the effect of new
or existing site on construction costs is given with the help of R's
linear modeling function. In effect, the existing site effect is
estimated as one "way" in a two-way \textsc{anova}, the other "way"
being the factor variable that represents the matching result,
i.e. pm
.
summary(lm(cost ~ pr + pm, data = nuke.nopt))
There are other types of matches you might want to try. Here's how to create matched triples (each treatment group unit is matched to two control group units):
tm <- pairmatch(pr ~ cap, controls = 2, data = nuke.nopt)
There will be further variations suggested on the slides.
It's possible to give the software an impossible list of requirements for a match. For instance, try running the following:
pairmatch(pr ~ cap, controls = 3, data=nuke.nopt)
The problem here is that the data don't have 3 comparison units to go with each treatment unit, since we have 7 treatment units but only 19 comparison units.
Matching can also fail because the distance matrix embodies matching constraints that are impossible to meet. In these cases the matching function will generally run without complaint, although it won't create any matches. Here is an example, where the caliper is so narrow as to forbid all possible matches:
pairmatch(pr ~ cap + cost, caliper=.001, data = nuke.nopt)
Behind the scenes, the caliper
argument restricts how the maximum
distance between matched objects. For example, consider Table 1
above. Plants A and H are 1.3 units apart in date
. If we assigned
caliper=1
, they could never be matched because they exceed the
caliper limit.
If before matching you want to remove just the subjects lacking a
counterpart within caliper distance, you can do pairmatch(...,
remove.unmatchables = TRUE)
. That won't help with the minuscule
caliper above, but with less extreme calipers it helps you salvage a
few matches.
Getting back to a matching that succeeded, note that summary()
reports information about how close the matches are.
summary(pm)
Comparing overt biases before and after matching. An assessment of
the unmatched difference between the groups on cap
can be had via:
cap.noadj <- lm(cap ~ pr, data = nuke.nopt) summary(cap.noadj)
The output is suppressed, as most it is not relevant to balance. This variation hones in on the part that is:
summary(lm(cap ~ pr, data = nuke.nopt))$coeff["pr",]
(Note again the use of square brackets, [
and ]
, for specifying
subsets of a matrix. With R one has to carefully distinguish square
brackets, curly brackets and parentheses.)
Here is a parallel calculation that takes the match pm
into account.
summary(lm(cap ~ pr + pm, data = nuke.nopt))$coeff["pr",]
The RItools package's xBalance
function zeroes in on balance, and
facilitiates checking balance on multiple variables at the same time. Here are some examples:
xBalance(pr ~ cap + t2, report="all", data=nuke.nopt) xBalance(pr ~ cap + t2 + strata(pm),##, strata=data.frame("none"="-", "pm"=pm), data=nuke.nopt, report=c("adj.mean.diffs", "std", "z"))
Exercises.
pm
, tm
and the unmatched samples in terms of balance on
t2
.pm
, tm
and the unmatched samples in terms of balance on
date
.pm
to Mahalanobis pair matching on t1
in terms of
balance on date
.cap
and date
to
Mahalanobis pair matching on cap
, date
and each of t1,t2
. Add
the last two variables in one at a time, so that you're comparing a
total of three matches. Compare on balance in cap
and t2
.mlm
for post-matching analysisLet's attempt to fit a regression model that accounts for the matched structure. We saw earlier that we can do a basic matched analysis by adding a fixed effect per matched set.
summary(lm(cost ~ t1 + pm, data = nuke.nopt))
However, as the number of matched sets grows, this can become
unwieldy. The mlm
function addresses this concern by instead
accounting for the matched structures internally. The syntax mirrors
the syntax for lm
, and has the additional requirement that one
object on the right-hand side of the formula is an optmatch
object
(in this case, pm
).
mlm1 <- mlm(cost ~ t1 + pm, data = nuke.nopt) summary(mlm1)
You'll notice that there is no coefficient estimated for pm
.
The standard errors associated with the coefficients in mlm1
do not
account for the matched structure either. We can use sandwich
estimators to properly compute them. Sandwich-style estimates of
standard errors are more robust to model misspecification, but are
less efficient than traditional standard error estimates.
(The sandwich
package should be installed on R by default, if it
appears to not be, you may need to install it using
install.packages(sandwich)
.)
library(sandwich) sandwich(mlm1)
Keep in mind that sandwich
returns variance estimates, so we need to
take the squareroot of the diagonal components to get the standard
error estimates.
sqrt(diag(sandwich(mlm1)))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.