masc: Cross-validated Estimation of the Matching and Synthetic...

Description Usage Arguments Details Value References See Also Examples

View source: R/crossvalidation.R

Description

Implements the matching and synthetic control (masc) estimator of Kellogg, Mogstad, Pouliot, and Torgovitsky (2019).

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
masc(
  treated,
  donors,
  treatment = NULL,
  sc_est = sc_estimator,
  tune_pars_list = list(),
  cv_pars = list(forecast.minlength = 1, forecast.maxlength = 1),
  nogurobi = FALSE,
  alloutput = FALSE
)

Arguments

treated

A Tx1 matrix of outcomes for the treated unit.

donors

A TxN matrix of outcome paths for untreated units, each column being a control unit.

Note that currently this estimator is designed so that donors and treated should only contain the time series values for a single outcome for each unit. It does not allow for including other covariates.

treatment

An integer. The period T' in which forecasting begins. If NULL or T'>T, then we assume all data is pre-treatment.

sc_est

A function which constructs weights associated with a synthetic control-type estimator. See sc_estimator for input and output if you'd prefer to substitute your own estimator.

tune_pars_list

A list containing 4 elements. You may specify only one of min_preperiods and set_f. Those elements describe the folds we include in the cross-validation procedure. Each fold f is indexed by the last period it uses for estimation. That is, fold f will fit estimators using data from period 1 through period f, and forecast into period f+1.

m:

a vector of integers. Denotes the set of nearest neighbor estimators from which we are allowed to pick. E.g., tune_pars_list$m=c(1,3,5) would allow us to pick from 1-NN, 3-NN, or 5-NN. Alternatively, tune_pars_list$m permits a logical vector. In this case, e.g., tune_pars_list$m=c(FALSE,TRUE,TRUE) would allow us to pick from 2-NN or 3-NN. If NULL, we default to allowing all possible nearest neighbor estimators.

min_preperiods:

an integer. The smallest number of estimation periods allowed in a fold used for cross-validation. We use all folds from fold min_preperiods up to the latest possible fold treatment-2.

set_f:

a list containing a single element, a vector of integers. Identifies the set of folds used for cross-validation. As above, each integer identifies a fold by the last time period it uses in estimation. E.g., set_f=c(7,8,9) would implement cross-validation using fold 7, fold 8, and fold 9.

weights_f:

a vector of length length(set_f) or length(min_preperiods:(treatment-2)), containing weakly positive relative weight values for each of the cross-validation folds. The elements of weights_f are normalized to sum to 1

If neither min_preperiods nor set_f are specified, then we set min_preperiods to ceiling(treatment/2). In other words, we pick the first cross-validation fold so that it is estimated on the first half of the pre-period data. By default, all folds are equally weighted.

cv_pars

A list containing 2 integer elements, forecast.minlength and forecast.maxlength. Cross-validation fold f will forecast into periods f+forecast.minlength and up to period f+forecast.maxlength or the treatment period (whichever comes first). If f+forecast.minlength lies in the treatment interval for one of the folds f given by the user, then masc returns an error.

nogurobi

A logical value. If true, uses LowRankQP to solve the synthetic control estimator, rather than gurobi.

alloutput

A logical value. If true, output includes a list all.results containing full set of output associated with each candidate nearest neighbor estimator.

Details

The masc estimator takes a convex combination of a nearest neighbor estimator and a synthetic control estimator. That combination is parametrized by a model averaging parameter which takes a value of 1 when the estimator is equivalent to a pure matching estimator, and 0 when it is equivalent to a pure synthetic control estimator. This function selects the nearest neighbor estimator and model averaging parameter by a rolling-origin cross-validation procedure. Computationally, we minimize the cross-validation criterion in two steps. First, for each candidate nearest neighbor estimator, we solve for the model averaging parameter using an analytic expression (see Equation 15 of the working paper). Then, we select the candidate nearest neighbor estimator which produces the smallest cross-validation criterion value.

This implementation by default uses gurobi interfaced with R to solve for the synthetic control estimator. Gurobi and its associated R package are available on the gurobi website: https://cran.r-project.org/web/packages/prioritizr/vignettes/gurobi_installation.html

It is recommended to use this implementation of the masc estimator. However, the function may alternatively use an implementation based on LowRankQP for convenience.

Value

By default, returns a list containing five objects:

phi_hat:

selected value for the model averaging parameter (1 is pure matching, 0 pure synthetic control).

m_hat:

selected matching (nearest neighbor) estimator.

weights:

The vector length N containing weights placed on each control unit.

pred.error:

The vector of treatment effects implied by the masc counterfactual, for periods T' to T.

cv.error:

The average (weighted by weights_f) of the cross-validation errors generated by each fold.

cv.error.byfold:

The cross-validation error generated by each fold.

Additionally, returns the following object if alloutput=TRUE:

all.results:

A list containing the five above output components for each candidate matching estimator. Values for each candidate are appended as columns.

References

Kellogg, M., M. Mogstad, G. Pouliot, and A. Torgovitsky. Combining Matching and Synthetic Control to Trade off Biases from Extrapolation and Interpolation. Working Paper, 2019.

See Also

Other masc functions: cv_masc(), masc_by_phi(), sc_estimator(), solve_masc()

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
##First example: a small, arbitrary dataset.

data3 <- data.frame(t = 1:10, treated = 1:10,
control1 = c(15,12,14,22,21,28,29,30,29,31),
control2 = c(10,5,10,12,16,20,22,21,23,21),
control3 = c(-7,-11,-7,-3,-4,-9,-7,-3,-7,-11),
control4 = c(-15,-13,-14,-16,-15,-12,-13,-13,-14,-14))

#controls are above stored as differences from treated unit. Translating into levels:
data3$control1 <- data3$control1 + data3$treated
data3$control2 <- data3$control2 + data3$treated
data3$control3 <- data3$control3 + data3$treated
data3$control4 <- data3$control4 + data3$treated

#defining treatment period:
treatperiod <- 6

result<-masc(treated=data3$treated,donors=data3[,-c(1,2)],treatment=treatperiod,
            tune_pars_list=list(m=1:3,min_preperiods=3))

#an equivalent specification:
result<-masc(treated=data3$treated,donors=data3[,-c(1,2)],treatment=treatperiod,
            tune_pars_list=list(m=1:3,set_f=list(3:4)))

#Weights selected, for controls 1 through 4 respectively:
print(round(result$weights,2))

##Second example: Terrorism in the Basque Region, from
##Abadie and Gardeazabal (2003).

#First, load the Synth package, which includes the dataset:
if (requireNamespace("Synth",quietly=TRUE) & requireNamespace("data.table",quietly=TRUE)){
library(Synth)
library(data.table)
data(basque)
basque<-as.data.table(basque)
basque <- basque[regionno!=1,]
basque[,regionname:= gsub(" (.*)","",regionname)]
#Grabbing region names:
names<- c(unique(basque[regionno==17,regionname]),unique(basque[regionno!=17,regionname]))
basque <- cbind(basque[regionno==17,gdpcap],
                                            t(reshape(basque[regionno!=17,.(regionno,year,gdpcap)],
                                             idvar='regionno', timevar='year',direction='wide')[,-"regionno",with=FALSE]))


result <- masc(treated=basque[,1], donors=basque[,-1],treatment=16, tune_pars_list=list(m=1:10,
                                             min_preperiods=8))
names(result$weights)<-names[-1]

#weights on control units:
print(round(result$weights,3))

#Treatment effects of terrorism on GDP per capita
#in thousands of 1986 US dollars, over 1970-1975:
#(first 6 years of treatment)
print(result$pred.error[1:6,])

#Selected tuning parameters?
print(paste0("Selected matching estimator: ",result$m_hat))
print(paste0("Selected weight on matching: ",result$phi_hat))

#Now, examine the shape of A) the CV error (mean square prediciton error in pre-period) and
# B) average prediction error (AKA treatment effect) over the first 5 treatment years,
#both over values of phi, fixing the matching estimator (moving from matching to synthetic controls)
phis<-seq(0,1,length.out=100)
phi_table<-masc_by_phi(treated=basque[,1], donors=basque[,-1],treatment=16, tune_pars=list(m=result$m_hat,
                                             min_preperiods=8,phis=phis))
#Printing CV error and prediction error over values of phi. CV error is clearly lowest at intermediary values of phi,
#suggesting an estimator between matching and synthetic controls does best at forecasting. The average medium-run treatment
#effect is monotonically increasing as we move away from synthetic control and toward matching.
print(phi_table)
}

maxkllgg/masc documentation built on Sept. 1, 2020, 5:35 p.m.