exclude: true class: center, middle
class: inverse, middle, center, hide-logo
library(RefManageR) BibOptions(check.entries = FALSE, bib.style = "authoryear", cite.style = "authoryear", style = "markdown", hyperlink = FALSE, dashed = FALSE) myBib <- ReadBib("./biblio.bibtex", check = FALSE)
Goal: quantify effect of treatment $T \in {0, 1}$ on outcome $Y$
r NoCite(myBib, 'DAME', 'FLAME')
--
Two potential outcomes for each unit: ${Y_i(1), Y_i(0)}$, denoting response under treatment, control
--
--
--
In observational data, covariates $\mathbf{X} \in \mathbb{R}^p$ might be confounders
??? Can't naively compare the average outcome of control units to estimate a control counterfactual
One approach to estimating counterfactuals under confounding: matching
--
If, for treated unit $i$, there existed control unit $k$ such that $\mathbf{x}_k = \mathbf{x}_i$, then:
-- - $Y_k$ (observed) is a good estimate of $Y_i(0)$ (unobserved)
--
But exact matches are unlikely in high dimensional settings
--
Given a unit $i$, covariate weights $\mathbf{w}$, and a covariate selection vector $\boldsymbol{\theta}$, define the AME problem:
Given a unit $i$, covariate weights $\mathbf{w}$, and a covariate selection vector $\boldsymbol{\theta}$, define the AME problem:
$$\overbrace{\text{argmax}{\boldsymbol{\theta} \in {0, 1}^p}\;\boldsymbol{\theta}^T\mathbf{w}}^{\text{most important covariate set}}\quad\text{s.t.}\\quad \exists k\;\:\text{with}\;\: \underbrace{\mathbf{x}{k} \circ \boldsymbol{\theta} = \mathbf{x}{i} \circ \boldsymbol{\theta}}{\text{exact matching on }\boldsymbol{\theta}} \;\:\text{and}\;\: \color{blue}{\underbrace{T_{k} = 1 -T_i}_{\text{opposite treatment}}}$$
Given a unit $i$, covariate weights $\mathbf{w}$, and a covariate selection vector $\boldsymbol{\theta}$, define the AME problem:
$$\color{blue}{\overbrace{\text{argmax}{\boldsymbol{\theta} \in {0, 1}^p}\;\boldsymbol{\theta}^T\mathbf{w}}^{\text{most important covariate set}}}\quad\text{s.t.}\\quad \exists k\;\:\text{with}\;\: \underbrace{\mathbf{x}{k} \circ \boldsymbol{\theta} = \mathbf{x}{i} \circ \boldsymbol{\theta}}{\text{exact matching on }\boldsymbol{\theta}} \;\:\text{and}\;\: \underbrace{T_{k} = 1 -T_i}_{\text{opposite treatment}}$$
--
Implicitly defines a distance metric that:
-- 2. Matches exactly when possible
--
Iterate over covariate sets, starting with more important ones
-- exclude: true In practice, don't have $\mathbf{w}$; run ML algorithm on separate holdout set
Compute Predictive Error ( $\mathtt{PE}$ ): error in using a covariate set to predict the outcome
Determines next covariate set to match on
Learning a distance metric
??? Going to try and solve the AME problem for units. Way this is going to work in practice is that we're going to pick a theta, starting with a theta of all 1s, which corresponds to exact matching -- the best possible thing we can do -- and match all possible units. Then we're going to choose another theta, and match those units. Bc in practice we don't have fixed covariate weights, for each of these thetas, ..
--
--
Solves the AME problem exactly for each unit
--
Efficient solution via downward closure property
--
--
Approximates the exact solution via backwards stepwise selection.
--
At each iteration, eliminate an entire covariate
??? Given all this background, it's now very natural and easy to explain two of our methods
In practice, don't have $\mathbf{w}$; run ML algorithm on separate holdout set
--
Compute Predictive Error ( $\mathtt{PE}$ ): error using covariate set to predict outcome
--
Determines next covariate set to match on
exclude: true
Oftentimes don't have a priori measures of covariate importance
-- exclude: true At every iteration, run ML algorithm on separate holdout set to model how well a covariate set predicts the outcome
-- exclude: true The Predictive Error ( $\mathtt{PE}$ ) measures the error in doing so and determines what covariate set next to match on.
exclude: true
class: inverse, middle, center, hide-logo
FLAME
FLAME
and DAME
are the workhorses of the package
Match input data under a wide variety of specifications
Efficient bit-vectors routine for making matches
Return S3 objects of class ame
with print
, plot
, and summary
methods
CRAN
install.packages('FLAME')
GitHub
library(devtools) install_github('https://github.com/vittorioorlandi/FLAME') # Or (mirror of the above) install_github('https://github.com/almost-matching-exactly/R-FLAME')
library(FLAME)
natality_out <- readRDS('../natality/natality_out_500k_lm.rds')
US 2010 Natality Data r Citep(myBib, 'natality2010')
.
Data on neonatal health outcomes in Neonatal Intensive Care Unit (NICU)
Effect of "extreme smoking" ( $\geq 10$ cigarettes a day during pregnancy) on birth weight r Citep(myBib, 'kondracki2020')
.
Subset of ~500k observations with 16 covariates including sex of infant, races of parents, previous Cesarean deliveries, and others.
missing_data
: how missing values in data
to be matched are handled
-- - drop: effectively drop units with missingness from the data
-- - impute: impute missing values and match on complete dataset
-- - keep: keep missing values but do not match on them
--
missing_holdout
is analogous, with impute and keep options
Two implemented options for computing $\mathtt{PE}$
- glmnet::cv.glmnet
with 5-fold cross-validation (default)
- xgboost::xgb.cv
with 5-fold cross-validation
Supply your own function:
my_PE_lm <- function(X, Y) { df <- as.data.frame(cbind(X, Y = Y)) return(lm(Y ~ ., df)$fitted.values) }
Full call to use FLAME to match natality data:
my_PE_lm <- function(X, Y) { df <- as.data.frame(cbind(X, Y = Y)) return(lm(Y ~ ., df)$fitted.values) } natality_out <- FLAME(data = natality, holdout = 0.25, replace = FALSE, treated_column_name = 'smokes10', outcome_column_name = 'dbwt' missing_data = 'drop', missing_holdout = 'drop', PE_method = my_PE_lm, estimate_CATEs = TRUE)
Full call to use FLAME to match natality data:
my_PE_lm <- function(X, Y) { df <- as.data.frame(cbind(X, Y = Y)) return(lm(Y ~ ., df)$fitted.values) } natality_out <- FLAME(data = natality, holdout = 0.25, replace = FALSE, #<< treated_column_name = 'smokes10', outcome_column_name = 'dbwt' missing_data = 'drop', missing_holdout = 'drop', PE_method = my_PE_lm, estimate_CATEs = TRUE)
Full call to use FLAME to match natality data:
my_PE_lm <- function(X, Y) { df <- as.data.frame(cbind(X, Y = Y)) return(lm(Y ~ ., df)$fitted.values) } natality_out <- FLAME(data = natality, holdout = 0.25, replace = FALSE, treated_column_name = 'smokes10', #<< outcome_column_name = 'dbwt' #<< missing_data = 'drop', missing_holdout = 'drop', PE_method = my_PE_lm, estimate_CATEs = TRUE)
Full call to use FLAME to match natality data:
my_PE_lm <- function(X, Y) { df <- as.data.frame(cbind(X, Y = Y)) return(lm(Y ~ ., df)$fitted.values) } natality_out <- FLAME(data = natality, holdout = 0.25, replace = FALSE, treated_column_name = 'smokes10', outcome_column_name = 'dbwt' missing_data = 'drop', missing_holdout = 'drop', #<< PE_method = my_PE_lm, estimate_CATEs = TRUE)
Full call to use FLAME to match natality data:
my_PE_lm <- function(X, Y) { df <- as.data.frame(cbind(X, Y = Y)) return(lm(Y ~ ., df)$fitted.values) } natality_out <- FLAME(data = natality, holdout = 0.25, replace = FALSE, treated_column_name = 'smokes10', outcome_column_name = 'dbwt' missing_data = 'drop', missing_holdout = 'drop', PE_method = my_PE_lm #<< estimate_CATEs = TRUE)
Full call to use FLAME to match natality data:
my_PE_lm <- function(X, Y) { df <- as.data.frame(cbind(X, Y = Y)) return(lm(Y ~ ., df)$fitted.values) } natality_out <- FLAME(data = natality, holdout = 0.25, replace = FALSE, treated_column_name = 'smokes10', outcome_column_name = 'dbwt' missing_data = 'drop', missing_holdout = 'drop', PE_method = my_PE_lm estimate_CATEs = TRUE) #<<
??? Estimate of the treatment effect for units that share certain covariate values
print(natality_out, linewidth = 60, digits = 1)
plot(natality_out, which_plots = 1)
plot(natality_out, which_plots = 2)
plot(natality_out, which_plots = 3)
plot(natality_out, which_plots = 4)
# Just bc takes a while; nothing fishy :) natality_summ <- readRDS('../natality/natality_summary.rds')
(natality_summ <- summary(natality_out))
print(natality_summ)
```{css, echo=FALSE} pre { background: #FFFFFF; max-width: 100%; overflow-x: scroll; }
```r op <- options('width' = 250)
high_quality_MG <- MG(natality_summ$MG$highest_quality[1], natality_out)[[1]] head(high_quality_MG, n = 14)
options(op)
FLAME and DAME are scalable algorithms for observational causal inference
Use ML on a holdout set to learn a distance metric that prioritizes matches on more important covariates
Resulting matched groups are interpretable and high quality
Future work: - Database implementation - Algorithms for mixed data
class: middle, center
.center[]
almost-matching-exactly.github.io
???
Documentation, links to papers, Python package
PrintBibliography(myBib)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.