ECICourse::upload_images_to_canvas() knitr::opts_chunk$set( echo = TRUE, message = TRUE, collapse = TRUE )
Read along with this tutorial, and run the R code in RStudio as you go (copy and paste, or type directly into R).
Start by loading packages:
library(ECICourse) library(tidyverse)
The data for this tutorial describe a fictional observational study conducted by an online seller of accessories for smart phones and tablet computers. The company emailed a single-use coupon code to all 16,576 customers in its database whose most recent order was between 1 and 6 months ago. The coupons could be used to receive a 30% discount the next order, as long as they were redeemed within the next month.
The company wants to learn how much revenue increased as a result of customers using the coupons. However, because the customers chose whether to use the coupon or not, there is a problem of selection (non-random treatment). For example, high frequency customers might be more likely to place an order during the experiment than low frequency customers. The company therefore realizes that the naive estimator of the average treatment effect will be biased and inconsistent.
In this tutorial, we will use a classical matching strategy to calculate the average treatment effects among the treated (ATT; those using the coupons on their orders), controls (ATC; those who did not use the coupon), and all customers (ATE).
To obtain the data describing this experiment, run the following code:
coupon <- get_ECI_data('module-6-tutorial')
The data contain an id for each customer (customer
), background information describing each customer (explained below), indicators for whether the customer placed an order (placed_order
), whether they redeemed the coupon with that order (used_coupon
), and the size of the first order the customer placed during the promotion, not including the discount (order_size
). If no order was placed (placed_order == 0
), then order_size == 0
. To simplify this tutorial, we will assume that none of the customers placed more than one order during the month of the promotion.
coupon
The columns last_order
, num_orders
, and avg_value
are variables describing the past purchase behavior of each customer:
The value of last_order
is set to the number of months since the customer's most recent order.
The value of num_orders
is set to the number of previous orders in the last six months, binned into values of 1, 2, and 3 or more.
The value of avg_value
is based on the average revenue per order, and is binned into three groups for high, medium, and low value.
The rate of coupon use varied across customers with different values of last_order
, num_orders
, and avg_value
.
coupon |> group_by(last_order, num_orders, avg_value) |> summarise(used_coupon = mean(used_coupon), .groups = 'drop') |> ggplot(aes(x = last_order, y = used_coupon, colour = factor(num_orders))) + geom_line() + scale_y_continuous(labels = scales::percent) + scale_x_continuous(breaks = seq(0, 12, by = 3)) + scale_colour_viridis_d(option = 'C', end = .85) + facet_wrap(~avg_value) + labs(x = 'Months since last order', y = 'Percent using coupon', colour = 'Previous orders') + theme_bw()
Some customers placed orders without using their coupon. These customers might have deleted the email without reading it.
coupon |> group_by(used_coupon) |> summarise(num_placing_order = sum(placed_order))
This raises an interesting point about the effect of the coupon. For some customers who used the coupon, receiving the coupon might have prompted them to place an order they otherwise would not have made. For these customers, we might expect the treatment effect to be rather large, because their values of Y(0) (how much they would spend without the coupon) would be equal to zero. At the same time, we can imagine that a different subset of customers who used the coupon might have placed an order even if they hadn't received the coupon. For these customers, we might expect their values of Y(0) to be greater than 0, and thus we might find a relatively smaller treatment effect.
Which of these two processes is actually at play is unobserved to us. By assuming that selection is on the observables (and only on the observables), we are assuming that among any group of customers with the same observed variables, the unobserved process determining whether a customer uses the coupon is unrelated to the size of their treatment effect. This is a very strong assumption!
The company believes that the stratifying variables last_order
, num_orders
, and avg_value
completely account for all systematic variation in whether customers used the coupon. In other words, the company is willing to assume that all selection is on the observables. This means that among groups of customers in the same stratum---i.e., customers who have all three values of last_order
, num_orders
, and avg_value
in common---we are assuming that coupon use was essentially random with respect to their potential outcomes. To repeat, this is a very strong assumption to make, but it is what we need to assume in order to use a classical matching procedure to estimate average treatment effects. And in any case, the company believes this is close enough to the truth to make this assumption.
To help build intuition for how matching works, we will first consider just a single stratum of customers with last_order == 3
, num_orders == '2'
, and avg_value == 'Medium value'
:
coupon |> filter(last_order == 3 & num_orders == '2' & avg_value == 'Medium value')
high_S <- coupon |> filter(last_order == 3 & num_orders == '2' & avg_value == 'Medium value')
The assumption of selection on the observables means that within this group, coupon use is independent of potential outcomes---how large orders would be when using or not using a coupon. Under this assumption, any difference in average order size among coupon users and non-users (in this stratum) is due to the coupon.
coupon |> filter(last_order == 3 & num_orders == '2' & avg_value == 'Medium value') |> group_by(used_coupon) |> summarise(avg_order_size = mean(order_size), n_customers = n())
The conditional average treatment effect (CATE) for this stratum of customers is difference between these two average order sizes.
coupon |> filter(last_order == 3 & num_orders == '2' & avg_value == 'Medium value') |> group_by(used_coupon) |> summarise(avg_order_size = mean(order_size)) |> pivot_wider(names_from = 'used_coupon', values_from = 'avg_order_size', names_prefix = 'coupon_') |> mutate(CATE = coupon_1 - coupon_0)
Following this logic, we can calculate CATEs for all strata---i.e., for all unique combinations of the three stratifying variables (last_order
, num_orders
, and avg_value
). First, we calculate the average order size among coupon users and non-users in each stratum:
avg_order_sizes_by_group <- coupon |> group_by(last_order, num_orders, avg_value, used_coupon) |> summarise(avg_order_size = mean(order_size), .groups = 'drop') avg_order_sizes_by_group
We can now calculate the CATE for each group:
avg_order_sizes_by_group |> pivot_wider(names_from = 'used_coupon', values_from = 'avg_order_size', names_prefix = 'coupon_') |> mutate(CATE = coupon_1 - coupon_0) |> # renaming columns to make it easier to read: rename(with_coupon = coupon_1, without_coupon = coupon_0)
The estimated CATEs shown above vary widely across groups of customers. As we will see below, one way to think of classical matching is that it works by taking weighted averages of these CATEs in order to generate unbiased and consistent estimates for the average treatment effect (ATE), average treatment effect on the treated (ATT), and the average treatment effect on the controls (ATC). The weights in the weighted averages depend on the target effect to be estimated.
First, we will use a matching procedure to estimate the ATT. For these data, the ATT is the average effect of coupon use on customers who chose to use the coupon. If that's confusing, think instead of the ATT as an answer to the following question: "Among customers who chose to use the coupon, how much smaller would their orders have been if they hadn't used the coupon?"
In the material below, we will first apply the matching procedure in a way that goes slowly and uses a lot of code, but is conceptually more instructive. At the end I will show you an equivalent procedure, using the Matching R package, that requires less code.
First, isolate the treatment and control groups into different data frames:
treatment <- coupon |> filter(used_coupon == 1) control <- coupon |> filter(used_coupon == 0)
The two data frames are not balanced, by which we mean that within each stratum of the variables last_order
, num_orders
, and avg_value
, there are a different number of coupon users and non-users.
treatment |> count(last_order, num_orders, avg_value) control |> count(last_order, num_orders, avg_value)
To calculate the ATT, we want to construct an artificial control group that is balanced with respect to the treatment group. Consider this example: There are r sum(high_S$used_coupon)
treated customers with last_order == 3
, num_orders == '2'
, and avg_value == 'Medium value'
.
n_treated_in_this_stratum <- treatment |> filter(last_order == 3 & num_orders == '2' & avg_value == 'Medium value') |> nrow() n_treated_in_this_stratum
To calculate the ATT, the artificial control group for this stratum needs to contain exactly r sum(high_S$used_coupon)
coupon non-users. After sampling, the artificial control group for this stratum might look something like this:
set.seed(1)
artificial_control_for_this_stratum <- control |> filter(last_order == 3 & num_orders == '2' & avg_value == 'Medium value') |> sample_n(n_treated_in_this_stratum, replace = TRUE) |> arrange(customer) artificial_control_for_this_stratum
To calculate the ATT for the entire customer base, we need to repeat this sampling procedure for each stratum (i.e., unique combination of values for the three variables). To do that, we need to know how many treated customers there are in each stratum. We will calculate that first:
n_treated_per_stratum <- treatment |> count(last_order, num_orders, avg_value, name = 'n_treated') n_treated_per_stratum
Next, within each stratum, we need to sample n_treated
customers from the control group. The approach here will be to join the tables control
(containing control group customers) and n_treated_per_stratum
(indicating how many control customers to sample from each stratum). We will then use the values of n_treated
and the sample_n
function (from dplyr) to construct the artificial control group.
control_with_n_treated <- inner_join(control, n_treated_per_stratum, by = c("last_order", "num_orders", "avg_value")) control_with_n_treated artificial_control <- control_with_n_treated |> group_by(last_order, num_orders, avg_value, n_treated) |> # within each stratum, sample n_treated # controls with replacement: sample_n(size = n_treated[1], replace = TRUE) |> ungroup() |> select(-n_treated)
The treatment
and artificial_control
data frames are now balanced within each stratum:
treatment |> count(last_order, num_orders, avg_value) artificial_control |> count(last_order, num_orders, avg_value)
Now that we have an artificial control group that is balanced with respect to the treatment group, the next step is to combine the treatment and artificial control groups in order to calculate the average treatment effect on the treated (ATT):
treatment_and_artificial_control <- bind_rows(treatment, artificial_control) treatment_and_artificial_control |> group_by(used_coupon) |> summarise(avg_order_size = mean(order_size)) |> pivot_wider(names_from = used_coupon, values_from = avg_order_size) |> mutate(att = `1` - `0`)
The code above is spread out with a lot of discussion. Here is a version that is more compact, and makes use of the group_map
function in dplyr. Feel free to skip over this material and jump to the next subsection (showing how to use the Matching package).
coupon |> group_by(last_order, num_orders, avg_value) |> # number of customers in the treated condition within each stratum: mutate(n_treated = sum(used_coupon)) |> group_by(last_order, num_orders, avg_value, used_coupon, n_treated) |> group_map( # apply the function below to each group_by subset of rows: ~sample_n(.x, # sample rows from this group_by subset (.x) size = .y$n_treated, # <- the number of rows to sample = n_treated (.y refers to the grouping variables) replace = !.y$used_coupon), # sample with replacement from the control group (sample all from the treated group) .keep = TRUE) |> bind_rows() |> # calculate average order size for treatment/control group_by(used_coupon) |> summarise(avg_order_size = mean(order_size)) |> # calculate ATT pivot_wider(names_from = 'used_coupon', values_from = 'avg_order_size', names_prefix = 'coupon_') |> summarise(ATT = mean(coupon_1 - coupon_0))
The estimated ATT is not precisely the same in the two preceding examples because the randomly sampled artificial control groups are not the same.
First, load the Matching package. The Matching package also loads the MASS package, which replaces dplyr's version of select
. We will need dplyr's select
, however, so we call select <- dplyr::select
after loading packages.
library(Matching) select <- dplyr::select # because MASS replaced dplyr::select and we want the dplyr version
To calculate the ATT using the Matching package, we need to create distinct variables for 1) the outcome variable (order_size
), 2) the matching variables (last_order
, num_orders
, and avg_value
), and 3) the treatment variable (used_coupon
):
Y <- coupon$order_size X <- coupon |> select(last_order, num_orders, avg_value) |> # Matching expects a matrix of numbers, not a data frame with factors: mutate(across(everything(), as.numeric)) |> as.matrix() D <- coupon$used_coupon att <- Match(Y = Y, Tr = D, X = X, estimand = 'ATT', M = 1, replace = TRUE, ties = TRUE) summary(att)
Notice that the Match
procedure calculates a standard error for the estimate, which we didn't do in our version based on dplyr.
To calculate the ATC, we follow a very similar procedure, except we sample treatment cases in order to create an artificial treatment group that matches the number of control units in each stratum. Hence, code below is similar to what we did above.
control_and_artificial_treatment <- coupon |> group_by(last_order, num_orders, avg_value) |> mutate(n_control = sum(used_coupon == 0)) |> group_by(last_order, num_orders, avg_value, used_coupon, n_control) |> group_map( ~sample_n(.x, size = .y$n_control, replace = .y$used_coupon), .keep = TRUE) |> bind_rows()
To calculate the ATC we continue in a similar manner as before:
control_and_artificial_treatment |> group_by(used_coupon) |> summarise(avg_order_size = mean(order_size), .groups = 'drop') |> pivot_wider(names_from = 'used_coupon', values_from = 'avg_order_size', names_prefix = 'coupon_') |> summarise(ATC = mean(coupon_1 - coupon_0))
The ATC for this study represents the increase in order size we would expect among customers who didn't use the coupon, if we could somehow force them to place an order using the coupon.
We can also use the Matching package to calculate the ATC and ATE, just as we did when calculating the ATT.
atc <- Match(Y = Y, Tr = D, X = X, estimand = 'ATC', M = 1, replace = TRUE, ties = TRUE) summary(atc) ate <- Match(Y = Y, Tr = D, X = X, estimand = 'ATE', M = 1, replace = TRUE, ties = TRUE) summary(ate)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.