knitr::opts_chunk$set(
  dev = "png",
  collapse = TRUE,
  comment = "#>",
  fig.width = 8,
  fig.height = 6,
  fig.align = "center",
  warning = FALSE
)

GeoLift: And end-to-end solution to Measure Lift at a Geo Level {.tabset}

Introduction

What is Lift?

Lift is a scientific measurement technique which allows users to accurately understand and compare the value of their strategies. Lift, which is grounded on experimental design, is typically used to measure the true or causal effect that an ad campaign had on business metrics such as Sales, App Installs, or Leads. By estimating the incremental effect that these strategies provide, analysts are able to understand, analyze, and optimize their cross-channel strategies through data and science.

What is Geo Experimentation?

There are many ways to measure Lift and Geo Experimentation is one that has been rapidly gaining popularity. In general, Geo Experimentation refers to any test where the treatment is administered at the geo-level. In practice, this often means that ad campaigns are geo-targeted in such a way that the test regions are eligible to see ads while the control ones aren't. Geo Experimental methods use this information to calculate the true value that this strategy had on your audience. Moreover, since they rely on aggregated data they're privacy-safe, can work across channels, and are very robust against signal-loss.

Why GeoLift?

GeoLift is Meta's open-source solution to measure Lift at a geographical level. By leveraging the latest advances in Synthetic Control Methods, GeoLift empowers you to start making decisions grounded in incrementality and measure the true value of your marketing campaigns. This tool offers an end-to-end solution to geo-experimentation which spans data ingestion, power analyses, market selection, and inference in this easy-to-use R package. Moreover, the tool's open-source nature makes it transparent and fully reproducible!

Installation

You can install GeoLift from our GitHub repository using the remotes package.

## Install remotes if noy already installed
install.packages("remotes", repos = "http://cran.us.r-project.org")
## Install augsynth & GeoLift from GitHub
remotes::install_github("ebenmichael/augsynth")
remotes::install_github("facebookincubator/GeoLift")
library(GeoLift)
library(gridExtra)
library(ggplot2)

Example - Data

To show an end-to-end implementation of GeoLift we will use simulated data of 40 US cities across 90 days to: design a test, select test markets, run power calculations, and finally calculate the Lift caused by the campaign. As with every GeoLift test, we start analyzing pre-test historical information. We will use the data included in the GeoLift package.

data(GeoLift_PreTest)

The GeoLift_PreTest data set contains three variables:

  1. location (city)
  2. date (in "yyyy-mm-dd" format)
  3. Y (number of conversions/KPI in each day/location).

Every GeoLift experiment should contain at least these three variables that reflect when, where, and how much of the KPI was measured. Nevertheless, if you have more data available, you can include covariates to GeoLift to improve our results through the X parameter of all GeoLift functions.

The first step to run a GeoLift test is to read the data into the proper format using the GeoDataRead function.

GeoTestData_PreTest <- GeoDataRead(
  data = GeoLift_PreTest,
  date_id = "date",
  location_id = "location",
  Y_id = "Y",
  X = c(), # empty list as we have no covariates
  format = "yyyy-mm-dd",
  summary = TRUE
)
head(GeoTestData_PreTest)

This function analyzes the data set, handles locations with missing data, and returns a data frame with time-stamps instead of dates. In this case, since we're inputting daily data each time unit represents a day.

Note: Before reading the data into this format, always make sure that there are no missing variables, NAs, or locations with missing time-stamps as those will be dropped by the GeoDataRead() function.

A good next step is to plot the panel data with GeoPlot to observe it's trend, contribution per location, and also to detect any data anomalies before moving on to the data analysis.

GeoPlot(GeoTestData_PreTest,
  Y_id = "Y",
  time_id = "time",
  location_id = "location"
)

In this case we see a similar pattern that's shared across all locations. These structural similarities between regions are the key to a successful test!

Example - Power Analysis

Running a prospective power analysis is fundamental prior to executing a test. It is only through a thorough statistical analysis of our data that we can set it up for success. In general, through the power analysis we can find:

Assessing the power and selecting the test markets for a GeoLift test can be accomplished through the GeoLiftMarketSelection() function. Through a series of simulations, this algorithm will find which are the best combinations of test and control locations for the experiment. Moreover, for each of these test market selections, the function will display the Minimum Detectable Effect, minimum investment needed to run a successful test, and other important model-fit metrics that will help us select the test that best matches our goals and resources.

Understanding Power Function Parameters

The key parameters needed to run this function are:

Continuing with the example and in order to explore the function's capabilities let's assume we have two restrictions: Chicago must be part of the test markets and we have up to \$100,000 to run the test. We can include these constraints into GeoLiftMarketSelection() with the include_markets and budget parameters and proceed with the market selection. Moreover, after observing that the historical KPI values in GeoPlot() have been stable across time we will proceed with a model with Fixed Effects. Finally, given a CPIC = \$7.50 obtained from a previous Lift test, a range between two to five test markets, and a duration between 10 and 15 days we obtain:

MarketSelections <- GeoLiftMarketSelection(
  data = GeoTestData_PreTest,
  treatment_periods = c(10, 15),
  N = c(2, 3, 4, 5),
  Y_id = "Y",
  location_id = "location",
  time_id = "time",
  effect_size = seq(-0.25, 0.25, 0.05),
  lookback_window = 1,
  include_markets = c("chicago"),
  exclude_markets = c("honolulu"),
  cpic = 7.50,
  budget = 100000,
  alpha = 0.1,
  Correlations = TRUE,
  fixed_effects = TRUE,
  side_of_test = "one_sided",
  parallel = FALSE
)

Power output - ranking variables

The results of the power analysis and market selection provide us with several key metrics that we can use to select our test market. These metrics are:

The results in MarketSelection show that the test markets with the best ranks are: (chicago, cincinnati, houston, portland), (atlanta, chicago, cleveland, las vegas, reno) and (chicago, portland), all of them tied at rank 1. We can plot() these last two results to inspect them further. This plot will show how the results of the GeoLift() model would look like with the latest possible test period as well as the test's power curve across a single simulation.

# Plot for chicago, cincinnati, houston, portland for a 15 day test
plot(MarketSelections, market_ID = 2, print_summary = FALSE)

# Plot for chicago, portland for a 15 day test
plot(MarketSelections, market_ID = 3, print_summary = FALSE)

Power output - deep dive into power curves

In order to ensure that power is consistent throughout time for these locations, we can run more than 1 simulation for each of the top contenders that came out of GeoLiftMarketSelection.

We will do this by running the GeoLiftPower method and expanding our lookback_window to 10 days, only for this treatment combination and plot their results.

NOTE: You could repeat this process for the top 5 treatment combinations that come out of GeoLiftMarketSelection, with increased lookback windows and compare their power curves. We will do it only for Chicago and Portland here.

market_id <- 3
market_row <- dplyr::filter(MarketSelections$BestMarkets, ID == market_id)
treatment_locations <- stringr::str_split(market_row$location, ", ")[[1]]
treatment_duration <- market_row$duration
lookback_window <- 10

power_data <- GeoLiftPower(
  data = GeoTestData_PreTest,
  locations = treatment_locations,
  effect_size = seq(-0.15, 0.15, 0.01),
  lookback_window = lookback_window,
  treatment_periods = treatment_duration,
  cpic = 7.5,
  side_of_test = "one_sided",
  parallel = FALSE
)

plot(power_data, show_mde = TRUE, smoothed_values = FALSE, breaks_x_axis = 5) +
  labs(caption = unique(power_data$location))

While both market selections perform excellent on all metrics, we will move further with the latter since it allows us to run a successful test with a smaller budget. Also, the power curve is symmetrical with respect to the y axis, it has no power when the true effect is zero, and has a similar behavior with more simulations. Finally, changing the print_summary parameter of plot() to TRUE can provide us with additional information about this market selection.

# Plot for chicago, portland for a 15 day test
plot(MarketSelections, market_ID = 3, print_summary = TRUE)

Note: Given that we are not using the complete pre-treatment data to calculate the weights in our power analysis simulations, the ones displayed by the plotting function above are not the final values. However, you can easily obtain them with the GetWeights() function.

weights <- GetWeights(
  Y_id = "Y",
  location_id = "location",
  time_id = "time",
  data = GeoTestData_PreTest,
  locations = c("chicago", "portland"),
  pretreatment_end_time = 90,
  fixed_effects = TRUE
)

# Top weights
head(dplyr::arrange(weights, desc(weight)))

Example - Analyzing the Test Results

Based on the results of the Power Calculations, a test is set-up in which a \$65,000 and 15-day marketing campaign was executed in the cities of Chicago and Portland while the rest of the locations were put on holdout. Following the completion from this marketing campaign, we receive sales data that reflects these results. This new data-set contains the same format and information as the pre-test one but crucially includes results for the duration of the campaign. Depending on the vertical and product, adding a post-campaign cooldown period might be useful.

Test Data

Data for the campaign results can be accessed at GeoLift_Test.

data(GeoLift_Test)

Similarly to the process executed at the beginning of the Power Analysis phase, we read the data into GeoLift's format using the GeoDataRead function. You can observe in the summary output that additional 15 periods are contained in the new GeoLift data object.

GeoTestData_Test <- GeoDataRead(
  data = GeoLift_Test,
  date_id = "date",
  location_id = "location",
  Y_id = "Y",
  X = c(), # empty list as we have no covariates
  format = "yyyy-mm-dd",
  summary = TRUE
)
head(GeoTestData_Test)

The results can also be plotted using the GeoPlot function. However, for post-campaign data we can include the time-stamp at which the campaign started through the treatment_start parameter to clearly separate the two periods. Plotting the time-series is always useful to detect any anomalies with the data and to start noticing patterns with the test.

GeoPlot(GeoTestData_Test,
  Y_id = "Y",
  time_id = "time",
  location_id = "location",
  treatment_start = 91
)

GeoLift Inference

The next step in the process is to calculate the actual Lift caused by the marketing campaigns on our test locations. To do so we make use of the GeoLift() function, which will take as input the GeoLift dataframe as well as information about the test such as which were the cities in the treatment group, when the test started, and when it ended through the locations, treatment_start_time, and treatment_end_time parameters respectively.

GeoTest <- GeoLift(
  Y_id = "Y",
  data = GeoTestData_Test,
  locations = c("chicago", "portland"),
  treatment_start_time = 91,
  treatment_end_time = 105
)
GeoTest

The results show that the campaigns led to a 5.4% lift in units sold corresponding to 4667 incremental units for this 15-day test. Moreover, the Average Estimated Treatment Effect is of 155.556 units every day of the test. Most importantly, we observe that these results are statistically significant at a 95% level. In fact, there's only a 1.1% chance of observing an effect of this magnitude or larger if the actual treatment effect was zero. In other words, it is extremely unlikely that these results are just due to chance. To dig deeper into the results, we can run the summary() of our GeoLift object.

summary(GeoTest)

The summary show additional test statistics such as the p-value which was equal to 0.01 confirming the highly statistical significance of these results. Moreover, the summary function provides Balance Statistics which display data about our model's fit. The main metric of model fit used in GeoLift is the L2 Imbalance which represents how far our synthetic control was from the actual observed values in the pre-treatment period. That is, how similar the synthetic Chicago + Portland unit we crated is from the observed values of these cities in the period before the intervention. A small L2 Imbalance score means that our model did a great job replicating our test locations while a large one would indicate a poor fit. However, the L2 Imabalnce metric is scale-dependent, meaning that it can't be compared between models with different KPIs or number of testing periods. For instance, the L2 Imbalance of a model run on grams of units sold will be significantly larger than a model ran for tons of product sold even if they represent the same basic underlying metric.

Therefore, given that it's hard to tell whether the model had a good or poor fit by simply looking at the value of the L2 Imbalance metric, we also included the Scaled L2 Imbalance stat which is easier to interpret as it's bounded in the range between 0 and 1. A value close to zero represents a good model fit while values nearing 1 indicate a poor performance by the Synthetic Control Model. This scaling is accomplished by comparing the Scaled L2 Imbalance of our Synthetic Control Method with the Scaled L2 Imbalance obtained by a baseline/naive model (instead of carefully calculating which is the optimal weighting scheme for the Synthetic Control, we assign equal weights to each unit in the donor pool). The latter provides an upper bound of L2 Imbalance, therefore, the Scaled L2 Imbalance shows us how much better our GeoLift model is from the baseline.

In fact, another way to look at the Scaled L2 Imbalance is the percent improvement from a naive model which can be obtained by subtracting our model's Scaled L2 Imbalance from 100%. In this case, an improvement close to 100% (which corresponds to a Scaled L2 Imbalance close to zero) represents a good model fit. Finally, we also include the weights that generate our Synthetic Control. In this test we note that the locations that contribute the most to our GeoLift model are Cincinnati, Miami, and Baton Rouge.

plot(GeoTest, type = "Lift")

Plotting the results is a great way to assess the model's fit and how effective the campaign was. Taking a close look at the pre-treatment period (period before the dotted vertical line) provides insight into how well our Synthetic Control Model fitted our data. In this specific example, we see that the observed values of the Chicago + Portland test represented in the solid black line were closely replicated by our SCM model shown as the dashed red line. Furthermore, looking at the test period, we can notice the campaign's incrementality shown as the difference between the sales observed in the test markets and our counterfactual synthetic location. This marked difference between an almost-exact match in pre-treatment periods and gap in test time-stamps provides strong evidence of a successful campaign.

plot(GeoTest, type = "ATT")

Looking at the Average Estimated Treatment Effect's plot can also be extremely useful. The ATT metric shows us the magnitude of the Average Treatment Effect on a daily basis in contrast with the previous (Lift) plot which focused on aggregated effects. Moreover, this is a great example of a good GeoLift model as it has very small ATT values in the pre-treatment period and large ones when the treatment is administered to our test locations. Moreover, point-wise confidence intervals are included in this chart which help us measure how significant each day's Lift has been.

Improving The Model

While the results obtained from the test are robust and highly significant, a useful feature of GeoLift is its ability to improve the model fit even further and reduce bias through augmentation by a prognostic function. There are several options for augmentation of the standard GeoLift model such as regularization (specifically Ridge) and an application of Generalized Synthetic Control Model (GSC). While each of these approaches provide it's own set of advantages, for instance Ridge regularization usually performs well when the number of units and time-periods isn't large while GSC helps improve fit for situations with many pre-treatment periods, GeoLift offers the option to let the model decide which is the best approach by setting the model parameter to "best".

GeoTestBest <- GeoLift(
  Y_id = "Y",
  data = GeoTestData_Test,
  locations = c("chicago", "portland"),
  treatment_start_time = 91,
  treatment_end_time = 105,
  model = "best"
)
GeoTestBest
summary(GeoTestBest)
plot(GeoTestBest, type = "Lift")
plot(GeoTestBest, type = "ATT")

The new results augment the GeoLift model with a Ridge prognostic function which improves the model fit as seen in the new L2 Imbalance metrics. This additional robustness is translated in a small increase in the Percent Lift. Furthermore, by augmenting the model with a prognostic function, we have an estimate of the estimated bias that was removed by the Augmented Synthetic Control Model.

{-}



facebookincubator/GeoLift documentation built on May 31, 2024, 10:09 a.m.