Introduction to funneljoin

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>", 
  message = FALSE
)

The goal of funneljoin is to make it easy to analyze behavior funnels. For example, maybe you're interested in finding the people who visit a page and then register. Or you want all the times people click on an item and add it to their cart within 2 days.

You can do this with funneljoin's after_join() function. The arguments are:

after_join types

Funneljoins can be any combination of first, last, any, and lastbefore with first, last, any, and firstafter.

Let's take an example. We'll use two tables, one of landings and one of registrations, that come with the package. Each has a user_id and timestamp column.

library(dplyr)
library(funneljoin)
landed
registered

Let's say we wanted to get only the first time people landed and the first time they registered, if it was after their first landing. We would use a first-first inner join.

landed %>%
  after_inner_join(registered, 
                   by_user = "user_id",
                   by_time = "timestamp",
                   type = "first-first")

We now have a dataset of four rows. Notice a few things:

Max gap

What if instead we wanted all landing-registration pairs that happened within a 4-day window? We now add the max_gap argument. This takes either a number, which is the number of seconds, or a difftime object. Instead of calculating the number of seconds in 4 days, we'll just make a difftime object. Because we want any pair, not just the pairs of the first landings and registrations, we change the type to any-any.

We'll also add gap_col = TRUE to return a column, .gap, which is the gap in seconds between the events.

landed %>%
  after_inner_join(registered, 
                   by_user = "user_id",
                   by_time = "timestamp",
                   type = "any-any", 
                   max_gap = as.difftime(4, units = "days"),
                   gap_col = TRUE)

We now have a dataset of 7 rows. Notice a few things:

Other types of joins

Left joins are very handy when you want to know what percentage of people who did X did Y afterward. For example, we may have an experiment where we want to know "what percentage of people registered after entering the experiment?"

In this case, we'd use a first-firstafter left join - we want to get a person's first experiment start and then their first registration afterward, if they have one. Let's add another column to our experiment starts table, which experiment variant someone was in.

experiment_starts <- tibble::tribble(
  ~user_id, ~timestamp, ~ alternative.name,
  1, "2018-07-01", "control",
  2, "2018-07-01", "treatment",
  3, "2018-07-02", "control",
  4, "2018-07-01", "control",
  4, "2018-07-04", "control",
  5, "2018-07-10", "treatment",
  5, "2018-07-12", "treatment",
  6, "2018-07-07", "treatment",
  6, "2018-07-08", "treatment"
) %>%
  mutate(timestamp = as.Date(timestamp))

experiment_registrations <- tibble::tribble(
  ~user_id, ~timestamp, 
  1, "2018-07-02", 
  3, "2018-07-02", 
  4, "2018-06-10", 
  4, "2018-07-02", 
  5, "2018-07-11", 
  6, "2018-07-10", 
  6, "2018-07-11", 
  7, "2018-07-07"
) %>%
  mutate(timestamp = as.Date(timestamp))
experiment_starts %>%
  after_left_join(experiment_registrations, 
                   by_user = "user_id",
                   by_time = "timestamp",
                   type = "first-firstafter")

We now have an entry for each user in the experiment_starts table, which has the time of their first registration afterward or NA if they did not have a registration afterwards. Each user only appears once since it's a first-something join.

We can use funneljoin's summarize_conversions() function to get the number of starts and number of conversions (registrations) in each group - you just need to specify the column that indicates whether someone converted - if it's NA or FALSE, it will be treated as FALSE, otherwise TRUE.

experiment_starts %>%
  after_left_join(experiment_registrations, 
                   by_user = "user_id",
                   by_time = "timestamp",
                   type = "first-firstafter") %>% 
  group_by(alternative.name) %>%
  summarize_conversions(converted = timestamp.y)

summarize_conversions also works if we have multiple types of conversions, specified by a column.

for_conversion <- tibble::tribble(
  ~"experiment_group", ~"first_event", ~"last_event", ~"type", 
  "control", "2018-07-01", NA, "click",
  "control", "2018-07-02", NA, "click",
  "control", "2018-07-03", "2018-07-05", "click",
  "treatment", "2018-07-01", "2018-07-05", "click",
  "treatment", "2018-07-01", "2018-07-05", "click",
  "control", "2018-07-01", NA, "purchase",
  "control", "2018-07-02", NA, "purchase",
  "control", "2018-07-03", NA, "purchase",
  "treatment", "2018-07-01", NA, "purchase",
  "treatment", "2018-07-01", "2018-07-05", "purchase"
)

for_conversion %>%
  group_by(type, experiment_group) %>%
  summarize_conversions(converted = last_event)

Summarize Prop Tests

summarize_prop_tests() takes in a dataset with at least three columns - nb_users, nb_conversions, and a column indicating experiment group. It can also have an additional column that is the type of conversion - for example, you could have clicks and purchases. Each type of conversion can only have two rows, one control and one other group. If you have that additional column of type, you need to group by it first.

It returns a dataset with at least 5 columns:

If you had a type column, it will also be in the output.

tbl <- tibble::tribble(
  ~ experiment_group, ~nb_users, ~nb_conversions, ~type,
  "control", 500, 200, "purchase",
  "treatment", 500, 100, "purchase", 
  "control", 500, 360, "click",
  "treatment", 500, 375, "click"
)

tbl %>%
  group_by(type) %>%
  summarize_prop_tests(alternative_name = experiment_group)


Try the funneljoin package in your browser

Any scripts or data that you put into this service are public.

funneljoin documentation built on March 31, 2023, 11:07 p.m.