knitr::opts_chunk$set(echo = TRUE)

This vignette shows the analysis done on a survey experiment. A survey experiment took place, where a random combination of phrases was shown to respondents, each respondent saw 12 different random permutations. Respondents were then asked "How likely are you to download this mobile app?". The app in question is referring to an app to help people sleep better.

First of all, we load all the libraries we need

library(haven)
library(labelled)
library(DT)
library(cowplot)
library(grid)
library(gridExtra)
library(scales)
library(patchwork)

library(dplyr)
library(ggplot2)
library(stringr)
library(purrr)
library(tibble)
library(tidyr)

Then we install our own library

library(testGradient)

Exploratory data analysis

We have been provided with two data sets:

We will perform an EDA on each one of the datasets.

Survey data

We can begin by a quick glimpse at the survey data.

survey <- data.frame(survey)

DT::datatable(survey[,1:5])

We observe that we have 892 entries and 100 variables, composed of mainly demographic data.

dim(survey)

We will set our plot dimensions.

base_dimension <- patchwork::get_dim(get_bar_chart_labels(data = survey, variable = d_education, title = "Plot"))

Place of residence

Almost half of respondents live in a suburban setting. Followed by an urban setting for more than a third of them, and then in a rural setting for around 1 in 6 respondents.

get_bar_chart_labels(data = survey, variable = d_urban, title = "Place of residence") %>% set_dim(base_dimension)

Gender

As expected, women are make up a bit over half of our respondents.

get_bar_chart_labels(data = survey, variable = s_gender, title = "Gender of the respondents")%>% set_dim(base_dimension)

Education

We can break down our respondents by education.

get_bar_chart_labels(data = survey, variable = d_education, title = "Education of the respondents") %>% set_dim(base_dimension)

For simplifying our analysis and avoid having too many levels, we will recode this variable into just three levels :

education_labels = c("Less than collegue" = 1, "At least some college" = 2, "More than college" = 3 )

survey <- survey %>%
    mutate(d_education_collapse = labelled (case_when(d_education <= 2 ~ 1,
                                                    d_education > 2 & d_education <7  ~ 2,
                                                    d_education >= 7 ~ 3),
                                                    education_labels)) 

And we get a clearer picture of the distribution of education levels among our respondents. Almost 1 in 4 of our respondents have "Less than college" , more than half have "At least some college", and 1 in 5 have "More than college".

get_bar_chart_labels(data = survey, variable = d_education_collapse, title = "Education of the respondents") %>% set_dim(base_dimension)

Race of the respondents

Whites make around 60% of our sample, Hipanic and Latino represent around 1 in 6, our respondents, and African Americans represent around 1 in 8.

get_bar_chart_labels(data = survey, variable = s_race, title = "Race of the respondents") %>% set_dim(base_dimension)

Income

Around one in 4 of respondents earn "Less than 25,000", one in 4 earn between "\$25,000 and \$50,000", around 30% earn between "\$0,000 and \$100,000" and around one u-in five earn "\$100,000 or more".

get_bar_chart_labels(data = survey, variable = s_hhincome, title = "Income of the respondents") %>% set_dim(base_dimension)

Experiment data

We will now turn our attention to the experiment data to have a glimpse of the experiment and the outcomes. The experiment consisted in showing to each respondent 12 different permutations of a message intended to make them download the app in order to determine with which permutation the respondent are most likely to download the app.

So we begin by getting the experiment data, and joining it with the survey data.

experiment <- data.frame(experiment)


experiment_aug <- left_join(experiment, survey, by = "response_id")
DT::datatable(experiment_aug[,1:5])

The permutations were done on the following variables:

experiment_aug %>% select(duration:social_proof) %>% purrr::map(unique)

As every respondent was shown 12 different permutations, we have in total 10, 700 observations in the experiment dataset.

dim(experiment_aug)

We will begin by having a look at the distribution of answers. First, we will add labels to our dataset. We know that the corresponding labels are the following :

answer_labels = c("Very unlikely" = 1, "Somewhat unlikely" = 2, 
                  "Somewhat likely" = 3, "Very likely" = 4)

experiment_aug <- experiment_aug %>%
    mutate(answer = labelled(answer, answer_labels))

Once we have added our labels, we will observe the distribution of answers. We can see that the answers are not evenly distributed. The sum of the relative frequencies of the answers "Somewhat unlikely" and of "Somewhat likely" are around the same as the relative frequency of the "Very unlikely" alone.

get_bar_chart_labels(data = experiment_aug, variable = answer, title = "Answer of the respondents")

We will also want to see if the different levels of each categorical variable are more or less balanced in the data set. As we can see, the different levels of the categorical variables of the experiment are relatively well balanced. We do not observe for any variable that one level has a substantially higher share than the others.

experiment_graphs = c("duration", "offer", "outcome", "price", "rtb" )

list_graphs = map(.x = experiment_graphs, ~get_bar_simple(data = experiment_aug, variable = .x))
plots <-plot_grid(list_graphs[[1]], list_graphs[[2]], list_graphs[[3]], list_graphs[[4]]       ,list_graphs[[5]],
          labels = experiment_graphs, 
          label_size = 12,
          ncol = 1,
          nrow = 5,
          align = "hv",
          label_y = 1)
grid.arrange(plots, 
             bottom = textGrob("Relative frequency", gp = gpar(fontsize = 14, fontface="bold")))

Effect of demographic variables

We will look at the effect of the main demographic variables on the likelihood to say they would download the app. For this, we will be using get_index(), which is a function that allows us to compare the grouped distribution of a certain level of a categorical variable against the total population.

If, for example 50% of all rural respondents say they are "Very likely" to download the app, but only 40% of the total population say the same, then our index would be 50/40*100 = 125. This would mean that, proportionally to the general population, rural respondents are more likely to download the app. We can see the below example.

DT::datatable(get_index(data = experiment_aug, grouping_var = d_urban, index_var = answer))

We will better visualize the results by using graphs. We will first set the size of all of our graphs.

base_dimension_2 <- patchwork::get_dim(get_index(data = experiment_aug, grouping_var = s_race, index_var = answer) %>%
  get_bar_chart_index(grouping_var = s_race, title = "Plot"))

We can see that rural and suburban residents tend to have the same distribution among the Answer variable. However, urban residents exhibit a distribution that is eschewed in exactly the opposite direction. Rural and suburban residents are more more unlikely than the general population to say they would download the app, whereas urban resident are more likely to say they would do it.

get_index(data = experiment_aug, grouping_var = d_urban, index_var = answer) %>%
  get_bar_chart_index(grouping_var = d_urban, title = "Answer index by place of residence") %>%
  set_dim(base_dimension_2)

We turn now on the distribution of answers by gender. We can observe clear differences in gender, with men being more likely than women to say they would download the app.

get_index(data = experiment_aug, grouping_var = s_gender, index_var = answer) %>%
  get_bar_chart_index(grouping_var = s_gender, title = "Answer index by gender") %>%
  set_dim(base_dimension_2)

Then we obtain the distribution of answers by education. Respondents with "More than college" tend to be more much likely than the general population to say they would download the app.

get_index(data = experiment_aug, grouping_var = d_education_collapse, index_var = answer) %>%
  get_bar_chart_index(grouping_var = d_education_collapse, title = "Answer index by education")%>%
  set_dim(base_dimension_2)

By looking at the grouped distribution by income levels, we can say that those who earn more than \$100,00 per year or more are less likely to say they would download the app.

get_index(data = experiment_aug, grouping_var = s_hhincome, index_var = answer) %>%
  get_bar_chart_index(grouping_var = s_hhincome, title = "Answer index by education") %>%
  set_dim(base_dimension_2)

We turn to the distribution of answers by race. We can observe that African Americans are much more likely than the general population to say they would be "Very likely" to download the app, while white people are more likely than the general population to say they would be "Very unlikely" to download the app.

get_index(data = experiment_aug, grouping_var = s_race, index_var = answer) %>%
  get_bar_chart_index(grouping_var = s_race, title = "Answer index by race") %>%
  set_dim(base_dimension_2)

We will now turn to our attention to see if the distribution of answers differs between levels of the question How often would you say you have trouble sleeping at night during a typical week?. Unsurprisingly, those who say they have trouble sleeping every night are much more likely than the general population to say they would likely download the app.

get_index(data = experiment_aug, grouping_var = s_problem, index_var = answer) %>%
  get_bar_chart_index(grouping_var = s_problem, title = "Answer index by frequency of trouble slepping") %>%
  set_dim(base_dimension_2)

Effect of experiment variables

We will explore the effect of the experiment's variables on the distribution of the likelihood to download the app.

Duration

There seems to be no notable differences on the distribution of the likelihood to download the app by the different of the duration variable.

get_index(data = experiment_aug, grouping_var = duration, index_var = answer) %>%
  get_bar_chart_index(grouping_var = duration, title = "Answer index by duration") %>%
  set_dim(base_dimension_2)
Offer

There are no significant differences in distribution of the likelihood of downloading the app by the offer phase.

get_index(data = experiment_aug, grouping_var = offer, index_var = answer) %>%
  get_bar_chart_index(grouping_var = offer, title = "Answer index by offer") %>%
  set_dim(base_dimension_2)
Outcome

The phase"breaking bad habits and creating new routines" seems to be the best performing one.

get_index(data = experiment_aug, grouping_var = outcome, index_var = answer) %>%
  get_bar_chart_index(grouping_var = outcome, title = "Answer index by offer") %>%
  set_dim(base_dimension_2)
Price
get_index(data = experiment_aug, grouping_var = price, index_var = answer) %>%
  get_bar_chart_index(grouping_var = price, title = "Answer index by offer") %>%
  set_dim(base_dimension_2)
rtb

We observe a clear difference in the distribution of the answer variable according to the rtb phrase that was chosen. The phrase that appear to be more personalized, like "personalized, human coaching" and "a program created just for you" seem to make people more likely to download the app.

get_index(data = experiment_aug, grouping_var = rtb, index_var = answer) %>%
  get_bar_chart_index(grouping_var = rtb, title = "Answer index by offer") %>%
  set_dim(base_dimension_2)
Social proof

The best performing phrase appears to be "scientific evidence".

get_index(data = experiment_aug, grouping_var = social_proof, index_var = answer) %>%
  get_bar_chart_index(grouping_var = social_proof, title = "Answer index by offer") %>%
  set_dim(base_dimension_2)

Data preparation for modeling

Before starting the modeling, we need to prepare our data.

train <- experiment_aug %>%
  mutate(across(contains("m2_awareness_apps_"), ~if_else(is.na(.x), 0, 1)),
         across(contains("source_"), ~if_else(is.na(.x), 0, 1)),
         across(matches("behavior_\\d"), ~if_else(is.na(.x), 0, 1))) %>%
  mutate(m2_awareness_apps = rowSums(across(contains("m2_awareness_apps_"))),
         behavior = rowSums(across(matches("behavior_\\d"))))

For simplifying this first analysis, we keep only the variable that have no missing values after the quick data preparation we described above.

columns_keep <- train %>% map(~sum(is.na(.))) %>% keep(.==0) %>% names()

train <- train[columns_keep]

We get a vector of variables (or features), and then separate our into the independent variable data set, and the target (or independent variable) data set.

non_feature_vectors <- c("response_id", "task","answer", "weights")
feature_vectors <- setdiff(columns_keep, non_feature_vectors)

features_data <- train[feature_vectors]
target <- train["answer"]
#chisq.test(train$answer, train$duration, correct=FALSE)

Modeling

Chi-square test

We will perform a chi-square test on each variable. The purpose of doing a chi-square test is to determine if there is a significant difference between the expected and the observed frequencies in a dataset between different categories.

list_chi_square <- map(features_data, possibly(function(x) {
  chisq.test(target, x, correct = FALSE)}
  , NA_real_ ))

And then we recuperate the output of the tests

chi_statistic <- map(list_chi_square, "statistic") %>% tibble::enframe() %>% unnest()
chi_p_values <- map(list_chi_square, "p.value") %>% tibble::enframe() %>% unnest()

chi_results <- tibble(feature = chi_statistic$name,
                      statistic = chi_statistic$value,
                      p_value = chi_p_values$value) %>%
                mutate(across(where(is.double), ~scientific(., digits = 3))) 

From a first glance at the data, it would appear that the only variable that is not independent from the answer are only two:

DT::datatable(chi_results)

Both of these variable are part of the experiment's variables. However, we must keep in mind that a chi-square test is done exclusively between two variables, and, in this specific case, it would not allow us to detect relationships between independent variables. For example, if the duration variable really has a statistically significant effect on the likelihood to download the app but only after controlling for demographic variables, the chi-square test would not allow us to detect it.

Ordinal logistic regression

We will perform an ordinal logistic regression. An ordinal logistic regression serves to predict a dependent categorical variable where classes are ordered. In this case, we say that our classes are ordered because we can establish an order between the answers, which are:

print(answer_labels)

We can consider the four possible responses as different degrees of likelihood to download the app. In common language terms, we could say, that:

We prepare our data to be fed to the model, namely by ordering the answer variable and by casting as factors the categorical variables. For the variables related to the beliefs of the individual, the m1_philosophy_* we will do an ordinal approximation and treat them as continuous.

train_logit <- train %>%
mutate(answer = unlabelled(answer), 
       answer = factor(answer, 
                       levels = c("Very unlikely", "Somewhat unlikely", "Somewhat likely", "Very likely"), ordered=TRUE)) %>%
mutate(across(where(is_character), as.factor),
       across(past_coach:d_employment, unlabelled),
       across(d_urban:s_problem, unlabelled)) 

We run our model

model <- MASS::polr(formula = answer ~ duration + offer + outcome + price + rtb +    social_proof + s_gender +  m2_awareness_apps + behavior + past_coach + interest_coach + d_parent + d_employment + s_gender + s_race + s_problem + m1_philosophy_2 + m1_philosophy_5 + m1_philosophy_6 + m1_philosophy_8 + s_problem, data = train_logit, Hess = TRUE, weights = weights, method = "logistic" )

And print its output

print(model)

We get the variable coefficients and p-values.

model_summary <- coef(summary(model))

pval <- pnorm(abs(model_summary[, "t value"]),lower.tail = FALSE)* 2
model_summary <- as.data.frame(cbind(model_summary, "p value" = round(pval,3)))

We format our model output in a data frame.

model_summary_df <- rownames_to_column(model_summary, var = "variable") %>%
  mutate(across(where(is.double), ~scientific(., digits = 3)))
DT::datatable(model_summary_df)

In an ordinal logistic regression, we can interpret the coefficients in the following way:

From the coefficients and p-values of the regression, we can make the following interpretations:

Conclusions

Based on the above, we can say that:

I would tell the client that, in order to improve the download rate, I would recommend to:



NahieliV/testGradient_rep documentation built on Dec. 17, 2021, 5:19 a.m.