In dkopcino/sleepers: Tools and Analysis for the Gradient Metrics Homework Assignment

knitr::opts_chunk$set(
 echo = FALSE,
 warning = FALSE,
 message = FALSE,
  collapse = TRUE,
  comment = "#>"
)

# file names
experiment_data_file = "../inst/extdata/experiment_data.sav"
survey_data_file = "../inst/extdata/survey_data.sav"
sleepers_model_file = "../data/sleepers.plr.RDS"
sleepers_final_model_file = "../data/sleepers.plr.final.RDS"
buyer_model_file = "../data/buyer.glm.RDS"
buyer_final_model_file = "../data/buyer.glm.final.RDS"
buyer_best_personals_file = "../data/buyer_best_personals.xlsx"
cluster_model_file = "../data/buyer.cluster.RDS"

Introduction

In this vignette we will present the analysis of the survey data provided by Gradient Metrics. The data was gathered to find out the best marketing message for the mobile application that helps people with sleeping problems.

The parts of the message were organized into groups ("attributes") as follows:

#library(sleepers)
library(haven)
library(dplyr)
library(stringr)

experiment_data = read_sav(experiment_data_file)

# perform data adjustment, factorization, normalization...

# factorize all columns except task, duration and price
fact_columns = setdiff(colnames(experiment_data), c("task", "duration", "price"))
experiment_data[, fact_columns] = lapply(experiment_data[, fact_columns], factor)

# now factorize duration and price, but create new columns, this is just for pretty printing
experiment_data$duration_f = factor(experiment_data$duration, levels =  c("3 months", "6 months", "12 months"))
experiment_data$price_f = factor(experiment_data$price, levels = c("$20/month", "$30/month", "$40/month"))

# save factorized attributes columns names for later usage
att_columns = setdiff(colnames(experiment_data), c("response_id", "task", "duration", "price", "answer"))

# cast duration to numeric and normalize it to 0-1 range
# duration has 3 values c("3 months", "6 months", "12 months") which are not equidistant
# we want to capture this difference between the values and allow for other values too
experiment_data$duration = as.numeric(unlist(lapply(str_split(experiment_data$duration, "[[:space:]]+month(s)*"), "[[", 1)))
min_experiment_data_duration = min(experiment_data$duration)
max_experiment_data_duration = max(experiment_data$duration)
experiment_data$duration = (experiment_data$duration - min_experiment_data_duration)/(max_experiment_data_duration - min_experiment_data_duration)

# cast price to numeric and make logarithm
# price has 3 values c("$20/month", "$30/month", "$40/month") and these are equidistant
# we want to capture the ordering of these values on logarithmic scale (usually better for prices because
# it givess smaller difference on higher prices than on lower prices - more realistic) and allow for other values too
experiment_data$price = as.numeric(gsub("\\$", "", unlist(lapply(str_split(experiment_data$price, "/month(s)*"), "[[", 1))))
experiment_data$log_price = log(experiment_data$price)

# answer has only numbers and we want to have text descriptions
levels(experiment_data$answer) = c("Very unlikely", "Somewhat unlikely", "Somewhat likely", "Very likely")
experiment_data$answer = factor(experiment_data$answer, levels = c("Very unlikely", "Somewhat unlikely", "Somewhat likely", "Very likely"), ordered = TRUE)

# IMPORTANT: create a predict_transform function that will take new data given in the same format as 
# experiment_data and perform adjustments as described above to be able to make predictions.
predict_transform = function(new_data) {
  for (cn in setdiff(fact_columns, "answer")) {
    new_data[[cn]] = factor(new_data[[cn]], levels = levels(experiment_data[[cn]]))
  }
  if (!is.null(new_data[["answer"]])) {
    # here we assume that the answers are given as in the original data, with numbers indicating the answer
    new_data[["answer"]] = factor(new_data[["answer"]])
    levels(new_data[["answer"]]) = levels(experiment_data[["answer"]])
  }
  new_data$duration = as.numeric(unlist(lapply(str_split(new_data$duration, "[[:space:]]+month(s)*"), "[[", 1)))
  new_data$duration = (new_data$duration - min_experiment_data_duration)/(max_experiment_data_duration - min_experiment_data_duration)
  new_data$price = as.numeric(gsub("\\$", "", unlist(lapply(str_split(new_data$price, "/month(s)*"), "[[", 1))))
  new_data$log_price = log(new_data$price)
  #new_data$price = (new_data$price - min_experiment_data_price)/(max_experiment_data_price - min_experiment_data_price)
  new_data
}

# printout attributes and values
r = lapply(att_columns, function(cn) {
  cn_levels = levels(experiment_data[[cn]])
  print(paste0("Attribute ", cn, " (", length(cn_levels), " values): ", paste(cn_levels, collapse = " | ")))
})

# just checking if every respondent answered the same number of questions > YES
# respondent_tasks = experiment_data %>% group_by(response_id) %>% summarise(max_task = max(task))
# max(respondent_tasks$max_task)-min(respondent_tasks$max_task)

There were r length(levels(experiment_data$response_id)) respondents and each respondent answered r max(experiment_data$task) questions.

Each question had r length(levels(experiment_data$answer)) answer options: r paste0(levels(experiment_data$answer), collapse = " | "). Respondent chose one answer option on each question.

Let's have a look at how many times each attribute was shown and the frequency distribution of the answers:

summary(experiment_data[, c(att_columns, "answer")])
# here we also see that we don't have any NAs, missing data

The attributes frequencies seem equally distributed. The answers frequencies tend to be "Very unlikely" but this can be simply a result of lack of buying intent by the respondents.

Since the survey data is actually a panel data (many answers from the same respondent) we will also check if there are some inconsistencies or strange behavior with respect to the respondents. Here is a summary of the buying intent answers for a few respondents:

rank_1 = experiment_data %>% group_by(response_id) %>% 
  summarise(min_buying_intent = round(min(as.numeric(answer)), 2), 
            avg_buying_intent = round(mean(as.numeric(answer)), 2),
            max_buying_intent = round(max(as.numeric(answer)), 2),)
colnames(rank_1)[1] = "response_id"
head(rank_1, n = 10)

Some respondents seem to be more willing to buy, some less, but we can notice that e.g. respondent with id "R_0ezucdFLFLIYzK1" has the minimum and maximum buying intent 1. This means that all his/her answers were "Very unlikely". Such respondents pose a problem for our (later) modeling because their answers provide no information about the value of attributes, they just "pull" everything towards their single answer, in this case "Very unlikely".

Let's see how many respondents like this one we have - always giving the same answer, grouped with respect to the buying intent options:

always_the_same_answer = rank_1$min_buying_intent == rank_1$max_buying_intent
paste0("Always the same answer: ", sum(always_the_same_answer))
paste0("All Very unlikely: ", sum(rank_1$max_buying_intent == 1))
paste0("All Somewhat unlikely: ", sum((rank_1$min_buying_intent == rank_1$max_buying_intent) & (rank_1$max_buying_intent == 2)))
paste0("All Somewhat likely: ", sum((rank_1$min_buying_intent == rank_1$max_buying_intent) & (rank_1$max_buying_intent == 3)))
paste0("All Very likely: ", sum(rank_1$min_buying_intent == 4))

This is quite a big number (r round((sum(always_the_same_answer)/nrow(rank_1))*100, 2)%) of people who constantly chose the same answer. Since the goal of this analysis is to find how the attributes affect the buying intent, we could say that for this group of respondents - it doesn't!

If this was not panel data we could conclude that the attributes are simply not driving enough buying intent. But since we have a panel data, we tend to conclude that the respondents who always chose the same answer either did not understand the task or simply don't care about the whole survey and didn't think about the questions at all. In both cases their answers are invalid.

That is why we will exclude the answers given by these people from the further analysis. This doesn't mean that these respondents shouldn't be further contacted or similar (especially those that always wanted to buy), but for the sake of this analysis their answers are invalid.

Note: we tried using all the data, but the results were much statisically weaker (e.g. much larger AIC) as expected.

experiment_data = experiment_data[experiment_data$response_id %in% rank_1[!always_the_same_answer, "response_id", drop = TRUE], ]

Besides the survey about the best marketing message for the mobile application, respondents also filled a personal survey, giving information about them, their sleeping problems, ways of coping with them etc.

survey_data = read_sav(survey_data_file)
# factorize all columns
survey_data_1 = data.frame(lapply(survey_data, factor))
# levels are now numeric, we want to have text descriptions on what each level means
fact_columns_names = colnames(survey_data)[colnames(survey_data) != "response_id"]
for (cn in fact_columns_names) {
  # have to make this check because some labels were NULL, unclear why
  if (!is.null(attributes(survey_data[[cn]])$labels)) {
    levels(survey_data_1[[cn]]) = names(attributes(survey_data[[cn]])$labels)
  }
}
survey_data = survey_data_1

There were r ncol(survey_data) additional questions in this personal survey. With this additional data, we tried to segment the respondents (with respect to the answers they gave) and identify a segment/group that would be more willing to buy the mobile application.

Experiment Data Analysis and Modelling

As a first step in our analysis of the (experimental) survey data, we will make a quick "counting" analysis: let's see what is the average "buying intent" (from 1 = r levels(experiment_data$answer)[1] to r length(levels(experiment_data$answer)) = r levels(experiment_data$answer)[length(levels(experiment_data$answer))]) for different attributes and their values.

library(ggplot2)
library(gridExtra)

cutlabelsN = function(x, N = 10) {
  is_long = nchar(x) > N
  x[is_long] = paste0(substr(x[is_long], 1, N-3), "...")
  x
}

cutlabels5 = function(x) cutlabelsN(x, 5)

plot_rvar = function(rvar, df) {
  rank_1 = df[!is.na(df[[rvar]]), ] %>% group_by(get(rvar)) %>% summarise(avg_buying_intent = round(mean(as.numeric(answer)), 2), .groups = 'keep')
  colnames(rank_1)[1] = rvar
  label_fill = rep("gray", length(rank_1$avg_buying_intent))
  label_fill[rank_1$avg_buying_intent == max(rank_1$avg_buying_intent)] = "red"
  label_fill[rank_1$avg_buying_intent == min(rank_1$avg_buying_intent)] = "pink"
  ggplot(data = rank_1, aes(x = get(rvar), y = avg_buying_intent)) + scale_x_discrete(labels = cutlabelsN) + geom_col(fill = label_fill) + geom_text(aes(x = get(rvar), y = avg_buying_intent/2, label = avg_buying_intent), data = rank_1, size = 3, colour = "white") + xlab(rvar) + theme(axis.text.x = element_text(size = 8), axis.title.y = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank())
}
ggi = lapply(att_columns, plot_rvar, df = experiment_data)
marrangeGrob(ggi, nrow = 3, ncol = 2)

We can see that the price attribute levels show relatively clear trend (higher buying intent for lower price, lower buying intent for higher price). This is expected and shows that we don't have strange behavior.

We will transform the price attribute to numeric to capture other possible values (which are not used in this experiment). We will also transform the price to its logarithm because it better captures the buyers behavior (smaller difference between higher prices than between smaller prices).

With other attributes, we can't see such a clear trend.

Let's also have a look at the average buying intents for different 2-combinations of attribute levels:

two_atts_combs = combn(x = att_columns, m = 2)
ggi2 = lapply(1:ncol(two_atts_combs), function(two_atts_i) {
  two_atts = two_atts_combs[, two_atts_i]
  fst_att = two_atts[1]
  snd_att = two_atts[2]
  #print(paste0(fst_att, ", ", snd_att))
  fst_levs = levels(experiment_data[[fst_att]])
  snd_levs = levels(experiment_data[[snd_att]])
  rank_1 = experiment_data[!is.na(experiment_data[[fst_att]]) & !is.na(experiment_data[[snd_att]]), ] %>% group_by(get(fst_att), get(snd_att)) %>% summarise(avg_buying_intent = round(mean(as.numeric(answer)), 2), .groups = 'keep')
  if (nrow(rank_1) == 0) return(NULL)
  colnames(rank_1)[1:2] = c(fst_att, snd_att)
  label_fill = rep("gray", length(rank_1$avg_buying_intent))
  label_fill[rank_1$avg_buying_intent == max(rank_1$avg_buying_intent)] = "red"
  label_fill[rank_1$avg_buying_intent == min(rank_1$avg_buying_intent)] = "pink"
  ggplot(data = rank_1, aes(x = get(fst_att), y = get(snd_att))) + scale_x_discrete(labels = cutlabelsN) + scale_y_discrete(labels = cutlabelsN) + geom_label(aes(x = get(fst_att), y = get(snd_att), label = avg_buying_intent), data = rank_1, colour = "white", fill = label_fill) + xlab(fst_att)+ ylab(snd_att) + theme(axis.text.x = element_text(size = 8), axis.text.y = element_text(size = 8))
})
ggi2 = ggi2[!sapply(ggi2, is.null)]
marrangeGrob(ggi2, nrow = 2, ncol = 2)

#grid.arrange(ggi[[1]], ggi[[2]], ggi[[3]], ggi[[4]], ggi[[5]], ggi[[6]], ggi[[7]], ggi[[8]], ggi[[9]], ncol = 2)

We can see that some combinations of attributes have somewhat bigger difference between the smallest (pink) and largest (red) average buying intent, e.g. offer + rtb, offer + price, outcome + price, rtb + price.... This indicates that some of these combinations could have an interaction effect, where one value combined with another value gives a cumulative (positive or negative) effect on the average buying intent. Thus maybe we should include these combinations in the modeling.

It is now time to build a model. Since our output variable answer is an ordinal variable (has comparable values), we will use an ordinal logistic regression model with answer as the output variable and all other variables as predictors.

We get the following summary:

library(MASS)

options(contrasts = c("contr.treatment", "contr.poly"))

if (file.exists(sleepers_model_file)) {
  sleepers.plr = readRDS(sleepers_model_file)
} else {
  sleepers.plr <- polr(answer ~ duration + offer + outcome + log_price + rtb + social_proof, data = experiment_data, Hess = TRUE)
  # save the model 
  saveRDS(sleepers.plr, sleepers_model_file)
}

#logLik(sleepers.plr)

source("../R/summary_helper.R")

sumtable = get_summary_table_for_polr_model(sleepers.plr, experiment_data[1, c("offer", "outcome", "rtb", "social_proof"), drop = FALSE])

#sumtable_cis = confint(sleepers.plr)

library(tibble)

sumtable[, c("Value", "p.value", "signif", "odds_effect")] %>% tibble::rownames_to_column("rnames") %>% mutate(rnames = substr(rnames, 1, 30), odds_effect = format(odds_effect, digits = 3)) %>% tibble::column_to_rownames("rnames")

Each coefficient represents attribute value (numeric or a level). Without getting into too many details about this output format, we can see (from signif column) that only r paste0(rownames(sumtable[sumtable$signif != "", ]), collapse = ", ") coefficients are significant (not 0 with high probability).

Negative coefficients in the table above mean that increasing the corresponding attribute value reduces the odds of the buying intent and vice versa - positive coefficients mean that increasing the corresponding attribute value increases the odds of the buying intent. So e.g. increasing price (with its negative coefficient r sumtable["price", "Value"]) reduces the buying intent while giving "scientific evidence" in social_proof (with its positive coefficient r sumtable["social_proofscientific evidence", "Value"]) increases the buying intent compared with the "r levels(experiment_data$social_proof)[1]".

The following coefficients are significant and decrease the buying odds: r paste0(rownames(sumtable[(sumtable$signif != "") && (sumtable$Value < 0), ]), collapse = ", ").

The following coefficients are significant and increase the buying odds: r paste0(rownames(sumtable[(sumtable$signif != "") && (sumtable$Value > 0), ]), collapse = ", ").

duration is not significant, so it has no impact on buying odds.

offer has no significant values, so it has no impact on buying odds.

outcome has no significant values, so it has no impact on buying odds.

log_price is significant with a negative coefficient, so lower log_price (meaning lower price) increases buying odds, while higher log_price (meaning higher price) decreases buying odds. This is natural and expected.

rtb is significant on a few values, but the coefficients are negative, so respondents prefer r levels(experiment_data$rtb)[1] than the other values.

social_proof is significant on a value "scientific evidence" and the coefficient is positive, so respondents prefer "scientific evidence" than the other values.

All intercepts are significant and negative which shows that when all attributes are at 0 or at their first level, respondents tend to be more willing to buy than not to buy. So negative coefficients increase buying odds - this is contrary to the other coefficients but it's due to some historical notation reasons. But since intercepts are not something we can control, we will focus on the non-intercept coefficients.

All coefficients are given on a logarithmic scale, so to find how they really affect the odds of buying intent, we have to exponentiate them. This is the "odds_effect" column. If this number is < 1, the attribute value decreases the odds of buying. If this number is > 1, the attribute value increases the odds of buying.

A lot of coefficients are non significant and a model is better if it has less parameters so we tested if we can remove some variables from the model or improve the model by adding interactions (as explained before with the graphics).

#addterm(sleepers.plr, ~.^2, test = "Chisq")
#sleepers.plr2 <- stepAIC(sleepers.plr, ~ . ^2)
#sleepers.plr2$anova
#anova(sleepers.plr, sleepers.plr2)

We found that we could remove duration, offer, outcome without losing statistical quality (AIC has actually decreased around 5). Thus our final model summary is:

if (file.exists(sleepers_final_model_file)) {
  sleepers.plr.final = readRDS(sleepers_final_model_file)
} else {
  sleepers.plr.final <- polr(answer ~ log_price + rtb + social_proof, data = experiment_data, Hess = TRUE)
  # save the model 
  saveRDS(sleepers.plr.final, sleepers_final_model_file)
}

#logLik(sleepers.plr.final)

sumtable = get_summary_table_for_polr_model(sleepers.plr.final, experiment_data[1, c("rtb", "social_proof"), drop = FALSE])

#sumtable_cis = confint(sleepers.plr.final)

sumtable[, c("signif", "explanation")] %>% tibble::rownames_to_column("rnames") %>% mutate(rnames = substr(rnames, 1, 30)) %>% tibble::column_to_rownames("rnames")

Since we built an ordinal regression model, we have to check its main assumption (parallel regression assumption, a.k.a. the proportional odds assumption).

library(brant)
brant(sleepers.plr.final)

Looking at the probability column (it's the 3rd number in each row, some are moved to left because of an error in tabulating the output), we can see that all coefficients except log_price and Omnibus have probability of the assumption H0 at least 0.05 (5%). This means that we can trust the H0 and the modeling assumption holds for these coefficients.

We ignore the Omnibus 0 probability because it's essentially a product of all the other probabilities and thus very small by default.

log_price doesn't follow the proportional odds assumption, but this is expected, because influence of the price on the buying intent is not proportional, e.g. a higher price will affect the difference between Somewhat unlikely and Very unlikely less than the difference between Very likely and Somewhat likely.

Thus altogether we can proceed with the analysis.

#pr <- profile(sleepers.plr)
#confint(pr)
#plot(pr)
#pairs(pr)

To check how our model works on the existing data, we ran a check with the following result:

new_data = read_sav(experiment_data_file)
new_data = predict_transform(new_data)
# clear the same invalid respondents as previously
new_data = new_data[new_data$response_id %in% rank_1[!always_the_same_answer, "response_id", drop = TRUE], ]

# do we want to predict class = the highest probability will always win?
# note: this never gives the Very likely answer
#new_data_predicted_class = predict(sleepers.plr, new_data, type = "class")
# or do we want to take probabilities and make sample with these probabilities to determine class?
new_data_predicted_probs = predict(sleepers.plr.final, new_data, type = "p")
new_data_predicted_class = unlist(lapply(1:nrow(new_data_predicted_probs),function(i) sample(colnames(new_data_predicted_probs), 1, replace = TRUE, prob = new_data_predicted_probs[i, ])))
new_data_predicted_class = factor(new_data_predicted_class, levels = levels(experiment_data$answer))

library(caret)
confusionMatrix(new_data_predicted_class, new_data$answer)

Overall accuracy (around 26%) is not too good which indicates that the given attributes are not very good predictors of buying intent.

The Best Message

Summing everything up, the best marketing message would consist of:

Duration: 12 months (or any other value because it did not have a significant impact, but a longer value of duration would benefit the application seller)
Offer: improve your health for the long-run (or any other value because it did not have a significant impact)
Outcome: breaking bad habits and creating new routines (or any other value because it did not have a significant impact)
Price: $20/month
Rtb: a program created just for you OR cognitive behavioral therapy
Social proof: scientific evidence

The lowest price is the least beneficial for the application seller, but when combined with the longer duration, which did not seem to bother the respondents, the overall benefit would be higher than pushing higher prices with the shorter duration.

Respondents Segmentation and Predicting Buyers

Now we will analyze the other part of the survey data, related to demographic and other personal information.

Since the data contains quite a lot of missing information (NA), we cleaned and imputed meaningful data wherever we could.

We initially removed the weights variable because we didn't understand its meaning and interpretation.

# order survey_data by response_id
survey_data = survey_data[order(survey_data$response_id), ]

# Try to clean up NAs.

# demography
# "d_urban" "s_gender" "s_race" "d_education" "s_hhincome" "s_problem" no NAs
# "d_marital" "d_h_hnumber" "d_parent" no NAs
# "d_child_infant" "d_child_young" "d_child_older" all have 592 NAs, probably corresponding to the 592 where d_parent = No
# we can set these NAs to appropriate levels
survey_data[is.na(survey_data$d_child_infant), ]$d_child_infant = "None"
survey_data[is.na(survey_data$d_child_young), ]$d_child_young = "None"
survey_data[is.na(survey_data$d_child_older), ]$d_child_older = "None"
# "d_politics" 25 NAs, change to Other
survey_data[is.na(survey_data$d_politics), ]$d_politics = "Other"
# "d_political_view" "d_employment" 
# "d_work_schedule" "d_work_hours" both have 459 NAs, probably corresponding to d_employment = Temporarily laid off,
# Unemployed, Retired, Permanently disabled, Taking care of home or family, Student, Other (total 459)
# we will put Don’t know / not applicable for d_work_schedule
survey_data[is.na(survey_data$d_work_schedule), ]$d_work_schedule = "Don’t know / not applicable"
# we will put Less than 20 for d_work_hours
survey_data[is.na(survey_data$d_work_hours), ]$d_work_hours = "Less than 20"
# "s_region" "s_age" "weights"

# philosophy is ok, no NAs
#summary(survey_data[, grep("m1_philosophy", colnames(survey_data))])

# attitudes is ok, no NAs
#summary(survey_data[, grep("m2_attitudes", colnames(survey_data))])

# helper function, replaces NAs in factor columns defined by nameregexp
# with new level NAlevel and adds the new level to the factor levels for the columns
replaceNAandrefactor = function(nameregexp, NAlevel) {
  for (cn in grep(nameregexp, colnames(survey_data))) {
    levels(survey_data[, cn]) <<- c(levels(survey_data[, cn]), NAlevel)
    survey_data[is.na(survey_data[[cn]]), cn] <<- NAlevel
  }
}

# awareness is not ok, many NAs, probably related to the fact that people don't use/know some app
# we replace NA with None and refactor the variable
#summary(survey_data[, grep("m2_awareness", colnames(survey_data))])
# source is not OK, similar to awareness
#summary(survey_data[, grep("source", colnames(survey_data))])
# behavior_N is not OK, similar to awareness
#summary(survey_data[, grep("behavior_[[:digit:]]+", colnames(survey_data))])
replaceNAandrefactor("m2_awareness|source|behavior_[[:digit:]]+", "None")
# behavior_a_N is not OK, NAs probably mean I’m not sure
#summary(survey_data[, grep("behavior_[[:alpha:]]+", colnames(survey_data))])
replaceNAandrefactor("behavior_[[:alpha:]]+", "I’m not sure")

# awareness_apps have a "None" as a second level, we will change that so "None" will be the 0-level which is more logical
for (cn in colnames(survey_data)[grep("m2_awareness_apps_", colnames(survey_data))]) {
  survey_data[[cn]] = factor(survey_data[[cn]], c("None", setdiff(levels(survey_data[[cn]]), "None")))
}

# order rank_1 by response_id to be able to compare with survey_data
rank_1 = rank_1[order(rank_1$response_id), ]

# check that the response_ids are equally ordered both in rank_1 and survey_data
#sum(levels(rank_1$response_id)[rank_1$response_id] == levels(survey_data$response_id)[survey_data$response_id]) == nrow(rank_1)

buyer_levels = c("non-buyer", "buyer") # when casting to numeric - 1, non-buyer is 0, buyer is 1

# make simple flag variable for buyers
rank_1$buyer = ifelse(rank_1$avg_buying_intent >= 3, "buyer", "non-buyer")
rank_1$buyer_f = factor(rank_1$buyer, levels = buyer_levels)

Since our main goal is to try to identify the buying influencing variables, we connected the information if a respondent is a buyer (has average buying intent as previously defined >= 3) or not with the survey data and we will try to identify the best predictors for the new buyer variable. With this partitioning, we have the following numbers:

table(rank_1$buyer_f)

The number of survey variables is large (99) and the number of observations (the same as number of respondents = r nrow(survey_data)) not too large, so we ran a series of Chi-squared test to see which variables show some correlation with the buyer variable. We kept these for the further analysis:

# remove weights because it seems numeric and it is not clear what it means
# find columns that have some correlation (chi-squared test) with the buyer flag

cns = c()
for (cn in setdiff(colnames(survey_data), c("response_id", "weights"))) {
  #print(cn)
  if (max(as.numeric(survey_data[[cn]])) > min(as.numeric(survey_data[[cn]]))) {
    cst = chisq.test(rank_1$buyer_f, survey_data[[cn]], simulate.p.value = TRUE)
    if (cst$p.value < 0.001) cns = c(cns, cn)
  } else {
    #print(paste0(cn, " has only one value"))
  }
}

cns

Another way to see which variables could lead to higher buying probability would be to simply count the number of buyers in different groups, defined by different variables and their levels. Here are the results of such counting analysis, where we kept only the groups with at least 50% of buyers:

buyer_cns = data.frame()
for (cn in cns) {
  buyer_cn = data.frame(table(rank_1$buyer_f, survey_data[[cn]])) %>% group_by(Var2) %>% 
    summarise(ATT_NAME = cn, BUYERS = Freq[2], PROB_BUYER_PERC = round((Freq[2]/(Freq[2]+Freq[1]))*100), .groups = 'keep') %>% 
    filter(PROB_BUYER_PERC > 50) %>% rename(ATT_VALUE = Var2) %>% relocate(ATT_NAME) %>% data.frame()
  if (nrow(buyer_cn) > 0) buyer_cns = rbind(buyer_cns, buyer_cn)
}
buyer_cns %>% mutate(ATT_VALUE = substr(ATT_VALUE, 1, 20))

If the application seller has ability to target the groups defined with above attributes and values, then it is probably the best way to go - these are e.g. people who already used a sleep-helping app, visited a sleep clinic, had a coach that helped them before etc.

But realistically, unless really many people fill the survey as in this project, it is hard to find the members of many of the groups above. Therefore, we will concentrate on more general variables/groups, that could be targeted with e.g. online ads or similar.

Predicting Buyers

Continuing the discussion in the previous section, we decided to concentrate on the following attributes: d_urban, s_gender, s_race, d_education, s_hhincome, m2_awareness_apps_1, m2_awareness_apps_4, m2_awareness_apps_5, m2_awareness_apps_6, m2_awareness_apps_7, m2_awareness_apps_8, m2_awareness_apps_9, m2_awareness_apps_10, m2_awareness_apps_11, m2_awareness_apps_12, m2_awareness_apps_13, m2_awareness_apps_14, d_marital, d_h_hnumber, d_parent, d_child_infant, d_child_young, d_child_older, d_politics, d_political_view, d_employment, d_work_schedule, d_work_hours, s_region, s_age. These variables represent either some general demography and could be targeted via e.g. online ads ("if you have a sleeping problem, try our app") or sleep-helping apps awareness and could again be targeted via application stores (to those that downloaded some app - "try our app").

Now we're set up to build a model that will try to predict if a respondent is a buyer or not from the variables/answers given. We get the following summary (we show only significant estimates because there are too many to show them all):

cns = c("d_urban", "s_gender", "s_race", "d_education", "s_hhincome", "m2_awareness_apps_1", "m2_awareness_apps_4", "m2_awareness_apps_5", "m2_awareness_apps_6", "m2_awareness_apps_7", "m2_awareness_apps_8", "m2_awareness_apps_9", "m2_awareness_apps_10", "m2_awareness_apps_11", "m2_awareness_apps_12", "m2_awareness_apps_13", "m2_awareness_apps_14", "d_marital", "d_h_hnumber", "d_parent", "d_child_infant", "d_child_young", "d_child_older", "d_politics", "d_political_view", "d_employment", "d_work_schedule", "d_work_hours", "s_region", "s_age")

buyer_survey_data = data.frame(buyer_f = rank_1$buyer_f, survey_data[, cns])
buyer_survey_data$buyer = as.numeric(buyer_survey_data$buyer_f) - 1

if (file.exists(buyer_model_file)) {
  buyer.glm = readRDS(buyer_model_file)
} else {
  buyer.glm <- glm(as.formula(paste0("buyer ~ ", paste(cns, collapse = " + "))), 
                   data = buyer_survey_data, 
                   family = "binomial")
  saveRDS(buyer.glm, buyer_model_file)
}

sumtable.buyer.glm = get_summary_table_for_glm_logistic_model(buyer.glm, survey_data[, cns])

sumtable.buyer.glm[sumtable.buyer.glm$signif != "", c("Estimate", "Pr...z..", "signif", "odds_effect")] %>% tibble::rownames_to_column("rnames") %>% mutate(rnames = substr(rnames, 1, 30), odds_effect = format(odds_effect, digits = 3)) %>% tibble::column_to_rownames("rnames")

As with the experimental data before, we will try to reduce the number of variables without losing statistical quality. We get:

cns = c("d_urban", "m2_awareness_apps_6", "m2_awareness_apps_9", "m2_awareness_apps_10", "m2_awareness_apps_11", "m2_awareness_apps_12", "m2_awareness_apps_13", "m2_awareness_apps_14", "d_parent", "d_politics", "d_work_schedule", "s_age")

if (file.exists(buyer_final_model_file)) {
  buyer.glm.final = readRDS(buyer_final_model_file)
} else {
  buyer.glm.final <- glm(as.formula(paste0("buyer ~ ", paste(cns, collapse = " + "))), 
                         data = buyer_survey_data, 
                         family = "binomial")
  saveRDS(buyer.glm.final, buyer_final_model_file)
}

sumtable.buyer.glm.final = get_summary_table_for_glm_logistic_model(buyer.glm.final, survey_data[, cns])

sumtable.buyer.glm.final[sumtable.buyer.glm.final$signif != "", c("Estimate", "Pr...z..", "signif", "odds_effect")] %>% tibble::rownames_to_column("rnames") %>% mutate(rnames = substr(rnames, 1, 30), odds_effect = format(odds_effect, digits = 3)) %>% tibble::column_to_rownames("rnames")

Before continuing with the analysis of the coefficients, we check how does our model work on the existing data:

library(pscl)
library(caret)
pR2(buyer.glm.final) # pseudo R-squared to check fit
predict_probs = predict(buyer.glm.final, buyer_survey_data, type = "response")
predicted_class = factor(as.numeric(predict_probs > 0.5), levels = c("0", "1"))
levels(predicted_class) = buyer_levels
confusionMatrix(data = predicted_class, reference = buyer_survey_data$buyer_f)

#coef_cis = confint(buyer.glm)

McFadden's pseudo R-squared, accuracy and other checks are ok, so we can continue with our analysis.

Now that we have coefficients of all variables that occurred in the survey, we can pull out the values of each variable that "pull" the buying odds in the positive direction (compared to other values in the same attribute):

# find the values from the possible survey data answers that maximally increase the odds of being a buyer
best_personals = data.frame()
for (an in cns) {
  an_row_indices = grep(paste0("^", an, "[^[:digit:]]"), rownames(sumtable.buyer.glm.final))
  if (length(an_row_indices) == 0) an_row_indices = grep(paste0("^", an), rownames(sumtable.buyer.glm.final))
  an_rows = sumtable.buyer.glm.final[an_row_indices, , drop = FALSE]
  # we treat all non-significant rows, including the reference level row, as 0
  # if there is no significant row with a positive estimate, we suggest all non-significant rows
  signif_indices = an_rows$Pr...z.. < 0.1
  signif_an_rows = an_rows[signif_indices, , drop = FALSE]
  if ((nrow(signif_an_rows) > 0) && (max(signif_an_rows$Estimate) > 0)) {
    max_estimate_level = rownames(signif_an_rows)[which.max(signif_an_rows$Estimate)]
    max_estimate_level_clr = gsub(an, "", max_estimate_level)
    max_estimate_level_clr_estimate = an_rows[max_estimate_level, ]$Estimate
    max_estimate_level_clr_p_value = an_rows[max_estimate_level, "Pr...z.."]
    max_estimate_level_clr_odds_effect = an_rows[max_estimate_level, "odds_effect"]
    max_estimate_level_clr_explanation = an_rows[max_estimate_level, ]$explanation
    is_single_level = 1
    best_personals = rbind(best_personals, 
                           c(an, 
                             max_estimate_level_clr, 
                             max_estimate_level_clr_estimate, 
                             max_estimate_level_clr_p_value,
                             max_estimate_level_clr_odds_effect,
                             max_estimate_level_clr_explanation,
                             is_single_level
                             )
                           )
  } else {
    non_signif_an_rows = an_rows[!signif_indices, , drop = FALSE] # at least reference level is here
    max_estimate_level = rownames(non_signif_an_rows)
    max_estimate_level_clr = gsub(paste0(an, "|, ref. level"), "", max_estimate_level)
    is_single_level = as.numeric(length(max_estimate_level_clr) == 1)
    for (mel in max_estimate_level_clr) {
      best_personals = rbind(best_personals, 
                             c(an, 
                               mel, 
                               0, 
                               1,
                               1,
                               "no effect on buying odds",
                               is_single_level
                               )
                             )
    }
  }
}
colnames(best_personals) = c("personal", "value", "estimate", "p-value", "odds_effect", "explanation", "is_single")

# have to convert back to numeric, don't know why the columns were converted to character in the loop above
numeric_columns = c("estimate", "p-value", "odds_effect")
best_personals[, numeric_columns] = lapply(best_personals[, numeric_columns], as.numeric)

# # find the best answers in the survey data to identify respondents who gave these answers
# fnd_indices = lapply(unique(best_personals[best_personals$estimate != 0, ]$personal), function(cn) {
# #fnd_indices = lapply(unique(best_personals$personal), function(cn) {
#   which(survey_data[[cn]] %in% best_personals[best_personals$personal == cn, "value"])
# })
# 
# fnd_indices_intersect = Reduce(intersect, fnd_indices)

# are these buyers?
#table(buyer_survey_data[fnd_indices_intersect, ]$buyer_f)

library(openxlsx)
write.xlsx(best_personals, buyer_best_personals_file)

# printout details and the total effect on odds for the best answers
#total_odds_effect = prod(best_personals[best_personals$personal > 0, ]$odds_effect)
#best_personals[best_personals$estimate > 0, c("personal", "value", "odds_effect")] %>% mutate(value = substr(value, 1, 20))
best_personals[, c("personal", "value", "odds_effect")] %>% mutate(value = substr(value, 1, 20))

The results intuitively make sense: higher probability buyers are

urban (more sleeping problems, more used to using mobile apps),
users of other sleep helping apps (Whoop, Sleep quality, Oura ring, Sleepio),
not users of the FitBit app,
parents,
with fixed working schedule (a lot of work),
with rotating work schedule (repetitive changes, harder to get a good sleeping habit),
younger (more inclined to using mobile apps for help).

So the best approach to find buyers in general would be to target people that satisfy the values of the attributes above.

Clustering Respondents

We also tried to cluster the respondents according to the answers given in the survey to see if we can correlate some segments with the buyers. This was more of a check of the results above because we answered the business question.

The first step in clustering is to define the number of clusters we wish to create. This is usually done with a so called "elbow method". A good guess for the number of clusters is picked at a point where a sudden change in direction (an "elbow") can be seen in the graphs below:

source("../R/clustering.R")

maximal_k = 10

clust_survey_data = buyer_survey_data[, !(colnames(buyer_survey_data) %in% c("buyer", "buyer_f"))]
clust_survey_data[, colnames(clust_survey_data)] = lapply(clust_survey_data, as.numeric)

elbow_method(clust_survey_data, k = maximal_k, method = "kmeans")

elbow_method(clust_survey_data, k = maximal_k, method = "hclust")

The "WCSS" is a total sum of squares within all the clusters. Where the sudden change in direction of the graph occurs is a probable good guess of the real number of clusters in the data.

The graphs above show an "elbow" only for 2 clusters so we proceed with 2 clusters. We tried 2 clustering methods - k-means and hierarchical clustering. k-means turned out better in a sense of lesser total sum of squares within all the clusters.

### --> Define the number of clusters to be used by all algorithms
n_clusters = 2

library(fpc)

if (file.exists(cluster_model_file)) {
  cluster_model = readRDS(cluster_model_file)
} else {
  clusboot = clusterboot(clust_survey_data, clustermethod = kmeansCBI, k = n_clusters)
  #clusboot = clusterboot(clust_survey_data, clustermethod = hclustCBI, method = "ward.D", k = n_clusters)
  cluster_model = list("model" = clusboot$result, "clustering" = clusboot$result$partition)
  saveRDS(cluster_model, cluster_model_file)
}

dmodel = cluster_model$model
dclustering = cluster_model$clustering
#wss.total(clust_survey_data, dclustering)

princ = prcomp(clust_survey_data)
nComp = 2
project = predict(princ, newdata = clust_survey_data)[, 1:nComp]
project.plus = cbind(as.data.frame(project),
                     cluster = as.factor(dclustering))
print(ggplot(project.plus, aes(x = PC1, y = PC2)) +
  geom_point(aes(color = cluster, shape = cluster)))

As the image above shows, 2 clusters are well separated. Let's see how do the clusters correspond to buyers:

# ggplot(data = buyer_survey_data, aes(x = d_urban, fill = clustering)) + geom_bar(position = "dodge") + scale_x_discrete(labels = cutlabelsN) + xlab("d_urban")

buyer_survey_data$clustering = factor(dclustering)

table(buyer_survey_data$buyer_f, buyer_survey_data$clustering)

Cluster 2 has a majority non-buyers. Cluster 1 has a majority of buyers, although not so much prevailing.

Now we will compare the variables identified as the most important in predicting buyers in the previous section with respect to the clusters we just created. These variables (and their levels) are:

best_personals_single_i = (best_personals$is_single == 1)
paste0(best_personals[best_personals_single_i, ]$personal, ":", best_personals[best_personals_single_i, ]$value)

We want to see if there are some patterns that connect the clusters and the variables above. Here is a list of bar-charts showing the distribution of each variable with respect to the clusters:

#ggi = lapply(setdiff(colnames(buyer_survey_data), c("buyer", "buyer_f", "clustering")), 
ggi = lapply(best_personals[best_personals$is_single == 1, ]$personal, 
             function(cn) {
               ggplot(data = buyer_survey_data, aes(x = get(cn), fill = clustering)) + geom_bar(position = "dodge") + scale_x_discrete(labels = cutlabels5) + xlab(cn)
             }
             )
marrangeGrob(ggi, nrow = 3, ncol = 2)

We can see e.g.:

d_urban:Urban (An area that is commonly within a town or city with a large population) has more cluster 1 respondents, which are higher probability buyers. This is consistent with previous conclusions.
m2_awareness_apps_9:None has more cluster 1 respondents, which are higher probability buyers. This is consistent with previous conclusions.
m2_awareness_apps_10:Whoop has more cluster 1 respondents, which are higher probability buyers. This is consistent with previous conclusions.
m2_awareness_apps_11:Sleep quality has more cluster 1 respondents, which are higher probability buyers. This is consistent with previous conclusions.
m2_awareness_apps_12:Oura ring has more cluster 1 respondents, which are higher probability buyers. This is consistent with previous conclusions.
m2_awareness_apps_13:Sleepio has more cluster 1 respondents, which are higher probability buyers. This is consistent with previous conclusions.
m2_awareness_apps_14:None has more cluster 1 respondents, which are higher probability buyers. This is consistent with previous conclusions.
d_parent:Yes has more cluster 1 respondents, which are higher probability buyers. This is consistent with previous conclusions.

We can conclude that all our previous conclusions and the new clustering support each other in the "prediction" of buyers.

We also looked at the clustering with respect to the other variables that previously turned insignificant, in hope that we could find some additional positive predictors for buyers (from the clustering itself).

#ggi = lapply(setdiff(colnames(buyer_survey_data), c("buyer", "buyer_f", "clustering")), 
ggi = lapply(c("d_employment", "d_work_schedule", "d_work_hours"), 
             function(cn) {
               ggplot(data = buyer_survey_data, aes(x = get(cn), fill = clustering)) + geom_bar(position = "dodge") + scale_x_discrete(labels = cutlabels5) + xlab(cn)
             }
             )
marrangeGrob(ggi, nrow = 2, ncol = 2)

Although some values have clearly more cluster 1 respondents and thus could have a positive influence on the buying odds, we tend to trust the real prediction model that we built before because clustering algorithm was not run as a model for predicting the buyers (but finding similarities) and because there are not many clustering differences, except for those already matched with the prediction model. So we will keep the buying predictors from the prediction model.

The Best Target Public

All together, here are all the variables and levels that have a high probability of a positive effect on buying odds:

d_urban:Urban (An area that is commonly within a town or city with a large population),
m2_awareness_apps_9:None,
m2_awareness_apps_10:Whoop,
m2_awareness_apps_11:Sleep quality,
m2_awareness_apps_12:Oura ring,
m2_awareness_apps_13:Sleepio,
m2_awareness_apps_14:None,
d_parent:Yes.

Discussion

If the survey was choice based instead of ratings based we could make a better analysis, e.g. we would avoid having respondents that always pick the same answer.

It would be better if we had more choice experiment survey data instead of so many demographic survey data (100 variables).

If we had clearer explanations on the survey variables, perhaps it could help in better predictions.

dkopcino/sleepers documentation built on Dec. 31, 2020, 11:21 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

dkopcino/sleepers
Tools and Analysis for the Gradient Metrics Homework Assignment

In dkopcino/sleepers: Tools and Analysis for the Gradient Metrics Homework Assignment

Introduction

Experiment Data Analysis and Modelling

The Best Message

Respondents Segmentation and Predicting Buyers

Predicting Buyers

Clustering Respondents

The Best Target Public

Discussion

R Package Documentation

Browse R Packages

We want your feedback!

dkopcino/sleepers Tools and Analysis for the Gradient Metrics Homework Assignment

In dkopcino/sleepers: Tools and Analysis for the Gradient Metrics Homework Assignment

Introduction

Experiment Data Analysis and Modelling

The Best Message

Respondents Segmentation and Predicting Buyers

Predicting Buyers

Clustering Respondents

The Best Target Public

Discussion

R Package Documentation

Browse R Packages

We want your feedback!

dkopcino/sleepers
Tools and Analysis for the Gradient Metrics Homework Assignment