knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(zmarket) library(dplyr) library(ggplot2) library(magrittr) library(purrr)
This document contains a simple exploration and modeling of the Sleep App data provided by Gradient Metrics. Our main sources are Harrel [-@harrel] and Rao [-@rao].
To examine the experimental design, we calculate an index of all levels of discrete variables using all other discrete variables as grouping variables. For a level of a discrete variable, this index with respect to the level of another grouping variable is equal to its frequency in that level of the grouping variable divided by its frequency across all levels of the grouping variable times 100.
discrete_vars <- experiment_data %>% select(duration:social_proof) indtabs <- list() i <- 1 cat("## Tables of indexes\n Groups levels are on rows and index levels are on columns.") for (v1 in names(discrete_vars)) { for (v2 in setdiff(names(discrete_vars), v1)) { indtabs[[i]] <- index(experiment_data[[v1]], experiment_data[[v2]]) print(knitr::kable(indtabs[[i]])) i <- i + 1 } }
All index values are close to 100, which means that these variables do not have strong correlations in the overall data set. That is consistent with the description of the experiment as 12 random permutations of a message with 6 attributes.
Next we reproduce a result from the assignment to allow us to comment on it.
(levs <- experiment_data %>% select(duration:social_proof) %>% map(unique) %>% map( ~ t(t(.x))))
There are 6 attributes in the experiment with between three and six levels. If we use a dummy variable encoding of each attribute, our model would have r sum(map_dbl(levs, ~nrow(.x) - 1))
parameters. Therefore, it is not reasonable for us to fit such a model for each individual respondent. Therefore, a key modeling challenge will be fit models with relatively large numbers of parameters from relatively small numbers of observations. Our main tools will be to try to pool individuals into homogeneous groups and fit models for those groups and to apply penalties to model parameters to reduce their estimation variance.
To avoid overfitting, we remove 2 tasks from each respondent. We will later use this reserved data for model validation.
set.seed(1) wdat <- experiment_data %>% group_by(response_id) %>% slice_sample(n = 10) %>% mutate(foldid = 1:n()) vdat <- experiment_data %>% anti_join(wdat, by = c("response_id", "task"))
To get an idea about the heterogeneity in among respondents, we'll look at some summary statistics of the distribution of their answers across the ten tasks in our working data.
rstats <- wdat %>% group_by(response_id) %>% summarise( mean_ans = mean(answer), sd_ans = sd(answer), max_ans = max(answer), min_ans = min(answer), range_ans = max_ans - min_ans ) rstats %>% ggplot(aes(x = mean_ans)) + geom_histogram() rstats %>% ggplot(aes(x = sd_ans)) + geom_histogram() rstats %>% ggplot(aes(x = range_ans)) + geom_histogram() rstats %>% ggplot(aes(x = max_ans)) + geom_histogram() rstats %>% ggplot(aes(x = min_ans)) + geom_histogram()
The mean response appears to follow a mixture distribution. A certain fraction of about 250 / 892 of respondents answered 1 ("Very unlikely") in all 12 tasks. Another 70 or so respondents also provided the same answer to all tasks, resulting in about 320 respondents with an standard deviation of zero and a range of 1. Ten individuals had a minimum answer of 4 and must have answered 4 for all tasks. Models for these groups of constant-response individuals could be very simple. They could maximize AIC by having a constant intercept only.
The number of respondents in the sample was a decreasing function of the range of respondents. Ranges of 3 were even rarer than extrapolation of the linear trend from ranges of 0 to 2 would predict. Although rare, the sensitivity of these individuals to some attributes in the experiment may be valuable to know.
On balance the mean answers from the respondents that did not always answer 1 seem to be uniformly distributed. There may be more variation among groups of individuals than within individuals. This variation maybe explainable in terms of variables included in the survey.
Hmisc::describe(survey_data)
There's quite a lot to work with here. One challenge is that missing data seems common.
As a starting point, we'll fit a CR (continuation ratio) model [@harrel] that assumes homogeneity among respondents. According to Rao [-@rao], this homogeneity assumption is rarely satisfied but accounting for heterogeneity with an overly complex model could produce results that do not generalize beyond the training data and are difficult to interpret. Additionally, our time for this analysis is highly limited. Therefore, we will start with a simple model and build up complexity towards an optimal level. The continuation ratio model is a very simple model that is appropriate for an ordinal response variable.
u <- with(wdat, rms::cr.setup(answer)) crdat <- wdat[u$subs,] %>% ungroup() %>% select(task, response_id, foldid, duration:social_proof) crdat$cohort <- u$cohort crdat$Y <- u$y m1 <- rms::lrm(Y ~ cohort + duration + offer + outcome + price + rtb + social_proof, data = crdat) m1 anova(m1)
This model does not fit particularly well, but even so there are several statistically significant variables. The most important appears to be the cohort variable, followed by price and social proof.
We'll use prediction performance on the validation data as our main metric to compare this model with later models. We will use the log loss, equivalent to the negative log likelihood, of the model on the validation data as our specific metric.
valu <- with(vdat, rms::cr.setup(answer)) val_crdat <- vdat[valu$subs,] %>% ungroup() %>% select(response_id, duration:social_proof) val_crdat$cohort <- valu$cohort val_crdat$y <- valu$y calc_val_logloss <- function(mod, valdata) { vdf <- as.data.frame(valdata) # tibbles cause errors because predictrms uses [,i] to select vectors p <- predict(mod, vdf, type = "fitted") - sum(log(ifelse(vdf$y == 1, p, 1 - p))) } validation_log_losses <- list() (validation_log_losses$m1 <- calc_val_logloss(m1, val_crdat))
Next we will try relaxing the assumption in the CR model that parameters are the same in all cohorts. Said differently, we'll relax the assumption that all attributes have the same effect on ratings no matter whether the rating is greater than 1, 2, or 3. We've already seen that many 1 responses are from individuals who do seem to be affected by any attributes, so this assumption seems poor.
m2 <- rms::lrm( Y ~ cohort * (duration + offer + outcome + price + rtb + social_proof), x = TRUE, y = TRUE, data = crdat ) m2 anova(m2) (validation_log_losses$m2 <- calc_val_logloss(m2, val_crdat))
Cohort has a significant interaction with price, and overall is significant. However, the log loss of this much more complicated model is only marginally lower. We have gained very little for a great increase in model complexity. Next, we'll try to find a more optimal level of complexity with penalized maximum likelihood.
(pt <- rms::pentrace(m2, list(simple = 0, interaction = seq(0, 4000, by = 100))))
The best model seems to have twice as many degrees of freedom as our first model, but is much simpler than our second model.
m3 <- update(m2, penalty = list(simple = 0, interaction = 2000), data = crdat) (validation_log_losses$m3 <- calc_val_logloss(m3, val_crdat))
Interestingly, this model does not have lower loss than m2 in our validation data. But the difference may not be statistically significant, and m3 is considerably simpler:
rms::effective.df(m3)
We'll now try building up complexity by removing the assumption of homogeneity among respondents. We will model this first by allowing the intercept term for each respondent to vary. The full model is unlikely to be optimal so we'll again use penalized maximum likelihood to determine a more optimal level of model complexity. In fact, I am not able to fit an unpenalized model on my system due to numerical problems which appear to be caused by large variances.
m4 <- rms::lrm( Y ~ cohort + response_id + duration + offer + outcome + price + rtb + social_proof, x = TRUE, y = TRUE, penalty = 10, data = crdat ) anova(m4) (validation_log_losses$m4 <- calc_val_logloss(m4, val_crdat))
As expected from our EDA, variation among respondents explains a lot of the variation in the outcomes. Our log loss in the validation data has gone down considerably now that our model can account for the individual providing the answer. As a next step, we'll try optimizing the penalty to some extent. We could use rms::pentrace again for this but the time required to fit models with rms::lrm is becoming a source of friction in our analysis. Thus, next we try using glmnet for fitting logistic regression models with L1-regularization. The L1 component has the added benefit of encouraging sparsity in the model fit. This will have the effect of grouping some respondents with similar intercept into a homogeneous cluster. We also opt for L1 rather than elastic net regularization because there are no correlations in our predictive variables.
library(glmnet) library(doMC) registerDoMC(cores = 2) foldid <- crdat$foldid system.time( cvfit <- cv.glmnet( m4$x, m4$y, foldid = foldid, family = "binomial", type.measure = "deviance", parallel = TRUE ) ) plot(cvfit) newx <- rms::lrm( y ~ cohort + response_id + duration + offer + outcome + price + rtb + social_proof, method = "model.matrix", data = val_crdat ) stopifnot(all(colnames(newx) == colnames(m4$x))) pcvfit <- predict(cvfit, newx = newx, s = "lambda.min", type = "response") %>% as.numeric() (validation_log_losses$m5 <- -sum(log(ifelse( val_crdat$y == 1, pcvfit, 1 - pcvfit ))))
The glmnet model with a cross-validated penalty is by far the best model for predicting our validation data. As expected, this model grouped together many respondents with the intercept because the L1 penalty shrank the effect of their coefficients to zero:
colnames(newx)[which(coef(cvfit, s = "lambda.min") == 0)]
Some of the variation among individuals may be explainable in terms of variables in the survey data. Also, these variables might be useful for determining the response of individuals to certain message attributes. Before attempting to put all of the survey variables directly in the model, we'll try reducing the dimensionality of the data with PCA. This may provide limit some overfitting and provide for an easier to understand model. We'll further simplify matters by excluding variables that included missing values.
nonmissing <- purrr::map_lgl(survey_data, ~ all(!is.na(.x))) sd2 <- survey_data[, nonmissing] %>% select(-response_id,-weights) mmat <- glmnet::makeX(train = sd2) mmat2 <- mmat[,-which(apply(mmat, 2, sd) == 0)] res <- cor(mmat2, method = "pearson") corrplot::corrplot(res, method = "color", order = "hclust", tl.pos = 'n')
On second thought, there does not appear to be many very strong correlations among predictors constructed from the survey data. It may be easier and more effective to simply add these variables to the regularized regression model directly.
crdat2 <- survey_data[, nonmissing] %>% select(-weights) %>% right_join(crdat, by = "response_id") x2 <- glmnet::makeX(train = select(crdat2,-Y,-foldid,-task)) novar <- which(apply(x2, 2, sd) == 0) x3 <- x2[,-novar] system.time( cvfit2 <- cv.glmnet( x3, crdat2$Y, foldid = crdat2$foldid, family = "binomial", type.measure = "deviance" ) ) plot(cvfit2) val_crdat2 <- survey_data[, nonmissing] %>% select(-weights) %>% right_join(val_crdat, by = "response_id") newx2 <- glmnet::makeX(train = select(val_crdat2,-y)) stopifnot(all(colnames(newx2) == colnames(x2))) newx3 <- newx2[,-novar] pcvfit2 <- predict(cvfit2, newx = newx3, s = "lambda.min", type = "response") %>% as.numeric() (validation_log_losses$m6 <- -sum(log( ifelse(val_crdat2$y == 1, pcvfit2, 1 - pcvfit2) )))
Adding the survey variables led to a modest improvement of about 4 in the log loss on the validation data.
Although these did not have a big effect on validation log loss when we investigated it earlier, adding them was beneficial and perhaps the effect size will be different now that our model has grown.
fm <- ~ cohort:(duration + offer + outcome + rtb + price + social_proof) + . x4a <- model.matrix(fm, select(crdat2,-Y,-foldid,-task)) not_constant <- apply(x4a, 2, sd) > 0 x4 <- cbind("(Intercept)" = x4a[, "(Intercept)"], x4a[, not_constant]) system.time( cvfit3 <- cv.glmnet( x4, crdat2$Y, foldid = crdat2$foldid, family = "binomial", type.measure = "deviance" ) ) plot(cvfit3) newx4a <- model.matrix(fm, select(val_crdat2,-y)) newx4 <- cbind("(Intercept)" = newx4a[, "(Intercept)"], newx4a[, not_constant]) stopifnot(isTRUE(all(colnames(newx4) == colnames(x4)))) pcvfit3 <- predict(cvfit3, newx = newx4, s = "lambda.min", type = "response") %>% as.numeric() (validation_log_losses$m7 <- -sum(log( ifelse(val_crdat2$y == 1, pcvfit3, 1 - pcvfit3) )))
The reduction in log loss is much larger than before.
We will now use our model with greatest predictive performance to determine how each attribute and attribute level influences the stated likeliness to download. To begin, we will examine the estimated coefficients for the attribute main effects in our models with the penalty which minimized cross-validation deviance.
ests7 <- predict(cvfit3, s = "lambda.min", type = "coef") atts_main <- colnames(x4) %>% stringr::str_subset("^price|^offer|^outcome|^duration|^rtb|^social_proof") reflevels <- select(crdat2,-Y,-task,-foldid,-s_problem) %>% purrr::map( ~ if (is.factor(.x)) { levels(.x)[1] } else levels(factor(.x))[1]) pdata <- data.frame(coef_name = atts_main, est = ests7[atts_main,]) %>% mutate( var = stringr::str_extract(atts_main, "duration|price|offer|rtb|social_proof|outcome"), level = stringr::str_remove(atts_main, "duration|price|offer|rtb|social_proof|outcome") ) pdata %>% ggplot(aes(x = level, y = -est)) + geom_col() + facet_wrap( ~ var, scales = "free_y", ncol = 1) + coord_flip() + labs(y = "Effect of attribute levels on logit of increasing likeliness of download rating") + geom_hline(yintercept = 0, col = "grey") pdata %>% ggplot(aes(x = level, y = exp(-est))) + geom_col() + facet_wrap( ~ var, scales = "free_y", ncol = 1) + coord_flip() + geom_hline(yintercept = 1, col = "grey") + labs(y = "Ratio of odds of increasing download rating:\n odds(with with attribute level) / odds(without attribute level)")
To arrive at the interpretation of the coefficients plotted, we referred to the definition of the in equation 13.1 of Harrel [@harrel]: \begin{equation} Pr(Y = j | Y \ge j, X) = \frac{1}{1 + \exp[-(\alpha + \theta_j + X \gamma) ]}, \end{equation} where $Y$ is the rating, $X$ is the matrix of predictors, $\alpha$ is the intercept, $\theta_j$ are the cohort effects, and $\gamma$ are the regression model coefficients.
Another key point for interpretation is that these effects are contrast with a reference level for each variable. These reference levels are:
data.frame("reference" = unlist(reflevels)) %>% knitr::kable()
We used the first levels of factors in the provided survey data set as a reference level. Levels for other variables were selected by the default method of alphabetical sorting, and choosing different reference levels could be useful for directly displaying effects of interest.
In the context of our data, we can think of the linear predictor in our model as describing the logit of conditional probabilities of not increasing a download rating beyond level $j$, where we have conditioned on the event that the rating is at least $j$. Thus to obtain the effects on the logits of increasing download ratings, we simply reverse the sign. The interpretation of the exponential of the regression coefficients in terms of odds ratios is appropriate because our variables are encoded as zeros or ones to indicate absence or presence of an attribute.
Next, we consider the interactions of attributes with the cohort variables.
attrs_coh <- colnames(x4) %>% stringr::str_subset(":") pdata2 <- data.frame(coef_name = attrs_coh, estchange = ests7[attrs_coh,]) %>% mutate( var = stringr::str_extract(attrs_coh, "duration|price|offer|rtb|social_proof|outcome"), coh = stringr::str_extract(attrs_coh, "answer>=[1-3]|all"), level = stringr::str_replace_all( attrs_coh, "cohortall|cohortanswer>=[2-3]|:|duration|price|offer|rtb|social_proof|outcome", "" ) ) %>% left_join(pdata, by = c("level", "var")) %>% mutate(comb_est = est + estchange) pdata2 %>% ggplot(aes( x = level, y = -comb_est, fill = coh )) + geom_col(position = position_dodge()) + facet_wrap( ~ var, scales = "free_y", ncol = 1) + coord_flip() + scale_fill_discrete(name = "Answer level") + labs(y = "Effect of attribute levels on logit of increasing likeliness of download rating\n conditional on rating value")
The effect of an attribute level on further increasing the likelihood of download can depend on whether the increase will be from a very unlikely to somewhat unlikely (Answer group all), somewhat unlikely likely to somewhat likely (answer>=2), or somewhat likely to very likely (answer>=3). For example the $40/month price seems to keep individuals from moving from an answer of 2 to 3 or 4 more than from an answer of 3 to 4 or away from 1. Scientific evidence seems most useful in converting answers from somewhat likely to very likely.
Next we will put the overall effects of the variables in perspective with each other and with the other variables in the model. We will use the sum of the absolute values of all effects of the levels of a variable as a measure of variable importance.
pdata %>% ggplot(aes( x = var, y = abs(est), fill = paste(var, level, sep = "=") )) + geom_col() + scale_fill_discrete(name = "Attribute level") + labs(x = "Attribute", y = "Importance") + coord_flip() + theme(legend.position = "top") + guides(fill = guide_legend(ncol = 2, byrow = TRUE))
Overall, the most important attributes are price, social_proof, and rtb. Next, we'll examine the importance of interactions for each variables by summing the absolute values of their effect sizes.
pdata2 %>% ggplot(aes(x = var, y = abs(estchange))) + geom_col() + labs(x = "Attribute interactions with answer group", y = "Importance") + ylim(c(0, 2.2)) + coord_flip()
The importance of interactions is often greater than the main effects of attributes, and the sum of all interaction importances is greater than the importance of any one attribute.
Now lets see how these attribute-related effects compare to others in our model. To begin with, we'll plot the distribution of all (negated) regression coefficients.
cdf <- data.frame(ncoefficients = -ests7[, 1], vname = rownames(ests7)) cdf %>% ggplot(aes(x = ncoefficients)) + geom_histogram(binwidth = 0.1) + labs(x = "Effect on logit of increasing download rating")
Clearly, the coefficients for attributes are far from the largest ones in the model. Next, lets stratify by cohort effects, attribute effects, and respondent level effects.
cdf2 <- cdf %>% mutate( var_type = case_when( stringr::str_detect( vname, "^offer|^price|^social_proof|^rtb|^outcome|^duration" ) ~ "attribute", stringr::str_detect(vname, "^response_id") ~ "response_id", stringr::str_detect(vname, "^cohort") ~ "cohort/Intercept", vname == "(Intercept)" ~ "cohort/Intercept", TRUE ~ "survey" ) ) cdf2 %>% ggplot(aes(x = abs(ncoefficients))) + geom_histogram(binwidth = 0.1) + facet_grid(var_type ~ ., scales = "free_y") + scale_x_continuous(trans = "log1p") + labs(x = "Variable importance")
Clearly, respondent_id are among the most important variables for explaining the rated likelihood of downloading. We are able to explain some of this individual level variation with our survey variables:
cdf2 %>% filter(var_type == "survey") %>% select(Variable = vname, "Effect on logit of increasing rating" = ncoefficients) %>% knitr::kable(row.names = FALSE)
Many of these effects seem reasonable. There's a negative correlation of age with an individuals likelihood to download, which may be a sign of younger individuals being more conversant with technology. Individuals who are very interested in a coach are more likely to download. Further analysis and modeling of these data would likely yield many more potentially useful insights.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.