# overall set-up 
library(MASS)
library(flexclust)
library(stringr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(CustSegs)
data("volunteers")
vol_ch <- volunteers[-(1:2)]

Introduction

Customer preference surveys ask a question like "Please check off of the following [things] which are [important] for you." This example uses the volunteers data set from package flexclust. From the package documentation:

Part of an Australian survey on motivation of 1415 volunteers to work for non-profit organisations like Red Cross, State Emergency Service, Rural Fire Service, Surf Life Saving, Rotary, Parents and Citizens Associations, etc..

The survey has r ncol(vol_ch) preference check boxes for various motivations which were important motivators in their decision to volunteer. These are: r paste(colnames(vol_ch), collapse = ", ").

We will show how the flexclust package ...

NTS: include references to Fritz's package & papers & Sara & Fritz's marketing application papers.

An example flexclust segmentation run

First, make a data frame with just the preference choice columns in vol.

library(flexclust)
data("volunteers")
vol_ch <- volunteers[-(1:2)]
vol.mat <- as.matrix(vol_ch)
knitr::kable(vol_ch[1:5, 1:7])

Plot the percent of each preference that is checked.

``` {r echo = FALSE, fig.width = 8, fig.height = 6} vol_ch.pcts <- vol_ch %>% gather("Question", "Response") %>% group_by(Question) %>% summarize(Pct_Checked = 100.0 * sum(Response) / n())

ggplot(vol_ch.pcts, aes(x = Pct_Checked, y = reorder(Question, -as.integer(Question)))) + geom_point(size = 2) + coord_cartesian(xlim = c(0, 100)) + ylab("Preference Question") + xlab("% Checked") + ggtitle("Response to Each Preference")

Set the parameters for the cluster algorithm.

``` {r}
fc_cont <- new("flexclustControl")  ## flexclustControl object holds "hyperparameters"
fc_cont@tolerance <- 0.1   ## kcca only uses if classify == "weighted"
fc_cont@iter.max <- 30
fc_cont@verbose <- 1

fc_family <- "ejaccard"             ## distance metric

Now we can invoke kcca to do clustering. Start with three clusters to show the process.

seed1 <- 577 #243       ## Why we use this seed will become clear below
fc_seed <- seed1

num_clusters <- 3

set.seed(fc_seed)

## verbose > 0 will show iterations
vol.cl <- kcca(vol.mat, k = num_clusters, save.data = TRUE,
                control = fc_cont, family = kccaFamily(fc_family))

summary(vol.cl)

``` {r echo = FALSE}

set up plot titles for this run

main_text1 <- "Volunteers Stated Preferences Survey" sub_text <- paste0("kcca ", vol.cl@family@name, " - ", num_clusters, " clusters (seed = ", fc_seed, ")")

#### Segment separation plot (aka neighborhood plot) 

``` {r fig.width = 7, fig.height = 7.7}
vol.pca <- prcomp(vol.mat)      ## plot on first two principal components
plot(vol.cl, data = vol.mat, project = vol.pca,
     main = paste0(main_text1, " - Segment Seperation Plot"), 
     sub = sub_text)

This plot shows each respondent's preferences plotted on the surface defined by the first two principal components. Centroids of each cluster (segment) are the numbered circles. The color indicates the respondent's cluster membership. The solid line (convex hull) encloses 50% of the points in the cluster while the dashed line encloses 95% of the points. The separation between each cluster is indicated by the thinness of the line between any two centroids. Of course, the physical distance between centroids on the PC2 x PC1 plane does not correspond to the actual separation in any non-trivial problem.

Segment profile plot - the primary tool for interpreting the solution as customer segments or persona.

``` {r fig.width = 8, fig.height = 6} barchart(vol.cl, strip.prefix = "#", shade = TRUE, layout = c(vol.cl@k, 1), main = paste0(main_text1, " - Segment Profile Plot"))

This plot has a facet for each cluster plotted above. The colors are consistent between the two plots. The header in each facet gives the cluster number, the number of respondents assigned to that cluster and the percent of the total respondents. For each of the preferences, the bar width shows the proportion checked for that cluster. For reference, the overall population proportion is shown by the red line & dot. (which corresponds to the dot plot above). 

A bar is grayed out when the preference is not important is distinguishing a cluster from the others. But note that grayed bar(s) may well be important when coming up with the customer segmentation (or persona) story.

> Exercise for the reader: How would you describe each of these three segments?

These are the basic ideas behind using _flexclust_ to segment customers based on results of a stated preference survey.
Next we will look at two important practical issues:

1. Is the cluster solution stable for a given number of clusters (k)?
2. How many clusters best describe the respondents?

### Stability of the solutions for any k
#### The Stability Problem
If we run kcca() on the same data but with different starting seeds for any real customer dataset (in which clusters are typically somewhat fuzzy), we expect to get different solutions. For example comparing the above solution with two other solutions, we see the results are somewhat different:

```r
fc_cont@verbose <- 0
set.seed(fc_seed)
vol.cl <- kcca(vol.mat, k = num_clusters, save.data = TRUE,
                control = fc_cont, family = kccaFamily(fc_family))
vol.pca <- prcomp(vol.mat)      ## plot on first two principal components
plot(vol.cl, data = vol.mat, project = vol.pca,
           main = paste0("Segment Seperation Plot, k=", num_clusters, ", seed=", fc_seed))
barchart(vol.cl, strip.prefix = "#", shade = TRUE, layout = c(3, 1),
         main = paste0("Segment Profile Plot, k=", num_clusters, ", seed=", fc_seed))

fc_seed <- 215
set.seed(fc_seed)
vol.cl <- kcca(vol.mat, k = num_clusters, save.data = TRUE,
                control = fc_cont, family = kccaFamily(fc_family))
vol.pca <- prcomp(vol.mat)      ## plot on first two principal components
plot(vol.cl, data = vol.mat, project = vol.pca,
           main = paste0("Segment Seperation Plot, k=", num_clusters, ", seed=", fc_seed))
barchart(vol.cl, strip.prefix = "#", shade = TRUE, layout = c(3, 1),
         main = paste0("Segment Profile Plot, k=", num_clusters, ", seed=", fc_seed))

fc_seed <- 129
set.seed(fc_seed)
vol.cl <- kcca(vol.mat, k = num_clusters, save.data = TRUE,
                control = fc_cont, family = kccaFamily(fc_family))
vol.pca <- prcomp(vol.mat)      ## plot on first two principal components
plot(vol.cl, data = vol.mat, project = vol.pca,
           main = paste0("Segment Seperation Plot, k=", num_clusters, ", seed=", fc_seed))
barchart(vol.cl, strip.prefix = "#", shade = TRUE,  layout = c(3, 1),
         main = paste0("Segment Profile Plot, k=", num_clusters, ", seed=", fc_seed))

These three examples are similar in that there a single predominate cluster but looking closer there are important differences in how each solution is interpreted.

One Simple Solution

We need a method to explore many solutions and look for a frequent stable solution. The challenge is coming up with a measure of solution similarty. kcca() returns an object with 17 slots, so there are a lot of metrics we could devise.

Keeping it very simple, let's just look at the scatter plot of the number of members in the two largest clusters. This is easy to get from slot @clusinfo:

r str(vol.cl@clusinfo)

The plan is to will run kcca() a few hundred times, incrementing the seed with each run to get data for the plot. First we build up a data.frame capturing @clusinfo for each run, where the run is identified by the values or k and the seed.

fc_seed <- 123
num_clusters <- 3
num_trys <- 500
fc_cont@verbose <- 0
cli_trys <- NULL

# build df with cluster info for each seed
for (itry in 1:num_trys) {
  fc_seed <- fc_seed + 1
  set.seed(fc_seed)
  cli <- kcca(vol.mat, k = num_clusters, save.data = TRUE,
              control = fc_cont, family = kccaFamily(fc_family))
  cli_info <- cli@clusinfo %>%
    mutate(clust_num = row_number(),
           clust_rank = rank(desc(size), ties.method = "first")) %>%
    arrange(clust_rank) %>%
    dplyr::select(c(6, 5, 1:4))
  cli_try <- cbind(data.frame(k = num_clusters, seed = fc_seed),
                   cli_info)
  cli_trys <- rbind(cli_trys, cli_try)
}
cli_trys <- as.tbl(cli_trys)
cli_trys

It is ordered by seed (aka runID) and the cluster rank. "clust_num" is the original cluster sequence number coming out of kcca(). Notice that cluster #1 is not necessarly the largest cluster. (This is an anoyance with the "random walk" nature of kcca(). The same cluster, based on its properties, will not necessarly be in the sequence between runs.)

We need to massage cli_trys so it is suitable for plotting. At the same time we pick up the location of the peak.

# set up plot of size of rank 2 x rank 1
cli_sizes <- cli_trys %>%
  dplyr::select(k, seed, clust_num, clust_rank, size) %>%
  filter(clust_rank <= 2) %>%
  mutate(clust_label = paste0("Size_", clust_rank),
         in_order = clust_num == clust_rank) %>%
  dplyr::select(-clust_rank, -clust_num) %>%
  spread(key = clust_label, value = size) %>% 
  group_by(k, seed) %>% 
  summarize(in_order = all(in_order),
            Size_1 = min(Size_1, na.rm = TRUE),
            Size_2 = min(Size_2, na.rm = TRUE))

# get location of peak numerically with MASS:kde2d
s2d <- with(cli_sizes, kde2d(Size_1, Size_2, n = 100))
s2d_peak <- which(s2d$z == max(s2d$z))
Size_1_peak_at <- round(s2d$x[s2d_peak %% 100], 1)
Size_2_peak_at <- round(s2d$y[s2d_peak %/% 100], 1)

From which we can plot:

xend <- Size_1_peak_at + 100
yend <- Size_2_peak_at + 100
ggplot(cli_sizes, aes(Size_1, Size_2)) +
  geom_point(alpha = 0.5, size = 2) +
  stat_density2d() +
  annotate("segment", x = Size_1_peak_at, y = Size_2_peak_at,
           xend = xend, yend = yend, color = "red", size = 1) +
  annotate("text", xend, yend, label = paste0("(", Size_1_peak_at, ", ", Size_2_peak_at, ")"), vjust = 0) +
  ggtitle(paste0("Size of Cluster 2 by Size of Cluster 1 for k=", num_clusters, ", # tries=", num_trys))

Now we just need the distance of each solution's first & second cluster counts to the peak we found above:

cli_best <- cli_sizes %>%
  filter(in_order) %>%    ## just look at solutions with clusters in decending sizes
  mutate(distance = sqrt((Size_1 - Size_1_peak_at)^2 + (Size_2 - Size_2_peak_at)^2)) %>%
  arrange(distance)
cli_best

Taking the value pair (k = r cli_best[1, 1], seed = r cli_best[1, 2]) from solution closest to the peak (the top row), gives us the parameters we need to re-generate the plots for the "stable clustering" for a given k. We apply this method in "production" mode next.

Stable Clusters for k = 2, 3, 4, ...

{r echo=FALSE, fig.width = 8, fig.height = 6, cache=TRUE} clusters <- NULL for(k in 2:15) { cat("k =", k, "\n") num_clusters <- k cli <- fc_rclust(vol.mat, k=num_clusters, fc_cont, nrep=200, fc_family) cli head(cli$best, 3) fc_seed <- as.integer(cli$best[1, 2]) x <- try(set.seed(fc_seed)) if(!is.null(x)) { cat("set.seed error, invalid cli$best. Skipping k = ", k) next() } set.seed(fc_seed) vol.cl <- kcca(vol.mat, k = num_clusters, save.data = TRUE, control = fc_cont, family = kccaFamily(fc_family)) vol.cl <- fc_reorder(vol.cl) clusters <- rbind(clusters, data.frame(k = k, seed = fc_seed, cluster = vol.cl@cluster)) vol.pca <- prcomp(vol.mat) ## plot on first two principal components plot(vol.cl, data = vol.mat, project = vol.pca, main = paste0("Segment Seperation Plot, k=", num_clusters, ", seed=", fc_seed)) bc <- barchart(vol.cl, strip.prefix = "#", shade = TRUE, layout = c(num_clusters, 1), main = paste0("Segment Profile Plot, k=", num_clusters, ", seed=", fc_seed)) print(bc) }



ds4ci/CustSegs documentation built on May 15, 2019, 2:56 p.m.