knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(choicetools)
In this vignette, we demonstrate a simple choice-based conjoint (CBC)
analysis project, end to end, using the choicetools
package.
Choice-based conjoint is a variety of conjoint analysis in which respondents choose among several (typically 2-4) products that differ in various attributes such as brand, performance, and price. Each attribute will have two or more levels, such as brand name or specific prices. A complete product (aka 'concept') is a fully specified set of attributes and levels (Chapman & Feit, 2019).
For example, a Ford Truck with 300 HP engine at $35000 is an example of a product that 'conjoins' four attributes with specific levels: body style (with the level 'truck'), brand (Ford), engine size (300 HP), and price ($35000).
By randomizing the products' attributes and asking for repeated selection of preference, we can estimate the relative value (aka utility) of each attribute level. For instance, we can determine whether the engine size is more important than brand. We can also determine willingness to pay. For example, we can match the utility scores for engine size to those for price, to see how much more we might expect respondents to pay for a specific engine size.
Each respondent is asked to choose among a specific number of concepts (products) at a time, such as their preference among 3 choices. They do this repeatedly, typically 8-12 times (aka tasks).
The CBC tools here are probably most useful for didactic purposes --- to demonstrate and explain conjoint analysis in a classroom setting. A complete conjoint analysis study can be run with this code and a simple spreadsheet editing program, such as Google Sheets or Microsoft Office 365. (See notes at the end for details on that.) Beyond that, many CBC studies with simple design characteristics can be estimated using the tools.
In this vignette, we will imagine a hypothetical USB flash drive with five attributes: Brand (4 choices), Style (4 choices), Price (5 price points), Color (2 choices), and Size (4 choices). Further, we assume that we will ask for preference among 3 concepts at a time, asking each respondent to judge 12 concepts (sets of concepts), with a total of N=400 respondents
We set up the study by defining those parameters:
set.seed(98103) cbc.tasks <- 12 # trials per respondent cbc.concepts <- 3 # cards per trial N <- 400 # N of respondents cbc.attrs <- c(Brand=4, Style=4, Price=5, Color=2, Size=4)
Next, we can assign labels for the attribute levels. In the present code,
this is a simple list, where you should ascertain that they are ordered
correctly. We specify brands of "Alpha", "Bravo", and so forth; hypothetical
design styles of "Shiny", "Flat", etc; and the other attributes as follow:
cbc.levels <- c("Alpha", "Bravo", "Charlie", "Delta", # Brand "Shiny", "Flat", "Sparkly", "Odd", # Style "$9", "$14", "$19", "$24", "$29", # Price "Blue", "Red", # Color "8GB", "16GB", "32GB", "64GB") # Size
We can create a product concept by randomly selecting one level for each of the attributes. Thus, we might have an Alpha brand Shiny 16GB Blue USB drive for \$19, or a Charlie Flat design, Red 64GB device for \$9. Each task in the survey will present 3 such randomly constructed concepts and ask the respondent to select the single concept that is most preferred.
Now we create an experimental design matrix for the study. The
design matrix specifies which randomly-generated concepts appear together.
To ensure the design is balanced and unbiased, we want each level to appear
approximately the same number of times, and approximately the same number
in combination with every other attribute. This is done with the
generalMNLrandomTab()
function:
cbc.tab <- generateMNLrandomTab(cbc.attrs, respondents=N, cards=cbc.concepts, trials=cbc.tasks ) knitr::kable(head(cbc.tab))
This function iteratively searches randomized designs and retains a design if it has improved balance vs. the previous best design. The resulting design matrix specifies the level of each attribute that might be shown on the survey.
In most cases, you would obtain this design matrix from the survey platform in which the survey was fielded. For instance, Sawtooth Software can export the "tab design" matrix in a compatible format. In that case, you do not need to generate a design matrix here; use that one instead. We generate the design here in order to field a survey for didactic purposes; see below.
We can also convert the "tab style" layout to a fully-extended, dummy coded matrix of 0 and 1 entries:
cbc.des <- convertSSItoDesign(cbc.tab) # dummy coded matrix knitr::kable(head(cbc.des, 9))
In this case, we see that the first product concept is a combination of Brand 2, Style 3, Price level 2, and so forth.
Given the design matrix, we can see how the survey choice tasks might
appear (with minimal instruction). writeCBCdesignCSV()
will produce a
minimal "survey" in CSV format:
writeCBCdesignCSV(head(cbc.tab, 6), cards=3, trials=2, attr.list=cbc.attrs, lab.attrs=names(cbc.attrs), lab.levels = cbc.levels)
By "fielding" such a survey in a spreadsheet editor --- such as Google
Sheets or Microsoft Office --- we can easily demonstrate a CBC survey in
a classroom setting. (To design a better looking and more robust survey
for real respondents, use a survey authoring platform such as Sawtooth
Software or Qualtrics.) We use the digest
package (Eddelbuettel et al, 2018)
to include a hash value for the design matrix as noted at the top of the CSV.
This helps ensure that the collected data match the design matrix when choices
are imported.
The simplest form of estimation is an aggregate multinomial logit model (MNL). This estimates a simple multinomial (choice among options) model for the entire sample. The estimated coefficients are known as part worths ("part" because the values for multiple attributes are added together for the full utility of a concept). In an MNL model, the part worths sum to zero across the levels of each attribute.
To see how this works, we use generateRNDpws()
to make up some zero-sum part worths,
and then pickMNLwinningCards()
to find the preferred concept based on those
part worths for each choice set:
cbc.pws <- generateRNDpws(cbc.attrs) # make up some zero-sum part worths cbc.win <- pickMNLwinningCards(cbc.des, cbc.pws) knitr::kable(head(cbind(cbc.win, cbc.des), 9))
In the first set of choices -- the first three rows -- the second concept
(Brand 1, Style 2, Price 3, etc.) was preferred and selected in the cbc.win
vector.
Now we estimate the part worths based on those "observed" choices, and plot those estimates against the original part worths. We see a near perfect recovery of the part worth values (aside from minor rescaling of the magnitude):
cbc.mnl <- estimateMNLfromDesign(cbc.des, cbc.win, cards=cbc.concepts) plot(cbc.pws, cbc.mnl)
Of course it is more interesting to collect data from real individuals! Most of the time one would do that by fielding a survey online using survey authoring tools such as Sawtooth Software or Qualtrics. Such tools are able to display CBC tasks in a friendly way and require users to make tradeoffs among concepts.
For this vignette, however, we will write out the CSV file, and then read the data back:
csv.filename <- "~/Downloads/testCBC.csv" writeCBCdesignCSV(cbc.tab, filename=csv.filename, # filename="" for console cards=3, trials=12, attr.list=cbc.attrs, lab.attrs=names(cbc.attrs), lab.levels = cbc.levels, overwrite=TRUE) # read the CSV csvfile.in <- readLines(csv.filename)
If respondents had filled in choices, we could estimate the MNL model from them. Because they have not, we will make some random choices and rewrite the CSV:
# Fill in all of the choices with random choices (1, 2, or 3) lines.with.choices <- which(grepl("CHOICE for Trial [0-9]+", csvfile.in)) csvfile.in[lines.with.choices] <- paste(csvfile.in[lines.with.choices], sample(cbc.concepts, length(lines.with.choices), replace = TRUE)) writeLines(csvfile.in, con=csv.filename)
Now we have a CSV that might have been completed by respondents (if they
were answering randomly). We read those data using readCBCchoices()
with
the design matrix (cbc.tab
) and the filename. readCBCchoices()
will check
that the design hash for the file matches the design matrix that we give it.
# get those choices cbc.choices <- readCBCchoices(cbc.tab, filename=csv.filename, cards=3, trials=12, verbose=FALSE)
We use estimateMNLfromDesign()
to estimate the part worths from the observed
data. We plot those and see that -- because the choices were assigned randomly --
the estimates are mostly close to zero, and have no correspondence to the
original part worths:
# estimate partworths (should be near 0 because all choices were random) cbc.mnl2 <- estimateMNLfromDesign(cbc.des, cbc.choices, cards=cbc.concepts, no.output = TRUE) plot(cbc.pws, cbc.mnl2) abline(h=0)
It is important to note that the estimateMNLfromDesign()
function is not
intended for production usage; it is primarily a didactic function whose
NR algorithm is easy to explain (and to read in code). For actual analyses,
we would use hierarchical Bayes estimation instead (Rossi, Allenby, and
McCulloch, 2005).
Hierachical Bayes estimation is typically used for choice models to estimate a mixed effects model. The upper level comprises fixed effects estimates for the sample, while the lower level represents estimates for each individual respondent within the overall distribution. The individual-level estimates are based on the group of multiple observations for each respondent.
If you fielded the survey online, you would commonly receive responses as 1/0
for each row, indicating whether the item was chosen (similar to our
cbc.win
vector above). For exposition here, we will use the winning cards
that were selected on the basis of our simulated part worths, given in cbc.win
,
but add a degree of randomness.
To estimate the hierarchical Bayes model, use the function
estimateMNLfromDesignHB()
. This function is primarily a wrapper that
simplifies the data setup and calls ChoiceModelR::choicemodelr()
(Sermas, 2012). Note that we use the tab format for the design matrix
(as opposed to the 1/0 dummy coded, expanded version).
This is an iterative MCMC estimation routine. For speed,
we specify 2000 total MCMC draws (burn-in and posterior); in practice,
this would typically be 10000s, as the MCMC process will need a longer
burn-in period for convergence.
estimateMNLfromDesignHB()
includes optional arguments to specify the
proportion of the MCMC chain that is regarded as posterior
draws (pitersUsed
), whether to save the draws for each respondent (drawKeep
)
and the proportion of posterior draws to retain (skipping them in the sequence
to avoid autocorrelation; drawKeepK
).
# replace 30% of the perfect "winning" vector with random draws cbc.win2 <- cbc.win cbc.replace <- sample(length(cbc.win2), length(cbc.win2)*0.3) # to replace cbc.win2[cbc.replace] <- sample(3, length(cbc.replace), replace=TRUE) # estimate using the design and winning cards cbc.hb <- estimateMNLfromDesignHB(tmp.des=cbc.tab, tmp.win=cbc.win2, kCards=cbc.concepts, kTrials=cbc.tasks, kResp=N , mcmcIters=2000)
The MCMC process shows a trace plot of the estimates for each attribute.
The estimated part worths may be
extracted from the HB model with the extractHBbetas()
function. We extract those,
add respondent IDs, and plot the values (Wickham, 2016;
Wilke, 2018). We can plot each respondent
as a separate faint point, allowing us to see the distribution of estimates;
this is especially helpful when there are groups of respondents who show differing
sets of preferences:
cbc.est <- data.frame(extractHBbetas(cbc.hb, cbc.attrs)) names(cbc.est) <- cbc.levels cbc.est$ID <- 1:nrow(cbc.est) library(ggplot2) library(reshape2) cbc.m <- melt(cbc.est, id.vars = "ID") library(ggridges) ggplot(data=cbc.m, aes(x=value, y=variable, group=variable)) + geom_density_ridges(scale=0.9, alpha=0, jittered_points=TRUE, rel_min_height=0.005, position="points_sina", point_color = "blue", point_alpha=1/sqrt(N), point_size=2.5) + ylab("Item") + xlab("Relative preference (blue circles=individuals)") + ggtitle("Preference estimates: Distribution of individuals")
The ggridges
plot is very useful to demonstrate the distribution of individuals
in HB estimates. If these were real data, we could interpret them as showing a
strong preference for Sparkly design, Red color, $14 price point, and so forth.
Estimated share of preference (aka "market share") may be estimated using
the marketSim()
function. Suppose we wanted to assess preference for a
Flat 8GB Blue drive at \$9 from Alpha, vs. an Odd 64GB Red drive at \$24
from Bravo. We specify the sets of attributes for each product and compare
them. In this case, we will use first choice preference, where
each individual is regarded as "purchasing" the item with highest preference;
other options in marketSim()
include share of preference at the individual
level, and first choice with randomization:
# cbc.levels <- c("Alpha", "Bravo", "Charlie", "Delta", # Brand 1-4 # "Shiny", "Flat", "Sparkly", "Odd", # Style 5-8 # "$9", "$14", "$19", "$24", "$29", # Price 9-13 # "Blue", "Red", # Color 14-15 # "8GB", "16GB", "32GB", "64GB") # Size 16-19 prod1 <- c(6, 16, 14, 9, 1) # Flat 8GB Blue $9 Alpha prod2 <- c(8, 19, 15, 12, 2) # Odd 64GB Red $24 Bravo usb.pref <- marketSim( cbc.est, # matrix of individual-level utilities list(prod1, prod2), # list of products to compare use.none=FALSE, # we have no "none" column style="first") # estimate share by first-choice approach # see the overall preference share for the two products colMeans(usb.pref)
Between just these two products, we estimate that
r round(colMeans(usb.pref)[1]*100)
% of respondents would prefer product 1,
the $9 flat 8GB drive from Alpha.
To use this package in a classroom setting, I suggest the following:
In classroom settings, I usually find that the individual level plots lead to interesting discussion. Disagreement, such as one respondent feeling that that her answers are not reflected in the distribution, is rare. Such discussion is often helpful to answer questions about the efficacy of choice modeling methods.
The choicetools
package includes support for many other features of CBC
models and related marketing analyses that are beyond the scope of this
vignette. Those include:
Bahna, E., and Chapman, CN (2018). Constructed, Augmented MaxDiff. In B. Orme, ed., Proceedings of the 2018 Sawtooth Software Conference, Orlando, FL.
Chapman, CN, and Feit, EMF (2019). R for Marketing Research and Analytics, 2nd ed. Chapter 13: Choice Modeling. New York: Springer.
Eddelbuettel, Dirk, with contributions by Antoine Lucas, Jarek Tuszynski, Henrik Bengtsson, Simon Urbanek, Mario Frasca, Bryan Lewis, Murray Stokely, Hannes Muehleisen, Duncan Murdoch, Jim Hester, Wush Wu, Qiang Kou, Thierry Onkelinx, Michel Lang, Viliam Simko, Kurt Hornik and Radford Neal. (2018). digest: Create Compact Hash Digests of R Objects. R package version 0.6.18. https://CRAN.R-project.org/package=digest
Rossi, PE, Allenby, GM, and McCulloch, RE (2005). Bayesian Statistics and Marketing. New York: Wiley.
Sermas, Ryan (2012). ChoiceModelR: Choice Modeling in R. R package version 1.2. https://CRAN.R-project.org/package=ChoiceModelR
Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. New York: Springer-Verlag.
Wilke, CO. (2018). ggridges: Ridgeline Plots in 'ggplot2'. R package version 0.5.1. https://CRAN.R-project.org/package=ggridges
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.