This package contains functions for data preparation, parameter estimation, scoring, and plotting for the BG/BB, BG/NBD and Pareto/NBD models.
This project was funded and sponsored by the Wharton Customer Analytics Initiative (wcai.wharton.upenn.edu).
This package implements the BG/BB, BG/NBD and Pareto/NBD models, which capture/project customer purchase patterns in a typical non-contractual setting.
While these models are developed on a customer-by-customer basis, they
do not necessarily require data at such a granular level. The
Pareto/NBD requires a “customer-by-sufficient-statistic” matrix
(CBS), which consists of each customer's frequency, recency (the time
of their last transactions) and total time observed - but the timing
of each and every transaction (other than the last) is not needed by
the model. If, however, you do have the granular data in the form of
an event log (which contains at least columns for customer
identification and the time of each transaction, and potentially more
columns such as transaction amount), this package provides functions
to convert it to a CBS. You can use
dc.ReadLines to get
your event log from a comma-delimited file to an event log usable by
this package; it is possible to use read.table or read.csv, but
formatting will be required afterwards. You can then convert the event
log directly to a CBS (for both the calibration and holdout periods)
dc.ElogToCbsCbt. As the name suggests, this
function also produces a customer-by-time matrix (CBT). This matrix
consists of a row for every customer and a column for every date, and
is populated by a statistic of your choice (reach, frequency, or
spend). It is not necessary for any of the models presented in this
package, but is used as a building block to produce the CBS.
The BG/NBD model requires all the same inputs as the Pareto/NBD model.
The BG/BB model requires the same information as the Pareto/NBD model,
but as it models discrete transaction opportunities, this information
can be condensed into a recency-frequency matrix. A recency-frequency
matrix contains a row for every recency/frequency combination in the
given time period, and each row contains the number of customers with
that recency/frequency combination. Since frequency will always be
less than or equal to recency, this matrix will contain (n)(n-1)/2 + 1
rows at most, with n as the number of transaction opportunities (of
course, the maximum number of rows for pooled data - for customers
with varying numbers of transaction opportunities - will be the sum of
the above equation for each unique number of transaction
opportunities). You can convert a CBS to recency-frequency matrices
If you want to test the data contained in the package, or have data
formatted as a customer-by-sufficient-statistic or recency-frequency
matrix, a good starting place would be
bgbb.PlotFrequencyInCalibration will give a check that
the model fits the data in-sample. Further plotting functions,
comparing actual and expected results, are labelled
“pnbd.Plot...”, “bgnbd.Plot...” and “bgbb.Plot...”.
The building blocks of these functions are also provided:
bgbb.ConditionalExpectedTransactions may be of
This package uses the following conventions:
The time period used to estimate the model parameters is called the calibration period. Users may be accustomed to this being called the estimation period, or simply being referred to as “in-sample”. Function parameter names generally follow this convention: for example, “n.cal” is used to refer to the number of transaction opportunities in the calibration period.
The time period used to validate model performance is called the holdout period. Users may be accustomed to this being called the validation period, or simply being referred to as “out-of-sample”. Function parameters relating to this time period are generally appended with “.star”. For example, n.star is used to refer to the number of transaction opportunities in the holdout period.
As described in the papers referenced below, the BG/BB, BG/NBD and Pareto/NBD models are generally concerned with repeat transactions, not total transactions. This means that a customer's first transaction in the calibration period is usually not part of the data being modelled - this is due to the fact that a new customer generally does not show up “on the comapany's radar” until after their first purchase has taken place. This means that the modal number of repeat purchases tends to be zero. If your data does not have a relatively large number of customers with zero transactions, but does have a relatively large number of customers with one transaction, and the estimation functions are struggling, the problem is most likely that you are including customers' very first transactions. Some of the data-conversion functions have examples illustrating how to work with data that includes this very first transaction. Note that this does not apply to the holdout period; in the holdout period, we already know about the customer and take all of their previous transactions into account.
Lukasz Dziurzynski, Daniel McCarthy, Edward Wadsworth
Contributors: Peter Fader, Elea McDonnell Feit, Bruce Hardie, Arun Gopalakrishnan, Eric Schwartz, Yao Zhang, Elea McDonnell Feit
Maintainer: Daniel McCarthy <email@example.com>
See www.brucehardie.com for papers, notes, and datasets relating to applied probability models in marketing.
Fader, Peter S., and Bruce G.S. Hardie. “A Note on Deriving the Pareto/NBD Model and Related Expressions.” November. 2005. Web. http://www.brucehardie.com/notes/008/
Fader, Peter S., Bruce G.S. Hardie, and Ka L. Lee. “RFM and CLV: Using Iso-Value Curves for Customer Base Analysis.” Journal of Marketing Research Vol.42, pp.415-430. November. 2005. http://www.brucehardie.com/papers.html
Fader, Peter S., and Bruce G.S. Hardie. “Deriving an Expression for P (X(t) = x) Under the Pareto/NBD Model.” September. 2006. Web. http://www.brucehardie.com/notes/012/
Fader, Peter S., and Bruce G.S. Hardie. “Creating an RFM summary using Excel.” December. 2008. Web. http://www.brucehardie.com/notes/022/
Fader, Peter S., Bruce G.S. Hardie, and Jen Shang. “Customer-Base Analysis in a Discrete-Time Noncontractual Setting.” Marketing Science 29(6), pp. 1086-1108. 2010. INFORMS. http://www.brucehardie.com/papers/020/
Jerath, Kinshuk, Peter S. Fader, and Bruce G.S. Hardie. “Customer-Base Analysis on a 'Data Diet': Model Inference Using Repeated Cross-Sectional Summary (RCSS) Data.” June. 2011. Available at SSRN: http://ssrn.com/abstract=1708562 or http://dx.doi.org/10.2139/ssrn.1708562
Fader, Peter S., Bruce G.S. Hardie, and Ka L. Lee. ““Counting Your Customers” the Easy Way: An Alternative to the Pareto/NBD Model.” Marketing Science Vol.24, pp.275-284. Spring. 2005. http://www.brucehardie.com/papers.html
Fader, Peter S., Hardie, Bruce G.S., and Lee, Ka Lok. “Computing P(alive) Using the BG/NBD Model.” December. 2008. Web. http://www.brucehardie.com/notes/021/palive_for_BGNBD.pdf