BiocStyle::markdown()

knitr::opts_chunk$set(fig.width=6, fig.height=5)

Model categorical count data with a Hierarchical Dirichlet Process as described by Teh et al. (Hierarchical Dirichlet Processes, Journal of the American Statistical Association, 2006, 101:476). This R package was adapted from open source MATLAB and C code written by Yee Whye Teh.

library(hdp)

Terminology and method overview

The input data consists of counts of data items in each data category for each sample. The HDP model assumes the data is drawn from a mixture of underlying categorical distributions that are shared (in different mixture ratios) across samples.

The user defines a tree-like structure for the HDP, wherein each DP node draws a mixture of categorical data distributions from its parent node. Data is assigned to particular DP nodes (usually one sample per leaf node), and the structure of the parent DP nodes defines hierarchies of sample-relatedness given some prior knowledge of sample groups.

In each iteration of the posterior sampling process, every data item is assigned to a raw cluster. After posterior sampling, raw clusters are grouped into components. Unlike the raw clusters, the components won't be "too small" (a raw cluster could contain just one data item) nor "too similar" (the distribution of categories in two raw clusters could be about the same, but these raw clusters would then be grouped into one component), and there will be the same number of components across all posterior samples (the number of raw clusters varies across posterior samples). After grouping the raw clusters into global components, any data item that is not assigned to a main component is assigned to component zero. This "zero" component contains the proportion of the dataset with uncertain clustering behaviour, possibly due to noise.

Toy dataset

To illustrate the basic features of the r Githubpkg("nicolaroberts/hdp") package, we consider a toy dataset of categorical count data with 10 samples (rows) and 6 categories (columns). The counts were simulated from two underlying categorical distributions with a different average mixture ratio in the first and last set of five samples.

example_data_hdp

Quick start with basic HDP structure

Initialise hdpState object via hdp_quick_init

Use hdp_quick_init() to initialise a basic default HDP structure with one top parent DP node with no associated data, and one child DP node per sample. Every DP node shares the same concentration parameter, and will automatically be 'activated' (made available for posterior sampling). The base distribution is a uniform Dirichlet with psuedocount 1 in each data category.

set.seed(10)
quick_hdp <- hdp_quick_init(example_data_hdp)
quick_hdp

hdp_quick_init() returns a hdpState object. Notice that it has 11 DP nodes (one parent, and one for each of 10 samples), and that the index of the parent node is 0 for the first DP (the parent is the top node) and 1 for the others. The first DP node (the parent) has no associated data items, while the second DP node has 103 (first sample), the third has 129 (second sample) and so on. With the default settings, the number of initial clusters is 2, and the hyperparameters for the concentration parameter are both 1.

Note that initialising clusters is random, and so the seed is reported for reproducibility. However, hdp_quick_init() does not let the user set a seed, so for reproducible results the seed can be set before running hdp_quick_init() or the HDP can be defined manually (see below) and the seed set via dp_activate().

Run posterior sampling chain

The output of hdp_quick_init() is ready for posterior sampling via hdp_posterior(). The user must specify the number of burn-in iterations, the number of posterior samples to collect, and the number of iterations between each collected sample. Each iteration re-assigns the cluster allocation of every data item, given the current allocations of all other data items.

quick_chain <- hdp_posterior(quick_hdp, burnin=500, n=100, space=40, seed=1234)
quick_chain

hdp_posterior() returns a hdpSampleChain object. Notice that the number of raw clusters varies across the 100 posterior samples collected off the chain.

Assess quality of posterior sampling chain

Theoretically, an infinitely long posterior sampling chain would sample all possible cluster models in proportion to their likelihood. In practice, we aim to have a finite posterior sample set that approximates the true infinite random sampling space.

To assess the choice of burn-in time and number of iterations between collected samples, consider the following set of diagnostic plots (ideally, want no strong trends across posterior samples).

Note that the best approximation of the infinite random sampling space is achieved with multiple independent sampling chains (described in a later section).

plot_lik(quick_chain, bty="L")

plot_numcluster(quick_chain, bty="L")

plot_data_assigned(quick_chain, bty="L")

If the burn-in time was too short, the function cull_posterior_samples() throws away the first ncull posterior samples collected, effectively extending the burn-in time post-hoc.

Extract components from hdpSampleChain

The raw cluster assignments are not a particularly useful description of the dataset because the number of raw clusters varies across posterior samples, and some raw clusters may be very small (e.g. one data item in its own cluster), or have very similar distributions across data categories (effectively describing one underlying process twice).

To provide a more useful summary of the posterior samples, the raw clusters are consolidated into components (constant number across posterior samples).

For a hdpSampleChain object, the hdp_extract_components() function matches up the clusters across posterior samples and groups them into components such that:

quick_chain <- hdp_extract_components(quick_chain)
quick_chain

Note that quick_chain now includes component information, using two components to explain 96% of the data.

The following plot shows the number of data items assigned to each component in each posterior sample.

plot_comp_size(quick_chain, bty="L", lab=c(3, 5, 7))

The following plot shows the mean distribution of data categories for each component, and the 95% credibility intervals (thin black lines). See ?plot_comp_distn for more options.

par(mfrow=c(2,2), mar=c(3, 3, 2, 1))
plot_comp_distn(quick_chain, cat_names=paste0("Categ", 1:6), col="skyblue3")

The following plots show the mean distribution of components at each specified DP node. See ?plot_dp_comp_exposure for more options.

plot_dp_comp_exposure(quick_chain, dpindices=2:6, main_text="First five samples",
                      col=RColorBrewer::brewer.pal(4, "Set3"))
plot_dp_comp_exposure(quick_chain, dpindices=7:11, main_text="Last five samples",
                      col=RColorBrewer::brewer.pal(4, "Set3"))

Note that posterior samples collected off just one finite chain are somewhat correlated with each other, and the variance of the output statistics will tend to be underestimated. This problem is redressed in the next example by using multiple independent sampling chains.

Customised HDP structure

Initialise hdpState object via hdp_init and friends

Use hdp_init() to initialise a custom HDP structure by specifying a number of starting nodes, their parent relationships, the parameters (pseudocounts) of the base distribution, and a set of hyperparameters for the gamma priors over one or more concentration parameters.

In this example, we initialise a HDP with one 'top' DP node off the base distribution, and two children DP nodes off that parent. The two child DPs share a different concentration parameter to that of the parent (hyperparameters for both concentration parameters are rate=2, shape=0.5). The parent DP node draws from the 'base distribution', defined as a uniform Dirichlet with pseudocount 1 in each data category.

my_hdp <- hdp_init(ppindex=c(0, 1, 1), 
                   cpindex=c(1, 2, 2), 
                   hh=rep(1, 6), 
                   alphaa=rep(2, 2), 
                   alphab=rep(0.5, 2))
my_hdp

Further DP nodes can be added to the structure via hdp_adddp(), and further concentration parameters can be added via hdp_addconparam().

Next, we add two more concentration parameters, and ten further DP nodes (five children per parent).

my_hdp <- hdp_addconparam(my_hdp, 
                          alphaa=rep(2, 2), 
                          alphab=rep(0.5, 2))
my_hdp <- hdp_adddp(my_hdp, 
                    numdp=10, 
                    ppindex=rep(2:3, each=5), 
                    cpindex=rep(3:4, each=5))
my_hdp

Data is assigned via hdp_setdata(), with one row of data assigned to each of the DP nodes specified.

my_hdp <- hdp_setdata(my_hdp,
                      dpindex=4:13,
                      data=example_data_hdp)
my_hdp

Run multiple posterior sampling chains

When initialised, the DP nodes are 'heldout' (not available for posterior sampling). Use dp_activate() to activate them and specify the number of starting clusters. To see the effect:

dp_activate(my_hdp, dpindex=1:13, initcc=2, seed=5678)

After dp_activate, the hdpState object will be ready for posterior sampling!

In this example, I run four independent posterior sampling chains, with different dp_activate seeds to start the clustering from a different random starting point each time. At the end, use hdp_multi_chain() to store the results together in a hdpSampleMulti object. This approach is recommended, because the variance of the solution is likely to be underestimated if only one posterior sampling chain is used.

chlist <- vector("list", 4)

for (i in 1:4){

  activated_hdp <- dp_activate(my_hdp, dpindex=1:13, initcc=2, seed=i*1e2)

  chlist[[i]] <- hdp_posterior(activated_hdp, 
                               burnin=2500,
                               n=50,
                               space=100,
                               cpiter=2, 
                               seed=i*1e4)
}

chlist[1:2]

multi <- hdp_multi_chain(chlist)
multi

The constituent chains stored in a hdpSampleMulti object can be accessed via chains().

Diagnostic plots for hdpSampleMulti

To plot diagnostics for the constituent chains stored within a hdpSampleMulti object, use lapply() over chains().

par(mfrow=c(2,2), mar=c(4, 4, 2, 1))
p1 <- lapply(chains(multi), plot_lik, bty="L")
p2 <- lapply(chains(multi), plot_numcluster, bty="L")
p3 <- lapply(chains(multi), plot_data_assigned, bty="L")

Extract components from hdpSampleMulti

For a hdpSampleMulti object, the hdp_extract_components() function matches clusters across posterior samples (considering all chains together), and groups them into components. Recall that data items not assigned to the main components are captured by the 'zero component' for noise and uncertainty.

multi <- hdp_extract_components(multi)
multi

plot_comp_size(multi, bty="L", lab=c(3, 5, 7))

par(mfrow=c(2,2), mar=c(3, 3, 2, 1))
plot_comp_distn(multi, cat_names=paste0("Categ", 1:6), col="skyblue3")

par(mfrow=c(1,1))
plot_dp_comp_exposure(multi, dpindices=4:8, main_text="First five samples",
                      col=RColorBrewer::brewer.pal(3, "Set3"))
plot_dp_comp_exposure(multi, dpindices=9:13, main_text="Last five samples",
                      col=RColorBrewer::brewer.pal(3, "Set3"))

Conditioning on prior knowledge

To condition on prior knowledge about possible components (underlying catagorical data distributions), use hdp_prior_init to initialise a hdpState object with one top parent DP node (active) with no associated data, and one child DP node (frozen) per prior component. The frozen nodes signifying prior components contain some number of pseudo-data items - use these pseudocounts to specify the relative weighting of each prior component compared to the volume of observed data. HDP can simultaneously match data to known priors and discover any new or unspecified components.

In the following example, a toy dataset of 100 samples and 10 data categories was simulated from four underlying components.

# the dataset has 100 rows (samples) and 10 columns (data categories):
dim(example_data_hdp_prior)
# the two provided known priors:
example_known_priors
par(mfrow=c(1,2))
barplot(example_known_priors[,1], main='Known prior 1')
barplot(example_known_priors[,2], main='Known prior 2')

Two of these underlying components are provided and input as prior information with a pseudocount weighting of 1000 data items each.

Prior components are preserved by hdp_extract_components and are prefixed with "P". Any new components identified are prefixed with "N".

# a bit slow to run (few minutes) - load pre-made copy
# hdp_p <- hdp_prior_init(example_known_priors, rep(1000, 2), hh=rep(1, 10),
#              alphaa=c(1,1), alphab=c(1,1))
# hdp_p <- hdp_addconparam(hdp_p, alphaa=c(1,1), alphab=c(1,1))
# hdp_p <- hdp_adddp(hdp_p, 101, c(1, rep(4, 100)), c(3, rep(4, 100)))
# hdp_p <- hdp_setdata(hdp_p, 5:104, example_data_hdp_prior)
# 
# chlist <- vector("list", 4)
# 
# for (i in 1:4){
#   
#   activated_hdp <- dp_activate(hdp_p, dpindex=4:104, initcc=4, seed=i*1e3)
#   
#   chlist[[i]] <- hdp_posterior(activated_hdp, 
#                                burnin=2500,
#                                n=50,
#                                space=100,
#                                cpiter=2, 
#                                seed=i*1e5)
# }
# 
# hdp_p <- hdp_multi_chain(chlist)

# loading pre-made copy
hdp_p

par(mfrow=c(2,2), mar=c(4, 4, 2, 1))
p1 <- lapply(chains(hdp_p), plot_lik, bty="L")
p2 <- lapply(chains(hdp_p), plot_numcluster, bty="L")
p3 <- lapply(chains(hdp_p), plot_data_assigned, bty="L")


hdp_p <- hdp_extract_components(hdp_p)
par(mfrow=c(1,1))
plot_comp_size(hdp_p, bty="L")
par(mfrow=c(2,2))
plot_comp_distn(hdp_p, comp=1:4)
par(mfrow=c(1,1))
plot_dp_comp_exposure(hdp_p, 5:104, col_comp=RColorBrewer::brewer.pal(5, 'Purples'))

Access internal details of hdp object classes

For documentation on hdpState objects, see class?hdpState. Some slots of hdpState can be accessed like so:

# number of data categories
numcateg(quick_hdp)

# number of DP nodes
numdp(quick_hdp)

# number of concentration parameters
numconparam(quick_hdp)

# 'base' distribution above the top DP node
base(quick_hdp)

# parameters of 'base' distribution above top DP node
base_params(quick_hdp)

# concentration parameter details
conparam(quick_hdp)

# DP node detail
dp(quick_hdp)[1:5]

# DP 'state' for each node
# 2 is activated (included in posterior sampling)
# 1 is frozen (conditioned on during posterior sampling)
# 0 is heldout (ignored during postering sampling)
dpstate(quick_hdp)

# index of parent node for each node
ppindex(quick_hdp)

# index of concentration parameter for each node
cpindex(quick_hdp)

# seed used when activating nodes (initialising cluster assignments)
activating_seed(quick_hdp)

# index of frozen nodes containing psuedo-data for prior components,
# only used if initialised via hdp_prior_init
# pseudoDP()

For documentation on hdpSampleChain objects, see class?hdpSampleChain. Some slots of hdpSampleChain can be accessed like so:

# random seed
sampling_seed(quick_chain)

# settings of the posterior sampling chain
hdp_settings(quick_chain)

# instance of the hdpState object at the end of the chain
final_hdpState(quick_chain)

# data likelihood (given model) after every iteration
lik(quick_chain)[1:10]

# number of raw clusters in each posterior sample
numcluster(quick_chain)[1:10]

# concentration parameter values at each posterior sample
cp_values(quick_chain)[1:10]

# List of matrices (one from each posterior sample) counting the category-cluster 
# data assignment across all DP nodes. Number of rows is the number of 
# categories (constant), and number of columns is the number of clusters 
# in that posterior sample (variable).
clust_categ_counts(quick_chain)[1:3]

# List of matrices (one from each posterior sample) counting within-DP cluster 
# assignment (aggregating across data categories). Number of rows is the number of
# DPs (constant), and number of columns is the number of clusters in that posterior sample (variable).
clust_dp_counts(quick_chain)[1:3]

The consituent hdpSampleChain objects within a hdpSampleMulti object can be accessed via chains().

chlist <- chains(multi)
chlist[3:4]

Accessors for the component summary stats are the same for hdpSampleChain and hdpSampleMulti objects.

# Number of components extracted (not including comp zero)
numcomp(multi)

# Propertion of data assigned to main components (not comp zero)
prop.ex(multi)

# List of matrices (one for each component) counting the sample-category data 
# assignment across all DP nodes. Number of rows is the number of posterior samples, 
# and number of columns is the number of data categories.
comp_categ_counts(multi)[[2]][1:10,]

# List of matrices (one for each DP) counting sample-component assignment 
# (aggregating across data categories). Number of rows is the number of posterior 
# samples, and number of columns is the number of components.
comp_dp_counts(multi)[[2]][1:10,]

# List with elements "mean" and "cred.int", containing matrices with the mean 
# (and lower/upper 95% credibility interval) over data categories for each component. 
# Number of rows is the number of components, and number of columns is the 
# number of data categories.
comp_categ_distn(multi)

# List with elements "mean" and "cred.int", containing matrices with the mean 
# (and lower/upper 95% credibility interval) distribution over components for each DP. 
# Number of rows is the number of DPs, and number of columns is the number of components.
comp_dp_distn(multi)$mean
comp_dp_distn(multi)$cred.int[4:7]

Session info

Session information for the system on which this document was compiled:

devtools::session_info()


nicolaroberts/hdp documentation built on May 23, 2019, 5:09 p.m.