hdpGLM: Fit Hierarchical Dirichlet Process GLM
In hdpGLM: Hierarchical Dirichlet Process Generalized Linear Models

hdpGLM

R Documentation

Fit Hierarchical Dirichlet Process GLM

Description

The function estimates a semi-parametric mixture of Generalized Linear Models. It uses a (hierarchical) Dependent Dirichlet Process Prior for the mixture probabilities.

Usage

hdpGLM(
  formula1,
  formula2 = NULL,
  data,
  mcmc,
  family = "gaussian",
  K = 100,
  context.id = NULL,
  constants = NULL,
  weights = NULL,
  n.display = 1000,
  na.action = "exclude",
  imp.bin = "R"
)

Arguments

`formula1`	a single symbolic description of the linear model of the mixture GLM components to be fitted. The syntax is the same as used in the `lm` function.
`formula2`	eihter NULL (default) or a single symbolic description of the linear model of the hierarchical component of the model. It specifies how the average parameter of the base measure of the Dirichlet Process Prior varies linearly as a function of group level covariates. If `NULL`, it will use a single base measure to the DPP mixture model.
`data`	a data.frame with all the variables specified in `formula1` and `formula2`. Note: it is advisable to scale the variables before the estimation
`mcmc`	a named list with the following elements - `burn.in` (required): an integer greater or equal to 0 indicating the number iterations used in the burn-in period of the MCMC. - `n.iter` (required): an integer greater or equal to 1 indicating the number of iterations to record after the burn-in period for the MCMC. - `epsilon` (optional): a positive number. Default is 0.01. Used when `family='binomial'` or `family='multinomial'`. It is used in the Stormer-Verlet Integrator (a.k.a leapfrog integrator) to solve the Hamiltonian Monte Carlo in the estimation of the model. - `leapFrog` (optional) an integer. Default is 40. Used when `family='binomial'` or `family='multinomial'`. It indicates the number of steps taken at each iteration of the Hamiltonian Monte Carlo for the Stormer-Verlet Integrator. - `hmc_iter` (optional) an integer. Default is 1. Used when `family='binomial'` or `family='multinomial'`. It indicates the number of HMC iteration(s) for each Gibbs iteration.
`family`	a character with either 'gaussian', 'binomial', or 'multinomial'. It indicates the family of the GLM components of the mixture model.
`K`	an integer indicating the maximum number of clusters to truncate the Dirichlet Process Prior in order to use the blocked Gibbs sampler.
`context.id`	string with the name of the column in the data that uniquely identifies the contexts. If `NULL` (default) contexts will be identified by numerical indexes and unique context-level variables. The user is advised to pre-process the data to provide meaningful labels for the contexts to facilitate later visualization and analysis of the results.
`constants`	either NULL or a list with the constants of the model. If not NULL, it must contain a vector named `mu_beta`, whose size must be equal to the number of covariates specified in `formula1` plus one for the constant term; `Sigma_beta`, which must be a squared matrix, and each dimension must be equal to the size of the vector `mu_beta`; and `alpha`, which must be a single number. If @param family is 'gaussian', then it must also contain `s2_sigma` and `df_sigma`, both single numbers. If NULL, the defaults are `mu_beta=0`, `Sigma_beta=diag(10)`, `alpha=1`, `df_sigma=10`, `s2_sigma=10` (all with the dimension automatically set to the correct values).
`weights`	numeric vector with the same size as the number of rows of the data. It must contain the weights of the observations in the data set. NOTE: FEATURE NOT IMPLEMENTED YET
`n.display`	an integer indicating the number of iterations to wait before printing information about the estimation process. If zero, it does not display any information. Note: displaying informaiton at every iteration (n.display=1) may increase the time to estimate the model slightly.
`na.action`	string with action to be taken for the `NA` values. (currently, only `exclude` is available)
`imp.bin`	string, either "R" or "Cpp" indicating the language of the implementation of the binomial model.

Details

This function estimates a Hierarchical Dirichlet Process generalized linear model, which is a semi-parametric Bayesian approach to regression estimation with clustering. The estimation is conducted using Blocked Gibbs Sampler if the output variable is gaussian distributed. It uses Metropolis-Hastings inside Gibbs if the output variable is binomial or multinomial distributed. This is specified using the parameter family. See:

Ferrari, D. (2020). Modeling Context-Dependent Latent Effect Heterogeneity, Political Analysis, 28(1), 20–46. doi:10.1017/pan.2019.13.

Ferrari, D. (2023). "hdpGLM: An R Package to Estimate Heterogeneous Effects in Generalized Linear Models Using Hierarchical Dirichlet Process." Journal of Statistical Software, 107(10), 1-37. doi:10.18637/jss.v107.i10.

Ishwaran, H., & James, L. F., Gibbs sampling methods for stick-breaking priors, Journal of the American Statistical Association, 96(453), 161–173 (2001).

Neal, R. M., Markov chain sampling methods for dirichlet process mixture models, Journal of computational and graphical statistics, 9(2), 249–265 (2000).

Value

The function returns a list with elements samples, pik, max_active, n.iter, burn.in, and time.elapsed. The samples element contains a MCMC object (from coda package) with the samples from the posterior distribution. The pik is a n x K matrix with the estimated probabilities that the observation $i$ belongs to the cluster $k$

Examples



## Note: this example is for illustration. You can run the example
## manually with increased number of iterations to see the actual
## results, as well as the data size (n)

set.seed(10)
n = 300
data = tibble::tibble(x1 = rnorm(n, -3),
                      x2 = rnorm(n,  3),
                      z  = sample(1:3, n, replace=TRUE),
                      y  =I(z==1) * (3 + 4*x1 - x2 + rnorm(n)) +
                          I(z==2) * (3 + 2*x1 + x2 + rnorm(n)) +
                          I(z==3) * (3 - 4*x1 - x2 + rnorm(n)) 
                      ) 


mcmc    = list(burn.in = 0,  n.iter = 20)
samples = hdpGLM(y~ x1 + x2, data=data, mcmc=mcmc, family='gaussian',
                 n.display=30, K=50)

summary(samples)
plot(samples)
plot(samples, separate=TRUE)

## compare with GLM
## lm(y~ x1 + x2, data=data,  family='gaussian')

hdpGLM documentation built on June 8, 2025, 10:05 a.m.