Bayesian nonparametric generalized linear models
Description
Fits Dirichlet process mixtures of joint responsecovariate models, where the covariates are continuous while the discrete responses are represented utilizing continuous latent variables. See ‘Details’ section for a full model description.
Usage
1 2 3 4 
Arguments
formula 
a formula defining the response and the covariates e.g. 
family 
a description of the kernel of the response variable. Currently eight options are supported: 1. "poisson", 2. "negative binomial", 3. "generalized poisson", 4. "hyperpoisson", 5. "ctpd", 6. "compoisson", 7. "binomial" and 8. "beta binomial". The first six kernels are used for count data analysis while the last two are used for binomial data analysis. Kernels 3.6. allow for both over and underdispersion relative to the Poisson distribution. See ‘Details’ section for some of the kernel details. 
data 
an optional data frame, list or environment (or object coercible by ‘as.data.frame’ to a data frame) containing the variables in the model. If not found in ‘data’, the variables are taken from ‘environment(formula)’. 
offset 
this can be used to specify an a priori known component to be included in the linear predictor during fitting. This should be ‘NULL’ or a numeric vector of length equal to the sample size. One ‘offset’ term can be included in the formula, and if more are required, their sum should be used. 
sampler 
the MCMC algorithm to be utilized. The two options are 
StorageDir 
a directory to store files with the posterior samples of models parameters and other quantities of interest. If a directory is not provided, files are created in the current directory and removed when the sampler completes. 
ncomp 
number of mixture components. Defines where the countable mixture of densities [in (1) below] is truncated.
Even if 
sweeps 
total number of posterior samples, including those discarded in burnin period (see argument 
burn 
length of burnin period. 
thin 
thinning parameter. 
seed 
optional seed for the random generator. 
prec 
precision parameter. Updating the parameters of the response distribution requires a Metropolis  Hastings step, with proposal distributions centered at current values and with precision equal to this argument. It can be of length one (for "poisson" and "binomial" kernels) or of length two (for "negative binomial", "beta binomial", "generalizedpoisson", "hyperpoisson" and "compoisson" kernels) or of length three (for the "ctpd" kernel). 
V 
optional scale matrix V of the prior Wishart distribution assigned to precision matrix T_h. See ‘Details’ section. 
Vdf 
optional degrees of freedom Vdf of the prior Wishart distribution assigned to precision matrix T_h. See ‘Details’ section. 
Mu.nu 
optional prior mean μ_{ν} of the covariance vector ν_h. See ‘Details’ section. 
Sigma.nu 
optional prior covariance matrix Σ_{ν} of ν_h. See ‘Details’ section. 
Mu.mu 
optional prior mean μ_{μ} of the mean vector μ_h. See ‘Details’ section. 
Sigma.mu 
optional prior covariance matrix Σ_{μ} of μ_h. See ‘Details’ section. 
Alpha.xi 
an optional parameter that depends on the specified family.
See ‘Details’ section. 
Beta.xi 
an optional parameter that depends on the specified family.
See ‘Details’ section. 
Alpha.alpha 
optional shape parameter α_{α} of the Gamma prior assigned to the concentration parameter α. See ‘Details’ section. 
Beta.alpha 
optional rate parameter β_{α} of the Gamma prior assigned to concentration parameter α. See ‘Details’ section. 
Turnc.alpha 
optional truncation point c_{α} of the Gamma prior assigned to concentration parameter α. See ‘Details’ section. 
Xpred 
an optional design matrix the rows of which include the covariates x for which the conditional distribution of Yx,D (where D denotes the data) is calculated. These are treated as ‘new’ covariates i.e. they do not contribute to the likelihood. The matrix shouldn't include a column of 1's. 
offsetPred 
the offset term associated with the new covariates 
... 
Other options that will be ignored. 
Details
Function bnpglm
returns samples from the posterior distributions of the parameters of the model:
f(y_i,x_i) = ∑_{h=1}^{∞} π_h f(y_i,x_iθ_h), \hspace{80pt} (1)
where y_i is a univariate discrete response, x_i is a pdimensional vector of continuous covariates, and π_h, h ≥q 1, are obtained according to Sethuraman's (1994) stickbreaking construction: π_1 = v_1, and for l ≥q 2, π_l = v_l ∏_{j=1}^{l1} (1v_j), where v_k are iid samples v_k \simBeta (1,α), k ≥q 1.
The discrete responses y_i are represented as discretized versions of continuous latent variables y_i^*. Observed discrete and continuous latent variables are connected by:
y_{i} = q \iff c_{i,q1} < y^*_{i} < c_{i,q}, q=0,1,2,…,
where the cutpoints are obtained as: c_{i,1} = ∞, while for q ≥q 0, c_{i,q} = c_{q}(λ_{i}) = Φ^{1}\{F(q;λ_i)\}. Here Φ(.) is the cumulative distribution function (cdf) of a standard normal variable and F() denotes an appropriate cdf. Further, latent variables are assumed to independently follow a N(0,1) distribution, where the mean and variance are restricted to be zero and one as they are nonidentifiable by the data. Choices for F() are described next.
For counts, currently six options are supported. First, F(.;λ_i) can be specified as the cdf of a Poisson(H_i ξ_h) variable. Here λ_i=(ξ_h,H_i)^T, ξ_h denotes the Poisson rate associated with cluster h, and H_i the offset term associated with sampling unit i. Second, F(.;λ_i) can be specified as the negative binomial cdf, where λ_i= (ξ_{1h},ξ_{2h},H_i)^T. This option allows for overdispersion within each cluster relative to the Poisson distribution. Third, F(.;λ_i) can be specified as the Generalized Poisson cdf, where, again, λ_i=(ξ_{1h},ξ_{2h},H_i)^T. This option allows for both over and underdispersion within each cluster. The other three options, that also allow for both over and underdispersion relative to the Poisson distribution, are the Hyper Poisson (HP), COMPoisson and the Complex Triparametric Pearson (CTP) kernels. The HP and COMPoisson kernels have 2 parameters and the CTPD kernel has 3 parameters.
For Binomial data, currently two options are supported. First, F(.;λ_i) may be taken to be the cdf of a Binomial(H_i,ξ_h) variable, where ξ_h denotes the success probability of cluster h and H_i the number of trials associated with sampling unit i. Second, F(.;λ_i) may be specified to be the betabinomial cdf, where λ=(ξ_{1h},ξ_{2h},H_i)^T.
Details on all kernels are provided in the tables below. The first table provides the probability mass functions and the mean in the presence of an offset term (which may be taken to be one). The column ‘Sample’ indicates for which parameters the routine provides posterior samples. The second table provides information on the assumed priors along with the default values of the parameters of the prior distributions and it also indicates the function arguments that allow the user to alter these. Lastly, the third tables provides some details on the less frequently used kernels.
Kernel  PMF  Offset  Mean  Sample 
Poisson  \exp(Hξ) (Hξ)^y /y!  H  H ξ  ξ 
Negative Binomial  \frac{Γ(y+ξ_1)}{Γ(ξ_1)Γ(y+1)}(\frac{ξ_2}{H+ξ_2})^{ξ_1}(\frac{H}{H+ξ_2})^{y}  H  H ξ_1/ξ_2  ξ_1, ξ_2 
Generalized Poisson  ξ_1 \{ξ_1+(ξ_21)y\}^{y1} ξ_2^{y} \times  H  Hξ_1  ξ_1,ξ_2 
~~ \exp\{[ξ_1+(ξ_21)y]/ξ_2\}/y!  
Hyper Poisson  \frac{1}{_1F_1(1,ξ_2,ξ_3)} \frac{ξ_3^y}{(ξ_2)_y}  H  H ξ_1 = ξ_3   ξ_1,ξ_2 
~~ (ξ_21) \frac{_1F_1(1,ξ_2,ξ_3)1}{_1F_1(1,ξ_2,ξ_3)}  
CTP  f_0 \frac{(ξ_3+ξ_4 i)_y (ξ_3ξ_4 i)_y}{(ξ_2)_y y!}  H  H ξ_1 = \frac{ξ_3^2+ξ_4^2}{ξ_22ξ_31}  ξ_1, ξ_2, ξ_3 
COMPoisson  \frac{ξ_3^y}{Z(ξ_2,ξ_3)(y!)^{ξ_2}}  H  H ξ_1 = ξ_3 \frac{\partial \log(Z)}{\partial ξ_3}  ξ_1,ξ_2 
Binomial  {N \choose y} ξ^y (1ξ)^{Ny}  N  N ξ  ξ 
Beta Binomial  {N \choose y} \frac{{Beta}{(y+ξ_1,Ny+ξ_2)}}{{Beta}{(ξ_1,ξ_2)}}  N  N ξ_1/(ξ_1+ξ_2)  ξ_1,ξ_2 
Kernel  Priors  Default Values 
Poisson  ξ \sim Gamma(α_{ξ},β_{ξ})  Alpha.xi = 1.0, Beta.xi = 0.1 
Negative Binomial  ξ_i \sim Gamma(α_{ξ_i},β_{ξ_i}), i=1,2  Alpha.xi = c(1.0,1.0), Beta.xi = c(0.1,0.1) 
Generalized Poisson  ξ_1 \sim Gamma(α_{ξ_1},β_{ξ_1})  
ξ_2 \sim TN(α_{ξ_2},β_{ξ_2}) (β_{ξ_2} \equiv st.dev.)  Alpha.xi = c(1.0,1.0), Beta.xi = c(0.1,1.0)  
TN: truncated normal  
Hyper Poisson  ξ_i \sim Gamma(α_{ξ_i},β_{ξ_i}), i=1,2  Alpha.xi = c(1.0,0.5), Beta.xi = c(0.1,0.5) 
CTP  ξ_i \sim Gamma(α_{ξ_i},β_{ξ_i}), i=1,2  
ξ_3 \sim TN(α_{ξ_3},β_{ξ_3}) (β_{ξ_3} \equiv st.dev.)  Alpha.xi = c(1.0,1.0,0.0)  
TN: truncated normal  Beta.xi = c(0.1,0.1,100.0)  
COMPoisson  ξ_i \sim Gamma(α_{ξ_i},β_{ξ_i}), i=1,2  Alpha.xi = c(1.0,0.5), Beta.xi = c(0.1,0.5) 
Binomial  ξ \sim Beta(α_{ξ},β_{ξ})  Alpha.xi = 1.0, Beta.xi = 1.0 
Beta Binomial  ξ_i \sim Gamma(α_{ξ_i},β_{ξ_i}), i=1,2  Alpha.xi = c(1.0,1.0), Beta.xi = c(0.1,0.1) 
Kernel  Notes 
Generalized Poisson  ξ_1 > 0 is the mean and ξ_2 > 1/2 is a dispersion parameter. When ξ_2 = 1, 
the pmf reduces to the Poisson. Parameter values ξ_2 > 1 suggest over  
dispersion and parameter values 1/2 < ξ_2 < 1 suggest underdispersion  
relative to the Poisson.  
Hyper Poisson  ξ_1 > 0 is the mean and ξ_2 > 0 is a dispersion parameter. When ξ_2 = 1, 
the pmf reduces to the Poisson. When ξ_2 > 1 the pmf is overdispersed  
and when ξ_2 < 1 the pmf is underdispersed relative to the Poisson.  
COMPoisson  The mean is ξ_1 (> 0) and the variance approximately ξ_1/ξ_2, so 
similar comments as for the hyper Poisson hold.  
CTPD  Things are a bit more complex here. See RodriguezAvi et al. (2004) for the details. 
Further, joint vectors (y_i^{*},x_{i}) are modeled utilizing Gaussian distributions. Then, with θ_h denoting model parameters associated with the hth cluster, the joint density f(y_{i},x_{i}θ_h) takes the form
f(y_{i},x_{i}θ_h) = \int_{c_{i,y_i1}}^{c_{i,y_i}} N_{p+1}(y_{i}^{*},x_{i}μ_{h},C_h) dy_{i}^{*},
where μ_h and C_h denote the mean vector and covariance matrix, respectively.
The joint distribution of the latent variable y_i^{*} and the covariates x_{i} is
(y_{i}^{*},x_{i}^T)^Tθ_h \sim N_{p+1}≤ft( \begin{array}{ll} ≤ft( \begin{array}{l} 0 \\ μ_h \\ \end{array} \right), & C_h=≤ft[ \begin{array}{ll} 1 & ν_h^T \\ ν_h & Σ_h \\ \end{array} \right] \end{array}\right),
where ν_h denotes the vector of covariances cov(y_{i}^{*},x_{i}θ_h). Sampling from the posterior of constrained covariance matrix C_h is done using methods similar to those of McCulloch et al. (2000). Specifically, the conditional x_{i}y_{i}^{*} \sim N_{p}(μ_h+y_{i}^{*}ν_h, B_h = Σ_h  ν_h ν_h^T) simplifies matters as there are no constraints on matrix B_h (other than positive definiteness). Given priors for B_h and ν_h, it is easy to sample from their posteriors, and thus obtain samples from the posterior of Σ_h=B_h+ν_h ν_h^T.
Specification of the prior distributions:
Define T_h=B_h^{1} = (Σ_{h}  ν_h ν_h^T)^{1}, h ≥q 1. We specify that a priori T_h \sim Wishart_{p}(V,Vdf), where V is a p \times p scale matrix and Vdf is a scalar degrees of freedom parameter. Default values are: V = I_{p}/p and Vdf=p, however, these can be changed using arguments
V
and Vdf.The assumed prior for ν_h is N_p(μ_{ν},Σ_{ν}), h ≥q 1, with default values μ_{ν}=0 and Σ_{ν} = I_{p}. Arguments
Mu.nu
andSigma.nu
allow the user to change the default values.A priori μ_{h} \sim N_p(μ_{μ},Σ_{μ}), h ≥q 1. Here the default values are μ_{μ} = \bar{x} where \bar{x} denotes the sample mean of the covariates, and Σ_{μ} = D where D denotes a diagonal matrix with diagonal elements equal to the square of the observed range of the covariates. Arguments
Mu.mu
andSigma.mu
allow the user to change the default values.
For count data, with
family="poisson"
, a priori we take ξ_{h} \sim Gamma(α_{ξ},β_{ξ}), h ≥q 1. The default values are α_{ξ}=1.0,β_{ξ}=0.1, that define a Gamma distribution with mean α_{ξ}/β_{ξ}=10 and variance α_{ξ}/β_{ξ}^2=100. Defaults can be altered using argumentsAlpha.xi
andBeta.xi
.For count data with
family="negative binomial"
a priori we take ξ_{jh} \sim Gamma(α_{jξ},β_{jξ}), j=1,2, h ≥q 1. The default values are α_{jξ}=1.0,β_{jξ}=0.1, j=1,2. Default values for \{α_{jξ}: j=1,2\} can be altered using argumentAlpha.xi
, and default values for \{β_{jξ}: j=1,2\} can be altered using argumentBeta.xi
.For count data with
family="generalized poisson"
, a priori we take ξ_{1h} \sim Gamma(α_{1ξ},β_{1ξ}), and ξ_{2h} \sim Normal(α_{2ξ},β_{2ξ})I[ξ_{2h} \in R_{ξ_{2}}]. The default values are α_{jξ}=1.0, j=1,2 and β_{1ξ}=0.1, β_{2ξ}=1.0. Default values for \{α_{jξ}: j=1,2\} can be altered using argumentAlpha.xi
, and default values for \{β_{jξ}: j=1,2\} can be altered using argumentBeta.xi
.For count data with
family="hyperpoisson"
a priori we take ξ_{jh} \sim Gamma(α_{jξ},β_{jξ}), j=1,2, h ≥q 1. The default values are α_{1ξ}=1.0, α_{2ξ}=0.5 and β_{1ξ}=0.1, β_{2ξ}=0.5. Default values for \{α_{jξ}: j=1,2\} can be altered using argumentAlpha.xi
, and default values for \{β_{jξ}: j=1,2\} can be altered using argumentBeta.xi
.For count data with
family="ctpd"
, a priori we take ξ_{1h} \sim Gamma(α_{1ξ},β_{1ξ}), ξ_{2h} \sim Gamma(α_{2ξ},β_{2ξ}) and ξ_{3h} \sim Normal(α_{3ξ},β_{3ξ})I[ξ_{3h} \in R_{ξ_{3}}]. The default values are α_{1ξ}=1.0, α_{2ξ}=1.0, α_{3ξ}=0.0 and β_{1ξ}=0.1, β_{2ξ}=0.1, β_{3ξ}=100.0. Default values for \{α_{jξ}: j=1,2\} can be altered using argumentAlpha.xi
, and default values for \{β_{jξ}: j=1,2\} can be altered using argumentBeta.xi
.For count data with
family="compoisson"
a priori we take ξ_{jh} \sim Gamma(α_{jξ},β_{jξ}), j=1,2, h ≥q 1. The default values are α_{1ξ}=1.0, α_{2ξ}=0.5 and β_{1ξ}=0.1, β_{2ξ}=0.5. Default values for \{α_{jξ}: j=1,2\} can be altered using argumentAlpha.xi
, and default values for \{β_{jξ}: j=1,2\} can be altered using argumentBeta.xi
.For binomial data, with
family="binomial"
, a priori we take ξ_{h} \sim Beta(α_{ξ},β_{ξ}), h ≥q 1. The default values are α_{ξ}=1.0,β_{ξ}=1.0, that define a uniform distribution. Defaults can be altered using argumentsAlpha.xi
andBeta.xi
.For binomial data with
family="beta binomial"
, a priori we take ξ_{jh} \sim Gamma(α_{jξ},β_{jξ}), j=1,2, h ≥q 1. The default values are α_{jξ}=1.0,β_{jξ}=0.1. Default values for \{α_{jξ}: j=1,2\} can be altered using argumentAlpha.xi
, and default values for \{β_{jξ}: j=1,2\} can be altered using argumentBeta.xi
. The concentration parameter α is assigned a Gamma(α_{α},β_{α}) prior over the range (c_{α},∞), that is, f(α) \propto α^{α_{α}1} \exp\{α β_{α}\} I[α > c_{α}], where I[.] is the indicator function. The default values are α_{α}=2.0, β_{α}=4.0, and c_{α}=0.25. Users can alter the default using using arguments
Alpha.alpha
,Beta.alpha
andTurnc.alpha
.
Value
Function bnpglm
returns the following:
call 
the matched call. 
seed 
the seed that was used (in case replication of the results is needed). 
meanReg 
if 
modeReg 
if 
Q05Reg 
if 
Q10Reg 
if 
Q15Reg 
if 
Q20Reg 
if 
Q25Reg 
if 
Q50Reg 
if 
Q75Reg 
if 
Q80Reg 
if 
Q85Reg 
if 
Q90Reg 
if 
Q95Reg 
if 
denReg 
if 
denVar 
if 
Further, function bnpglm
creates files where the posterior samples are written. These files are (with all file names
preceded by ‘BNSP.’):
alpha.txt 
this file contains samples from the posterior of the concentration parameters α. The file is arranged in 
compAlloc.txt 
this file contains the allocations or configurations obtained at each iteration of the sampler. It consists of 
MeanReg.txt 
this file contains the conditional means of the response y given covariates x obtained at each iteration of the sampler. The rows represent the 
muh.txt 
this file contains samples from the posteriors of the pdimensional mean vectors μ_h, h=1,2,…,ncomp. The file is arranged in 
nmembers.txt 
this file contains 
nuh.txt 
this file contains samples from the posteriors of the pdimensional covariance vectors ν_h, h=1,2,…,ncomp. The file is arranged in 
Q05Reg.txt 
this file contains the 5% conditional quantile of the response y given covariates x obtained at each iteration of the sampler. The rows represent the 
Q10Reg.txt 
this file contains the 10% conditional quantile of the response y given covariates x obtained at each iteration of the sampler. The rows represent the 
Q15Reg.txt 
this file contains the 15% conditional quantile of the response y given covariates x obtained at each iteration of the sampler. The rows represent the 
Q20Reg.txt 
this file contains the 20% conditional quantile of the response y given covariates x obtained at each iteration of the sampler. The rows represent the 
Q25Reg.txt 
this file contains the 25% conditional quantile of the response y given covariates x obtained at each iteration of the sampler. The rows represent the 
Q50Reg.txt 
this file contains the 50% conditional quantile of the response y given covariates x obtained at each iteration of the sampler. The rows represent the 
Q75Reg.txt 
this file contains the 75% conditional quantile of the response y given covariates x obtained at each iteration of the sampler. The rows represent the 
Q80Reg.txt 
this file contains the 80% conditional quantile of the response y given covariates x obtained at each iteration of the sampler. The rows represent the 
Q85Reg.txt 
this file contains the 85% conditional quantile of the response y given covariates x obtained at each iteration of the sampler. The rows represent the 
Q90Reg.txt 
this file contains the 90% conditional quantile of the response y given covariates x obtained at each iteration of the sampler. The rows represent the 
Q95Reg.txt 
this file contains the 95% conditional quantile of the response y given covariates x obtained at each iteration of the sampler. The rows represent the 
Sigmah.txt 
this file contains samples from the posteriors of the p \times p covariance matrices Σ_h, h=1,2,…,ncomp. The file is arranged in 
SigmahI.txt 
this file contains samples from the posteriors of the p \times p precision matrices Σ_h^{1}, h=1,2,…,ncomp. The file is arranged in 
Th.txt 
this file contains samples from the posteriors of the p \times p precision matrices T_h, h=1,2,…,ncomp. The file is arranged in 
xih.txt 
this file contains samples from the posteriors of parameters ξ_h, h=1,2,…,ncomp. The file is arranged in 
Updated.txt 
this file contains 
Author(s)
Georgios Papageorgiou gpapageo@gmail.com
References
Consul, P. C. & Famoye, G. C. (1992). Generalized Poisson regression model. Communications in Statistics  Theory and Methods, 1992, 89109.
McCulloch, R. E., Polson, N. G., & Rossi, P. E. (2000). A Bayesian analysis of the multinomial probit model with fully identified parameters. Journal of Econometrics, 99(1), 173193.
Papageorgiou, G., Richardson, S. and Best, N. (2014). Bayesian nonparametric models for spatially indexed data of mixed type.
Papaspiliopoulos, O. (2008). A note on posterior sampling from Dirichlet mixture models. Technical report, University of Warwick.
RodriguezAvi, J., CondeSanchez, A., SaezCastillo, A. J., & OlmoJimenez, M. J. (2004). A triparametric discrete distribution with complex parameters. Statistical Papers, 45(1), 8195.
SaezCastillo, A. & CondeSanchez, A. (2013). A hyperpoisson regression model for overdispersed and underdispersed count data. Computational Statistics & Data Analysis, 61, 148157.
Sellers, K. F. & Shmueli, G. (2010). A flexible regression model for count data. Annals of Applied Statistics, 4(2), 943961.
Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica Sinica, 4, 639650.
Shmueli, G., Minka, T. P., Kadane, J. B., Borle, S., & Boatwright, P. (2005). A useful distribution for fitting discrete data: revival of the conwaymaxwellpoisson distribution. Journal of the Royal Statistical Society: Series C (Applied Statistics), 54(1), 127142.
Walker, S. G. (2007). Sampling the Dirichlet mixture model with slices. Communications in Statistics Simulation and Computation, 36(1), 4554.
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13  # Bayesian nonparametric GLM with Binomial response Y and one predictor X
data(simD)
pred<seq(with(simD,min(X))+0.1,with(simD,max(X))0.1,length.out=30)
npred<length(pred)
# fit1 and fit2 define the same model but with different numbers of
# components and posterior samples. They both use a slice sampler
# and parameter prec=200 achieves optimal acceptance rate, about 22%.
fit1 < bnpglm(cbind(Y,(EY))~X, family="binomial", data=simD, ncomp=30, sweeps=150,
burn=100, sampler="slice", prec1=c(200), Xpred=pred, offsetPred=rep(30,npred))
fit2 < bnpglm(cbind(Y,(EY))~X, family="binomial", data=simD, ncomp=50, sweeps=5000,
burn=1000, sampler="slice", prec=c(200), Xpred=pred, offsetPred=rep(30,npred))
plot(with(simD,X),with(simD,Y)/with(simD,E))
lines(pred,fit2$medianReg,col=3,lwd=2)
