ZDPMix: Function for posterior sampling of DP mixture of...
In stablemarkets/ChiRP: Chinese Restaurant Process Mixtures of Regressions

Description Usage Arguments Details Value Examples

View source: R/ZDPMix.r

This function takes in a training data.frame and optional testing data.frame and performs posterior sampling. It returns posterior predictions and posterior clustering for training and test sets. The function is built for zero-inflated, but otherwise continuous, outcomes.

ZDPMix(
  d_train,
  formula,
  d_test = NULL,
  burnin = 100,
  iter = 1000,
  phi_y = c(shape = 5, rate = 1000),
  beta_prior_mean = NULL,
  beta_prior_var = NULL,
  gamma_prior_mean = NULL,
  gamma_prior_var = NULL,
  init_k = 10,
  beta_var_scale = 1000,
  mu_scale = 1,
  tau_scale = 1,
  prop_sigma_z = diag(rep(0.025, nparams))
)

`d_train`	A `data.frame` object with outcomes and model covariates/features. All features must be `as.numeric` - either continuous or binary with binary variables coded using `1` and `0`. Categorical features are not supported. We recommend standardizing all continuous features. NA values are not allowed and each row should represent a single subject, longitudinal data is not supported.
`formula`	Specified in the usual way, e.g. for `p=2` covariates, `y ~ x1 + x2`. All covariates - continuous and binary - must be `as.numeric` , with binary variables coded as `1` or `0`. We recommend standardizing all continuous features. NA values are not allowed and each row should represent a single subject, longitudinal data is not supported.
`d_test`	Optional `data.frame` object containing a test set of subjects containing all variables specifed in `formula`. All the same rules for `d_train` apply to `d_test`.
`burnin`	integer specifying number of burn-in MCMC draws.
`iter`	integer greater than `burnin` specifying how many total MCMC draws to take.
`phi_y`	Optional. Length two `as.numeric` vector specifying the shape and rate, respectively, of the Inverse Gamma hyper-prior placed on the outcome variance.
`beta_prior_mean`	Optional. If there are `p` covariates, it is a length `p+1` `as.numeric` vector specifying mean of the Gaussian prior on the outcome model's conditional mean parameter vector. Default is regression coefficients from running OLS on positive outcomes.
`beta_prior_var`	Optional. If there are `p` covariates, a length `p+1` `as.numeric` vector specifying variance of the Gaussian prior on the outcome model's conditional mean parameter vector. The full covarince of the prior is set to be diagonal. This vector specifies the diagonal enteries of this prior covariance. Default is estimated variances from running OLS on positive outcomes.
`gamma_prior_mean`	Optional. If there are `p` covariates, a length `p+1` `as.numeric` vector specifying mean of the Gaussian prior on the zero probability logistic model's conditional mean parameter vector. Default is a vector of 0s - i.e., null-centered prior mean.
`gamma_prior_var`	Optional. If there are `p` covariates, a length `p+1` `as.numeric` vector specifying variance of the Gaussian prior on the zero probability logistic model's conditional mean parameter vector. Default is vector of 2s - moderately flat on the odds ratio scale.
`init_k`	Optional. integer specifying the initial number of clusters to kick off the MCMC sampler.
`beta_var_scale`	Optional. A multiplicative constant that scales `beta_prior_var`. If you leave `beta_prior_mean` and `beta_prior_var` at their defaults, This constant toggles how wide new cluster parameters are dispersed around the observed data parameters, larger values implies wider distribution.
`mu_scale`	Optional. An numeric, scalar constant that controls how widely distributed new cluster continuous covariate means are distributed around the empirical covariate mean. Specifically, all continuous covariates are assumed to have Gaussian likelihood with Gaussian prior on their means. `mu_scale=2` specifies that the variance of the Gaussian prior is twice as large as the empirical variance.
`tau_scale`	Optional. An numeric, scalar constant that controls how widely distributed new cluster continuous covariate variances are distributed around the empirical variance. Specifically, all continuous covariates are assumed to have Gaussian likelihood with Inverse Gamma prior on their variance. `tau_scale=2` specifies that the rate of the InvGamma prior is twice as large as the empirical variance.
`prop_sigma_z`	Optional. If you specified `p` covariates in `formula`, `p+1` regression parameters are sampled for the probability of the outcome being zero using a Metropolis step. `prop_sigma_z` is a `p+1` by `p+1` covariance matrix for the Metropolis proposal distribution.

Please see https://stablemarkets.github.io/ChiRPsite/index.htmlfor examples and detailed model and parameter descriptions.

Please see https://arxiv.org/abs/1810.09494 for a methodological reference.

Returns predictions$train and cluster_inds$train. predictions$train returns an nrow(d_train) by iter - burnin matrix of posterior predictions. cluster_inds$train returns an nrow(d_train) by iter - burnin matrix of cluster assignment indicators, which can be input into the function cluster_assign_mode() to compute posterior mode assignment. predictions$test and cluster_inds$test are returned if d_test is specified.

set.seed(1)
n<-200 ## generate from clustered, skewed, data distribution
X11 <- rnorm(n = n, mean = 10, sd = 3)
X12 <- rnorm(n = n, mean = 0, sd = 2)
X13 <- rnorm(n = n, mean = -10, sd = 4)

Y1 <- rnorm(n = n, mean = 100 + .5*X11, 20)*(1-rbinom(n, 1, prob = pnorm( -10 + 1*X11 ) ))
Y2 <- rnorm(n = n, mean = 200 + 1*X12, 30)*(1-rbinom(n, 1, prob = pnorm( 1 + .05*X12 ) ))
Y3 <- rnorm(n = n, mean = 300 + 2*X13, 40)*(1-rbinom(n, 1, prob = pnorm( -3 -.2*X13 ) ))

d <- data.frame(X1=c(X11, X12, X13), Y = c(Y1, Y2, Y3))

d$X1 <- scale(d$X1)

ids <- sample(1:600, size = 500, replace = FALSE )
d_train <- d[ids,]
d_test <- d[-ids, ]

res <- ChiRP::ZDPMix(d_train = d_train, d_test = d_test, formula = Y ~ X1,
                     burnin=100, iter=200, init_k = 5, phi_y = c(10, 10000))