fDPMix: Function for fast sampling of DP mixture of Linear...

Description Usage Arguments Details Value Examples

View source: R/FDPMix.r

Description

This function takes in a training data.frame and optional testing data.frame and performs posterior sampling. It returns posterior mean regression line for training and test sets. The function is built for continuous outcomes. This differs from NDPMix in the following ways: NDPMix returns draws from the posterior *predictive* distribution of the outcome, whereas fDPMix() returns the regression line. This will have the same mean as the NDPMix predictions, but lower variance. Finally, the back-end computation is different, with regression line evaluation at point X being evaluated as a weighted average of cluster-specific regression evaluations at X. This is faster than NDPMix which takes a Monte Carlo appraoch: assign X to one of the cluster specific regressions, then draw a predicted outcome from that cluster's regression.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
fDPMix(
  d_train,
  formula,
  d_test = NULL,
  burnin = 100,
  iter = 1000,
  init_k = 10,
  phi_y = NULL,
  beta_prior_mean = NULL,
  beta_prior_var = NULL,
  beta_var_scale = 1000,
  mu_scale = 1,
  tau_x = c(0.05, 2)
)

Arguments

d_train

A data.frame object with outcomes and model covariates/features. All features must be as.numeric - either continuous or binary with binary variables coded using 1 and 0. Categorical features are not supported. We recommend standardizing all continuous features. NA values are not allowed and each row should represent a single subject, longitudinal data is not supported.

formula

Specified in the usual way, e.g. for p=2 covariates, y ~ x1 + x2. All covariates - continuous and binary - must be as.numeric , with binary variables coded as 1 or 0. We recommend standardizing all continuous features. NA values are not allowed and each row should represent a single subject, longitudinal data is not supported.

d_test

Optional data.frame object containing a test set of subjects containing all variables specifed in formula. The same rules that apply to d_train apply to d_test.

burnin

integer specifying number of burn-in MCMC draws.

iter

integer greater than burnin specifying how many total MCMC draws to take.

init_k

Optional. integer specifying the initial number of clusters to kick off the MCMC sampler.

phi_y

Optional. Length two as.numeric vector specifying the mean and variance of the inverse gamma prior on the outcome variance. By default, phi_y[1] is set to var(y), where y is the outcome, so the inverse gamma prior is centered around the marginal empirical variance. With prior variance given by phi_y[2]. By default, phi_y[2]=.5*var(y). This value should be large to be "flat" and small to be "tight" around the marginal empirical variance.

beta_prior_mean

Optional. If there are p covariates, a length p+1 as.numeric vector specifying mean of the Gaussian prior on the outcome model's conditional mean parameter vector. Default is regression coefficients from running OLS on the outcomes.

beta_prior_var

Optional. If there are p covariates, a length p+1 as.numeric vector specifying variance of the Gaussian prior on the outcome model's conditional mean parameter vector. The full covarince of the prior is set to be diagonal. This vector specifies the diagonal enteries of this prior covariance. Default is estimated variances from running OLS on the outcome.

beta_var_scale

Optional. A multiplicative constant that scales beta_prior_var. If you leave beta_prior_mean and beta_prior_var at their defaults, This constant toggles how wide new cluster parameters are dispersed around the observed data parameters.

mu_scale

Optional. An as.numeric, scalar constant that controls how widely distributed new cluster continuous covariate means are distributed around the empirical covariate mean. Specifically, all continuous covariates are assumed to have Gaussian likelihood with Gaussian prior on their means. mu_scale=2 specifies that the variance of the Gaussian prior is twice as large as the empirical variance.

tau_x

Optional numeric length two vector for specifying the inverse gamma prior on each continuous covariate's prior variance. By default, the inverse gamma prior is centered around the empirical variance. tau_x[1] is the positive constant that multiplies this empirical variance. A value of 1 indicates that the inverse gamma is centered around the empirical variance. A value of one half centers the inverse gamma prior around one half the empirical covariate variance. The second argument of tau_x is the prior variance. For example, iftau_x[1]=1 and tau_x[2]=.001, then the inverse gamma prior is tight around the empirical variance. If tau_x[2]=100, then the inverse gamma prior for this covariate is widely distributed around the empirical variance.

Details

We recommend normalizing continuous covariates and outcomes via the scale function before running fDPMix

Please see https://stablemarkets.github.io/ChiRPsite/index.html for examples and detailed model and parameter descriptions.

Value

Returns predictions$train and cluster_inds$train. predictions$train returns an nrow(d_train) by iter - burnin matrix of posterior predictions. cluster_inds$train returns an nrow(d_train) by iter - burnin matrix of cluster assignment indicators, which can be input into the function cluster_assign_mode() to compute posterior mode assignment. predictions$test and cluster_inds$test are returned if d_test is specified.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
set.seed(1)

N = 200
x<-seq(1,10*pi, length.out = N) # confounder
y<-rnorm(n = length(x), sin(.5*x), .07*x )
d <- data.frame(x=x, y=y)
d$x <- as.numeric(scale(d$x))
d$y <- as.numeric(scale(d$y))

plot(d$x,d$y, pch=20, xlim=c(min(d$x), max(d$x)+1 ), col='gray')

d_test = data.frame(x=seq(max(d$x), max(d$x+1 ), .01 ))


res = fDPMix(d_train = d, d_test = d_test, formula = y ~ x,
             iter=100, burnin=50, tau_x = c(.01, .001) )

## in-sample
lines(d$x, rowMeans(res$train), col='steelblue')
lines(d$x, apply(res$train,1,quantile,probs=.05) , col='steelblue', lty=2)
lines(d$x, apply(res$train,1,quantile,probs=.90) , col='steelblue', lty=2)

## out of sample
lines(d_test$x, rowMeans(res$test), col='pink')
lines(d_test$x, apply(res$test,1,quantile,probs=.05) , col='pink', lty=2)
lines(d_test$x, apply(res$test,1,quantile,probs=.90) , col='pink', lty=2)
                

stablemarkets/ChiRP documentation built on July 26, 2021, 2:25 a.m.