Bayesian non-parametric dependent Gaussian process model for time-indexed functional data

Share:

Description

Estimates a collection of time-indexed functions with Gaussian process (GP) formulations where a Dirichlet process mixture allows sub-groupings of the functions to share the same GP covariance parameters. The GP formulation supports any number of additive GP covariance terms, expressing either or both of multiple trend and seasonality.

Usage

1
2
3
gpdpgrow(y, ipr, time_points, gp_cov, sn_order, jitter, gp_shape, gp_rate,
  noise_shape, noise_rate, dp_shape, dp_rate, M_init, lower, upper, sub_size,
  w_star, w, n.iter, n.burn, n.thin, n.tune, progress, b_move, cluster, s)

Arguments

y

A multivariate continuous response, specified as an N x T matrix, where N denotes the number of functions and T, the number of time points per function. Intermittent missing-at-random values are allowed and will be estimated from the posterior predictive distribution. Missing cells should be denoted with NA. The sampling of missing values requires co-sampling the functions, bb, while these functions are marginalized out for sampling under the case of no missing data. So the sampler will run more slowly in the case of intermittent missingness.

ipr

An optional input vector of inclusion probabilities for each observation unit in the case the observed data were acquired through an informative sampling design, so that unbiased inference about the population requires adjustments to the observed sample. Defaults to ipr = rep(1,nrow(y)) indicating an iid sample.

time_points

Inputs a vector of common time points at which the collections of functions were observed (with the possibility of intermittent missingness). The length of time_points should be equal to the number of columns in the data matrix, y. Defaults to time_points = 1:ncol(y).

gp_cov

A vector of length L to select the covariance function for each of L terms. Allowed inputs are c("rq","se","sn"), where "rq" denotes the rational quadratic covariance function, "se", the square exponential, and "sn", seasonality. The seasonality covariance is a quasi-periodic multiplicative combination of a fixed length- scale periodic covariance kernel (where the period length is specified in 'sn_order') and a squared exponential kernel (of varying length-scale) to allow the periodicity in the seasonal term to evolve. e.g. If 4 terms are wanted - 2 trend terms with "se" and 2 seasonality terms, the input would be gp_cov = c("se","se","sn","sn"). Defaults to gp_cov = "rq".

sn_order

A vector of length L_s, the number of terms in gp_cov where "sn" is selected, that denotes the seasonality order for each term; e.g. if the two "sn" terms above are for 3 and 12 month seasonality, respectively, for monthly data, then sn_order = c(3,12). Defaults to sn_order = NULL.

jitter

A scalar numerical value added to the diagonal elements of the T x T GP covariance matrix to stabilize computation. Defaults to jitter = 0.01.

gp_shape

The shape parameter of the Gamma base distribution for the DP prior on the P x N matrix of GP covariance parameters (where P denotes the number of parameters for each of the N experimental units). Defaults to gp_shape = 1

gp_rate

The rate parameter of the Gamma base distribution on GP covariance parameters. Defaults to gp_rate = 1.

noise_shape

The shape parameter of the Gamma base distribution on tau_e, the model noise precision parameter. Defaults to noise_shape = 3.

noise_rate

The rate parameter of the Gamma base distribution on tau_e, the model noise precision parameter. Defaults to noise_rate = 1.

dp_shape

The shape parameter for the Gamma prior on the DP concentration parameter, conc. Defaults to dp_shape = 1.

dp_rate

The rate parameter for the Gamma prior on the DP concentration parameter, conc. Defaults to dp_rate = 1.

M_init

Starting number of clusters of nrow(y) units to initialize sampler. Defaults to M_init = nrow(y).

lower

The lower end of the range to be used in conditionally sampling the GP covariance parameters (kappa,tau_e) in the slice sampler. Defaults to lower = 0.

upper

The upper end of the range to be used in conditionally sampling the GP covariance parameters (kappa,tau_e) in the slice sampler. Defaults to upper = 1e10.

sub_size

Integer vector whose length, n, equals the number of progressively coarser GP covariance matrices to use for tempered sampling steps in an alternative space to sample the GP covariance parameters. Each entry denotes the number of sub-sample time points in T to draw under a latin hypercube design from the N x T data matrix, y to employ for that distribution; for example, suppose T = 300. sub_size = c(100,50), would randomly select first 100 and then 50 time points from y to use for each of the n = 2 distributions. Defaults to sub_size = c(floor(0.25*T),floor(0.1*T)).

w_star

Integer value denoting the number of cluster locations to sample ahead of observations in the auxiliary Gibbs sampler used to sample the number of clusters and associated cluster assignments. A higher value reduces samplin auto-correlation, but increases computational burden. Defaults to w_star = 2.

w

Numeric value denoting the step width used to construct the interval from which to draw a sample for each GP covariance parameter in the slice sampler. This value is adaptively updated in the sampler tuning stage for each parameter to be equal to the difference in the 0.95 and 0.05 sample quantiles for each of 5 block updates. Defaults to w = 1.5.

n.iter

Total number of MCMC iterations.

n.burn

Number of MCMC iterations to discard. gpdpgrow will return (n.iter - n.burn) posterior samples.

n.thin

Gap between successive sampling iterations to save.

n.tune

Number of iterations (before ergodic chain instantiated) to adapt w, separately, for each covariance term, p = 1,...,P. Sets each w_p to lie in the 90 percent credible interval computed from the tuning sample (that is divided into 5 blocks so that w_p is successively updated in each block of runs).

progress

A boolean value denoting whether to display a progress bar during model execution. Defaults to progress = TRUE

b_move

A boolean value denoting whether to sample the GP function, bb, in T x 1 Gibbs steps b_move = TRUE or through elliptical slice sampling. Defaults to b_move = TRUE. Only used in the case there is any intermittent missingness; otherwise bb is marginalized out of the sampler and post-sampled from the its predictive distribution.

cluster

A boolean value denoting whether to employ DP mix model over set of GP functions or to just use GP model with no clustering of covariance function parameters. Defaults to cluster = TRUE

s

An N x 1 integer vector that inputs a fixed clustering, rather than sampling it. Defaults to s = NULL

Value

S3 gpdpgrow object, for which many methods are available to return and view results. Generic functions applied to an object, res of class gpdpgrow, includes:

samples(res)

contains (n.iter - n.burn) posterior sampling iterations for every model parameter

resid(res)

contains the model residuals.

Note

The intended focus for this package are data composed of observed noisy functions (each of length T) for a set of experimental units where the functions may express dependence among the experimental units

Author(s)

Terrance Savitsky tds151@gmail.com Daniell Toth danielltoth@yahoo.com

References

T. D. Savitsky and D. Toth (2014) Bayesian Non-parametric Models for Collections of Time- indexed Functions. submitted to: JRSS Series A (Statistics in Society).

T. D. Savitsky (2014) Bayesian Non-parametric Functional Mixture Estimation for Time-indexed data. submitted to: Annals of Applied Statistics.

T. D. Savitsky (2014) Bayesian Non-Parametric Mixture Estimation for Time-Indexed Functional Data for R. Submitted to: Journal of Statistical Software.

See Also

gmrfdpgrow

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
{
library(growfunctions)

## load the monthly employment count data for a collection of 
## U.S. states from the Current 
## Population Survey (cps)
data(cps)
## subselect the columns of N x T, y, associated with 
## the years 2011 - 2013
## to examine the state level employment 
## levels during the "great recession"
y_short   <- cps$y[,(cps$yr_label %in% c(2011:2013))]

## uses default setting of a single "rational quadratic" covariance
## run for 500 iterations, with half discarded as burn-in to 
## obtain a more useful result.
res_gp               <- gpdpgrow(y = y_short, 
                                 n.iter = 4, 
                                 n.burn = 1, 
                                 n.thin = 1, 
                                 n.tune = 0)  

## Two plots of estimated functions, 
## 1. faceted by cluster 
## 2. fitted functions vs noisy observations
## first plot will plot estimated denoised function, 
## bb_i, for a single (randomly-selected) "state"
fit_plots_gp        <- cluster_plot( object = res_gp,  
                           units_name = "state", 
                           units_label = cps$st, 
                           single_unit = TRUE, 
                           credible = TRUE )
## second plot will randomly select 6 states 
## and plot their estimated denoised functions, bb_i.
## with setting "single_unit = FALSE". 
## (Option "num_plot" may be set to plot 
## any integer number of 
## randomly-selected units.)
fit_plots_gp        <- cluster_plot( object = res_gp,  
                                     units_name = "state", 
                                     units_label = cps$st, 
                                     single_unit = FALSE, 
                                     credible = TRUE )

}

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.