Overview. Data transformations are a useful companion for parametric regression models. A well-chosen or learned transformation can greatly enhance the applicability of a given model, especially for data with irregular marginal features (e.g., multimodality, skewness) or various data domains (e.g., real-valued, positive, or compactly-supported data).
Given paired data $(x_i,y_i)$ for $i=1,\ldots,n$, SeBR
implements
efficient and fully Bayesian inference for semiparametric regression
models that incorporate (1) an unknown data transformation
$$ g(y_i) = z_i $$
and (2) a useful parametric regression model
$$ z_i = f_\theta(x_i) + \sigma \epsilon_i $$
with unknown parameters $\theta$ and independent errors $\epsilon_i$.
Examples. We focus on the following important special cases:
$$ z_i = x_i'\theta + \sigma\epsilon_i, \quad \epsilon_i \stackrel{iid}{\sim} N(0, 1) $$
The transformation $g$ broadens the applicability of this useful class of models, including for positive or compactly-supported data.
$$ z_i = x_i'\theta + \sigma\epsilon_i, \quad \epsilon_i \stackrel{iid}{\sim} ALD(\tau) $$
to target the $\tau$th quantile of $z$ at $x$, or equivalently, the $g^{-1}(\tau)$th quantile of $y$ at $x$. The ALD is quite often a very poor model for real data, especially when $\tau$ is near zero or one. The transformation $g$ offers a pathway to significantly improve the model adequacy, while still targeting the desired quantile of the data.
$$ z_i = f_\theta(x_i) + \sigma \epsilon_i, \quad \epsilon_i \stackrel{iid}{\sim} N(0, 1) $$
where $f_\theta$ is a GP and $\theta$ parameterizes the mean and covariance functions. Although GPs offer substantial flexibility for the regression function $f_\theta$, this model may be inadequate when $y$ has irregular marginal features or a restricted domain (e.g., positive or compact).
Challenges: The goal is to provide fully Bayesian posterior inference for the unknowns $(g, \theta)$ and posterior predictive inference for future/unobserved data $\tilde y(x)$. We prefer a model and algorithm that offer both (i) flexible modeling of $g$ and (ii) efficient posterior and predictive computations.
Innovations: Our approach (https://doi.org/10.1080/01621459.2024.2395586) specifies a nonparametric model for $g$, yet also provides Monte Carlo (not MCMC) sampling for the posterior and predictive distributions. As a result, we control the approximation accuracy via the number of simulations, but do not require the lengthy runs, burn-in periods, convergence diagnostics, or inefficiency factors that accompany MCMC. The Monte Carlo sampling is typically quite fast.
SeBR
The package SeBR
is installed and loaded as follows:
# CRAN version:
# install.packages("SeBR")
# Development version:
# devtools::install_github("drkowal/SeBR")
library(SeBR)
The main functions in SeBR
are:
sblm()
: Monte Carlo sampling for posterior and predictive inference
with the semiparametric Bayesian linear model;
sbsm()
: Monte Carlo sampling for posterior and predictive inference
with the semiparametric Bayesian spline model, which replaces the
linear model with a spline for nonlinear modeling of
$x \in \mathbb{R}$;
sbqr()
: blocked Gibbs sampling for posterior and predictive
inference with the semiparametric Bayesian quantile regression; and
sbgp()
: Monte Carlo sampling for predictive inference with the
semiparametric Bayesian Gaussian process model.
Each function returns a point estimate of $\theta$ (coefficients
),
point predictions at some specified testing points (fitted.values
),
posterior samples of the transformation $g$ (post_g
), and posterior
predictive samples of $\tilde y(x)$ at the testing points
(post_ypred
), as well as other function-specific quantities (e.g.,
posterior draws of $\theta$, post_theta
). The calls coef()
and
fitted()
extract the point estimates and point predictions,
respectively.
Note: The package also includes Box-Cox variants of these functions,
i.e., restricting $g$ to the (signed) Box-Cox parametric family
$g(t; \lambda) = {\mbox{sign}(t) \vert t \vert^\lambda - 1}/\lambda$
with known or unknown $\lambda$. The parametric transformation is less
flexible, especially for irregular marginals or restricted domains, and
requires MCMC sampling. These functions (e.g., blm_bc()
, etc.) are
primarily for benchmarking.
Detailed documentation and examples are available at https://drkowal.github.io/SeBR/.
Kowal, D. and Wu, B. (2024). Monte Carlo inference for semiparametric Bayesian regression. JASA. https://doi.org/10.1080/01621459.2024.2395586
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.