knitr::opts_chunk$set( collapse = TRUE, comment = "#>", cache = FALSE, fig.width = 7, fig.height = 4, out.width = "100%" )
fda.vi implements the novel Bayesian basis function selection method of da Cruz, de Souza, and Sousa (2024) for functional data analysis. The method smooths one or multiple functional curves simultaneously, while accounting for within-curve correlation, by expressing each curve as a linear combination of basis functions (B-splines or Fourier) and using a variational expectation-maximization (VEM) algorithm to select essential basis functions while selectively removing unnecessary ones via sparsity-inducing priors.
Key method features:
# install.packages("devtools") devtools::install_github("steviek16/fda.vi")
Let $y_{ij}$ denote observation $j$ of curve $i$, for $i = 1, \ldots, m$ curves each measured at evaluation points $t_{ij}$, $j = 1, \ldots, n_i$. Each curve is modelled as
$$ y_{ij} = g_i(t_{ij}) + \varepsilon_i(t_{ij}), \qquad g_i(t_{ij}) = \sum_{k=1}^{K} Z_{ki}\,\beta_{ki}\,B_k(t_{ij}) $$
where $B_k(\cdot)$ are $K$ unknown basis functions, $\beta_{ki}$ are the basis coefficients, and $Z_{ki} \in {0,1}$ are inclusion indicators drawn from independent Bernoulli sparsity-inducing priors. The errors $\varepsilon_i(t)$ follow a zero-mean Gaussian process with Ornstein-Uhlenbeck covariance function:
$$ \psi(t, t') = \sigma^2 \exp!\left(-w\,|t - t'|\right) $$
with decay parameter $w > 0$ and variance $\sigma^2 > 0$. The parameter $\tau^2$ controls the regularization of the basis coefficients.
The VEM algorithm iterates between:
until convergence or the maximum number of iterations is achieved.
library(fda.vi) data(toy_curves) # Fit at a single K fit <- vem_fit( y = toy_curves$y, Xt = toy_curves$Xt, K = 8, center = FALSE, scale = FALSE ) summary(fit)
The toy_curves dataset contains three simulated curves generated using
$K = 8$ cubic B-spline basis functions, with known basis coefficients
(basis functions 2 and 5 are not relevant, with corresponding coefficients
set to zero), and Ornstein-Uhlenbeck correlated errors with $\sigma = 0.1$
and $w = 6$.
toy_curves Datasetdata(toy_curves) str(toy_curves)
The dataset is a named list with three elements:
y: a list of 3 numeric vectors, each of length 50, containing the
observed noisy curve valuesXt: a numeric vector of 50 equally spaced evaluation points on $[0, 1]$true_coef: the true basis coefficients used to generate the data,
c(1.5, 0, -1, 0.8, 0, -0.5, 1.2, -0.9) — basis functions 2 and 5 are
not relevant, with corresponding coefficients set to zeroplot(toy_curves$Xt, toy_curves$y[[1]], type = "p", pch = 16, cex = 0.6, col = "steelblue", xlab = "t", ylab = "y(t)", main = "Toy Curves Dataset") for (i in 2:3) { points(toy_curves$Xt, toy_curves$y[[i]], pch = 16, cex = 0.6, col = c("firebrick", "forestgreen")[i - 1]) } legend("topright", legend = paste("Curve", 1:3), col = c("steelblue", "firebrick", "forestgreen"), pch = 16, bty = "n")
When a single integer is passed to K, vem_fit fits the model directly
at that basis size without GCV tuning:
fit <- vem_fit( y = toy_curves$y, Xt = toy_curves$Xt, K = 8, center = FALSE, scale = FALSE )
When a vector of candidate values is passed to K, vem_fit fits the
model at each candidate and selects the $K$ minimizing the mean
generalized cross-validation (GCV) score across all curves:
fit_gcv <- vem_fit( y = toy_curves$y, Xt = toy_curves$Xt, K = c(6, 8, 10, 15) ) fit_gcv$best_K fit_gcv$tuning$gcv_matrix
Setting selection_metric = "per_curve" selects the best $K$ independently
for each curve, returning a composite fit with the results obtained from the
optimal fit per curve:
fit_pc <- vem_fit( y = toy_curves$y, Xt = toy_curves$Xt, K = c(6, 8, 10), selection_metric = "per_curve" ) fit_pc$selected_K fit_pc$is_composite
For periodic functional data, a Fourier basis can be used by setting
basis_type = "fourier":
fit_f <- vem_fit( y = toy_curves$y, Xt = toy_curves$Xt, K = 10, basis_type = "fourier" ) summary(fit_f)
summary(fit)
The summary reports:
coef(fit)
Returns a $K \times m$ matrix of estimated basis coefficients. Inactive
basis functions (PIP $\leq 0.5$) have their coefficients set to zero by
the sparsity-inducing prior. For toy_curves the true zeros at positions 2
and 5 should be recovered:
coefs <- coef(fit) coefs[c(2, 5), ] # should be zero
The posterior inclusion probabilities (PIPs) for all basis functions
across all curves are stored in fit$model$prob and can be inspected
directly:
K <- fit$best_K m <- length(toy_curves$y) pip_mat <- matrix(fit$model$prob, nrow = K, ncol = m) rownames(pip_mat) <- paste0("B", 1:K) colnames(pip_mat) <- paste0("Curve_", 1:m) round(pip_mat, 3)
Predictions based on the vem_fit object. Returns a list of length $m$
(one numeric vector per curve), where each vector has length equal to the
number of evaluation points requested.
# Predictions at original evaluation points preds <- predict(fit) length(preds) # one vector per curve length(preds[[1]]) # same length as Xt # Predictions at a denser grid Xt_new <- seq(0, 1, length.out = 200) preds_new <- predict(fit, newdata = Xt_new)
Estimated curves based on the results from the fit object. The shaded region
is a 95% credible band; use show_CI = FALSE to suppress it.
# Fitted curve with 95% credible band for curve 1 plot(fit, curve_idx = 1)
# All three curves for (i in 1:3) plot(fit, curve_idx = i)
da Cruz, A. C., de Souza, C. P. E., & Sousa, P. H. T. O. (2024). Fast Bayesian basis selection for functional data representation with correlated errors. arXiv:2405.20758. https://arxiv.org/abs/2405.20758
citation("fda.vi")
@misc{dacruz2024vem, title = {Fast {Bayesian} basis selection for functional data representation with correlated errors}, author = {da Cruz, Ana Carolina and de Souza, Camila P. E. and Sousa, Pedro H. T. O.}, year = {2024}, note = {arXiv:2405.20758}, url = {https://arxiv.org/abs/2405.20758} }
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.