SURE.trendfilter: Optimize the trend filtering hyperparameter (with respect to...
In capolitsch/trendfilteringSupp: Optimal one-dimensional data analysis with trend filtering

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/SURE.trendfilter.R

\loadmathjax

SURE.trendfilter estimates the fixed-input mean-squared error of a trend filtering estimator (via Stein's unbiased risk estimate) on a grid of values for the hyperparameter gamma, and returns the full error curve and the optimized trend filtering estimate within a larger list with useful ancillary information.

SURE.trendfilter(
  x,
  y,
  weights,
  k = 2L,
  ngammas = 250L,
  gammas = NULL,
  nx.eval = 1500L,
  x.eval = NULL,
  thinning = NULL,
  optimization.params = trendfilter.control.list(max_iter = 600L, obj_tol = 1e-10)
)

`x`	The vector of observed values of the input variable (a.k.a. the predictor, covariate, explanatory variable, regressor, independent variable, control variable, etc.)
`y`	The vector of observed values of the output variable (a.k.a. the response, target, outcome, regressand, dependent variable, etc.).
`weights`	Currently mandatory. If the uncertainty in the observed outputs is not well understood, use `cv.trendfilter` instead. A vector of weights for the observed outputs. These are defined as `weights = 1 / sigma^2`, where `sigma` is a vector of standard errors of the uncertainty in the observed outputs. `weights` should either have length equal to 1 (corresponding to observations with a constant (scalar) variance of `sigma = 1/sqrt(weights)`) or length equal to `length(y)` (i.e. general heteroskedastic noise).
`k`	The degree of the trend filtering estimator. Defaults to `k = 2` (quadratic trend filtering). Must be one of `k = 0,1,2,3`, although `k = 3` is discouraged due to algorithmic instability (and is visually indistinguishable from `k = 2` anyway).
`ngammas`	Integer. The number of trend filtering hyperparameter values to run the grid search over.
`gammas`	Overrides `ngammas` if passed. A user-supplied vector of trend filtering hyperparameter values to run the grid search over. It is advisable to let the vector be equally-spaced in log-space and provided in descending order. The function output will contain the sorted hyperparameter vector regardless of the input ordering, and all related output objects (e.g. the `errors` vector) will correspond to the sorted ordering. Unless you know what you are doing, it is best to leave this `NULL`.
`nx.eval`	Integer. The length of the equally-spaced `x` grid to evaluate the optimized trend filtering estimate on.
`x.eval`	Overrides `nx.eval` if passed. A user-supplied grid of inputs to evaluate the optimized trend filtering estimate on.
`thinning`	logical. If `TRUE`, then the data are preprocessed so that a smaller, better conditioned data set is used for fitting. When left `NULL`, the default, the optimization will automatically detect whether thinning should be applied (i.e., cases in which the numerical fitting algorithm will struggle to converge). This preprocessing procedure is controlled by the `x_tol` argument of `trendfilter.control.list`.
`optimization.params`	a named list of parameters produced by the glmgen function `trendfilter.control.list` that contains all parameter choices (user-supplied or defaults) to be passed to the trend filtering ADMM algorithm (Ramdas and Tibshirani 2016). See the linked function documentation for full details. No technical understanding of the ADMM algorithm is needed and the default parameter choices will almost always suffice. However, the following three parameters may require some adjustments to ensure that your trend filtering estimate has sufficiently converged: `max_iter`: Maximum iterations allowed for the trend filtering convex optimization. Defaults to `max_iter = 600L`. Increase this if the trend filtering estimate does not appear to have fully converged to a reasonable estimate of the signal. `obj_tol`: The tolerance used in the convex optimization stopping criterion; when the relative change in the objective function is less than this value, the algorithm terminates. Decrease this if the trend filtering estimate does not appear to have fully converged to a reasonable estimate of the signal. x_tol: defines uniqueness or sameness of `x`'s. If we make bins of size `x_tol` and find at least two `x`'s which fall into the same bin, then we thin the data.

This will contain a very detailed description...

An object of class 'SURE.trendfilter'. This is a list with the following elements:

`x.eval`	The grid of inputs the optimized trend filtering estimate was evaluated on.
`tf.estimate`	The optimized trend filtering estimate of the signal, evaluated on `x.eval`.
`validation.method`	"SURE"
`gammas`	Vector of hyperparameter values tested during validation (always returned in descending order).
`errors`	Vector of SURE error estimates corresponding to the descending set of gamma values tested during validation.
`gamma.min`	Hyperparameter value that minimizes the SURE error curve.
`edfs`	Vector of effective degrees of freedom for all trend filtering estimators fit during validation.
`edf.min`	The effective degrees of freedom of the optimally-tuned trend filtering estimator.
`i.min`	The index of `gammas` (descending order) that minimizes the SURE error curve.
`x`	The vector of the observed inputs.
`y`	The vector of the observed outputs.
`weights`	A vector of weights for the observed outputs. These are defined as `weights = 1 / sigma^2`, where `sigma` is a vector of standard errors of the uncertainty in the observed outputs.
`fitted.values`	The optimized trend filtering estimate of the signal, evaluated at the observed inputs `x`.
`residuals`	`residuals = y - fitted.values`.
`k`	The degree of the trend filtering estimator.
`thinning`	logical. If `TRUE`, then the data are preprocessed so that a smaller, better conditioned data set is used for fitting.
`optimization.params`	a list of parameters that control the trend filtering convex optimization.
`n.iter`	Vector of the number of iterations needed for the ADMM algorithm to converge within the given tolerance, for each hyperparameter value. If many of these are exactly equal to `max_iter`, then their solutions have not converged with the tolerance specified by `obj_tol`. In which case, it is often prudent to increase `max_iter`.
`x.scale, y.scale, data.scaled`	for internal use.

Collin A. Politsch, collinpolitsch@gmail.com

Trend filtering with Stein's unbiased risk estimate

Politsch et al. (2020a). Trend filtering – I. A modern statistical tool for time-domain astronomy and astronomical spectroscopy. Monthly Notices of the Royal Astronomical Society, 492(3), p. 4005-4018. [Link]
Politsch et al. (2020b). Trend Filtering – II. Denoising astronomical signals with varying degrees of smoothness. Monthly Notices of the Royal Astronomical Society, 492(3), p. 4019-4032. [Link]

Estimating effective degrees of freedom in trend filtering

Tibshirani and Taylor (2012). Degrees of freedom in lasso problems. The Annals of Statistics, 40(2), p. 1198-1232. [Link]

Stein's unbiased risk estimate

Tibshirani and Wasserman (2015). Stein’s Unbiased Risk Estimate. 36-702: Statistical Machine Learning course notes (Carnegie Mellon). [Link]
Efron (2014). The Estimation of Prediction Error: Covariance Penalties and Cross-Validation. Journal of the American Statistical Association, 99(467), p. 619-632. [Link]
Stein (1981). Estimation of the Mean of a Multivariate Normal Distribution. The Annals of Statistics, 9(6), p. 1135-1151. [Link]

Trend filtering optimization algorithm

Ramdas and Tibshirani (2016). Fast and Flexible ADMM Algorithms for Trend Filtering. Journal of Computational and Graphical Statistics, 25(3), p. 839-858. [Link]
Arnold, Sadhanala, and Tibshirani (2014). Fast algorithms for generalized lasso problems. R package glmgen, Version 0.0.3. [Link] (Implementation of Ramdas and Tibshirani algorithm)

cv.trendfilter, bootstrap.trendfilter

#############################################################################
##                    Quasar Lyman-alpha forest example                    ##
#############################################################################
# A quasar is an extremely luminous galaxy with an active supermassive black 
# hole at its center. Absorptions in the spectra of quasars at vast 
# cosmological distances from our galaxy reveal the presence of a gaseous 
# medium permeating the entirety of intergalactic space -- appropriately 
# named the 'intergalactic medium'. These absorptions allow astronomers to 
# study the structure of the Universe using the distribution of these 
# absorptions in quasar spectra. Particularly important is the 'forest' of 
# absorptions that arise from the Lyman-alpha spectral line, which traces 
# the presence of electrically neutral hydrogen in the intergalactic medium.
#
# Here, we are interested in denoising the Lyman-alpha forest of a quasar 
# spectroscopically measured by the Sloan Digital Sky Survey. SDSS spectra 
# are equally spaced in log10 wavelength space, aside from some instances of 
# masked pixels.

data(quasar_spec)

# head(data)
#
# | log10.wavelength|       flux|   weights|
# |----------------:|----------:|---------:|
# |           3.5529|  0.4235348| 0.0417015|
# |           3.5530| -2.1143005| 0.1247811|
# |           3.5531| -3.7832341| 0.1284383|

SURE.out <- SURE.trendfilter(x = data$log10.wavelength, 
                             y = data$flux, 
                             weights = data$weights)


# Extract the estimated hyperparameter error curve and optimized trend 
# filtering estimate from the `SURE.trendfilter` output, and transform the 
# input grid to wavelength space (in Angstroms).

log.gammas <- log(SURE.out$gammas)
errors <- SURE.out$errors
log.gamma.min <- log(SURE.out$gamma.min)

wavelength <- 10 ^ (SURE.out$x)
wavelength.eval <- 10 ^ (SURE.out$x.eval)
tf.estimate <- SURE.out$tf.estimate


# Plot the results

par(mfrow = c(2,1), mar = c(5,4,2.5,1) + 0.1)
plot(x = log.gammas, y = errors, main = "SURE error curve", 
     xlab = "log(gamma)", ylab = "SURE error")
abline(v = log.gamma.min, lty = 2, col = "blue3")
text(x = log.gamma.min, y = par("usr")[4], 
     labels = "optimal gamma", pos = 1, col = "blue3")

plot(x = wavelength, y = SURE.out$y, type = "l", 
     main = "Quasar Lyman-alpha forest", 
     xlab = "Observed wavelength (Angstroms)", ylab = "Flux")
lines(wavelength.eval, tf.estimate, col = "orange", lwd = 2.5)
legend(x = "topleft", lwd = c(1,2), lty = 1, col = c("black","orange"), 
       legend = c("Noisy quasar Lyman-alpha forest", 
                  "Trend filtering estimate"))