R/local_fit.R
In resemble: Memory-Based Learning in Spectral Chemometrics

Documented in local_fit_gpr local_fit_pls local_fit_wapls

#' @title Local fit functions
#' @name local_fit
#' @aliases local_fit
#' @aliases local_fit_pls
#' @aliases local_fit_wapls
#' @aliases local_fit_gpr
#' @description
#' \loadmathjax
#' These functions define the way in which each local fit/prediction is done
#' within each iteration in the \code{\link{mbl}} function.
#' @usage
#' local_fit_pls(pls_c, modified = FALSE, max_iter = 100, tol = 1e-6)
#'
#' local_fit_wapls(min_pls_c, max_pls_c, modified = FALSE,
#'                 max_iter = 100, tol = 1e-6)
#'
#' local_fit_gpr(noise_variance = 0.001)
#'
#' @param pls_c an integer indicating the number of pls components to be used in
#' the local regressions when the partial least squares (\code{local_fit_pls})
#' method is used.
#' @param min_pls_c an integer indicating the minimum number of pls components
#' to be used in the local regressions when the weighted average partial least
#' squares (\code{local_fit_wapls}) method is used. See details.
#' @param max_pls_c integer indicating the maximum number of pls components
#' to be used in the local regressions when the weighted average partial least
#' squares (\code{local_fit_wapls}) method is used. See details.
#' @param modified a logical indicating whether the modified version of the pls
#' algorithm (Shenk and Westerhaus, 1991 and Westerhaus, 2014). Default is
#' \code{FALSE}.
#' @param tol a numeric value indicating the convergence for calculating the
#' scores. Default is 1-e6.
#' @param max_iter an integer indicating the maximum number of iterations in
#' case \code{tol} is not reached. Defaul is 100.
#' @param noise_variance a numeric value indicating the variance of the noise
#' for Gaussian process local regressions (\code{local_fit_gpr}). Default is
#' 0.001.
#'
#' @details
#' These functions are used to indicate how to fit
#' the regression models within the \code{\link{mbl}} function.
#'
#' There are three possible options for performing these regressions:
#' \itemize{
#'  \item{Partial least squares (pls, \code{local_fit_pls}): It uses the
#'  orthogonal scores (non-linear iterative partial least squares, nipals)
#'  algorithm. The only parameter which needs to be optimized is the number of
#'  pls components.}
#'  \item{Weighted average pls (\code{local_fit_wapls}): This method was
#'  developed by Shenk et al. (1997) and it used as the regression method in the
#'  widely known LOCAL algorithm. It uses multiple models generated by multiple
#'  pls components (i.e. between a minimum and a maximum number of pls
#'  components). At each local partition the final predicted value is a ensemble
#'  (weighted average) of all the predicted values generated by the multiple pls
#'  models. The weight for each component is calculated as follows:
#'
#'  \mjdeqn{w_{j}  =  \frac{1}{s_{1:j}\times g_{j}}}{w_j  =  1/(s_{1:j} xx g_{j})}
#'
#'  where \mjeqn{s_{1:j}}{s_{1:j}} is the root mean square of the 
#'  spectral reconstruction error of the unknown (or target) observation(s) 
#'  when a total of \mjeqn{j}{j} pls components are used and 
#'  \mjeqn{g_{j}}{g_{j}} is the root mean square of the squared regression 
#'  coefficients corresponding to the \mjeqn{j}{j}th pls component (see 
#'  Shenk et al., 1997 for more details).}
#'  \item{Gaussian process with dot product covariance (\code{local_fit_gpr}): 
#'  Gaussian process regression is a probabilistic and non-parametric Bayesian
#'  method. It is commonly described as a collection of random variables which
#'  have a joint Gaussian distribution and it is characterized by both a mean
#'  and a covariance function (Rasmussen and Williams, 2006). The covariance
#'  function used in the implemented method is the dot product. The only
#'  parameter to be taken into account in this method is the noise. In this
#'  method, the process for predicting the response variable of a new sample
#'  (\mjeqn{y_u}{y_u}) from its predictor variables
#'  (\mjeqn{x_u}{x_u}) is carried out first by computing a prediction
#'  vector (\mjeqn{A}{A}). It is derived from a reference/training observations
#'  congaing both a response vector (\mjeqn{Y}{Y}) and predictors (\mjeqn{X}{X}) as follows:
#'
#'  \mjdeqn{A = (X  X^{T} + \sigma^2 I)^{-1} Y}{A = (X  X^T + sigma^2 I)^{-1} Y}
#'
#'   where  \mjeqn{\sigma^2}{sigma^2} denotes the variance of the noise and \mjeqn{I}{I} the
#'   identity matrix (with dimensions equal to the number of observations in
#'   \mjeqn{X}{X}). The prediction of \mjeqn{y_{u}}{y_u} is then done as follows:
#'
#'   \mjdeqn{\hat{y}_{u} = (x_{u}x_{u}^{T}) A}{hat y_{u} = (x_{u} x_{u}^T) A}
#'
#'  }
#'  }
#'  
#'
#' The \code{modified} argument in the pls methods (\code{local_fit_pls()}
#' and \code{local_fit_wapls()}) is used to indicate if
#' a modified version of the pls algorithm (modified pls or mpls) is to be used.
#' The modified pls was proposed Shenk and Westerhaus
#' (1991, see also Westerhaus, 2014) and it differs from the standard pls method
#' in the way the weights of the predictors (used to compute the matrix of
#' scores) are obtained. While pls uses the covariance between  response(s)
#' and predictors (and later their deflated versions corresponding at each pls
#' component iteration) to obtain these weights, the modified pls uses the
#' correlation as weights. The authors indicate that by using correlation,
#' a larger potion of the response variable(s) can be explained.
#'
#' @return An object of class \code{local_fit} mirroring the input arguments.
#' @author \href{https://orcid.org/0000-0002-5369-5120}{Leonardo Ramirez-Lopez}
#' @references
#' Shenk, J. S., & Westerhaus, M. O. 1991. Populations structuring of
#' near infrared spectra and modified partial least squares regression.
#' Crop Science, 31(6), 1548-1555.
#'
#' Shenk, J., Westerhaus, M., and Berzaghi, P. 1997. Investigation of a LOCAL
#' calibration procedure for near infrared instruments. Journal of Near Infrared
#' Spectroscopy, 5, 223-232.
#'
#' Rasmussen, C.E., Williams, C.K. Gaussian Processes for Machine Learning.
#' Massachusetts Institute of Technology: MIT-Press, 2006.
#'
#' Westerhaus, M. 2014. Eastern Analytical Symposium Award for outstanding
#' Wachievements in near infrared spectroscopy: my contributions to
#' Wnear infrared spectroscopy. NIR news, 25(8), 16-20.
#' @seealso \code{\link{mbl}}
#' @examples
#' local_fit_wapls(min_pls_c = 3, max_pls_c = 12)
#' @export

## History:
## 28.05.2020 Leo     Hello world!
## 19.07.2020 Leo     arguments pls_max_iter and pls_tol were removed fas they
##                    are only required when modleing for more than one response
##                    variable, i.e. pls2 (which is not implemented for mbl)
## 20.09.2020 Leo     max_iter and tol were added to pls obajects

local_fit_pls <- function(pls_c, modified = FALSE, max_iter = 100, tol = 1e-6) {
  if (missing(pls_c)) {
    stop("'pls_c' must be specified")
  }

  if (length(pls_c) != 1 | !is.numeric(pls_c)) {
    stop(paste0(
      "'pls_c' must be a single numerical ",
      "value specifiying the maximum number of pls components to ",
      "be evaluated"
    ))
  }

  fit_type <- list(
    method = "pls",
    pls_c = pls_c,
    modified = modified,
    max_iter = max_iter,
    tol = tol
  )
  class(fit_type) <- c("local_fit", "list")
  fit_type
}

#' @aliases local_fit
#' @export local_fit_wapls
local_fit_wapls <- function(min_pls_c,
                            max_pls_c,
                            modified = FALSE,
                            max_iter = 100,
                            tol = 1e-6) {
  if (missing(min_pls_c) | missing(max_pls_c)) {
    stop("Both 'min_pls_c' and 'max_pls_c' must be specified")
  }

  if (length(min_pls_c) != 1 | !is.numeric(min_pls_c)) {
    stop(paste0(
      "'min_pls_c' must be a single numerical ",
      "value specifiying the minimum number of pls components to ",
      "be evaluated"
    ))
  }

  if (length(max_pls_c) != 1 | !is.numeric(max_pls_c)) {
    stop(paste0(
      "'max_pls_c' must be a single numerical ",
      "value specifiying the maximum number of pls components to ",
      "be evaluated"
    ))
  }

  if (min_pls_c >= max_pls_c) {
    stop("min_pls_c must be smaller than max_pls_c")
  }

  fit_type <- list(
    method = "wapls",
    pls_c = c(min_pls_c = min_pls_c, max_pls_c = max_pls_c),
    modified = modified,
    max_iter = max_iter,
    tol = tol
  )

  class(fit_type) <- c("local_fit", "list")
  fit_type
}

#' @aliases local_fit
#' @export local_fit_gpr
local_fit_gpr <- function(noise_variance = 0.001) {
  if (length(noise_variance) != 1 | !is.numeric(noise_variance)) {
    stop("'noise_variance' must be a single numerical value")
  }
  fit_type <- list(
    method = "gpr",
    noise_variance = noise_variance
  )
  class(fit_type) <- c("local_fit", "list")
  fit_type
}