R/shenkWest.R
In prospectr: Miscellaneous Functions for Processing and Sample Selection of Spectroscopic Data

Documented in shenkWest

#' @title SELECT algorithm for calibration sampling
#'
#' @description
#' Select calibration samples from a large multivariate data using the SELECT
#' algorithm as described in Shenk and Westerhaus (1991).
#' @usage
#' shenkWest(X,
#'           d.min = 0.6,
#'           pc = 0.95,
#'           rm.outlier = FALSE,
#'           .center = TRUE,
#'           .scale = FALSE)
#' @param X a numeric matrix (optionally a data frame that can
#' be coerced to a numerical matrix).
#' @param d.min a minimum distance (default = 0.6).
#' @param pc the number of principal components retained in the computation
#' distance in the standardized Principal Component space (Mahalanobis distance).
#' If `pc < 1`, the number of principal components kept corresponds to the
#' number of components explaining at least (`pc * 100`) percent of the total
#' variance (default = 0.95).
#' @param rm.outlier logical. If `TRUE`, remove observations with a standardized
#' mahalanobis distance to the center of the data greater than 3
#' (default = `FALSE`).
#' @param .center logical. Indicates whether the input matrix should be centered
#' before Principal Component Analysis. Default set to \code{TRUE}.
#' @param .scale logical. Indicates whether the input matrix should be scaled
#' before Principal Component Analysis. Default set to \code{FALSE}.
#' @author Antoine Stevens
#' @return a `list` with components:
#' \itemize{
#'  \item{'`model`': numeric vector giving the row indices of the input data
#'  selected for calibration}
#'  \item{'`test`': numeric vector giving the row indices of the remaining
#'  observations}
#'  \item{'`pc`': a numeric matrix of the scaled pc scores}
#' }
#' @details
#' The SELECT algorithm is an iterative procedure based on the standardized
#' Mahalanobis distance between observations.
#' First, the observation having the highest number of neighbours within a given
#' minimum distance is selected and its neighbours are discarded. The procedure
#' is repeated until there is no observation left.
#'
#' If the `rm.outlier` argument is set to `TRUE`, outliers will be removed
#' before running the SELECT algorithm, using the CENTER algorithm of
#' Shenk and Westerhaus (1991), i.e. samples with a standardized Mahalanobis
#' distance `>3` are removed.
#' @examples
#' data(NIRsoil)
#' # reduce data size
#' NIRsoil$spc <- binning(X = NIRsoil$spc, bin.size = 5)
#' sel <- shenkWest(NIRsoil$spc, pc = .99, d.min = .3, rm.outlier = FALSE)
#' plot(sel$pc[, 1:2], xlab = "PC1", ylab = "PC2")
#' # points selected for calibration
#' points(sel$pc[sel$model, 1:2], pch = 19, col = 2)
#' # without outliers
#' sel <- shenkWest(NIRsoil$spc, pc = .99, d.min = .3, rm.outlier = TRUE)
#' plot(sel$pc[, 1:2], xlab = "PC1", ylab = "PC2")
#' # points selected for calibration
#' points(sel$pc[sel$model, 1:2], pch = 15, col = 3)
#' @references Shenk, J.S., and Westerhaus, M.O., 1991. Population Definition,
#' Sample Selection, and Calibration Procedures for Near Infrared Reflectance
#' Spectroscopy. Crop Science 31, 469-474.
#' @seealso \code{\link{kenStone}}, \code{\link{duplex}}, \code{\link{puchwein}}
#' @export
#'

shenkWest <- function(X,
                      d.min = 0.6,
                      pc = 0.95,
                      rm.outlier = FALSE,
                      .center = TRUE,
                      .scale = FALSE) {
  if (is.data.frame(X)) {
    X <- as.matrix(X)
  }

  # Compute scores of PCA
  pca <- prcomp(X, center = .center, scale = .scale)
  if (pc < 1) {
    pvar <- pca$sdev^2 / sum(pca$sdev^2)
    pcsum <- cumsum(pvar) < pc
    if (any(pcsum)) {
      pc <- max(which(pcsum)) + 1
    } else {
      pc <- 1
    }
  }
  scores.ini <- scores <- sweep(pca$x[, 1:pc, drop = F], 2, pca$sdev[1:pc], "/") # scaling of the scores

  n <- nini <- 1:nrow(X)
  model <- NULL

  if (rm.outlier) {
    m <- fastDistV(scores, colMeans(scores), "euclid") # squared mahalanobis distance
    m <- m / pc # standardized mahalanobis distance (also called GH, Global H distance)
    idx <- m <= 3
    scores <- scores[idx, , drop = F] # remove samples with H > 3
    n <- n[idx]
  }

  d <- fastDist(scores, scores, "euclid") # NH - Neighbour mahalanobis H distance
  d <- d / pc # standardized mahalanobis distance
  d <- d < d.min # distance treshold

  while (ncol(d) > 1) {
    idx <- which.max(colSums(d))
    knn <- which(d[, idx])
    if (length(knn) < 2) {
      break
    }
    model <- c(model, n[idx])
    n <- n[-knn]
    d <- d[-knn, -knn, drop = F]
  }

  return(list(model = model, test = nini[-n], pc = scores.ini))
}

Any scripts or data that you put into this service are public.

prospectr documentation built on June 22, 2024, 11:08 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

prospectr
Miscellaneous Functions for Processing and Sample Selection of Spectroscopic Data

R/shenkWest.R
In prospectr: Miscellaneous Functions for Processing and Sample Selection of Spectroscopic Data

Defines functions shenkWest

Documented in shenkWest

Try the prospectr package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

prospectr Miscellaneous Functions for Processing and Sample Selection of Spectroscopic Data

R/shenkWest.R In prospectr: Miscellaneous Functions for Processing and Sample Selection of Spectroscopic Data

Defines functions shenkWest

Documented in shenkWest

Try the prospectr package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

prospectr
Miscellaneous Functions for Processing and Sample Selection of Spectroscopic Data

R/shenkWest.R
In prospectr: Miscellaneous Functions for Processing and Sample Selection of Spectroscopic Data