R/dkden.r

Defines functions dkden pkden qkden rkden

Documented in dkden pkden qkden rkden

#' @name kden
#' 
#' @title Kernel Density Estimation, With Variety of Kernels
#'
#' @description Density, cumulative distribution function, quantile function and
#'   random number generation for the kernel density estimation using the kernel
#'   specified by \code{kernel}, with a constant bandwidth specified by either
#'   \code{lambda} or \code{bw}.
#'
#' @inheritParams gpd
#' @inheritParams kernels
#' 
#' @details Kernel density estimation using one of many possible kernels with a
#' constant bandwidth. 
#' 
#' The alternate bandwidth definitions are discussed in the
#' \code{\link[evmix:kernels]{kernels}}, with the \code{lambda} as the default.
#' The \code{bw} specification is the same as used in the
#' \code{\link[stats:density]{density}} function.
#' 
#' The possible kernels are also defined in \code{\link[evmix:kernels]{kernels}} help
#' documentation with the \code{"gaussian"} as the default choice.
#'
#' The density function \code{\link[evmix:kden]{dkden}} produces exactly the
#' same density estimate as \code{\link[stats:density]{density}} when a sequence
#' of \code{x} values are provided, see examples. The latter function is far
#' more efficient in this situation as it takes advantage of the computational
#' savings from doing the kernel smoothing in the spectral domain (using the FFT),
#' where the convolution becomes a multiplication. So even after accounting for applying
#' the (Fast) Fourier Transform (FFT) and its inverse it is much more efficient
#' especially for a large sample size or large number of evaluation points.
#' 
#' However, this KDE function applies the less efficient convolution using the
#' standard definition:
#' \deqn{\hat{f}_(x) = \frac{1}{n} \sum_{j=1}^{n} K(\frac{x - x_j}{\lambda})}
#' where \eqn{K(.)} is the density function for the standard
#' kernel. Thus are no restriction on the values \code{x} can take. For example, in the 
#' \code{"gaussian"} kernel case for a particular \code{x} the density is evaluated as
#' \code{mean(dnorm(x, kerncentres, lambda))} for the density and
#' \code{mean(pnorm(x, kerncentres, lambda))} for cumulative distribution
#' function which is slower than the FFT but is more adaptable.
#' 
#' An inversion sampler is used for random number generation which also rather
#' inefficient, as it can be carried out more efficiently using a mixture representation.
#' 
#' The quantile function is rather complicated as there is no closed form solution,
#' so is obtained by numerical approximation of the inverse cumulative distribution function
#' \eqn{P(X \le q) = p} to find \eqn{q}. The quantile function 
#' \code{\link[evmix:kden]{qkden}} evaluates the KDE cumulative distribution
#' function over the range from \code{c(max(kerncentre) - lambda, max(kerncentre) + lambda)},
#' or \code{c(max(kerncentre) - 5*lambda, max(kerncentre) + 5*lambda)} for normal kernel.
#' Outside of this range the quantiles are set to \code{-Inf} for lower tail and \code{Inf}
#' for upper tail. A sequence of values
#' of length fifty times the number of kernels (with minimum of 1000) is first
#' calculated. Spline based interpolation using \code{\link[stats:splinefun]{splinefun}},
#' with default \code{monoH.FC} method, is then used to approximate the quantile
#' function. This is a similar approach to that taken
#' by Matt Wand in the \code{\link[ks:kde]{qkde}} in the \code{\link[ks:kde]{ks}} package.
#' 
#' If no bandwidth is provided \code{lambda=NULL} and \code{bw=NULL} then the normal
#' reference rule is used, using the \code{\link[stats:bandwidth]{bw.nrd0}} function, which is
#' consistent with the \code{\link[stats:density]{density}} function. At least two kernel
#' centres must be provided as the variance needs to be estimated.
#' 
#' @return \code{\link[evmix:kden]{dkden}} gives the density, 
#' \code{\link[evmix:kden]{pkden}} gives the cumulative distribution function,
#' \code{\link[evmix:kden]{qkden}} gives the quantile function and 
#' \code{\link[evmix:kden]{rkden}} gives a random sample.
#' 
#' @note Unlike most of the other extreme value mixture model functions the 
#'   \code{\link[evmix:kden]{kden}} functions have not been vectorised as
#'   this is not appropriate. The main inputs (\code{x}, \code{p} or \code{q})
#'   must be either a scalar or a vector, which also define the output length.
#' 
#' The kernel centres \code{kerncentres} can either be a single datapoint or a vector
#' of data. The kernel centres (\code{kerncentres}) and locations to evaluate density (\code{x})
#' and cumulative distribution function (\code{q}) would usually be different.
#' 
#' Default values are provided for all inputs, except for the fundamentals 
#' \code{kerncentres}, \code{x}, \code{q} and \code{p}. The default sample size for 
#' \code{\link[evmix:kden]{rkden}} is 1.
#' 
#' Missing (\code{NA}) and Not-a-Number (\code{NaN}) values in \code{x},
#' \code{p} and \code{q} are passed through as is and infinite values are set to
#' \code{NA}. None of these are not permitted for the parameters.
#' 
#' Error checking of the inputs (e.g. invalid probabilities) is carried out and
#' will either stop or give warning message as appropriate.
#' 
#' @references
#' \url{http://en.wikipedia.org/wiki/Kernel_density_estimation}
#' 
#' \url{http://en.wikipedia.org/wiki/Cross-validation_(statistics)}
#' 
#' Scarrott, C.J. and MacDonald, A. (2012). A review of extreme value
#' threshold estimation and uncertainty quantification. REVSTAT - Statistical
#' Journal 10(1), 33-59. Available from \url{http://www.ine.pt/revstat/pdf/rs120102.pdf}
#' 
#' Hu Y. and Scarrott, C.J. (2018). evmix: An R Package for Extreme Value Mixture Modeling, 
#' Threshold Estimation and Boundary Corrected Kernel Density Estimation. Journal of
#' Statistical Software 84(5), 1-27. doi: 10.18637/jss.v084.i05.
#' 
#' Bowman, A.W. (1984). An alternative method of cross-validation for the smoothing of
#' density estimates. Biometrika 71(2), 353-360.
#' 
#' Duin, R.P.W. (1976). On the choice of smoothing parameters for Parzen estimators of
#' probability density functions. IEEE Transactions on Computers C25(11), 1175-1179.
#' 
#' MacDonald, A., Scarrott, C.J., Lee, D., Darlow, B., Reale, M. and Russell, G. (2011).
#' A flexible extreme value mixture model. Computational Statistics and Data Analysis
#' 55(6), 2137-2157.
#' 
#' Wand, M. and Jones, M.C. (1995). Kernel Smoothing. Chapman && Hall.
#' 
#' @author Yang Hu and Carl Scarrott \email{carl.scarrott@@canterbury.ac.nz}.
#'
#' @section Acknowledgments: Based on code
#' by Anna MacDonald produced for MATLAB.
#' 
#' @seealso \code{\link[evmix:kernels]{kernels}}, \code{\link[evmix:kfun]{kfun}},
#' \code{\link[stats:density]{density}}, \code{\link[stats:bandwidth]{bw.nrd0}}
#' and \code{\link[ks:kde]{dkde}} in \code{\link[ks:kde]{ks}} package.
#' 
#' @aliases kden dkden pkden qkden rkden
#' @family  kden 
#' @family  kdengpd
#' @family  gkg
#' @family  bckden
#' @family  bckdengpd
#' @family  fkden
#' 
#' @examples
#' \dontrun{
#' set.seed(1)
#' par(mfrow = c(2, 2))
#' 
#' nk=50
#' x = rnorm(nk)
#' xx = seq(-5, 5, 0.01)
#' plot(xx, dnorm(xx))
#' rug(x)
#' for (i in 1:nk) lines(xx, dnorm(xx, x[i], sd = bw.nrd0(x))*0.05)
#' lines(xx, dkden(xx, x), lwd = 2, col = "red")
#' lines(density(x), lty = 2, lwd = 2, col = "green")
#' legend("topright", c("True Density", "KDE Using evmix", "KDE Using density function"),
#' lty = c(1, 1, 2), lwd = c(1, 2, 2), col = c("black", "red", "green"))
#' 
#' # Estimate bandwidth using cross-validation likelihood
#' x = rnorm(nk)
#' fit = fkden(x)
#' hist(x, nk/5, freq = FALSE, xlim = c(-5, 5), ylim = c(0, 0.6)) 
#' rug(x)
#' for (i in 1:nk) lines(xx, dnorm(xx, x[i], sd = fit$bw)*0.05)
#' lines(xx,dnorm(xx), col = "black")
#' lines(xx, dkden(xx, x, lambda = fit$lambda), lwd = 2, col = "red")
#' lines(density(x), lty = 2, lwd = 2, col = "green")
#' lines(density(x, bw = fit$bw), lwd = 2, lty = 2,  col = "blue")
#' legend("topright", c("True Density", "KDE fitted evmix",
#' "KDE Using density, default bandwidth", "KDE Using density, c-v likelihood bandwidth"),
#' lty = c(1, 1, 2, 2), lwd = c(1, 2, 2, 2), col = c("black", "red", "green", "blue"))
#'
#' plot(xx, pnorm(xx), type = "l")
#' rug(x)
#' lines(xx, pkden(xx, x), lwd = 2, col = "red")
#' lines(xx, pkden(xx, x, lambda = fit$lambda), lwd = 2, col = "green")
#' # green and blue (quantile) function should be same
#' p = seq(0, 1, 0.001)
#' lines(qkden(p, x, lambda = fit$lambda), p, lwd = 2, lty = 2, col = "blue") 
#' legend("topleft", c("True Density", "KDE using evmix, normal reference rule",
#' "KDE using evmix, c-v likelihood","KDE quantile function, c-v likelihood"),
#' lty = c(1, 1, 1, 2), lwd = c(1, 2, 2, 2), col = c("black", "red", "green", "blue"))
#' 
#' xnew = rkden(10000, x, lambda = fit$lambda)
#' hist(xnew, breaks = 100, freq = FALSE, xlim = c(-5, 5))
#' rug(xnew)
#' lines(xx,dnorm(xx), col = "black")
#' lines(xx, dkden(xx, x), lwd = 2, col = "red")
#' legend("topright", c("True Density", "KDE Using evmix"),
#' lty = c(1, 2), lwd = c(1, 2), col = c("black", "red"))
#' }
#' 
NULL

#' @export
#' @aliases kden dkden pkden qkden rkden
#' @rdname  kden

# density function for kernel density estimator
dkden <- function(x, kerncentres, lambda = NULL, bw = NULL, kernel = "gaussian", log = FALSE) {

  # Check properties of inputs
  check.quant(x, allowna = TRUE, allowinf = TRUE)
  check.quant(kerncentres, allowna = TRUE, allowinf = TRUE)
  check.kbw(lambda, bw, allownull = TRUE)
  check.kernel(kernel)
  check.logic(log)

  kernel = ifelse(kernel == "rectangular", "uniform", kernel)
  kernel = ifelse(kernel == "normal", "gaussian", kernel)

  if (any(is.infinite(x))) warning("infinite quantiles set to NA")

  x[is.infinite(x)] = NA # user will have to deal with infinite cases
    
  if (any(!is.finite(kerncentres))) warning("non-finite kernel centres are dropped")

  kerncentres = kerncentres[is.finite(kerncentres)]
  check.quant(kerncentres)
  nk = length(kerncentres)

  if (is.null(lambda) & is.null(bw)) {
    if (nk == 1) {
      stop("Automated bandwidth estimation requires 2 or more kernel centres")
    } else if (nk < 10) {
      warning("Automated bandwidth estimation unreliable with less than 10 kernel centres")
    }
    bw = bw.nrd0(kerncentres)
  }
  lambda = klambda(bw, kernel, lambda)

  d = x # pass through NA/NaN as entered
  
  whichok = which(is.finite(x))
  
  d[whichok] = sapply(x[whichok], FUN = kdenx, kerncentres = kerncentres,
    lambda = lambda, kernel = kernel) 

  if (log) d = log(d)

  d
}

#' @export
#' @aliases kden dkden pkden qkden rkden
#' @rdname  kden

# cumulative distribution function for kernel density estimator
pkden <- function(q, kerncentres, lambda = NULL, bw = NULL, kernel = "gaussian", lower.tail = TRUE) {

  # Check properties of inputs
  check.quant(q, allowna = TRUE, allowinf = TRUE)
  check.quant(kerncentres, allowna = TRUE, allowinf = TRUE)
  check.kbw(lambda, bw, allownull = TRUE)
  check.kernel(kernel)
  check.logic(lower.tail)

  kernel = ifelse(kernel == "rectangular", "uniform", kernel)
  kernel = ifelse(kernel == "normal", "gaussian", kernel)

  if (any(is.infinite(q))) warning("infinite quantiles set to NA")

  q[is.infinite(q)] = NA # user will have to deal with infinite cases

  if (any(!is.finite(kerncentres))) warning("non-finite kernel centres are dropped")

  kerncentres = kerncentres[is.finite(kerncentres)]
  check.quant(kerncentres)
  nk = length(kerncentres)

  if (is.null(lambda) & is.null(bw)) {
    if (nk == 1) {
      stop("Automated bandwidth estimation requires 2 or more kernel centres")
    } else if (nk < 10) {
      warning("Automated bandwidth estimation unreliable with less than 10 kernel centres")
    }
    bw = bw.nrd0(kerncentres)
  }
  lambda = klambda(bw, kernel, lambda)

  p = q # pass through NA/NaN as entered
  
  whichok = which(is.finite(q))
  
  p[whichok] = sapply(q[whichok], FUN = pkdenx, kerncentres = kerncentres,
    lambda = lambda, kernel = kernel) 

  # sometimes due to numerical errors p>1 or p<0
  p = pmax(pmin(p, 1), 0)
  
  if (!lower.tail) p = 1 - p

  p
}

#' @export
#' @aliases kden dkden pkden qkden rkden
#' @rdname  kden

# inverse cumulative distribution function for kernel density estimator
qkden <- function(p, kerncentres, lambda = NULL, bw = NULL, kernel = "gaussian", lower.tail = TRUE) {

  # Check properties of inputs
  check.prob(p, allowna = TRUE)
  check.quant(kerncentres, allowna = TRUE, allowinf = TRUE)
  check.kbw(lambda, bw, allownull = TRUE)
  check.kernel(kernel)
  check.logic(lower.tail)

  kernel = ifelse(kernel == "rectangular", "uniform", kernel)
  kernel = ifelse(kernel == "normal", "gaussian", kernel)

  if (any(!is.finite(kerncentres))) warning("non-finite kernel centres are dropped")

  kerncentres = kerncentres[is.finite(kerncentres)]
  check.quant(kerncentres)
  nk = length(kerncentres)
  
  if (is.null(lambda) & is.null(bw)) {
    if (nk == 1) {
      stop("Automated bandwidth estimation requires 2 or more kernel centres")
    } else if (nk < 10) {
      warning("Automated bandwidth estimation unreliable with less than 10 kernel centres")
    }
    bw = bw.nrd0(kerncentres)
  }
  lambda = klambda(bw, kernel, lambda)

  if (!lower.tail) p = 1 - p

  q = p
  
  # obtain quantile function but interpolation between kernel based CDF estimates
  # - CDF is quick to estimate, interpolation algorithms also quite fast
  # - an alternative solution is to numerical solve to find quantile, but is much slower

  maxp = ifelse(kernel == "gaussian", 5*lambda, lambda)
  
  qrange = range(kerncentres) + maxp * c(-1, 1)
  qk = seq(qrange[1], qrange[2], length.out = min(50 * nk, 1000))

  pk = sapply(qk, FUN = pkdenx, lambda = lambda, kerncentres = kerncentres, kernel = kernel) 

  qfun = splinefun(x = pk, y = qk)

  whichinterp = which((p >= min(pk)) & (p <= max(pk)))
  ninterp = length(whichinterp)
  which0 = which(p < min(pk))
  n0 = length(which0)
  which1 = which(p > max(pk))
  n1 = length(which1)
  
  if (ninterp > 0) q[whichinterp] = qfun(p[whichinterp])

  # if further out than 5 standard deviations from kernel centres then set:
  # lower tail to -Inf and upper tail set to Inf
  if (n0 > 0) q[which0] = -Inf
  if (n1 > 0) q[which1] = Inf
  
  q
}

#' @export
#' @aliases kden dkden pkden qkden rkden
#' @rdname  kden

# random number generation for kernel density estimator
rkden <- function(n = 1, kerncentres, lambda = NULL, bw = NULL, kernel = "gaussian") {

  # Check properties of inputs
  check.n(n)
  check.quant(kerncentres, allowna = TRUE, allowinf = TRUE)
  check.kbw(lambda, bw, allownull = TRUE)
  check.kernel(kernel)

  # Inversion sampling (very inefficient!, but avoids need for RNG's for all different kernels)
  qkden(runif(n), kerncentres, lambda, bw, kernel)
}

Try the evmix package in your browser

Any scripts or data that you put into this service are public.

evmix documentation built on Sept. 3, 2019, 5:07 p.m.