bckden: Boundary Corrected Kernel Density Estimation Using a Variety...

Description Usage Arguments Details Value Boundary Correction Methods Warning Acknowledgments Note Author(s) References See Also Examples

Description

Density, cumulative distribution function, quantile function and random number generation for boundary corrected kernel density estimators using a variety of approaches (and different kernels) with a constant bandwidth lambda.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
dbckden(x, kerncentres, lambda = NULL, bw = NULL,
  kernel = "gaussian", bcmethod = "simple", proper = TRUE,
  nn = "jf96", offset = NULL, xmax = NULL, log = FALSE)

pbckden(q, kerncentres, lambda = NULL, bw = NULL,
  kernel = "gaussian", bcmethod = "simple", proper = TRUE,
  nn = "jf96", offset = NULL, xmax = NULL, lower.tail = TRUE)

qbckden(p, kerncentres, lambda = NULL, bw = NULL,
  kernel = "gaussian", bcmethod = "simple", proper = TRUE,
  nn = "jf96", offset = NULL, xmax = NULL, lower.tail = TRUE)

rbckden(n = 1, kerncentres, lambda = NULL, bw = NULL,
  kernel = "gaussian", bcmethod = "simple", proper = TRUE,
  nn = "jf96", offset = NULL, xmax = NULL)

Arguments

x

quantiles

kerncentres

kernel centres (typically sample data vector or scalar)

lambda

bandwidth for kernel (as half-width of kernel) or NULL

bw

bandwidth for kernel (as standard deviations of kernel) or NULL

kernel

kernel name (default = "gaussian")

bcmethod

boundary correction method

proper

logical, whether density is renormalised to integrate to unity (where needed)

nn

non-negativity correction method (simple boundary correction only)

offset

offset added to kernel centres (logtrans only) or NULL

xmax

upper bound on support (copula and beta kernels only) or NULL

log

logical, if TRUE then log density

q

quantiles

lower.tail

logical, if FALSE then upper tail probabilities

p

cumulative probabilities

n

sample size (positive integer)

Details

Boundary corrected kernel density estimation (BCKDE) with improved bias properties near the boundary compared to standard KDE available in kden functions. The user chooses from a wide range of boundary correction methods designed to cope with a lower bound at zero and potentially also both upper and lower bounds.

Some boundary correction methods require a secondary correction for negative density estimates of which two methods are implemented. Further, some methods don't necessarily give a density which integrates to one, so an option is provided to renormalise to be proper.

It assumes there is a lower bound at zero, so prior transformation of data is required for a alternative lower bound (possibly including negation to allow for an upper bound).

The alternate bandwidth definitions are discussed in the kernels, with the lambda as the default. The bw specification is the same as used in the density function.

Certain boundary correction methods use the standard kernels which are defined in the kernels help documentation with the "gaussian" as the default choice.

The quantile function is rather complicated as there is no closed form solution, so is obtained by numerical approximation of the inverse cumulative distribution function P(X ≤ q) = p to find q. The quantile function qbckden evaluates the KDE cumulative distribution function over the range from c(0, max(kerncentre) + lambda), or c(0, max(kerncentre) + 5*lambda) for normal kernel. Outside of this range the quantiles are set to 0 for lower tail and Inf (or xmax where appropriate) for upper tail. A sequence of values of length fifty times the number of kernels (upto a maximum of 1000) is first calculated. Spline based interpolation using splinefun, with default monoH.FC method, is then used to approximate the quantile function. This is a similar approach to that taken by Matt Wand in the qkde in the ks package.

Unlike the standard KDE, there is no general rule-of-thumb bandwidth for all these estimators, with only certain methods having a guideline in the literature, so none have been implemented. Hence, a bandwidth must always be specified and you should consider using fbckden function for cross-validation MLE for bandwidth.

Random number generation is slow as inversion sampling using the (numerically evaluated) quantile function is implemented. Users may want to consider alternative approaches instead, like rejection sampling.

Value

dbckden gives the density, pbckden gives the cumulative distribution function, qbckden gives the quantile function and rbckden gives a random sample.

Boundary Correction Methods

Renormalisation to a proper density is assumed by default proper=TRUE. This correction is needed for bcmethod="renorm", "simple", "beta1", "beta2", "gamma1" and "gamma2" which all require numerical integration. Renormalisation will not be carried out for other methods, even when proper=TRUE.

Non-negativity correction is only relevant for the bcmethod="simple" approach. The Jones and Foster (1996) method is applied nn="jf96" by default. This method can occassionally give an extra boundary bias for certain populations (e.g. Gamma(2, 1)), see paper for details. Non-negative values can simply be zeroed (nn="zero"). Renormalisation should always be applied after non-negativity correction. Non-negativity correction will not be carried out for other methods, even when requested by user.

The non-negative correction is applied before renormalisation, when both requested.

The boundary correction methods implemented are listed below. The first set can use any type of kernel (see kernels help documentation):

bcmethod="simple" is the default and applies the simple boundary correction method in equation (3.4) of Jones (1993) and is equivalent to the kernel weighted local linear fitting at the boundary. Renormalisation and non-negativity correction may be required.

bcmethod="cutnorm" applies cut and normalisation method of Gasser and Muller (1979), where the kernels themselves are individually truncated at the boundary and renormalised to unity.

bcmethod="renorm" applies first order correction method discussed in Diggle (1985), where the kernel density estimate is locally renormalised near boundary. Renormalisation may be required.

bcmethod="reflect" applies reflection method of Boneva, Kendall and Stefanov (1971) which is equivalent to the dataset being supplemented by the same dataset negated. This method implicitly assumes f'(0)=0, so can cause extra artefacts at the boundary.

bcmethod="logtrans" applies KDE on the log-scale and then back-transforms (with explicit normalisation) following Marron and Ruppert (1992). This is the approach implemented in the ks package. As the KDE is applied on the log scale, the effective bandwidth on the original scale is non-constant. The offset option is only used for this method and is commonly used to offset zero kernel centres in log transform to prevent log(0).

All the following boundary correction methods do not use kernels in their usual sense, so ignore the kernel input:

bcmethod="beta1" and "beta2" uses the beta and modified beta kernels of Chen (1999) respectively. The xmax rescales the beta kernels to be defined on the support [0, xmax] rather than unscaled [0, 1]. Renormalisation will be required.

bcmethod="gamma1" and "gamma2" uses the gamma and modified gamma kernels of Chen (2000) respectively. Renormalisation will be required.

bcmethod="copula" uses the bivariate normal copula based kernesl of Jones and Henderson (2007). As with the bcmethod="beta1" and "beta2" methods the xmax rescales the copula kernels to be defined on the support [0, xmax] rather than [0, 1]. In this case the bandwidth is defined as lambda=1-ρ^2, so the bandwidth is limited to (0, 1).

Warning

The "simple", "renorm", "beta1", "beta2", "gamma1" and "gamma2" boundary correction methods may require renormalisation using numerical integration which can be very slow. In particular, the numerical integration is extremely slow for the kernel="uniform", due to the adaptive quadrature in the integrate function being particularly slow for functions with step-like behaviour.

Acknowledgments

Based on code by Anna MacDonald produced for MATLAB.

Note

Unlike most of the other extreme value mixture model functions the bckden functions have not been vectorised as this is not appropriate. The main inputs (x, p or q) must be either a scalar or a vector, which also define the output length.

The kernel centres kerncentres can either be a single datapoint or a vector of data. The kernel centres (kerncentres) and locations to evaluate density (x) and cumulative distribution function (q) would usually be different.

Default values are provided for all inputs, except for the fundamentals lambda, kerncentres, x, q and p. The default sample size for rbckden is 1.

The xmax option is only relevant for the beta and copula methods, so a warning is produced if this is not NULL for in other methods. The offset option is only relevant for the "logtrans" method, so a warning is produced if this is not NULL for in other methods.

Missing (NA) and Not-a-Number (NaN) values in x, p and q are passed through as is and infinite values are set to NA. None of these are not permitted for the parameters.

Error checking of the inputs (e.g. invalid probabilities) is carried out and will either stop or give warning message as appropriate.

Author(s)

Yang Hu and Carl Scarrott carl.scarrott@canterbury.ac.nz.

References

http://en.wikipedia.org/wiki/Kernel_density_estimation

http://en.wikipedia.org/wiki/Cross-validation_(statistics)

Scarrott, C.J. and MacDonald, A. (2012). A review of extreme value threshold estimation and uncertainty quantification. REVSTAT - Statistical Journal 10(1), 33-59. Available from http://www.ine.pt/revstat/pdf/rs120102.pdf

Bowman, A.W. (1984). An alternative method of cross-validation for the smoothing of density estimates. Biometrika 71(2), 353-360.

Duin, R.P.W. (1976). On the choice of smoothing parameters for Parzen estimators of probability density functions. IEEE Transactions on Computers C25(11), 1175-1179.

MacDonald, A., Scarrott, C.J., Lee, D., Darlow, B., Reale, M. and Russell, G. (2011). A flexible extreme value mixture model. Computational Statistics and Data Analysis 55(6), 2137-2157.

Chen, S.X. (1999). Beta kernel estimators for density functions. Computational Statistics and Data Analysis 31, 1310-45.

Gasser, T. and Muller, H. (1979). Kernel estimation of regression functions. In "Lecture Notes in Mathematics 757, edited by Gasser and Rosenblatt, Springer.

Chen, S.X. (2000). Probability density function estimation using gamma kernels. Annals of the Institute of Statisical Mathematics 52(3), 471-480.

Boneva, L.I., Kendall, D.G. and Stefanov, I. (1971). Spline transformations: Three new diagnostic aids for the statistical data analyst (with discussion). Journal of the Royal Statistical Society B, 33, 1-70.

Diggle, P.J. (1985). A kernel method for smoothing point process data. Applied Statistics 34, 138-147.

Marron, J.S. and Ruppert, D. (1994) Transformations to reduce boundary bias in kernel density estimation, Journal of the Royal Statistical Society. Series B 56(4), 653-671.

Jones, M.C. and Henderson, D.A. (2007). Kernel-type density estimation on the unit interval. Biometrika 94(4), 977-984.

See Also

kernels, kfun, density, bw.nrd0 and dkde in ks package.

Other kden: fbckden, fgkgcon, fgkg, fkdengpdcon, fkdengpd, fkden, kdengpdcon, kdengpd, kden

Other bckden: bckdengpdcon, bckdengpd, fbckdengpdcon, fbckdengpd, fbckden, fkden, kden

Other bckdengpd: bckdengpdcon, bckdengpd, fbckdengpdcon, fbckdengpd, fbckden, fkdengpd, gkg, kdengpd, kden

Other bckdengpdcon: bckdengpdcon, bckdengpd, fbckdengpdcon, fbckdengpd, fbckden, fkdengpdcon, gkgcon, kdengpdcon

Other fbckden: fbckden

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
## Not run: 
set.seed(1)
par(mfrow = c(1, 1))

n=100
x = rgamma(n, shape = 1, scale = 2)
xx = seq(-0.5, 12, 0.01)
plot(xx, dgamma(xx, shape = 1, scale = 2), type = "l")
rug(x)
lines(xx, dbckden(xx, x, lambda = 1), lwd = 2, col = "red")
lines(density(x), lty = 2, lwd = 2, col = "green")
legend("topright", c("True Density", "Simple boundary correction",
"KDE using density function", "Boundary Corrected Kernels"),
lty = c(1, 1, 2, 1), lwd = c(1, 2, 2, 1), col = c("black", "red", "green", "blue"))

n=100
x = rbeta(n, shape1 = 3, shape2 = 2)*5
xx = seq(-0.5, 5.5, 0.01)
plot(xx, dbeta(xx/5, shape1 = 3, shape2 = 2)/5, type = "l", ylim = c(0, 0.8))
rug(x)
lines(xx, dbckden(xx, x, lambda = 0.1, bcmethod = "beta2", proper = TRUE, xmax = 5),
  lwd = 2, col = "red")
lines(density(x), lty = 2, lwd = 2, col = "green")
legend("topright", c("True Density", "Modified Beta KDE Using evmix",
  "KDE using density function"),
lty = c(1, 1, 2), lwd = c(1, 2, 2), col = c("black", "red", "green"))

# Demonstrate renormalisation (usually small difference)
n=1000
x = rgamma(n, shape = 1, scale = 2)
xx = seq(-0.5, 15, 0.01)
plot(xx, dgamma(xx, shape = 1, scale = 2), type = "l")
rug(x)
lines(xx, dbckden(xx, x, lambda = 0.5, bcmethod = "simple", proper = TRUE),
  lwd = 2, col = "purple")
lines(xx, dbckden(xx, x, lambda = 0.5, bcmethod = "simple", proper = FALSE),
  lwd = 2, col = "red", lty = 2)
legend("topright", c("True Density", "Simple BC with renomalisation", 
"Simple BC without renomalisation"),
lty = 1, lwd = c(1, 2, 2), col = c("black", "purple", "red"))

## End(Not run)

evmix documentation built on Sept. 3, 2019, 5:07 p.m.