od.opt.param: Optimal Parameter Values In RaPKod

Description Usage Arguments Details Value See Also Examples

Description

Uses a heuristic formula to set optimal values for gamma and p.

Usage

1
2
od.opt.param(X, K1 = 6, K2 = 50, which.estim = "Gauss", RATIO = 0.1, 
            randomize = TRUE, sub.n = floor(nrow(X)))

Arguments

X

a data frame or an n x d matrix.

K1

universal constant used in the heuristic formula of the optimal parameter gamma.

K2

universal constant used in the heuristic formula of the optimal parameter p.

which.estim

specifies the estimation method of the parameters: either "Gauss"(default) or "general".

RATIO

optional parameter used in estimation method "Gauss"

randomize

optional parameter used in the estimation method "general".

sub.n

optional parameter used in the estimation method "general" if randomize=TRUE.

Details

This function uses a heuristic formula to determine the optimal parameter values gamma and p, in the case when a Gaussian kernel is used. This formula is of the form gamma = K1 * |f|_2^{2/(d+2)} * n^{1/(d+2)} and p = ceil(K2 * |f|_2^{2/(d+2)} * n^{2/(d+2)} ), where |f|_2 is the L2-norm of the density function of non-outliers f and ceil(x) denotes the smallest integer larger than x.

Two methods are proposed to estimate |f|_2 and are specified by the argument which.estim: "Gauss" and "general".

If which.estim="Gauss", the estimation is done as though f was a Gaussian density, which yields |f|_2^{2/(d+2)} ) = (4*pi)^{-0.5}*exp(0.5*mean(log(1/ev))), where ev are the covariance eigenvalues of the non-outlier distribution. Note that the eigenvalues smaller than ev[1]*RATIO (where ev[1] is the largest eigenvalue) are discarded to avoid numerical issues.

If which.estim="general", |f|_2 is estimated without any assumption on f. However this method may fail in very high dimensions because of the dimensionality curse, since it relies on an estimation of the derivative of F at 0 where F is the cdf of the pairwise distance between two non-outliers. . Besides, to shorten the computation time, the optional argument 'randomize' can be set as TRUE, so that only a subset of size sub.n of the data is considered to estimate the cdf F.

Value

gamma.opt

optimal value for gamma.

p.opt

optimal value for p.

est.f2.pw

estimation of |f|_2^{2/(d+2)} .

See Also

rapkod

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
data(iris)

##Define data frame with non-outliers
inliers = iris[sample(which(iris$Species!="setosa"), 100, replace=FALSE),
                                              -which(names(iris)=="Species")]
                                              
param <- od.opt.param(inliers)

#display optimal gamma
param$gamma.opt
#display optimal p
param$p.opt

RaPKod documentation built on May 2, 2019, 5:58 a.m.