od.opt.param: Optimal Parameter Values In RaPKod
In RaPKod: Random Projection Kernel Outlier Detector

Description Usage Arguments Details Value See Also Examples

Uses a heuristic formula to set optimal values for gamma and p.

1 2	od.opt.param(X, K1 = 6, K2 = 50, which.estim = "Gauss", RATIO = 0.1, randomize = TRUE, sub.n = floor(nrow(X)))

`X`	a data frame or an n x d matrix.
`K1`	universal constant used in the heuristic formula of the optimal parameter gamma.
`K2`	universal constant used in the heuristic formula of the optimal parameter p.
`which.estim`	specifies the estimation method of the parameters: either "Gauss"(default) or "general".
`RATIO`	optional parameter used in estimation method "Gauss"
`randomize`	optional parameter used in the estimation method "general".
`sub.n`	optional parameter used in the estimation method "general" if randomize=TRUE.

This function uses a heuristic formula to determine the optimal parameter values gamma and p, in the case when a Gaussian kernel is used. This formula is of the form gamma = K1 * |f|_2^{2/(d+2)} * n^{1/(d+2)} and p = ceil(K2 * |f|_2^{2/(d+2)} * n^{2/(d+2)} ), where |f|_2 is the L2-norm of the density function of non-outliers f and ceil(x) denotes the smallest integer larger than x.

Two methods are proposed to estimate |f|_2 and are specified by the argument which.estim: "Gauss" and "general".

If which.estim="Gauss", the estimation is done as though f was a Gaussian density, which yields |f|_2^{2/(d+2)} ) = (4*pi)^{-0.5}*exp(0.5*mean(log(1/ev))), where ev are the covariance eigenvalues of the non-outlier distribution. Note that the eigenvalues smaller than ev[1]*RATIO (where ev[1] is the largest eigenvalue) are discarded to avoid numerical issues.

If which.estim="general", |f|_2 is estimated without any assumption on f. However this method may fail in very high dimensions because of the dimensionality curse, since it relies on an estimation of the derivative of F at 0 where F is the cdf of the pairwise distance between two non-outliers. . Besides, to shorten the computation time, the optional argument 'randomize' can be set as TRUE, so that only a subset of size sub.n of the data is considered to estimate the cdf F.

`gamma.opt`	optimal value for gamma.
`p.opt`	optimal value for p.
`est.f2.pw`	estimation of \|f\|_2^{2/(d+2)} .

rapkod

data(iris)

##Define data frame with non-outliers
inliers = iris[sample(which(iris$Species!="setosa"), 100, replace=FALSE),
                                              -which(names(iris)=="Species")]
                                              
param <- od.opt.param(inliers)

#display optimal gamma
param$gamma.opt
#display optimal p
param$p.opt