rapkod: RaPKod: Random Projections Kernel Outlier Detection

Description Usage Arguments Details Value See Also Examples

Description

RaPKod is a kernel method for detecting outliers in a given dataset on the basis of a reference set of non-outliers. To do so, it 'transforms' a tested observation into some kernel space (through a 'feature map') and then projects it onto a random low-dimensional subspace of this kernel space. Since the distribution of this projection is known in the case of a non-outlier, it allows RaPKod to control the probability of false alarm error (ie labelling a non-outlier as an outlier).

Usage

1
2
3
  rapkod(X, given.kern = FALSE, ref.n=NULL, gamma=NULL,  p=NULL, alpha = 0.05, 
	use.tested.inlier = FALSE, lowrank = "No", r.lowrk = ceiling(sqrt(nrow(X))), 
	K1 = 6, K2 = 50)

Arguments

X

either a data frame or an n x d matrix (if given.kern=FALSE), otherwise an n x n kernel matrix (if given.kern=TRUE). In the former case, a Gaussian kernel is used by default.

given.kern

If FALSE (default), each row of X is an observation. Otherwise X is a kernel matrix (in this case, gamma and p must be user-specified).

ref.n

the size of the reference non-outlier dataset. Must be smaller than n.

gamma

the hyperparameter of the Gaussian kernel k(x, y) = exp( - gamma * || x - y ||^2). Set automatically by the program if not specified and given.kern=FALSE.

p

the number of dimensions of the projection made in the kernel space. Set automatically by the program if not specified and given.kern=FALSE.

alpha

the prescribed probability of false alarm error.

use.tested.inlier

If TRUE, each tested observation that is labelled as a non-outlier is appended to the reference dataset of non-outliers (the 'oldest' reference non-outlier is discarded). Set to FALSE by default.

lowrank

if lowrank="No" (default), the full kernel matrix is used. Otherwise, a low-rank approximation of the kernel matrix is computed: if "Nyst", it is approximated through Nystrom method; if "RKS", it is approximated by random Kitchen Sinks (in this case, X must be a dataset matrix, not a kernel matrix)

r.lowrk

if lowrank="Nyst" or "RKS", specifies the (low) rank of the approximated kernel matrix.

K1

universal constant used in the heuristic formula of the optimal parameter gamma.

K2

universal constant used in the heuristic formula of the optimal parameter p.

Details

If given.kern = FALSE, X is a dataset matrix whose first ref.n rows corresponds to the reference dataset of non-outliers. The (n - ref.n) other observations will be tested one by one by RaPKod to determine whether they are outliers or not.

If given.kern = TRUE, X must be a n x n Gram matrix. The kernel used to compute this Gram matrix should be of the form k(x, y) = K(gamma * || x - y ||^2) where K is a positive function. Also note that in this case, the parameters gamma and p must be specified by the user.

Value

stats

a vector of length (n - ref.n) containing the test statistics for each tested observation.

flag

a vector of length (n - ref.n) indicating which observations have been labelled as an outlier (TRUE in this case).

pv

a vector of length (n - ref.n) containing p-values for each tested observation.

gamma

the optimal value of gamma determined by the program (or the value provided by the user if it was user-specified).

p

the optimal value of p determined by the program (or the value provided by the user if it was user-specified).

See Also

od.opt.param

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
data(iris)

##Define data frame with non-outliers
inliers = iris[sample(which(iris$Species!="setosa"), 100, replace=FALSE),
                                              -which(names(iris)=="Species")]
##Define data frame with outliers
outliers = iris[which(iris$Species=="setosa"),-which(names(iris)=="Species")]


X = rbind(inliers, outliers)

ref.n = 50
result <- rapkod(X, ref.n = ref.n, use.tested.inlier = FALSE, alpha = 0.05)


##False alarm error ratio obtained on tested non-outliers (should be close to 0.05)
mean(result$pv[1:(nrow(inliers)-ref.n)]<0.05, na.rm = TRUE)
##Missed detection error ratio obtained on tested outliers (should be close to 0)
mean(result$pv[-(1:(nrow(inliers)-ref.n))]>0.05, na.rm = TRUE)
  

RaPKod documentation built on May 2, 2019, 5:58 a.m.