rapkod: RaPKod: Random Projections Kernel Outlier Detection
In RaPKod: Random Projection Kernel Outlier Detector

Description Usage Arguments Details Value See Also Examples

RaPKod is a kernel method for detecting outliers in a given dataset on the basis of a reference set of non-outliers. To do so, it 'transforms' a tested observation into some kernel space (through a 'feature map') and then projects it onto a random low-dimensional subspace of this kernel space. Since the distribution of this projection is known in the case of a non-outlier, it allows RaPKod to control the probability of false alarm error (ie labelling a non-outlier as an outlier).

1
2
3

  rapkod(X, given.kern = FALSE, ref.n=NULL, gamma=NULL,  p=NULL, alpha = 0.05, 
	use.tested.inlier = FALSE, lowrank = "No", r.lowrk = ceiling(sqrt(nrow(X))), 
	K1 = 6, K2 = 50)

`X`	either a data frame or an n x d matrix (if given.kern=FALSE), otherwise an n x n kernel matrix (if given.kern=TRUE). In the former case, a Gaussian kernel is used by default.
`given.kern`	If FALSE (default), each row of X is an observation. Otherwise X is a kernel matrix (in this case, gamma and p must be user-specified).
`ref.n`	the size of the reference non-outlier dataset. Must be smaller than n.
`gamma`	the hyperparameter of the Gaussian kernel k(x, y) = exp( - gamma \|\| x - y \|\|^2)*. Set automatically by the program if not specified and given.kern=FALSE.
`p`	the number of dimensions of the projection made in the kernel space. Set automatically by the program if not specified and given.kern=FALSE.
`alpha`	the prescribed probability of false alarm error.
`use.tested.inlier`	If TRUE, each tested observation that is labelled as a non-outlier is appended to the reference dataset of non-outliers (the 'oldest' reference non-outlier is discarded). Set to FALSE by default.
`lowrank`	if lowrank="No" (default), the full kernel matrix is used. Otherwise, a low-rank approximation of the kernel matrix is computed: if "Nyst", it is approximated through Nystrom method; if "RKS", it is approximated by random Kitchen Sinks (in this case, X must be a dataset matrix, not a kernel matrix)
`r.lowrk`	if lowrank="Nyst" or "RKS", specifies the (low) rank of the approximated kernel matrix.
`K1`	universal constant used in the heuristic formula of the optimal parameter gamma.
`K2`	universal constant used in the heuristic formula of the optimal parameter p.

If given.kern = FALSE, X is a dataset matrix whose first ref.n rows corresponds to the reference dataset of non-outliers. The (n - ref.n) other observations will be tested one by one by RaPKod to determine whether they are outliers or not.

If given.kern = TRUE, X must be a n x n Gram matrix. The kernel used to compute this Gram matrix should be of the form k(x, y) = K(gamma * || x - y ||^2) where K is a positive function. Also note that in this case, the parameters gamma and p must be specified by the user.

`stats`	a vector of length (n - ref.n) containing the test statistics for each tested observation.
`flag`	a vector of length (n - ref.n) indicating which observations have been labelled as an outlier (TRUE in this case).
`pv`	a vector of length (n - ref.n) containing p-values for each tested observation.
`gamma`	the optimal value of gamma determined by the program (or the value provided by the user if it was user-specified).
`p`	the optimal value of p determined by the program (or the value provided by the user if it was user-specified).

od.opt.param

data(iris)

##Define data frame with non-outliers
inliers = iris[sample(which(iris$Species!="setosa"), 100, replace=FALSE),
                                              -which(names(iris)=="Species")]
##Define data frame with outliers
outliers = iris[which(iris$Species=="setosa"),-which(names(iris)=="Species")]


X = rbind(inliers, outliers)

ref.n = 50
result <- rapkod(X, ref.n = ref.n, use.tested.inlier = FALSE, alpha = 0.05)


##False alarm error ratio obtained on tested non-outliers (should be close to 0.05)
mean(result$pv[1:(nrow(inliers)-ref.n)]<0.05, na.rm = TRUE)
##Missed detection error ratio obtained on tested outliers (should be close to 0)
mean(result$pv[-(1:(nrow(inliers)-ref.n))]>0.05, na.rm = TRUE)