Nothing
#' @title Local Density-Based Outlier Detection using Subsampling with Approximate Nearest Neighbor Search
#' @description This function computes local density-based outlier scores for input data.
#' @param X An n x p data matrix to compute outlier scores
#' @param k A vector of neighborhood sizes, k must be less than nsub
#' @param nsub Subsample size, nsub must be greater than k. Usually nsub = 0.10*n or larger is recommended. Default is nsub = n
#' @param method Character vector specifying the local density-based method(s) to compute. User can specify more than
#' one method. By default all methods are computed
#' @param ldf.param Vector of parameters for method LDF. h is the positive bandwidth parameter and c is a positive scaling constant. Default values are h=1 and c=0.1
#' @param rkof.param Vector of parameters for method RKOF. C is the postive bandwidth paramter, alpha is a sensitiveity parameter in the interval [0,1],
#' and sig2 is the variance parameter. Default values are alpha=1, C=1, sig2=1
#' @param lpdf.param Vector of paramters for method LPDF. cov.type is the covariance parameterization type,
#' which users can specifiy as either 'full' or 'diag'. sigma2 is the positive regularization parameter, tmax is the maximum number of updates, and
#' v is the degrees of freedom for the multivariate t distribution. Default values are cov.type = 'full',tmax=1, sigma2=1e-5, and v=1.
#' @param treetype Character vector specifiying tree method. Either 'kd' or 'bd' tree may be specified. Default is 'kd'.
#' Refer to documentation for RANN package.
#' @param eps Error bound. Default is 0.0 which implies exact nearest neighgour search. Refer to documentation for RANN package.
#' @param searchtype Character vector specifiying kNN search type. Default value is "standard". Refer to documentation for RANN package.
#' @param scale.data Logical value indicating to scale each feature of X using standard noramlization with mean 0 and standard deviation of 1
#' @details Computes the local density-based outlier scores for input data, X, referencing a random subsample of X. The subsampled
#' data set is constructed by randomly drawning nsub samples from X without replacement.
#'
#' Four different methods can be implemented LOF, LDF, RKOF, and LPDF. Each method specified returns densities and relative densities.
#' Methods LDF and RKOF uses guassian kernels, and method LDPF uses multivarite t distribution. Outlier scores returned are positive
#' except for lpde and lpdr which are log scaled densities (natural log). Score lpdr has shown to be highly sensitive to k.
#'
#' All kNN computations are carried out using the nn2() function from the RANN package. Multivariate t densities are
#' computed using the dmt() function from the mnormt package. Refer to specific packages for more details. Note: all
#' neighborhoods are strickly of size k; therefore, the algorithms for LOF, LDF, and RKOF are not exact implementations, but
#' algorithms are similiar for most situation and are equivalent when distance to k-th nearest neighbor is unique. If there are many
#' duplicate data points, then implementation of algorithms could lead to dramatically different (positive or negative) results than those that allow
#' neighborhood sizes larger than k, especially if k is relatively small. Removing duplicates is recommended before computing
#' outlier scores unless there is good reason to keep them.
#'
#' The algorithm can be used to compute an ensemble of outlier scores by using multiple k values
#' and/or iterating over multiple subsamples.
#'
#'
#' @return
#' A list of length 9 with the elements:
#'
#' lrd --An n x length(k) matrix where each column vector represents the local reachabiility denity (LRD) outlier scores for each specifed k value. Smaller values indicate a point in more outlying.
#'
#' lof --An n x length(k) matrix where each column vector represents the local outlier factor (LOF) outlier scores for each specifed k value. Larger values indicate a point in more outlying.
#'
#' lde --An n x length(k) matrix where each column vector represents the local density estimate (LDE) outlier scores for each specifed k value. Smaller values indicate a point in more outlying.
#'
#' ldf --An n x length(k) matrix where each column vector represents the local density factor (LDF) outlier scores for each specifed k value. Larger values indicate a point in more outlying.
#'
#' kde --An n x length(k) matrix where each column vector represents the kernel density estimate (KDE) outlier scores for each specifed k value. Smaller values indicate a point in more outlying.
#'
#' rkof --An n x length(k) matrix where each column vector represents the robust kernel density factor (RKOF) outlier scores for each specifed k value. Larger values indicate a point in more outlying.
#'
#' lpde --An n x length(k) matrix where each column vector represents the local parametric density estimate (LPDE) outlier scores for each specifed k value on log scale. Smaller values indicate a point in more outlying.
#'
#' lpdf --An n x length(k) matrix where each column vector represents the local parametric density factor (LPDF) outlier scores for each specifed k value. Smaller values indicate a point in more outlying.
#'
#' lpdr --An n x length(k) matrix where each column vector represents the local parametric density ratio (LPDR) outlier scores for each specifed k value. Smaller values indicate a point in more outlying.
#' LPDR is typically used to detect groups of outliers.
#'
#' If a method is not specified then returns NULL
#'
#' @references M. M. Breunig, H-P. Kriegel, R.T. Ng, and J. Sander (2000). LOF: Identifying density-based local outliers. In Proc. of ACM
#' International Conference on Knowledge Discovery and Data Mining, 93-104.
#'
#' L. J. Latecki, A. Lazarevic, and D. Pokrajac (2007). Outlier Detection with kernel density functions. In Proc. of Machine Learning and Data
#' Mining in Pattern Recognition, 61-75
#'
#' J. Gao, W. Hu, Z. Zhang, X. Zhang, and O. Wu (2011). RKOF: Robust kernel-based local outlier detection. In Proc. of Advances in Knowledge Discovery and
#' Data Mining, 270-283.
#'
#' K. T. Williams (2016). Local parametric density-based outlier deteciton and ensemble learning with application to malware detection. PhD Dissertation.
#' The University of Texas at San Antonio.
#' @examples
#' # 500 x 2 data matrix
#' X <- matrix(rnorm(1000),500,2)
#'
#' # five outliers
#' outliers <- matrix(c(rnorm(2,20),rnorm(2,-12),rnorm(2,-8),rnorm(2,-5),rnorm(2,9)),5,2)
#' X <- rbind(X,outliers)
#'
#'# compute outlier scores without subsampling for all methods using neighborhood size of 50
#' scores <- ldbod(X, k=50)
#'
#' head(scores$lrd); head(scores$rkof)
#'
#' # plot data and highlight top 5 outliers retured by lof
#' plot(X)
#' top5outliers <- X[order(scores$lof,decreasing=TRUE)[1:5],]
#' points(top5outliers,col=2)
#'
#' # plot data and highlight top 5 outliers retured by outlier score lpde
#' plot(X)
#' top5outliers <- X[order(scores$lpde,decreasing=FALSE)[1:5],]
#' points(top5outliers,col=2)
#'
#'# compute outlier scores for k= 10,20 with 10% subsampling for methods 'lof' and 'lpdf'
#' scores <- ldbod(X, k = c(10,20), nsub = 0.10*nrow(X), method = c('lof','lpdf'))
#'
#' # plot data and highlight top 5 outliers retuned by lof for k=20
#' plot(X)
#' top5outliers <- X[order(scores$lof[,2],decreasing=TRUE)[1:5],]
#' points(top5outliers,col=2)
#'
#' @importFrom stats sd cov.wt sd dt
#' @importFrom RANN nn2
#' @importFrom mnormt dmt pd.solve dmnorm
#'
#'
#' @export
#'
ldbod <- function(X, k = c(10,20), nsub = nrow(X), method = c('lof','ldf','rkof','lpdf'),
ldf.param = c(h = 1, c = 0.1),
rkof.param = c(alpha = 1, C = 1, sig2 = 1),
lpdf.param = c(cov.type = 'full',sigma2 = 1e-5, tmax=1, v=1),
treetype='kd',searchtype='standard',eps=0.0,
scale.data=TRUE)
{
if(is.null(k)) stop('k is missing')
if(!is.numeric(k)) stop('k is not numeric')
# coerce X to class matrix
X <- as.matrix(X)
if(!is.numeric(X)) stop('X contains non-numeric data type')
if(!is.matrix(X)) stop('X must be of class matrix')
k <- as.integer(k)
nsub <- as.integer(nsub)
# check max k less than than nsub-1
kmax <- max(k)
len.k <- length(k)
if(kmax > nsub - 1 ){ stop('k is greater than or equal to nsub') }
if(is.null(nsub)) (nsub <- nrow(X))
nsub <- as.integer(nsub)
kmax <- max(k)
len.k <- length(k)
if(kmax > nsub -1) stop('k is greater than or equal to nsub')
if(min(k) < 2 ) stop('k must be greater than 1')
# number of rows of X
n <- nrow(X)
# number of columns of X
p <- ncol(X)
# check that nsub <= n
if(nsub>n){ nsub=n }
if(nsub<10){ nsub=10 }
# subsample ids without replacement
sub_sample_ids = sample(1:n,nsub)
# scale X
if(scale.data){ X <- scale(X) }
# subsample X, Y will subsample of X
Y <- as.matrix(X[sub_sample_ids,])
# compute distance (euclidean) matrix between X and Y and returns kNN ids and kNN distances #
knn <- RANN::nn2(data = Y,query = X,k = kmax+1,treetype=treetype, searchtype=searchtype, eps=eps)
# kNN id matrix
knn_ids <- knn$nn.idx[,-1]
# kNN distance matrix
knn_dist_matrix <- knn$nn.dists[,-1]
# define storeage matrix for each outlier score ................
if('lof'%in%method){
store_lrd <- matrix(NA,n,len.k)
store_lof <- matrix(NA,n,len.k)
}else{
store_lrd <- NULL
store_lof <- NULL
}
if('ldf'%in%method){
store_lde <- matrix(NA,n,len.k)
store_ldf <- matrix(NA,n,len.k)
}else{
store_lde <- NULL
store_ldf <- NULL
}
if('rkof'%in%method){
store_kde <- matrix(NA,n,len.k)
store_rkof <- matrix(NA,n,len.k)
}else{
store_kde <- NULL
store_rkof <- NULL
}
if('lpdf'%in%method){
store_lpde <- matrix(NA,n,len.k)
store_lpdf <- matrix(NA,n,len.k)
store_lpdr <- matrix(NA,n,len.k)
}else{
store_lpde <- NULL
store_lpdf <- NULL
store_lpdr <- NULL
}
# initialize
ii <- 0
### compute lof and lrd for each k value specified by user
for(kk in k){
ii <- ii+1
# distance to kth nearest neighbor for X to Y
dist_k <- knn_dist_matrix[,kk]
## Reachability distance matrix if outlier score lrd, lof, lde, or ldf has been specified
if(sum(method%in%c('lof','ldf'))>0){
# arrange dist_k relative to subsample data set
dist_k_rel_to_test = t(apply(knn_ids[,1:kk],1,function(x)dist_k[sub_sample_ids[x]]))
# compute reachability distances between X and kNN of X relative to subsample data set
# returns reachability distance matrix of size n x kk
reach_dist_matrix_test <- t(apply( cbind(knn_dist_matrix[,1:kk] , dist_k_rel_to_test),1,function(x){
x1 = x[1:kk]
x2 = x[(kk+1):(2*kk)]
reach_dist = apply(rbind(x1,x2),2,max)
return(reach_dist)
}))
############ compute outlier scores lrd and lof ############
############ compute outlier scores lrd and lof ############
############ compute outlier scores lrd and lof ############
if('lof'%in%method){
# compute local reachability density for each point in X
lrd <- 1/(apply(reach_dist_matrix_test,1,mean)+1e-198)
# compute local outlier factor for each point in X
lof <- apply(knn_ids[,1:kk],1,function(x)mean(lrd[sub_sample_ids[x]]))/lrd
# store lof and lrd for each k
store_lrd[,ii] <- lrd
store_lof[,ii] <- lof
}# end if statement for lof
############ compute outlier scores lde and ldf ############
############ compute outlier scores lde and ldf ############
############ compute outlier scores lde and ldf ############
if('ldf'%in%method){
# paramters
h <- ldf.param[names(ldf.param)=='h']
c <- ldf.param[names(ldf.param)=='c']
lde <- sapply(1:n,function(id){
mean(1/((2*pi)^(p/2))*1/(h*dist_k[sub_sample_ids[knn_ids[id,1:kk]]])^p*
exp(-(.5*reach_dist_matrix_test[id,]^2)/(h*dist_k[sub_sample_ids[knn_ids[id,1:kk]]])^2))+1e-198
})
ldf <- sapply(1:n,function(id){
mean.lde = mean(lde[sub_sample_ids[knn_ids[id,1:kk]]])
ldf = mean.lde/(lde[id]+c*mean.lde)
return(ldf)
})
# store lof and lrd for each k
store_lde[,ii] <- lde
store_ldf[,ii] <- ldf
}# end if statement for ldf
}# end if statement for reachability distance calculations
############ compute outlier scores kde and rkof ############
############ compute outlier scores kde and rkof ############
############ compute outlier scores kde and rkof ############
if('rkof'%in%method){
# parameters
alpha <- rkof.param[names(rkof.param)=='alpha']
C <- rkof.param[names(rkof.param)=='C']
sig2 <- rkof.param[names(rkof.param)=='sig2']
## compute kde for each point in X
kde <- sapply(1:n, function(id){
mean(1/(2*pi)^(p/2)*1/(C*dist_k[sub_sample_ids[knn_ids[id,1:kk]]]^alpha)^2*
exp(-.5*knn_dist_matrix[id,1:kk]^2/(C*dist_k[sub_sample_ids[knn_ids[id,1:kk]]]^alpha)^2))+1e-198
})
## compute wde for each point in X
wde <- sapply(1:n,function(id){
weights = exp(-(dist_k[sub_sample_ids[knn_ids[id,1:kk]]]/min(dist_k[sub_sample_ids[knn_ids[id,1:kk]]])-1)^2/(2*sig2))
weights = weights/sum(weights)
wde = sum(weights*kde[sub_sample_ids[knn_ids[id,1:kk]]])
})
## compute rkof
rkof <- wde/kde
# store kde and rkof for each k
store_kde[,ii] <- kde
store_rkof[,ii] <- rkof
}# end if statement for rkof
############ compute outlier scores lpde, lpdf, and lpdr ############
########### compute outlier scores lpde, lpdf, and lpdr ############
########### compute outlier scores lpde, lpdf, and lpdr ############
if('lpdf'%in%method){
# parameters
cov.type<- lpdf.param[names(lpdf.param)=='cov.type']
tmax <- as.numeric(lpdf.param[names(lpdf.param)=='tmax'])
sigma2 <- as.numeric(lpdf.param[names(lpdf.param)=='sigma2'])
v <- as.numeric(lpdf.param[names(lpdf.param)=='v'])
# max iterations
tmax <- tmax+1
# identity matrix
II <- diag(1,p,p)
# regularization matrix
reg <- sigma2*II
# matrices to store densities and weights
store_dens <- matrix(NA,n,tmax)
store_R <- matrix(NA,n,tmax)
# compute density and weight function, R, for each iteration in 1:tmax
for(t in 1:tmax){
#compute multivarite t density for each point in X referencing subsample data set Y
dens <- sapply(1:n,function(id){
# test point
x = X[id,]
# k nearest neighborhood to compute weighted location and scatter
hood = as.matrix(Y[knn_ids[id,1:kk],])
# define weights
if(t==1){weights = rep(1/kk,kk)}
if(t>1){weights = R[sub_sample_ids[knn_ids[id,1:kk]]] }
# normalize weights to sum to 1
weights <- weights/sum(weights)
# comptue weighted sample mean and sample covariance matrix if cov.type='full'
if(cov.type=='full'){
covwt = cov.wt(hood,wt=weights,method='ML')
center = covwt$center
scatter = covwt$cov+reg
}
# comptue weighted sample mean and sample covariance matrix if cov.type='diag'
if(cov.type=='diag'){
center = colSums(weights * hood)
center.mat = matrix(center,kk,p,byrow=T)
vars = colSums(weights*(hood-center.mat)^2)
scatter = diag(vars,p,p)+reg
}
# compute multivaritae t density with degrees of freedom v
density = mnormt::dmt(x,mean=center,S=scatter,df=v)+1e-198
})
# compute updated weight function, R, for each point in X
R <- sapply(1:n,function(id){
# compute weights
weights = 1/(dist_k[sub_sample_ids[knn_ids[id,1:kk]]]^2+1e-20)
weights = weights/sum(weights)
# comptue ratio of density to weighted neighborhood density
log1p(dens[id])/sum(weights*log1p(dens[sub_sample_ids[knn_ids[id,1:kk]]]))
})
# store all densities and R
store_dens[,t] <- dens
store_R[,t] <- R
## update weight rule
if(t>2){
R <- apply(store_R[,2:t],1,max)
}
}### end t loop
# compute outlier scores lpde, lpdf, lpdr
lpde <- apply(store_dens,1,function(x)log(mean(x[-1])))
lpdf <- apply(store_R,1,function(x)mean(x[-1]))
lpdr <- lpde - log(store_dens[,1])
store_lpde[,ii] <- lpde
store_lpdf[,ii] <- lpdf
store_lpdr[,ii] <- lpdr
}# end if statement for lpdf
}# k loop
# return all outlier scores
return(list(lrd = store_lrd,
lof = store_lof,
lde = store_lde,
ldf = store_ldf,
kde = store_kde,
rkof = store_rkof,
lpde = store_lpde,
lpdf = store_lpdf,
lpdr = store_lpdr
)
)
} # end ldbod function
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.