tran_detct: Anomaly Detection Via Multiple Window Length Scan Statistics

Description Usage Arguments Details Value Note Author(s) Examples

View source: R/tran_detct.R

Description

This function applies scan statistics hypothesis test with multiple different window length for dealing with multiple clusters with different magnitude. The cluster significant level is estimated by Bonferroni method.

Usage

1
2
3
4
5
6
7
8
tran_detct(x, theta_th=1, theta_0 = theta_th, alpha_lvl=0.05,
           loc, HRR_kernel = "triangular",
           hazard_bandwidth=0.1, knn = NULL, est_fun = "pt",
           n_hz_sample = NULL, n_hz_size = NULL,
           pt_int = seq(0,1,by = 0.05),
           seq_theta = seq(0.5, 1, by = 0.05)*theta_0,
           x_unit = 0.01, plot_unit = 1, MLE_unit = 0.01,
           plt_mgn = 0, max_rec = 3, tail_obs = 50)

Arguments

x

A numeric vector of data values where is hypothesis test is applied on.

theta_th

Initial theoretical theta value of hypothesis test.

theta_0

Initial real theta value of hypothesis test. Default value is same as theta_th.

alpha_lvl

Significant level for the hypothesis test with Initial theoretical theta value theta_th.

loc

Lower bound for applying scan statistics. It is also the threshold of fitgpd

HRR_kernel

A character string giving the smoothing kernel to be used in HRR_pt_est or HRR_sbsp_est. This must partially match one of "gaussian", "rectangular", "triangular" or "knn". Default is "triangular".

hazard_bandwidth

the smoothing bandwidth to be used.

knn

number of neighbor points to be considered in smoothing for the "knn" kernel.

est_fun

A character string giving the hazard rate ratio estimation function. This must match with either "pt" or "sbsp". Default is "pt".

n_hz_sample

Number of replicates if est_fun is "sbsp".

n_hz_size

Resampled size if est_fun is "sbsp".

pt_int

A vector of hazard rate ratio estimated points.

seq_theta

A vector of theta values put in hypo_test for cluster detection. This sequence of theta needs to be in order. Default is seq(0.5, 1, by = 0.05)*theta_0/theta_th

x_unit

A number indicating the uniformization bin width.

plot_unit

A number indicating bin width for histogram in the plot.

MLE_unit

A number indicating the bin width for counting excess.

plt_mgn

Extra margin of clusters shown in plot.

max_rec

Maximum recursive number.

tail_obs

Minimum number of observations on tail to continue the recursive.

Details

This function is the method presented in the paper. It may not be as general as ultimate_detct, but it always has a better performance in dealing with transaction data anomaly detection.

An unique character for transaction data anomalous cluster is the magnitude of the cluster changes with respect to the price point, which is saying the clusters occur at large transaction amount commonly have smaller size compared with clusters occur at small transaction amount. Also, some of the clusters at small amount are not necessary to be anomalous clusters. It may comes from popular stuffs sold at specific price, for example: $ 0.99 for a bottle of water or a bar of chocolate.

In this case, we present to use smaller window length scan statistics only scanning on this tail part of the data to aviod obtain too much false positive clusters at small price point and also catch more true positive on large price point.

This function scans clusters on x with a given price point lower bound loc. After each scanning, the loc is updated to be the largest detected cluster locations. Then, the next scanning is working on the remaining tails with new loc and window length. The function stops when recursive attaches the max_rec or number of tail observations is less than tail_obs or no new clusters are detected.

max_rec should not be too large because with Bonferroni method, the significance level of clusters goes up fast. Also, the recursive is a sequential process because the starting point of the next scannning depends on previous scanning results. The method will take a long time if max_rec is large. Multiple window length scanning will be a future work topic.

Value

This function returns a list with components:

Total

Estimated quantity of clusters

Cluster

A matrix where first two columns are boundaries of clusters and thire column is the corresponding p-value. Notice that clusters are not necessary to be exclusive.

Plot

The plot.

Note

Package POT https://cran.r-project.org/package=POT needs to be installed first.

Author(s)

Zhicong Zhao

Examples

1
2
3
4
5
6
7
8
set.seed(100);x <- c(rgamma(4000,2,0.05),
                     runif(100,50,51),
                     runif(20,100,101))## generate data

res_rec <- tran_detct(x,loc = 30, HRR_kernel = "gaussian", est_fun = "sbsp",
                      n_hz_sample = 20, n_hz_size = 50, MLE_unit = 5,
                      x_unit = 0.001,
                      hazard_bandwidth = 0.2) ## recursive result

zhicongz/AnomDetct documentation built on Dec. 12, 2019, 9:16 a.m.