tran_detct: Anomaly Detection Via Multiple Window Length Scan Statistics
In zhicongz/AnomDetct: Anomaly Detection via Scan Statistics

Description Usage Arguments Details Value Note Author(s) Examples

This function applies scan statistics hypothesis test with multiple different window length for dealing with multiple clusters with different magnitude. The cluster significant level is estimated by Bonferroni method.

tran_detct(x, theta_th=1, theta_0 = theta_th, alpha_lvl=0.05,
           loc, HRR_kernel = "triangular",
           hazard_bandwidth=0.1, knn = NULL, est_fun = "pt",
           n_hz_sample = NULL, n_hz_size = NULL,
           pt_int = seq(0,1,by = 0.05),
           seq_theta = seq(0.5, 1, by = 0.05)*theta_0,
           x_unit = 0.01, plot_unit = 1, MLE_unit = 0.01,
           plt_mgn = 0, max_rec = 3, tail_obs = 50)

`x`	A numeric vector of data values where is hypothesis test is applied on.
`theta_th`	Initial theoretical theta value of hypothesis test.
`theta_0`	Initial real theta value of hypothesis test. Default value is same as `theta_th`.
`alpha_lvl`	Significant level for the hypothesis test with Initial theoretical theta value `theta_th`.
`loc`	Lower bound for applying scan statistics. It is also the `threshold` of `fitgpd`
`HRR_kernel`	A character string giving the smoothing kernel to be used in `HRR_pt_est` or `HRR_sbsp_est`. This must partially match one of "`gaussian`", "`rectangular`", "`triangular`" or "`knn`". Default is "`triangular`".
`hazard_bandwidth`	the smoothing bandwidth to be used.
`knn`	number of neighbor points to be considered in smoothing for the "`knn`" kernel.
`est_fun`	A character string giving the hazard rate ratio estimation function. This must match with either "`pt`" or "`sbsp`". Default is "`pt`".
`n_hz_sample`	Number of replicates if `est_fun` is "`sbsp`".
`n_hz_size`	Resampled size if `est_fun` is "`sbsp`".
`pt_int`	A vector of hazard rate ratio estimated points.
`seq_theta`	A vector of theta values put in `hypo_test` for cluster detection. This sequence of theta needs to be in order. Default is `seq(0.5, 1, by = 0.05)*theta_0/theta_th`
`x_unit`	A number indicating the uniformization bin width.
`plot_unit`	A number indicating bin width for histogram in the plot.
`MLE_unit`	A number indicating the bin width for counting excess.
`plt_mgn`	Extra margin of clusters shown in plot.
`max_rec`	Maximum recursive number.
`tail_obs`	Minimum number of observations on tail to continue the recursive.

This function is the method presented in the paper. It may not be as general as ultimate_detct, but it always has a better performance in dealing with transaction data anomaly detection.

An unique character for transaction data anomalous cluster is the magnitude of the cluster changes with respect to the price point, which is saying the clusters occur at large transaction amount commonly have smaller size compared with clusters occur at small transaction amount. Also, some of the clusters at small amount are not necessary to be anomalous clusters. It may comes from popular stuffs sold at specific price, for example: $ 0.99 for a bottle of water or a bar of chocolate.

In this case, we present to use smaller window length scan statistics only scanning on this tail part of the data to aviod obtain too much false positive clusters at small price point and also catch more true positive on large price point.

This function scans clusters on x with a given price point lower bound loc. After each scanning, the loc is updated to be the largest detected cluster locations. Then, the next scanning is working on the remaining tails with new loc and window length. The function stops when recursive attaches the max_rec or number of tail observations is less than tail_obs or no new clusters are detected.

max_rec should not be too large because with Bonferroni method, the significance level of clusters goes up fast. Also, the recursive is a sequential process because the starting point of the next scannning depends on previous scanning results. The method will take a long time if max_rec is large. Multiple window length scanning will be a future work topic.

This function returns a list with components:

`Total`	Estimated quantity of clusters
`Cluster`	A matrix where first two columns are boundaries of clusters and thire column is the corresponding p-value. Notice that clusters are not necessary to be exclusive.
`Plot`	The plot.

Package POT https://cran.r-project.org/package=POT needs to be installed first.

Zhicong Zhao

set.seed(100);x <- c(rgamma(4000,2,0.05),
                     runif(100,50,51),
                     runif(20,100,101))## generate data

res_rec <- tran_detct(x,loc = 30, HRR_kernel = "gaussian", est_fun = "sbsp",
                      n_hz_sample = 20, n_hz_size = 50, MLE_unit = 5,
                      x_unit = 0.001,
                      hazard_bandwidth = 0.2) ## recursive result