Description Usage Arguments Details Value Note Author(s) Examples
This function applies scan statistics hypothesis test with multiple different window length for dealing with multiple clusters with different magnitude. The cluster significant level is estimated by Bonferroni method.
1 2 3 4 5 6 7 8 | tran_detct(x, theta_th=1, theta_0 = theta_th, alpha_lvl=0.05,
loc, HRR_kernel = "triangular",
hazard_bandwidth=0.1, knn = NULL, est_fun = "pt",
n_hz_sample = NULL, n_hz_size = NULL,
pt_int = seq(0,1,by = 0.05),
seq_theta = seq(0.5, 1, by = 0.05)*theta_0,
x_unit = 0.01, plot_unit = 1, MLE_unit = 0.01,
plt_mgn = 0, max_rec = 3, tail_obs = 50)
|
x |
A numeric vector of data values where is hypothesis test is applied on. |
theta_th |
Initial theoretical theta value of hypothesis test. |
theta_0 |
Initial real theta value of hypothesis test. Default value is
same as |
alpha_lvl |
Significant level for the hypothesis test with
Initial theoretical theta value |
loc |
Lower bound for applying scan statistics. It is also the
|
HRR_kernel |
A character string giving the smoothing kernel to be used
in |
hazard_bandwidth |
the smoothing bandwidth to be used. |
knn |
number of neighbor points to be considered in smoothing for the
" |
est_fun |
A character string giving the hazard rate ratio
estimation function. This must match with either " |
n_hz_sample |
Number of replicates if |
n_hz_size |
Resampled size if |
pt_int |
A vector of hazard rate ratio estimated points. |
seq_theta |
A vector of theta values put in |
x_unit |
A number indicating the uniformization bin width. |
plot_unit |
A number indicating bin width for histogram in the plot. |
MLE_unit |
A number indicating the bin width for counting excess. |
plt_mgn |
Extra margin of clusters shown in plot. |
max_rec |
Maximum recursive number. |
tail_obs |
Minimum number of observations on tail to continue the recursive. |
This function is the method presented in the paper. It may not be
as general as ultimate_detct
, but it always has a better
performance in dealing with transaction data anomaly detection.
An unique character for transaction data anomalous cluster is the magnitude of the cluster changes with respect to the price point, which is saying the clusters occur at large transaction amount commonly have smaller size compared with clusters occur at small transaction amount. Also, some of the clusters at small amount are not necessary to be anomalous clusters. It may comes from popular stuffs sold at specific price, for example: $ 0.99 for a bottle of water or a bar of chocolate.
In this case, we present to use smaller window length scan statistics only scanning on this tail part of the data to aviod obtain too much false positive clusters at small price point and also catch more true positive on large price point.
This function scans clusters on x
with a given price point
lower bound loc
. After each scanning, the loc
is updated to be
the largest detected cluster locations. Then, the next scanning is working on
the remaining tails with new loc
and window length
. The function
stops when recursive attaches the max_rec
or number of tail observations
is less than tail_obs
or no new clusters are detected.
max_rec
should not be too large because with Bonferroni method, the
significance level of clusters goes up fast. Also, the recursive is a
sequential process because the starting point of the next scannning depends on
previous scanning results. The method will take a long time if max_rec
is large.
Multiple window length scanning will be a future work topic.
This function returns a list with components:
Total |
Estimated quantity of clusters |
Cluster |
A matrix where first two columns are boundaries of clusters and thire column is the corresponding p-value. Notice that clusters are not necessary to be exclusive. |
Plot |
The plot. |
Package POT
https://cran.r-project.org/package=POT
needs to be installed first.
Zhicong Zhao
1 2 3 4 5 6 7 8 |
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.