In zhicongz/AnomDetct: Anomaly Detection via Scan Statistics

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.align = "center"
)
library(ggplot2)
library(AnomDetct)
set.seed(20191001)

This vignette is a step-by-step guidebook for helping users using AnomDetct package. Even though the package includes two assemble functions AnomDetct::ultimate_detct() and AnomDetct::tran_detct() that put all the basic functions in this package together and could be applied on the data directly, it is still necessary for users to know what basic functions are doing there. This can give users more flexibility for tailoring the hypothesis testing based on the data to enhance the performance for this methodology. This vignette will focus on how to detect clusters on data AnomDetct::sim_df_N, which is a numeric vector.

x <- AnomDetct::sim_df_N
summary(x)
df_N <- data.frame(x = x)
ggplot2::ggplot(data = df_N, ggplot2::aes(x = x)) + ggplot2::geom_histogram(binwidth = 1)

Data cleaning

This detection method is done by monitoring the distance between consecutive order statistics. Here is a list of importart properties that the data must be satisfied with:

No missing data.

# Check if the data has any missing value.
sum(is.na(x)) 

# Remove missing.
x <- na.omit(x)

Data is in ascending order.

x <- sort(x)

Make Sure the data has no duplicated observations. This is the most important property that easily to be ignored. With duplicated values, the differences between order statistics will be 0 which leads to potential false positive clusters.

# check if the data has any duplicated value.
# If duplicate, add small random variance in original obervation. 
# Idea is like jitter points to avoid overlapping in ggplot2.
x_backup <- x
if(sum(duplicated(x))!=0){
  x <- AnomDetct::unif(x = x, unit = 0.01, rd = T)
}
N <- length(x)

Convert data to be uniform distributed

For any continous distributed random variable x, the CDF cdf_x is uniformly distributed on (0,1). Instead of directly studying the behaviors of x, focusing on cdf_x has following benefits:

It is on a bounded interval.
The expected order statistics form is terse.
cdf_x is a monotonic function. Wherever clusters detected on cdf_x, there must be corresponding clusters occuer on x.

Here, AnomDetct::cdf_convert is the function converts x to cdf_x. To use this function, users need to provide distribution family and the correponding parameters. However, in practice, those information might be unattachable. AnomDetct::fit_dist is designed to deal with this incomplete information cases.

based on users knowledge for the underlaying distribution cdf_x, he/she can do either of the followings:

# If the users know exactly what distribution data is from.
# For example, assume x is Weibull(0.6, 7)
cdf_x <- AnomDetct::cdf_convert(x = x, dist_null = "weibull", 
                                shape = 0.6, scale = 7)

# If the users only know distribution family but don't know parameters.
dist_info <- AnomDetct::fit_dist(x = x, dist_null = "weibull")
cdf_x     <- AnomDetct::cdf_convert(x = x, dist_null = dist_info$dist, 
                                    dist_info$para[1], dist_info$para[2])

# If the users know nothing for the underlying distribution.
dist_info <- AnomDetct::fit_dist(x = x)
cdf_x     <- AnomDetct::cdf_convert(x = x, dist_null = dist_info$dist, 
                                    dist_info$para[1], dist_info$para[2])

When dist_null is missing in AnomDetct::fit_dist, this function will search for common continuous distribution families and returns the best fitted underlying distribution via ks-test and MLE. Disbribution misspecification is considered and will be discussed later.

Disbribution misspecification leads to the issue that cdf_x is NOT unifrom distributed on (0,1). In fact, the data AnomDetct::sim_df_N is generatad from non-common distribution. Even function fit_dist suggests use "weibull" distribution, it actually does NOT fit with the data really well.

df_N <- data.frame(x = cdf_x)
ggplot2::ggplot(data = df_N, ggplot2::aes(x=cdf_x)) + 
  ggplot2::geom_histogram(binwidth = 0.05)

stats::ks.test(x, stats::pweibull, dist_info$para[1], dist_info$para[2])

The p-value of ks-test is only 0.02, which means if significant level is 0.05, the null hypothesis needs to be reject.

Estimate hazard rate ratio function

This cluster detection method is really sensitive to distribution misspecification without this hazard rate ratio adjustment. This function is designed to eliminate the false positive/negative clusters comes from the distribution misspecification.

The estimation is from local polynomial regression with reflection to deal with the bias on boundary points. The package supports different combinations of methods/parameters for uses to choice. From the efficency perspective, AnomDetct::HRR_pt_est with "gaussian" kernel for large data is not recommended because it involves solving high dimensional linear system.

pt_int <- seq(0,1,0.05)
density_est <- AnomDetct::HRR_sbsp_est(pt_int = pt_int, cdf = cdf_x, 
                                       kernel = "triangular", 
                                       hazard_bandwidth = 0.2,
                                       n_hz_sample = 50, n_hz_size = 1000)
density_est

ggplot2::ggplot(data = data.frame(x = 0), ggplot2::aes(x = x)) + 
  ggplot2::stat_function(fun = density_est$fhat) + ggplot2::xlim(0,1) +
  ggplot2::ylab("fhat") + ggplot2::xlab("cdf_x")

ggplot2::ggplot(data = data.frame(x = 0), ggplot2::aes(x = x)) + 
  ggplot2::stat_function(fun = density_est$HRR) + ggplot2::xlim(0,1) +
  ggplot2::ylab("HRR") + ggplot2::xlab("cdf_x")

AnomDetct::density_est has two elements, fhat and HRR. fhat will be used for estimate the anomaly quantity while HRR represents hazard rate ratio that will be used in cluster detection. Ideally, without distribution misspecification, fhat and HRR suppose to be constant functions of 1.

However, if users know precisely what the underlaying distribution is, he/she do not need to add this in detection part and this actually can improve the power of test somehow.

# No hazard rate ratio adjustment
fhat_fun <- stats::approxfun(c(0,1), c(1,1))
HRR_fun  <- stats::approxfun(c(0,1), c(1,1))
# density_est <- list(fhat = fhat_fun, HRR  = HRR_fun)

Window length

Window length is better to be set closer to cluster size. In the other words, if uses know the rough magnitude of the cluster he/she want to detect, set scan window length same as this. If not, the package provids function AnomDetct::MLE_window_lth for selecting prior window length.

# print(dist_info)
window_MLE <- AnomDetct::MLE_window_lth(x = x, dist_null = dist_info$dist, 
                                        dist_info$para[1], dist_info$para[2])
print(window_MLE)

Theoretical test statistic distributions and P-values

Here are 3 patermeters determine the scan statistics distribution:

Total number of binary trails
Success probability of each trail
Window length

Here, Total number of binary trails is given by data; window length is either given by the users or by AnomDetct::MLE_window_lth; the success probability of each trail is given by a function of theta in equation \@ref(eq:succ-prob).

\begin{equation} p = 1-exp(-theta) (#eq:succ-prob) \end{equation}

When those three numbers are given, the critical sum of success can determine the p-value of the hypothesis test or vice versa. In this package, the critical sum of success is calculated under the condition that theta is 1 and p-value is no larger than 0.05. Then, fix critical sum of success, as the theta shrink down to 0.5, the corresponding p-values are recorded. Those information will be used when making the decision of hypothesis test.

Essentially theta represents the intensity of the cluster. Clusters that are still able to be detected under smaller theta in a hypothesis test means clusters have higher intensity.

Estimating exactly distribution of scan statistics is a NP-hard problem. This package provides two ways to do that:

AnomDetct::prob_fun_mc: Monte Carlo simulation.
AnomDetct::prob_fun: an approximation by 1-dependent stationary sequences [see @Haiman-2007].

Monte Carlo simultion takes too long time for the estimation while the 1-dependent stationary sequences approximation has restrictions. In general, Monte Carlo simultion is not recommended.

# set test significant level
alpha_lvl <- 0.05

# set sequence of theta that users want to put in test. 
theta_th <- 1
seq_theta <- seq(0.5, 1, by = 0.05)*theta_th

# minmum window length to use 1-dependent stationary sequences approximation
window_Haiman <- AnomDetct::Haiman_window_lth(N = N-1, p = 1-exp(-theta_th), 
                                              alpha = alpha_lvl, 
                                              lower_wl = 3, upper_wl = 100)

window_lth <- max(window_Haiman, window_MLE)

# the minmum critical sum of success for giving alpha level under theta_th
min_crit <- AnomDetct::crit_theta_fun(N = N-1, window_l = window_lth, 
                                      p = 1-exp(-theta_th), alpha = alpha_lvl)

# the theoretical p-values and q-values for under different theta values
q_values <- mapply(AnomDetct::prob_fun, 1-exp(-seq_theta), 
                   MoreArgs = list(N = N-1, 
                                   k = min_crit, 
                                   m = window_lth))
p_values <- mapply(max, 1 - q_values, 0)

# This can also be done by Monte Carlo simulation
# p_values <- mapply(AnomDetct::prob_fun_mc, 1-exp(-seq_theta), 
#                    MoreArgs = list(N = N-1,
#                                    k = min_crit,
#                                    m = window_lth,
#                                    mc_rep = 1000))

A tricky setting here is the total number of trials. 'N' is the size of the data, since each trail is conducted on every two consecutive order statistics, there are actually only 'N-1' trails.

Test statistic and clusters

# Detect clusters under different theta level
HRR <- density_est$HRR(cdf_x)

clst <- mapply(AnomDetct::hypo_test, seq_theta,
               MoreArgs = list(cdf = cdf_x,
                               HRR = HRR,
                               critical_cnt = min_crit,
                               window_l = window_lth),
               SIMPLIFY = F)

# Clusters visualization
clst_p_values <- mapply(function(x,y)if(nrow(x)>0){cbind(x, p_value = y)}
                        else{cbind(x,p_value = numeric(0L))},
                        clst, p_values,
                        SIMPLIFY = F)

utils::tail(clst_p_values,2)
AnomDetct::clst_plot(x = x, org_x = x, clst = clst_p_values,
                     alpha_lvl = alpha_lvl, unit = 1, plt_mgn = 0)

The red bars in figure \@ref(fig:test-res) are detected clusters. The deeper it goes down, the smaller alpha level the cluster has.

Diagnostic

So far, this method successfully catches several clusters on figure \@ref(fig:test-res). However, there are also 2 tiny false negative clusters above 100 that the method is missed. Those missing clusters are from inappropriate window_lth.

The power of scan statistics is maximized as the scan window length is close to the true cluster size. However, if there are multiple clusters come into one sampled data and the sizes of those clusters are dramatically different, a single window length can barely take care of all the clusters one time. Here is another result running on a smaller window_lth.

# Slightly turning down the theta_th for smaller valid window length
theta_th   <- 0.8
seq_theta  <- seq(0.5, 1, by = 0.05)*theta_th
window_Haiman <- AnomDetct::Haiman_window_lth(N = N-1, p = 1-exp(-theta_th), 
                                              alpha = alpha_lvl, 
                                              lower_wl = 3, upper_wl = 100)
window_lth <- window_Haiman

# repeat previous steps
min_crit <- AnomDetct::crit_theta_fun(N = N-1, window_l = window_lth, 
                                      p = 1-exp(-theta_th), alpha = alpha_lvl)
q_values <- mapply(AnomDetct::prob_fun, 1-exp(-seq_theta),
                   MoreArgs = list(N = N-1, 
                                   k = min_crit, 
                                   m = window_lth))
p_values <- mapply(max, 1 - q_values, 0)
clst <- mapply(AnomDetct::hypo_test, seq_theta,
               MoreArgs = list(cdf = cdf_x,
                               HRR = HRR,
                               critical_cnt = min_crit,
                               window_l = window_lth),
               SIMPLIFY = F)
clst_p_values_s <- mapply(function(x,y)if(nrow(x)>0){cbind(x, p_value = y)}
                          else{cbind(x,p_value = numeric(0L))},
                          clst, p_values,
                          SIMPLIFY = F)

utils::tail(clst_p_values_s, 2)
AnomDetct::clst_plot(x = x, org_x = x, clst = clst_p_values_s, 
                     alpha_lvl = alpha_lvl, unit = 1, plt_mgn = 0)

All clusters are detected even though the left most clusters is too thin to be seen. Users can set plt_mgn in function AnomDetct::clst_plot to manually enlarge the margin of clusters shown in plot.

AnomDetct::clst_plot(x = x, org_x = x, clst = clst_p_values_s,
                     alpha_lvl =alpha_lvl, unit = 1, plt_mgn = 0.5)

For the data AnomDetct::sim_df_N, after manually change the theta_th and window_lth, this detection model has a better performance compared with model comes with default parameters. That's why this method is decomposed in those pieces to help users using this package smoothly.

Recusive detection

Instead of changing window_lth, a more general way to deal with different cluster size issue is multiple windows scanning and then merge results. In this way, the merge clusters' alpha level can be estimate by:

Monte Carlo simulation
Bonferroni inequalities

AnomDetct::tran_detct is conducted on Bonferroni inequalities. However that function only supports generalized Pareto distribution as underlaying distribution and the location of this distribution has no default value. To use AnomDetct::tran_detct, the suggested package POT needs to be installed and loc is the location parameter that needs be specified.

# require package POT
if(requireNamespace(POT, quietly = T)){
  foo <- AnomDetct::tran_detct(x = AnomDetct::sim_df_N, loc = 20)
  foo$Plot
}

Common Errors and solutions

Here are some common errors that users might meet when using this packages and the suggestions for solving those errors.

Package `POT`

foo <- AnomDetct::tran_detct(x = AnomDetct::sim_df_N, loc = 20)

AnomDetct::tran_detct() is the only function in this package that depents on package POT. The currently version does not have any alternative option for applying this function. If the user haven't installed POT, he/she can still install AnomDetct and use all the other functions in this package. However, if user want to use AnomDetct::tran_detct(), package POT is necessary.

Large scanning window length

foot <- AnomDetct::ultimate_detct(x = AnomDetct::sim_df_N, window_lth = 3000)

Length of 1-dependent sequence is depends on length(x)/window_lth.

The error mostly occurs with too large window_lth especially when there is a large cluster leads to AnomDetct::MLE_window_lth() select large window_lth. In this case, users can try manually input window_lth or AnomDetct::Haiman_window_lth().
The warnings occurs in same case. However, it means length of 1-dependent sequence is still valid but might bring potiantal bias in p-value estimation. reduce the window_lth is recommended in this case.

References

zhicongz/AnomDetct documentation built on Dec. 12, 2019, 9:16 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

zhicongz/AnomDetct
Anomaly Detection via Scan Statistics

In zhicongz/AnomDetct: Anomaly Detection via Scan Statistics

Data cleaning

Convert data to be uniform distributed

Estimate hazard rate ratio function

Window length

Theoretical test statistic distributions and P-values

Test statistic and clusters

Diagnostic

Recusive detection

Common Errors and solutions

Package `POT`

Large scanning window length

References

R Package Documentation

Browse R Packages

We want your feedback!

zhicongz/AnomDetct Anomaly Detection via Scan Statistics

In zhicongz/AnomDetct: Anomaly Detection via Scan Statistics

Data cleaning

Convert data to be uniform distributed

Estimate hazard rate ratio function

Window length

Theoretical test statistic distributions and P-values

Test statistic and clusters

Diagnostic

Recusive detection

Common Errors and solutions

Package POT

Large scanning window length

References

R Package Documentation

Browse R Packages

We want your feedback!

zhicongz/AnomDetct
Anomaly Detection via Scan Statistics

Package `POT`