knitr::opts_chunk$set(echo = TRUE, fig.width=10 , fig.height=7, out.width='100%', dpi=100) library(knitr) library(dplyr) library(ggplot2) library(stranger) library(tidyr)
In this vignette, we introduce every method available in stranger
package. Note that methods may require extra packages.
We will work with iris dataframe and use lucky_odd
function.
The list of cuttently available wrappers / weirds methods is listed below (one can use weirds_list
function to obtain them).
w=weirds_list()$detail kable(w[,c(1:4,6,9)],row.names = FALSE)
Following some helper function is introduced to simplify code apparing in each chunk.
anoplot <- function(data,title=NULL){ g <- ggplot(data, aes(x=Sepal.Length,y=Sepal.Width,color=Species,size=flag_anomaly))+geom_point()+scale_size_discrete(range=c(1,3)) if (!is.null(title)) g <- g+ ggtitle(title) return(g) }
infow <- function(method){ iw=get(paste("weird",method,sep="_"))(info=TRUE) cat(paste0("*** weird method ",iw$weird_method)) cat(paste0("\n",iw$name, " based on function ",iw$foo, " [",iw$package,"]" )) cat(paste0("\n","Metric: " ,iw$detail, " sorted in ", ifelse(iw$sort==1,"increasing","decreasing"), " order.")) }
infow("abod")
iris %>% lucky_odds(n.anom=6, analysis.drop="Species", weird="abod") %>% anoplot(title="abod - default parameters")
Default values for abod
parameters:
abod
(method
and n_sample_size
- see ?abod
.Extra parameters used in stranger
for this weird: none.
Default naming convention for generated metric based on k.
NOTE: this method is not recommended for volumetric data.
From abod
help:
Details
Please note that 'knn' has to compute an euclidean distance matrix before computing abof.
Value
Returns angle-based outlier factor for each observation. A small abof respect the others would indicate presence of an outlier.
infow("autoencode")
iris %>% lucky_odds(n.anom=6, analysis.drop="Species", weird="autoencode") %>% anoplot(title="autoencode - default parameters")
Changing some parameters:
iris %>% lucky_odds(n.anom=6, analysis.drop="Species", weird="autoencode",nl=4, N.hidden=c(10,8),beta=6) %>% anoplot(title="autoencode - change network layers strucure")
Default values for autoencode
parameters:
autoencode
(rescaling.offset).Extra parameters used in stranger
for this weird: none.
Default naming convention for generated metric based on nl and n.hidden.
From autoencode
package:
An autoencoder neural network is an unsupervised learning algorithm that applies backpropagation to adjust its weights, attempting to learn to make its target values (outputs) to be equal to its inputs. In other words, it is trying to learn an approximation to the identity function, so as its output is similar to its input, for all training examples. With the sparsity constraint enforced (requiring that the average, over training set, activation of hidden units be small), such autoencoder automatically learns useful features of the unlabeled training data, which can be used for, e.g., data compression (with losses), or as features in deep belief networks.
Usage here is to learn an autoencoder then apply it to same data and look at high residuals.
infow("isofor")
iris %>% lucky_odds(n.anom=6, analysis.drop="Species", weird="isofor") %>% anoplot(title="isofor - default parameters")
Default values for abod
parameters -- see ?iForest
:
Extra parameters used in stranger
for this weird: none.
Default naming convention for generated metric based on nt and phi.
NOTE: this method is not recommended for volumetric data.
From iForest
help:
An Isolation Forest is an unsupervised anomaly detection algorithm. The requested number of trees, nt
, are built completely at random on a subsample of size phi
. At each node a random variable is selected. A random split is chosen from the range of that variable. A random sample of factor levels are chosen in the case the variable is a factor.
Records from X
are then filtered based on the split criterion and the tree building begins again on the left and right subsets of the data. Tree building terminates when the maximum depth of the tree is reached or there are 1 or fewer observations in the filtered subset.
infow("kmeans")
iris %>% lucky_odds(n.anom=6, analysis.drop="Species", weird="kmeans") %>% anoplot(title="kmeans - default parameters")
Default values for kmeans
parameters -- see ?kmeans
:
Extra parameters used in stranger
for this weird: none.
Default naming convention for generated metric based on type and centers.
iris %>% lucky_odds(n.anom=6, analysis.drop="Species",weird="kmeans",type="euclidian",centers=8) %>% anoplot(title="kmeans - euclidean - nclusters (centers)=8") iris %>% lucky_odds(n.anom=6, analysis.drop="Species",weird="knn",simplify="median") %>% anoplot(title="knn - k=default (10), simplify=median")
infow("knn")
iris %>% lucky_odds(n.anom=6, analysis.drop="Species", weird="knn") %>% anoplot(title="knn - default parameters")
Default values for knn
parameters -- see ?knn
:
knn
(prob, algorihtm)Extra parameters used in stranger
for this weird:
median
) but can also use his own function -- name to be supplied as string.Default naming convention for generated metric based on k and simplify.
iris %>% lucky_odds(n.anom=6, analysis.drop="Species",weird="knn",k=8) %>% anoplot(title="knn - k=8") iris %>% lucky_odds(n.anom=6, analysis.drop="Species",weird="knn",simplify="median") %>% anoplot(title="knn - k=default (10), simplify=median")
infow("lof")
iris %>% lucky_odds(n.anom=6, analysis.drop="Species", weird="lof") %>% anoplot(title="lof - default parameters")
Default values for lof
parameters -- see ?lof
:
kNN
from dbscan
package (search, bucketSize...).Extra parameters used in stranger
for this weird: none.
Default naming convention for generated metric based on minPts.
iris %>% lucky_odds(n.anom=6, analysis.drop="Species",weird="lof",minPts=5, search="linear") %>% anoplot(title="lof - minPts=8 - linear kNN")
From lof
help:
LOF compares the local density of an point to the local densities of its neighbors. Points that have a substantially lower density than their neighbors are considered outliers. A LOF score of approximately 1 indicates that density around the point is comparable to its neighbors. Scores significantly larger than 1 indicate outliers.
infow("mahalanobis")
iris %>% lucky_odds(n.anom=6, analysis.drop="Species", weird="mahalanobis") %>% anoplot(title="mahalanobis - default parameters")
No parameter available.
Default naming convention: mahalanobis.
infow("pcout")
iris %>% lucky_odds(n.anom=6, analysis.drop="Species", weird="pcout") %>% anoplot(title="pcout - default parameters")
Default values for pcout
parameters -- see ?pcout
:
Extra parameters used in stranger
for this weird: none.
Default naming convention for generated metric based on explvar.
iris %>% lucky_odds(n.anom=6, analysis.drop="Species",weird="pcout", explvar=0.8, crit.Ml=1, crit.cl=3) %>% anoplot(title="pcout - custom values")
From pcout
help:
Based on the robustly sphered data, semi-robust principal components are computed which are needed for determining distances for each observation. Separate weights for location and scatter outliers are computed based on these distances. The combined weights are used for outlier identification.
infow("randomforest")
iris %>% lucky_odds(n.anom=6, analysis.drop="Species", weird="randomforest") %>% anoplot(title="randomforest - default parameters")
Default values for randomforest
parameters -- see ?randomforest
:
randomForest
Extra parameters used in stranger
for this weird: none.
Default naming convention for generated metric based on ntree and mtry.
iris %>% lucky_odds(n.anom=6, analysis.drop="Species",weird="randomforest", explvar=0.8, ntree=10,mtry=2) %>% anoplot(title="randomforest - custom values")
Logical next step is to look at how to work with weirds methods: manipulate, work with metrics (aggregation and stacking), derive anomalies. For this, read vignette Working with weirds
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.