knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = TRUE, cache = FALSE)
outsiders
is an unsupervised outlier detection package for use with standard multidimensional data. The state.x77
data matrix containing 9 demographic characteristics of US states was used often for algorithm development and evaluation. Making use of the same data set, this vignette will provide code and visual analysis demonstrating methods available in the package.
# document dependencies library(outsiders) # devtools::install_github("dannymorris/outsiders") to install library(dplyr) # data manipualation (CRAN) library(tidyr) # structural formatting and data standardization (CRAN) library(ggplot2) # static data visualization (CRAN) library(knitr) # document printing (eg tables) (CRAN) library(kableExtra) # "kable" (table) formatting (CRAN) library(QuickR) # another personal package with utility functions
state.x77 %>% head() %>% kable()
The standard variable scaling convention is to subtract the variable mean from each observation and divide by standard deviation. Doing so eliminates the undesirable influence of variables measured on larger and wider scales, such as Population (in thousands) compared to HS Grad (percent between 0 and 100).
$$X scaled = (X - mean(X)) / sd(X)) $$
data_scaled <- scale(state.x77)
par(mfrow = c(3, 3), mar = c(2,2,2,2)) for (i in 1:ncol(state.x77)) { plot(density(state.x77[, i]), xlab = "", main = colnames(state.x77)[i]) points(x = state.x77[, i], y = rep(0, 50), col = QuickR::add_alpha("red", 0.5), pch = 21) }
The density plot matrix shows an interesting combination of skewed, normal, and multi-modal distributions. Outliers are evident in univariate visual analysis and are likely evident in the multivariate sense.
principal_components <- princomp(data_scaled) summary(principal_components, loadings = T) pca_scores <- principal_components$scores[, 1:4] pairs(pca_scores, main = "Principal Components 1-4 Explaining \n88% of Total Variation")
Visual analysis of the first four principal components reveals the presence of multivariate outliers along with insights into the spatial arrangement of the data points.
We can use the function chisq_mvn()
to generate ordered multivariate Mahalanobis distances, which follow a chi-squared distribution under the assumption of multivariate normality.
chisq_distances <- outsiders::chisq_mvn(data_scaled) plot(chisq ~ distances, data = chisq_distances, main = "Multivariate Ordered Chi-Squared Distances") abline(lm(chisq ~ distances, data = chisq_distances), lty = 2)
The data appears largely non-normal in the multivariate sense.
The also()
function implements Attribute-Wise Learning for Scoring Outliers (ALSO), which combines supervised and unsupervised learning to score outliers using dependent variable modeling of all features. For each variable in the original data set, a supervised model (classification or regression) is fit using the remaining variables as predictors. Outlier scores reflect the distance of the predicted value for a given observation to its actual value. Outlier scores are then summed to provide a single score suitable for extreme-value analysis.
also_rand_forest <- outsiders::also(data = data_scaled, method = randomForest::randomForest, cross_validate = FALSE, scores_only = TRUE)
We will use this custom function to automate and standardize visualizations of outlier scores again the first four principal components.
plot_score <- function(score, global_title, low = "green4", high = "orange", cex = 1.3, cols = QuickR::add_alpha(mapping, 0.85), ...) { # numeric color scale colors <- colorRamp(c(low, high)) mapping <- rgb(colors(score / max(score)), maxColorValue = 255) # construct base plots build_plot <- function(data, title, xlab, ylab, ...) { plot(data, pch = 19, col = cols, cex = cex, xlab = xlab, ylab = ylab, xaxt = "n", yaxt = "n", ...) title(main = title, line = 0.5) } # set plotting region par(mfrow = c(2,2), mar = c(2,2,3,2), mgp = c(1,0,0)) # build 3 base plots build_plot(pca_scores[, 1:2], title = "PC 1 & 2", xlab = "PC 1", ylab = "PC 2") build_plot(pca_scores[, 2:3], title = "PC 2 & 3", xlab = "PC 2", ylab = "PC 3") build_plot(pca_scores[, 3:4], title = "PC 3 & 4", xlab = "PC 3", ylab = "PC 4") # add global title par(mfrow = c(1,1)) mtext(text = global_title, line = 2) }
plot_score(also_rand_forest, global_title = "ALSO with Random Forest")
also_lm <- outsiders::also(data = data_scaled, method = lm, cross_validate = TRUE, n_folds = 10, scores_only = TRUE) plot_score(also_lm, global_title = "ALSO with OLS")
This section will show two variations of k-nearest neighbors based on standard Euclidean distances: average k-nearest neighbors and harmonic mean of N-nearest neighbors. Both technques use the aggregate_knn()
function to aggregate distances over a selection of k-nearest neighbors.
average_5nn <- outsiders::aggregate_knn(data_scaled, fun = mean, k = 5) harmonic_5nn <- outsiders::aggregate_knn(data_scaled, fun = QuickR::harmonic_mean, nrow(data_scaled))
plot_score(average_5nn, global_title = "Average 5-Nearest Neighbors") plot_score(harmonic_5nn, global_title = "Harmonic Mean N-Nearest Neighbors")
Compared to the average Euclidean 5-nearest neightbors outlier score the harmonic mean variation detects more noise.
The dist_to_centers()
function can be used to find the multivariate distances of data points to centroids (e.g. mean). In this application, we'll generate cluster mean vectors from Ward's hierarchical clustering and calculate the minimum Mahalanobis distance of each data point to each cluster mean vector. The minimum Mahalanobis distance serves as the outlier score.
# Ward's hierarchical clustering pca_distance_mat <- dist(pca_scores) pca_hcl <- hclust(pca_distance_mat, 'ward.D2') plot(pca_hcl) # 3-cluster solution hcl3 <- cutree(pca_hcl, 3) # cluster mean vectors centers <- pca_scores %>% as_tibble() %>% mutate(clus = hcl3) %>% group_by(clus) %>% summarise_all(mean) # minimum Mahalanobis distance of observations to cluster means min_d2c <- outsiders::dist_to_centers(pca_scores, labels = hcl3, centers = centers[, -1]) %>% apply(., 1, min) plot_score(min_d2c, global_title = "Minimum Mahalanobis Distance to Clusters")
PCA rotation forest is an unsupervised ensemble technique that transforms subsamples of the oringinal data into new variables via principal components. All principal components are retained in each iteration, and the outlier scores from each iteration are summed to produce a final outlier score. The function pca_bag()
is used to generate scores.
pca_forest <- outsiders::pca_bag(data = data_scaled, outlier_fun = function(x) mahalanobis(x, colMeans(x), cov(x))) plot_score(pca_forest, global_title = "PCA Rotation Forest Combining Mahalanobis Distance Scores")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.