knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = TRUE,
                      cache = FALSE)

Overview

outsiders is an unsupervised outlier detection package for use with standard multidimensional data. The state.x77 data matrix containing 9 demographic characteristics of US states was used often for algorithm development and evaluation. Making use of the same data set, this vignette will provide code and visual analysis demonstrating methods available in the package.

Document Preparation

# document dependencies
library(outsiders)   # devtools::install_github("dannymorris/outsiders") to install

library(dplyr)       # data manipualation (CRAN)
library(tidyr)       # structural formatting and data standardization (CRAN)
library(ggplot2)     # static data visualization (CRAN)

library(knitr)       # document printing (eg tables) (CRAN)
library(kableExtra)  # "kable" (table) formatting (CRAN)

library(QuickR)      # another personal package with utility functions

Demographic Data on US States

state.x77 %>%
    head() %>%
    kable()

Variable Scaling

The standard variable scaling convention is to subtract the variable mean from each observation and divide by standard deviation. Doing so eliminates the undesirable influence of variables measured on larger and wider scales, such as Population (in thousands) compared to HS Grad (percent between 0 and 100).

$$X scaled = (X - mean(X)) / sd(X)) $$

data_scaled <- scale(state.x77)

Exploratory Univariate Analysis

par(mfrow = c(3, 3),
    mar = c(2,2,2,2))

for (i in 1:ncol(state.x77)) {
    plot(density(state.x77[, i]), xlab = "", main = colnames(state.x77)[i])
    points(x = state.x77[, i], y = rep(0, 50), 
           col = QuickR::add_alpha("red", 0.5), pch = 21)
}

The density plot matrix shows an interesting combination of skewed, normal, and multi-modal distributions. Outliers are evident in univariate visual analysis and are likely evident in the multivariate sense.

Multivariate Visualization of Principal Components

principal_components <- princomp(data_scaled)
summary(principal_components, loadings = T)

pca_scores <- principal_components$scores[, 1:4]

pairs(pca_scores, main = "Principal Components 1-4 Explaining \n88% of Total Variation")

Visual analysis of the first four principal components reveals the presence of multivariate outliers along with insights into the spatial arrangement of the data points.

Assessment of Multivariate Normality

We can use the function chisq_mvn() to generate ordered multivariate Mahalanobis distances, which follow a chi-squared distribution under the assumption of multivariate normality.

chisq_distances <- outsiders::chisq_mvn(data_scaled)

plot(chisq ~ distances, data = chisq_distances, 
     main = "Multivariate Ordered Chi-Squared Distances")
abline(lm(chisq ~ distances, data = chisq_distances), lty = 2)

The data appears largely non-normal in the multivariate sense.

Outlier Detection


Attribute-Wise Learning

The also() function implements Attribute-Wise Learning for Scoring Outliers (ALSO), which combines supervised and unsupervised learning to score outliers using dependent variable modeling of all features. For each variable in the original data set, a supervised model (classification or regression) is fit using the remaining variables as predictors. Outlier scores reflect the distance of the predicted value for a given observation to its actual value. Outlier scores are then summed to provide a single score suitable for extreme-value analysis.

Random Forest

also_rand_forest <- outsiders::also(data = data_scaled,
                                    method = randomForest::randomForest,
                                    cross_validate = FALSE,
                                    scores_only = TRUE)

We will use this custom function to automate and standardize visualizations of outlier scores again the first four principal components.

plot_score <- function(score, global_title, low = "green4", high = "orange", 
                       cex = 1.3, cols = QuickR::add_alpha(mapping, 0.85), ...) {

    # numeric color scale
    colors <- colorRamp(c(low, high))
    mapping <- rgb(colors(score / max(score)), maxColorValue = 255) 

    # construct base plots
    build_plot <- function(data, title, xlab, ylab, ...) {
        plot(data, pch = 19, col = cols,
             cex = cex, xlab = xlab, ylab = ylab, xaxt = "n", yaxt = "n", 
             ...)
        title(main = title, line = 0.5)
    }

    # set plotting region
    par(mfrow = c(2,2),
        mar = c(2,2,3,2),
        mgp = c(1,0,0))

    # build 3 base plots
    build_plot(pca_scores[, 1:2], title = "PC 1 & 2", xlab = "PC 1", ylab = "PC 2")
    build_plot(pca_scores[, 2:3], title = "PC 2 & 3", xlab = "PC 2", ylab = "PC 3")
    build_plot(pca_scores[, 3:4], title = "PC 3 & 4", xlab = "PC 3", ylab = "PC 4")

    # add global title
    par(mfrow = c(1,1))
    mtext(text = global_title, line = 2)
}
plot_score(also_rand_forest, global_title = "ALSO with Random Forest")

Ordinary Least Squares

also_lm <- outsiders::also(data = data_scaled,
                           method = lm,
                           cross_validate = TRUE,
                           n_folds = 10,
                           scores_only = TRUE)

plot_score(also_lm, global_title = "ALSO with OLS")

Proximity-based Scoring


This section will show two variations of k-nearest neighbors based on standard Euclidean distances: average k-nearest neighbors and harmonic mean of N-nearest neighbors. Both technques use the aggregate_knn() function to aggregate distances over a selection of k-nearest neighbors.

average_5nn <- outsiders::aggregate_knn(data_scaled, fun = mean, k = 5)
harmonic_5nn <- outsiders::aggregate_knn(data_scaled, fun = QuickR::harmonic_mean, 
                                         nrow(data_scaled))
plot_score(average_5nn, global_title = "Average 5-Nearest Neighbors")
plot_score(harmonic_5nn, global_title = "Harmonic Mean N-Nearest Neighbors")

Compared to the average Euclidean 5-nearest neightbors outlier score the harmonic mean variation detects more noise.

Cluster-Based Scoring

The dist_to_centers() function can be used to find the multivariate distances of data points to centroids (e.g. mean). In this application, we'll generate cluster mean vectors from Ward's hierarchical clustering and calculate the minimum Mahalanobis distance of each data point to each cluster mean vector. The minimum Mahalanobis distance serves as the outlier score.

# Ward's hierarchical clustering
pca_distance_mat <- dist(pca_scores)
pca_hcl <- hclust(pca_distance_mat, 'ward.D2')
plot(pca_hcl)

# 3-cluster solution
hcl3 <- cutree(pca_hcl, 3)

# cluster mean vectors
centers <- pca_scores %>%
    as_tibble() %>%
    mutate(clus = hcl3) %>%
    group_by(clus) %>%
    summarise_all(mean) 

# minimum Mahalanobis distance of observations to cluster means
min_d2c <- outsiders::dist_to_centers(pca_scores, labels = hcl3, 
                                      centers = centers[, -1]) %>%
    apply(., 1, min)


plot_score(min_d2c, global_title = "Minimum Mahalanobis Distance to Clusters")

PCA Rotation Forest

PCA rotation forest is an unsupervised ensemble technique that transforms subsamples of the oringinal data into new variables via principal components. All principal components are retained in each iteration, and the outlier scores from each iteration are summed to produce a final outlier score. The function pca_bag() is used to generate scores.

pca_forest <- outsiders::pca_bag(data = data_scaled,
                                 outlier_fun = function(x) mahalanobis(x, colMeans(x), cov(x)))

plot_score(pca_forest, 
           global_title = "PCA Rotation Forest Combining Mahalanobis Distance Scores")


dannymorris/outsiders documentation built on May 13, 2019, 1:22 p.m.