We propose a geometrically motivated framework for outlier detection, which exploits the metric structure of a (possibly high-dimensional) data set and provides a mathematically precise distinction between distributional outliers and structural outliers. Experiments show that the outlier structure of high-dimensional and non-tabular data can be detected, visualized and quantified using established manifold learning methods and standard outlier scoring. The decisive advantage of our framework from a theoretical perspective is that the resulting embeddings make subtle but important properties of outlier structure explicit and -- even more importantly -- that these properties are made accessible based on visualizations of the embeddings. From a more practical perspective, our proposal requires no prior knowledge nor any specific assumptions about the actual data structure in order to work, an important aspect since data generating processes are usually inaccessible. This is highly relevant in practice, in particular since a well established, computationally cheap combination of widely used and fairly simple methods like (t)MDS and LOF proved to be a strong baseline that yields fairly reliable results without the need for tuning hyperparameters. In addition, the proposed framework has several more general conceptual implications for outlier detection which will be summarized in the following.
Outlier taxonomy We propose a clear taxonomy to distinguish between frequently interchangeably used terms anomalies and outliers in a canonical way: we regard anomalies as observations from a different data generating process than the majority of the data (i.e. as observations that are on $\Man$ but not on $\Min$), which can be more precisely identified as structural outliers. Recall that Zimek and Filzmoser [-@zimek2018there, p. 10] refer to such observations as "real" outliers that need to be distinguished from "observations which are in the extremes of the model distribution". On the other hand, regarding outliers as observations from low density regions of the underlying "normal" data manifold $\Min$, they can be more precisely identified as distributional outliers.
Based on our reading of the literature, this distinction is usually not made explicit. Since there is rarely a practical reason to assume that a given data set contains only \textit{distributional} or only \textit{structural} outliers, some of the confusion surrounding the topic [@goldstein2016comparative; @zimek2018there; @unwin2019multivariate] might be due to the fact that such conceptual differences have not been made sufficiently clear.
As outlined, the concept of structural difference is very general. For example, structural differences in functional data may appear as shape anomalies in data mainly characterized by vertical shift variation (see Fig. \ref{fig:outtypes} A) or as vertical shift anomalies in data dominated by shape variation, as phase anomalies in data with magnitude variation or magnitude anomalies in data with phase variation, etc.
In real unlabeled data, there may not always be a clear distinction between somewhat structurally anomalous observations with "off-manifold" embeddings and merely distributionally outlying observations with embeddings on the periphery of the data manifold, as in the ECG data in Figure \ref{fig:fda-image-real} A. Nevertheless, the theoretical distinction between these two kinds of outliers adds conceptual clarity even if the practical application of the categories may not be straightforward.
Curse of dimensionality As outlined in section \ref{sec:prelims:scope}, outlier detection is often reported to suffer from the curse of dimensionality. For example, @goldstein2016comparative show that most outlier detection methods under consideration break down or perform poorly in a data set with 400 dimensions and conclude that unsupervised outlier detection is not possible in such high dimensions. Some [@aggarwal2017outlier; e.g.] attribute this to the fundamental problem that distance functions can lose their discriminating power in high dimensions [@beyer1999nearest], which is linked to the the concentration of measure effect [@pestov2000geometry]. However, this effect occurs only under fairly specific conditions [@zimek2012survey], which means that outlier detection does not have to be affected by the curse of dimensionality: In addition to the effects of dependency structures and signal-to-noise ratios [@zimek2012survey], the necessary conditions for concentration of measure are not fulfilled if the intrinsic dimensionality of the data is smaller than the actually observed dimensionality, or if the data is distributed in clusters that are relatively well separable [@beyer1999nearest]. Exactly these two characteristics are reflected in our framework in the form of (1) the manifold assumption, which implies low-ish intrinsic dimensionality, and (2) the assumption that structural outliers come from different manifolds than the rest of the data, i.e., from different "clusters" in $\hdspace$. This has two important consequences: First of all, the geometric perspective our framework is based on makes these important aspects for outlier detection in high-dimensional data explicit, while a purely probabilistic perspective obscures them. Secondly, it mitigates many of the problems associated with high-dimensional outlier detection: any outlier detection method that performs well in low dimensions becomes -- in principle -- applicable in nominally high-dimensional and/or complex non-tabular data when applied to suitable low dimensional embedding coordinates.
In addition, our results show that outlier sub-structure, specifically the differences between distributional and structural outliers, can be detected and visualized with manifold methods. This opens new possibilities for descriptive and exploratory analyses:
Visualizability of outlier characteristics If the embeddings provided by manifold methods are restricted to two or three dimensions, they also provide easily accessible visualizations of the data. In fact, manifold learning is often used in applications specifically to find two- or three-dimensional visualizations reflecting the essential intrinsic structure of the high-dimensional data as faithfully as possible. Consequently, structural and distributional outliers, which are rather glaring data characteristics if the manifolds are well separable, can often be separated clearly even in two or three dimensional representations as long as the embedding is (approximately) isometric with respect to a suitable dissimilarity measure. This is specifically important for complex non-tabular or high-dimensional data types such images or graphs where at most a few observations can be visualized and perceived simultaneously. In the same vein, substructures and notions of data depth are reflected in the embeddings, making the approach also useful as an exploration tool for settings with unclear structure.
Generalizability Since the central building block of the proposed framework is to capture the metric structure of data sets using distance measures, the framework is very general and applicable to any data types for which distance metrics are available. In Section \ref{sec:exps:qual}, we illustrated this generalizability using high-dimensional as well as non-tabular data; in particular we applied it to functional, curve, graph, and image data. This also makes the framework very flexible as one can make use of non-standard and customized dissimilarity measures to emphasize the relevant structural differences in specific situations based on domain knowledge: Representing image data as vectors of pixel intensities, we computed distances between those vectors, for example. Dissimilarities between different graphs were captured, for example, by constructing their graph Laplacians and computing Frobenius distances between them, and we used a specific elastic depth distance for the spiral curve data as suggested by earlier results in @steyer2021elastic.
Inliers, i.e. observations on $\Min$, may show large "within class" variation and/or may be spread over several disconnected clusters in some situations. For example, object images on $\Min$, which are structurally similar in terms of the depicted objects' shape, may vary in rotation, scale, or location, and may have different color or texture. In functional data, observations on $\Min$ may show phase and amplitude variation and form clusters due to different shapes. In such settings, $\Min$ can yield complex substructure and highly dispersed observations and it may be hard to distinguish whether separable structures observed in embeddings are due to groups of homogeneous structural outliers or due to multimodality in $\Min$ in which some modes are sparsely sampled. Moreover, in such cases, the dispersion of $\Min$ accounts for large parts of the data's variability and two or three dimensional MDS embeddings may not be sufficient to also faithfully represent structural outliers, since MDS embedding vectors are sorted decreasingly by explained "variance". However, this does not mean that structural outliers are not necessarily separable. Instead, they appear as outliers in higher embedding dimensions, requiring higher order embeddings to reflect the outlier structure. To some extent, scatterplot matrices visualizing the embedding dimensions of such higher order embeddings in a pairwise manner can be used for visualization in such situations [@herrmann2021geometric]. In other settings, however, techniques from multi-view learning such as "distance-learning from multiple views" may likely yield better results, because different structures (e.g. structure induced by color vs structure induced by texture) should be "treated separately as they are semantically different" [@zimek2015blind, p. 128]. Note, however, that suitable inductive biases can also be brought to bear in our framework fairly easily. If substantial considerations suggest that specific structural aspects are important, specifying dissimilarity metrics focused on these aspects allows to emphasize the relevant differences. For example, if isolated outliers in functional data (i.e. functions which yield outlying behavior only over small parts of the domain such as isolated peaks) are of most interest, higher order $L_p$ metrics such as $L_{10}$ will be much more sensitive to such structural differences than general $L_2$ distances. If phase variation should be ignored, the unnormalized $L_1$-Wasserstein or the Dynamic Time Warping (DTW) distance can be used. Such problem-specific distance measures can reduce the number of MDS embedding dimensions necessary for faithful embeddings of structural outliers [@herrmann2021geometric]. In future work, we will investigate these aspects and possible extensions w.r.t. to multi-view learning approaches. Moreover, we will elaborate more on the specifics of other data types, in particular, image data.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.