Detecting atypical observations that deviate substantially from the bulk of the data is an important task in data analysis with applications across domains like, e.g., intrusion detection [@zhang2006anomaly], medical imaging [@fritsch2012detecting], or network analysis [@azcorra2018unsupervised]. The most common terms for this task are outlier or anomaly detection, but many different terms are used [@zimek2018there]. Although there is a vast amount of literature on the topic, there is neither a commonly accepted, precise definition of what exactly constitutes outliers or anomalies, nor agreement on whether these two terms are synonymous. As Unwin [-@unwin2019multivariate, p. 635] puts it:
"Outliers are a complicated business. It is difficult to define what they are, it is difficult to identify them, and it is difficult to assess how they affect analyses."
\noindent Overviews on the topic are given by @zimek2012survey or @goldstein2016comparative from a computer science perspective, and by @rousseeuw2005robust or @unwin2019multivariate from a statistical perspective. @kandanaarachchi2020dimension provide a short summary including both perspectives, while @campos2016evaluation as well as @marques2020internal focus on the evaluation of unsupervised outlier detection. @zimek2018there provide a comprehensive survey bringing together both perspectives with in-depth epistemological discussion.
Here, we focus on unsupervised outlier detection. One way to tackle the problem is to define outliers based on a single probability distribution $P$ assumed to generate the data. An outlier is an observation which deviates from the bulk of the data with respect to $P$. If $P$ allows for a density, outliers are simply observations in low density regions. From this perspective, we have \textit{distributional outliers} whose outlyingness is defined relative to a single probability distribution. On the other hand, outliers are often assumed to be observations generated by a structurally different data generating process than the one generating the "normal" data. From this perspective, we have \textit{structural outliers} whose outlyingness is caused by the differences between the underlying data generating processes.
The two terms are complementary and both are necessary in order to fully address the challenges of outlier detection. The notion of distributional outliers is easy to define precisely in probabilistic terms, for example, based on minimum level sets [@scott2006learning] or M-estimation [@clemenccon2013scoring], and has yielded a multitude of results and algorithms. Structural outliers, in contrast, are much more difficult to formalize, but also more general, since assuming that all observations are realizations from a single underlying distribution that can be represented by its density is often problematic. In practical terms, this requires access to (an estimate of) the underlying density and finding a suitable (local) density level below which observations are to be classified as outliers.
Both are infeasible for general, non-tabular data types like shapes, functions or images whose domains frequently do not admit probability densities. In such settings,
a geometric perspective on outlier detection, which does not require the availability of probability densities defined over the data space but only some metric structure (i.e., suitable dissimilarity or distance measures) is necessary in order to perform outlier detection.
\indent The rest of the paper is structured as follows. Section \ref{sec:prelims} describes the scope and contribution of the study and outlines its background and related work.
The proposed theoretical framework is defined in section \ref{sec:framework} and its practical relevance is demonstrated in section \ref{sec:exps} using qualitative and quantitative experiments for a variety of data sets of different data types. Section \ref{sec:discussion} discusses our findings and the resulting conceptual implications, before we conclude in section \ref{sec:conclusion}.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.