In HerrMo/fda-geo-out: What the Package Does (One Line, Title Case)

Introduction {#sec:intro}

Problem setting and proposal {#sec:intro:problem}

Outlier detection for functional data is a challenging problem due to the complex and information-rich units of observations, which can be "outlying" or unusual in many different ways. Functional outliers are often categorized into magnitude and shape outliers [@dai2020functional; @arribas2015discussion, e.g.], whereas Hubert et al. [@hubert2015multivariate] differentiate between isolated and persistent outliers, the latter further subdivided into shift, amplitude and shape outliers. However, neither of these taxonomies yield precise, explicit, fully general definitions, which makes it difficult to theoretically describe, analyze and compare functional outliers. Magnitude outliers, for example, have been defined as functional observations "outlying in some part or across the whole design domain" [@dai2020functional, p. 1] or as "curves lying outside the range of the vast majority of the data" [@arribas2015discussion, p. 2], whereas Hubert et al. [@hubert2015multivariate, p. 3] define isolated outliers as observations which "exhibit outlying behavior during a very short time interval", in contrast to persistent outliers which "are outlying on a large part of the domain".

To cut through the confusion, we propose a geometric perspective on functional outlier detection based on the well-known "manifold hypothesis" [@ma2011manifold; @lee2007nonlinear]. This refers to the assumption that ostensibly complex, high-dimensional data lies on a much simpler, lower dimensional manifold embedded in the observation space and that this manifold's structure can be learned and then represented in a low-dimensional space, often simply called embedding space. We argue that such a perspective both clarifies and generalizes the concept of functional outliers, without the need for any strong assumptions or prior knowledge about the underlying data generating process or its outliers. In terms of theoretical development, the approach allows us to consistently formalize and systematically analyze functional outlier detection in full generality. We also demonstrate that procedures based on this perspective simplify and improve functional outlier detection in practice: it suggests a principled, yet flexible approach for applying well-established, highly performant standard outlier detection methods such as local outlier factors (LOF) [@breunig2000lof] to functional data, based on embedding coordinates obtained via manifold learning or dimension reduction methods. Our experiments show that doing so performs at least on par with existing functional-data-specific outlier detection methods, without the methodological complexity and limited applicability that methods specific to functional data often entail. Moreover, such lower dimensional representations serve as an easily accessible visualization and exploration tool that helps to uncover complex and subtle data structures which cannot be sufficiently reflected by one-dimensional outlier scores or labels, nor captured by many of the previously proposed 2D diagnostic visualizations for functional outliers.

Background and related work{#sec:prelims:background}

Functional data analysis (FDA) [@ramsay2005functional, e.g.] focuses on data where the units of observation are realizations of stochastic processes over compact domains. In many cases, the intrinsic dimensionality of functional data (FD) is much lower than the observed. First, while FD are infinite dimensional in theory, they are high-dimensional in practice -- functional observations are usually recorded on fine and dense grids of argument values. Second, the dominant drivers of differences between functional observations are often comparatively low-dimensional so that just a few modes of variation capture most of the structured variability in the data.
However, FD usually contain both amplitude and phase variation, i.e., "vertical" shape or level variation as well as "horizontal" shape variation. These different kinds of variability contribute to the difficulty to precisely define and differentiate the various forms of functional outliers and to develop methods that can "catch them all", making outlier detection a highly investigated research topic in FDA. For example, Arribas-Gil and Romo [@arribas2015discussion] argue that the proposed outlier taxonomy of Hubert et al. [@hubert2015multivariate] can be made more precise in terms of expectation functions $f(t)$ and $g(t)$, with $f(t)$ a "common" process, see Figure \ref{fig:tax}.

{=tex} \begin{figure} \centering \includegraphics{images/taxonomies.png} \caption{Functional outlier taxonomies. Bottom: standard taxonomy. Top: taxonomy as introduced by Hubert et al.~Image taken from Arribas-Gil and Romo \citep{arribas2015discussion}. \label{fig:tax}} \end{figure}

Despite these attempts some fundamental issues remain unsolved. The proposed taxonomies do not provide precise definitions and some of the definitions are contradictory to some extent. Finally, many outlier scenarios for realistic data generating processes are not covered by the described taxonomies at all. As Arribas-Gil and Romo [@arribas2015discussion] themselves point out, settings with phase-varying data (i.e., "horizontal" variability through elastic deformations of the functions' domains) are not sufficiently reflected, as functions deviating in terms of phase may be considered as shape outliers in cases where there are only few of such functions but not in settings where all functions display such variation.
In addition, the taxonomy in Figure \ref{fig:tax} provides a reasonable conceptual framework only if the non-outlying data from the "common" data generating process is characterized adequately just by its global mean function. This cannot be assumed for many real data sets which often contain highly variable sets of functions which display several modes of phase, shape and/or amplitude variation simultaneously and/or which come from multiple classes with class-specific means and higher moments (see Figure \ref{fig:ecg}, e.g.).

Published research focuses mostly on the development of outlier detection methods specifically for functional data, and a multitude of methods based on a variety of different concepts such as functional data depths [@hernandez2106kernel; @harris2020elastic, e.g.], functional PCA [@sawant2012fpca], functional isolation forests [@staerman2019functional], robust functional archetypoids [@vinue2020robust] or functional outlier metrics like directional outlyingness [@rousseeuw2018measure; @dai2019directional], often narrowly focused on detecting specific kinds of functional outliers, have been put forth. Dai et al. [@dai2020functional] propose a transformation-based approach to functional outlier detection and claim that sequentially transforming shape outliers, which "are much more challenging to handle", into magnitude outliers, makes them easier to detect with established methods [@dai2020functional, p. 2]. The approach allows to define functional outliers more precisely in terms of the transformations being used, like normalizing or centering functions or taking their derivatives, but practitioners still need to be able to come up with appropriate transformations for the data at hand first.
Recently, Xie et al. [@xie2017geometric] have introduced a decomposition of functional observations into amplitude, phase and shift components, based on which specific types of outliers can be identified in a more general geometric framework without necessarily requiring functional data to be of comparatively low rank. Similar in spirit to our proposal, Hyndman and Shang [@hyndman2010rainbow] used kernel density estimation and half-space depth contours of two-dimensional robustified FPCA scores to construct functional boxplot equivalents and detect outliers, and Ali et al. [@ali2019timecluster] use data representations in two dimensions obtained from manifold methods for outlier detection and clustering, but the focus of both is on practicalities without considering the theoretical implications and general applicability of embedding-based approaches nor do they consider the necessity of higher dimensional representations.

The remainder of the paper is structured as follows: We provide the theoretical formalization and discussion of the geometric approach in section \ref{sec:theory}. Based on these theoretical considerations, section \ref{sec:exps} presents extensive experiments. Section \ref{sec:exps:qual-analysis} covers a detailed qualitative analysis of real world ECG data, while section \ref{sec:exps:performance} provides quantitative experiments and systematic comparisons to previously proposed methods on complex synthetic outlier scenarios. We conclude with a discussion in section \ref{sec:dis}.

HerrMo/fda-geo-out documentation built on March 18, 2022, 8:54 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com