In HerrMo/fda-geo-out: What the Package Does (One Line, Title Case)

Discussion{#sec:dis}

Based on a geometrical perspective of functional outlier detection, we define two general types of functional outliers: off- and on-manifold outliers. Our investigation shows that this perspective clarifies the theoretical concepts and improves practical results. From a theoretical perspective it allows to formalize functional outlier scenarios in precise and consistent terms, beyond differences in terms of either shape, level or magnitude. This simplifies reasoning about specific outlier settings and provides a fully general theoretical conceptualization of the problem.
From an applied perspective, we formulate two important consequences. First of all, as has been demonstrated with a comprehensive analysis of a complex, real data set of ECG curves, the geometrical approach allows for easily accessible and highly informative visualizations. These are obtained by means of low dimensional embeddings reflecting the inherent structure of a functional data set in much detail. Such visualizations provide more accurate and complete pictures of the (outlier) structure of functional data. In particular, off-manifold outliers reliably appear as clearly separated (groups of) points in the low dimensional embeddings.
Second, the proposed approach makes it possible to apply highly-developed and performant standard outlier detection methods to functional data, since the geometric structure of the data is captured and reflected in their pairwise distance matrices. Outlier detection and scoring methods which can be applied to distance matrices directly can therefore be used for functional data as well. Furthermore, detection methods requiring tabular inputs can also be applied simply by using the embedding coordinates obtained with embedding methods as proxy data for the original functions. Our experiments using LOF scores show that the two approaches yield very similar results. This simultaneously simplifies and improves functional outlier detection: It simplifies, since functional data analysis becomes more accessible to a broader audience with general outlier detection methods that are widely used in other areas and that do not require an understanding of complex methodological details of functional data methods. It improves the state of the art since many functional outlier methods can only detect specific kinds of functional outliers by design, or fail in more complex realistic data that are widely dispersed or that contain multiple non-outlying subgroups like the ECG data. Moreover, note that our proposal is not limited to univariate functional data. Extending it to multivariate functions is completely straightforward, as long as a suitable dissimilarity measure is available to compute pairwise distances.
In this paper, most embeddings were obtained using MDS based on $L_2$ distances. This implies a close similarity to functional bagplots and highest density region (HDR) boxplots [@hyndman2010rainbow], which are based on the first two robust principal component scores. However, this similarity only applies if our geometrical approach is implemented with 2D MDS embeddings based on $L_2$ distances. As outlined, our proposal is neither limited to the $L_2$ metric as a distance measure nor to MDS as an embedding method or just two embedding dimensions. Other metrics and (higher-dimensional) embedding methods can be used and the conducted experiments indicate that alternative distance measure can further improve the performance in specific settings, sometimes considerably. In particular, even non-metric dissimilarity measures may be applicable as our results based on DTW distances indicate. On the other hand, the results also show that more sophisticated embedding methods such as ISOMAP and UMAP cannot be used as straightforwardly as MDS. Such methods, which do not take into account the ambient space geometry by default, at least require very careful parameter selection. In terms of practical applicability, the $O(n^3)$ time complexity and $O(n^2)$ storage complexity of standard MDS may prove problematic for large data, but generalizations such as Landmark MDS [@de2002global], Pivot MDS [@brandes2006eigensolver] or multilevel MDS exploiting GPU performance [@ingram2008glimmer] scale much better with the number of available observations.
Finally, we would argue that existing functional outlier detection approaches mostly lack the principled geometrical underpinning and conceptualization presented here. As outlined, we argue that such a conceptualization is necessary to make functional outlier detection tractable in full generality. Specifically, consider that existing methods typically limit themselves to creating a 1D or 2D representation of each curve (e.g., MBD-MEI, MO-VO, functional bagplots, HDR plots), often based on preconceived notions of the characteristics of functional outliers. Our investigations and experiments suggest that this is often not sufficient for real-world functional outlier detection: First, there is no reason to limit representations to two dimensions with modern outlier detection methods, and the geometrical perspective often strongly suggests otherwise in the case of complex functional data. Even more importantly, it is much more flexible to learn maximally informative low dimensional representations directly from data instead of starting with rather a rigid notion of which characteristics to look at and to ignore the rest. The latter is likely to lead to results not capturing the entire (outlier) structure of a given data set, which is essential in real-world unsupervised settings and exploratory analyses.
Based on theoretical considerations and the empirical results outlined above, we conclude that the proposed approach is well suited for both theoretical conceptualization and practical implementation of functional outlier detection. In particular, the choice of embedding method should consider whether it is able to preserve the extrinsic geometry of the function space and simple MDS embeddings based on functional distances provide a very strong baseline for that. On the basis of this work we intend to further investigate the implications of the geometrical perspective, such as the effects of other dissimilarity measures, embedding and outlier detection methods, in future research.

HerrMo/fda-geo-out documentation built on March 18, 2022, 8:54 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com