In rrrlw/ggtda: 'ggplot2' Extension to Visualize Topological Persistence

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Opening

The ggtda package provides users with ggplot2 layers to visualize persistent homology, especially as part of a topological data analysis (TDA) pipeline. This vignette shows how to use ggtda to construct publication-quality barcode plots, persistence diagrams, and persistence landscapes.

We set up our environment by attaching ggtda. Other packages will be called to generate a simulated data set and to compute persistent homology.

library(ggtda)

Sample data

We use the tdaunif package to generate a sample from an interesting topological space that requires 4 Euclidean dimensions to embed: the real projective plane, denoted $\mathbb{P}^2(\mathbb{R})$.[^circle] The sample of 240 points is generated uniformly from a commonly-used embedding $\mathbb{P}^2(\mathbb{R}) \to \mathbb{R}^4$ of the 2-dimensional manifold into 4-dimensional Euclidean space, with some 4-dimensional gaussian noise introduced.

[^circle]: For a simpler example using a noisy circle, see the README.

set.seed(7480L)
rpp <- tdaunif::sample_projective_plane(n = 240L, sd = 0.01)

We can build some limited intuition about the space by plotting 2-coordinate projections from the point cloud:

pairs(rpp, pch = 16L, cex = .5)

The real projective plane has known homology groups, given here over coefficients in $\mathbb{F}_2$, the field of two elements:

[^reduced]: Ripser computes reduced homology $\tilde{H}_k$, so that all homological features eventually die as the radius of the cover exceeds the diameter of the point cloud. The (non-reduced) degree-0 homology $H_0$ of any connected topological space has rank 1, which would result in an infinite degree-0 feature in persistent homology.

$$ \tilde{H}_k(\mathbb{P}^2(\mathbb{R});\mathbb{F}_2)=\begin{cases} 0 & k=0 \ \mathbb{F}_2 & k=1 \ 0 & k>1 \end{cases} $$

Each group $\tilde{H}k$ indicates the complexity of the space in a different dimension $k$. For example, the real projective plane has no gaps between connected components and no enclosed cells, so its degree-0 and degree-2 homology groups are null.[^reduced] It does, however, have a "hole", or degree-1 feature, which is detected by a single copy of the field of coefficients.[^prime] The ranks of these three groups---$0$, $1$, and $0$---count the features of each degree we should expect to emerge as the most _persistent.

[^prime]: Ripser [@Bauer2021] computes persistent homology using prime field coefficients for any prime characteristic $p<2^{16}$. Over integer coefficients, $\tilde{H}_1(\mathbb{P}^2(\mathbb{R}))=\mathbb{Z}/2\mathbb{Z}$ is cyclic; in our discussion we dodge this issue of torsion.

Persistent homology

We will calculate the persistent homology of our sample using the ripserr package, specifying the prime field p = 2L. We will then plot several visualizations of the resulting persistent pairs using ggtda. Because the space from which the sample was taken is 2-dimensional, we run the calculation up to dimension dim = 2L:

# calculate persistent homology
rpp_phom <- as.data.frame(ripserr::vietoris_rips(rpp, dim = 2L, p = 2L))
# format persistence data for ggplot2
rpp_phom$dimension <- as.factor(rpp_phom$dimension)

In keeping with tidy principles, ripserr returns a data frame whose observational units (one per row) are homological features. We take a quick peek with head() and tail() to see how long the features of this point cloud tend to persist:

# first few features (lowest-dimensional)
head(rpp_phom)
# last few features (highest-dimensional)
tail(rpp_phom)

Since we are interested in persistent features, let us have a look at the most persistent of these pairs:

rpp_pers <- transform(rpp_phom, persistence = death - birth)
head(rpp_pers[with(rpp_pers, order(-persistence)), ], n = 12L)

This table reveals that a single feature dominates the rest, with persistence nearly 0.6 versus just over 0.4 for the next-most persistent features. Nevertheless, graphical visualization would allow us to more easily and effectively interpret the complete collection of features than printing raw data. We will start off by plotting this as a topological barcode.[^filtrations]

[^filtrations]: Although ripserr::vietoris_rips() calculates persistent homology using the Vietoris–Rips filtration, persistent homology from other filtrations (e.g. of Čech complexes) can also be visualized with ggtda. We will therefore label the axes in a generic manner that can then be specialized by users depending on their specific context.

Topological barcode

Topological barcodes plot features as vertically stacked horizontal bars [@Carlsson2004; @Ghrist2008]. Each bar corresponds to a single feature, with the left boundary of the bar corresponding to the ball radius at which the feature appears in the simplicial complex (column "birth" in rpp_phom) and the right boundary of the bar corresponding to the radius at which the feature disappears (column "death"). Creating a topological barcode with ggtda is straightforward using geom_barcode(), with the persistence aesthetics start and end:

# create topological barcode with appropriate theme
ggplot(rpp_phom, aes(start = birth, end = death,
                     group = dimension, colour = dimension)) +
  geom_barcode() +
  xlab("Simplicial complex diameter") +
  theme_barcode()

This layer is coded as a graphical element ("geom") rather than as a statistical transformation ("stat") because the data undergo no computations---only aesthetic mappings---before the barcode is rendered. The group aesthetic orders features by dimension to ensure that all 0-dimensional features a plotted in one group, all 1-dimensional features above them, and so on. The color aesthetic visually discriminates between features of different dimensions. The barcode theme removes gridlines and the vertical axis, since these are irrelevant to reading the plot.

Given that $\mathbb{P}^2(\mathbb{R})$ has first homology group of rank 1, the dominant 1-dimensional feature (or 1-cycle) appears as expected. Other features of, say, secondary persistence also emerge, and we will discuss them later, after we have showcased some alternative representations of these data.

Persistence diagrams

Persistence diagrams plot features as points on a scatterplot [@Edelsbrunner2002]. Each point corresponds to a single feature, so that features with equal births and deaths yield overlapping points. They are generated using stat_persistence() as below, using the same persistence aesthetics. This is indeed a stat layer, in that the coordinates are calculated differently from the data depending on the layout. By default, persistence diagrams take the conventional "diagonal" layout: A feature born at $b$ and dead at $d$ is located at coordinates $(b,d)$.

# create persistence diagram with appropriate theme
ggplot(rpp_phom, aes(start = birth, end = death,
                     shape = dimension, colour = dimension)) +
  stat_persistence() +
  geom_abline(slope = 1, intercept = 0) +
  coord_equal() +
  xlab("Feature appearance") +
  ylab("Feature disappearance") +
  theme_persist()

The reference line $y=x$ often aids the interpretation of diagonal persistence diagrams, so we add one with geom_abline(). (This may become a default annotation in a future version.) Aesthetically, it is also generally pleasing to have the vertical and horizontal axes on a fixed coordinate system with both plotting the same range, achieved with coord_equal(). The shape and colour aesthetics together discriminate between features of different dimensions. This is particularly useful when one of the aesthetics (e.g. color) is rendered ineffective (e.g. for color-blind viewers).

Flat persistence diagrams can also be generated with stat_persistence(), by setting the diagram parameter to "flat" [@Wadhwa2019]. In this layout, the horizontal coordinate is equal to the radius at which each feature appears (column "birth") and the vertical coordinate is equal to the persistence of the feature (the difference between "death" and "birth"). While these dimensions can be understood to lie on the same scale, the usefulness of the plot lies in recognizing the most persistent features by height and their relative appearences by birth, so it is less essential to fix the aspect ratio.

# create flat persistence diagrams with adjustments
ggplot(rpp_phom, aes(start = birth, end = death,
                     shape = dimension, colour = dimension)) +
  stat_persistence(diagram = "flat") +
  xlab("Feature appearance") +
  ylab("Feature persistence") +
  theme_persist()

Persistence landscapes

Persistence landscapes vectorize persistence diagrams as real-valued functions over the parameter range. The convenient statistical properties of landscapes are outside the scope of this package, but the stat_landscape() layer provides elegant visualizations of landscapes derived from single diagrams. Below, we only plot the first 24 levels of each landscape. Whereas persistente diagrams default to the diagonal layout, landscapes default to the "landscape" layout:

# plot a persistent landscape for each dimension
ggplot(rpp_phom, aes(start = birth, end = death,
                     alpha = after_stat(level))) +
  stat_landscape(n_levels = 24) +
  facet_grid(rows = vars(dimension)) +
  coord_equal() +
  scale_alpha_continuous(trans = "reverse") +
  labs(x = NULL, y = NULL) +
  theme_persist() + theme(legend.position = "bottom")

Because landscapes take up more space than diagrams, those for different dimensions are rarely if ever overlaid. Instead, above they are faceted into three panels for ease of comparison.

Interpretation

As mentioned above, the dominant feature detected in the point cloud rpp is a 1-cycle that persists from radius 0.28 to radius 0.85 nearly twice the persistence of any other feature. Still, the flat persistence diagram reveals that some other degree-1 and -2 features stand out from the rest.

One way to convince ourselves that these are artifacts of noise in the sample is to compare their persistence to that of the degree-0 features. The latter are merely disconnected components, which finally disappear when the radius becomes large enough for the filtration to link every sampled point to every other. That the most persistent 1- and 2-cycles (other than the dominant 1-cycle) persist for no more than this "cut radius" of the point cloud suggests that they can plausibly be attributed to noise.

In particular, the most persistent 2-cycle, which stands out clearly in the landscape from its fellow 2-cycles, may be understood as a geometric rather than a topological feature. Each Vietoris–Rips complex of the filtration is built of simplices on points within the associated distance (the radius) of each other---and these distances are in $\mathbb{R}^4$, not in $\mathbb{P}^2(\mathbb{R})$. When the radius can reach between regions of the manifold embedded nearby in 4-space but not completely across, some purely topological features may collapse into features that are artifacts of the embedding. For a more intuitive example, imagine sampling points from a bottle with a narrow spout: A cover that connects nearby points will effectively plug the spout well before it fills the interior of the bottle, producing a persistent degree-2 feature though the manifold has null degree-2 homology.

In real-world settings, we may rarely have topological insight beyond a point cloud itself. So, while we can be wary of homological features of "middling" persistence, we should also not dismiss them.

ggplot2 functionality

Plots produced using ggtda can be handled like any other "ggplot" objects---for example, they can be saved using ggplot2::ggsave(). Running the following line of code will save the most recent plot (the landscapes) as a PNG image file named my-plot.png in the local working directory.