knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

Introduction

This document is an introduction to information-geometric techniques for geospatial analysis. The core situation we consider is one in which our data consist of counts in two or more categories, distributed in space. A standard example is demographic census data, in which the categories might be racial groups, and the spatial units tracts. This document does not focus on mathematical details of the methods, which will be explained in a forthcoming paper. Rather, the focus is on the intuition behind various analytical methods, and how to carry them out in order to learn about spatial separation with freely-available software tools. The primary methods we use are implemented in the packages compx for the R programming language. Get compx by running:

# install.packages('devtools') if necessary
devtools::install_github('PhilChodrow/compx')
library(compx)
library(compx)

To prepare data and analyze outputs of compx functions, we will also need the following packages:

library(tidyverse)
library(maptools)
library(igraph)
library(RColorBrewer) # for plots

Data inputs

The functions provided by package compx assume that your data is expressed in two components. The first is a SpatialPolygonsDataFrame (spdf) containing geographic polygons ("tracts"). For illustrative purposes, compx comes bundled with an spdf with the Census tracts for Wayne County, Michigan, which includes the urban core of Detroit, an oft-analyzed city in quantitative segregation studies. The tracts were originally accessed via the R package tigris.

detroit_tracts %>% plot()

The second data input must be a demographic table with class in c('tbl_df', 'tbl', 'data.frame'), which I will refer colloquially to as a "data frame." compx includes an example in detroit_race, giving racial demographic counts for each tract for decennial Censuses 1970, 1980, 1990, 2000, 2010. Due to changing questions and collection methods, only 1990 data and later is comparable with 2010. The demographic data was assembled and normalized by the Project on Diversity and Disparities at Brown University.

detroit_race

Note that detroit_race contains one row per combination of tract, time, and racial group. compx expects your demographic data to be in this tidy (or "long") format. If it's in a different shape, you may want to investigate the dplyr and tidyr packages to format it appropriately.

Three columns are required:

  1. tract, the key relating the data the corresponding spdf. compx assumes that tract matches the GEOID column of spdf@data.
  2. group, the demographic category (such as racial group, in this case).
  3. n, the count of members of group in each tract.

Additionally, you may include an optional column for time t. compx functions will automatically use this column of detected; if you don't want to do temporal analysis, you should delete the $t$ column if you have one. Any additional columns are ignored.

Data Preparation

Before going any further, let's do a bit of cleaning to ensure that we are only analyzing with tracts for which there is a reasonable amount of data available.

tracts_to_use <- detroit_race %>% 
    group_by(tract, t) %>%
    summarise(n = sum(n)) %>%
    filter(n > 100) 

tracts <- detroit_tracts[detroit_tracts@data$GEOID %in% tracts_to_use$tract, ]

f_tracts <- tracts %>%         # for plotting later
    fortify(region = 'GEOID')

The Information Manifold

Our analytical approach is motivated by information geometry. The core notion of information geometry is to view the data as lying on an information manifold, and then use some simple tools of differential geometry to study that manifold's properties.

The fundamental manifold property is the metric tensor $g$. The metric tensor contains complete information on how distances in information space are bent and curved, depending on where we are. It may therefore be used to compute distances. If $p, q, r \in M$, then, provided $p,q,$ and $r$ are sufficiently close, the geodesic distance between $q$ and $r$ is approximately

$$ d(q, r) \approx \sqrt{g_p(q - r, q - r)}\;.$$

When $g = I$, the identity matrix, this formula reduces to $d(q,r) \approx \lVert q-r \rVert$, the Euclidean distance. When $g$ differs substantially from $I$, we get more interesting behavior.

We can view the metric tensor $g$ as encoding the information we need to measure distances based on information, rather than on geography alone. A related and important idea of the metric tensor is that it encodes variation -- places where the components of the metric tensor are large correspond to boundary areas between different demographic clusters.

Computing the Metric Tensor

Because the metric tensor is fundamental to our approach, compx includes a function that computes it directly from tracts and appropriately keyed data. If the keyed data contains no column labeled t, then compx assumes that all data are from the same time period. The simplest case of data looks like this:

race_2010 <- detroit_race %>%
    filter(t == 2010) %>%
    select(-t)                 # remove the t column if not using it

race_2010

Now we can compute the metric tensor. There are a few arguments to specify here. They are all briefly described in the comments below, and more fully in the package documentation. One thing to note is that hessian parameter should be a function that takes in a single vector and returns a matrix. So far, compx comes bundled with appropriate functions for the KL Divergence (DKL_), Euclidean (euc_), and cumulative Euclidean (cum_euc_) matrices; more are in progress.

metric_df <- compute_metric(tracts,             # use these tracts
                            race_2010,          # and these data
                            km = T,             # convert spatial units to km
                            sigma = 100,        # for numerical derivatives
                            hessian = DKL_,     # use the KL divergence to compare distributions
                            smooth = T)         # spatial smoothing (avoids singularities)

In addition to the tracts and data parameters, compute_metric also allows users to specify whether length should be measured in km or degrees long/lat; the divergence function to use when comparing distributions (in c('euc', 'cum_euc', 'DKL')), and sigma, a technical parameter relating to the computation of numerical derivatives required to compute the metric tensor. The smooth parameter controls whether mild spatial smoothing is used to avoid singularities when using the KL divergence.

The output of compute_metric is a data frame, keyed by geoid, which gives the total population of the tract as well as a list-column with the metric tensor in local coordinates. Since we didn't give a time column, there are only two local coordinates x and y.

metric_df %>% head()

We can also take a quick peek at the metric tensor itself. For example, the tensors corresponding to the first two entries of metric_df are:

metric_df$local_g %>% head(2)

The metric tensor is a symmetric matrix. Roughly, $g_{ii}$ encodes the strength of the dependence on the frequency field on the $i$th coordinate. Thus, if we focus on

metric_df$local_g[[1]]

we can compare the entry of r metric_df$local_g[[1]][1,1] in the x component to r metric_df$local_g[[1]][2,2] for the y component to get a sense for the direction in which the demographic distribution is changing most rapidly. In this case, it's the $x$ direction.

Analyzing the Metric Tensor

To visualize the metric structure, it's usually necessary to apply a scalar function to the metric tensor. Good ones are the trace and the square root of the determinant, both of which are usefully interpretable from the information-geometric point of view. Roughly, the trace corresponds to length distortions, and the determinant to volume distortions in information space.

trace_df <- metric_df %>% 
    mutate(trace = map_dbl(local_g, . %>% diag() %>% sum()),
           det   = map_dbl(local_g, . %>% det() %>% abs() %>% sqrt()))
trace_df %>% head()

It's useful to plot these scalars to see how they behave visually. Here's the trace.

f_tracts %>% 
    left_join(trace_df, by = c('id' = 'geoid')) %>% 
    ggplot() + 
    aes(x = long, y = lat, fill = trace, group = group) + 
    geom_polygon() + 
    viridis::scale_fill_viridis(option = 'magma', trans = 'log10') + 
    ggthemes::theme_map() 

ggsave('../test.pdf')

One way to think about the more intense areas with higher traces is that they are areas of transition. If you compare this visualization to, e.g. UVA's Racial Dot Map you'll find that the lighter areas correspond to the predominating white/black boundary, as well as smaller subdivisions.

Incorporating Time

compx tries to make it easy for you to incorporate time into your analysis as well. To do this, you just need to make sure that your data has a t column with appropriate integer or numeric values. Then, compute_metric will work just as before.

race_temporal <- detroit_race %>%
    filter(t %in% c(1990, 2000, 2010))           # now we don't remove the t column

metric_df <- compute_metric(tracts, 
                            race_temporal, 
                            km = T, 
                            sigma = 100, 
                            hessian = DKL_,
                            smooth = T)

This version of metric_df is different in two respects:

metric_df %>% head()

First, there's a t column for time. Second, the metric tensor is now 3 x 3, which reflects the fact that we've added a third coordinate (time).

metric_df$local_g[[1]]

In general, it's not a good idea to directly compare temporal components of the metric tensor to spatial ones, since they have different units. In this case, the spatial components have units of kilometers, but the temporal ones have units of years. However, it's not wrong to interpret this result as saying that, at this point in space and time, the demographic composition changes more when you move a km in space than when you move a year in time.

Just like before, we can apply some scalar functions to extract information about the metric tensor. The spatial trace is the sum of the first two diagonal entries, corresponding to the components of the metric tensor that encode spatial (not temporal) variability.

trace_df <- metric_df %>% 
    mutate(trace = map_dbl(local_g, ~ .[1:2, 1:2] %>% diag() %>% sum()))

f_tracts %>% 
    left_join(trace_df, by = c('id' = 'geoid')) %>% 
    ggplot() + 
    aes(x = long, y = lat, fill = trace, group = group) + 
    geom_polygon() + 
    viridis::scale_fill_viridis(option = 'magma', trans = 'log10') + 
    ggthemes::theme_map() + 
    facet_wrap(~t)

The third component of the metric tensor quantifies the dependence on time. We can also extract that and visualize it:

t_df <- metric_df %>% 
    mutate(temporal = map_dbl(local_g, ~ .[3, 3]))

f_tracts %>% 
    left_join(t_df, by = c('id' = 'geoid')) %>% 
    ggplot() + 
    aes(x = long, y = lat, fill = temporal, group = group) + 
    geom_polygon() + 
    viridis::scale_fill_viridis(option = 'magma', trans = 'log10') + 
    ggthemes::theme_map() + 
    facet_wrap(~t)

We can see that, the predominantly white suburbs toward the west have generally experienced greater demographic change over time than the prodominantly black urban core toward the northeast. We also observe that the temporal component of the metric tensor tends to correlate with the spatial component of the previosu visualization. So, spatial transition areas also tend to be temporal transition areas. This makes sense if you think that the dynamics of population composition are continuous in space, so that change happens at the boundaries.

t_df  %>% 
    mutate(spatial = map_dbl(local_g, ~ .[1:2, 1:2] %>% diag() %>% sum())) %>% 
    ggplot() + 
    aes(x = spatial, y = temporal) + 
    geom_point() + 
    scale_y_continuous(trans = 'log') + 
    scale_x_continuous(trans = 'log') + 
    geom_smooth() + 
    facet_wrap(~t)

Network Analysis

The other main form of analysis enabled by compx is network-based analysis used to quantify and identify structure in demographic variation. Similarly to the hessian argument of compute_metric, here you need to specify a divergence or comparison function between distributions. It should operate on vectors of nonnegative real numbers n and m (note: do NOT assume that these are normalized), and return a real number. It should also be symmetric. An example is the Jensen-Shannon metric, defined below in terms of the KL divergence:

divergence <- function(n,m){
    p <- n / sum(n)
    q <- m / sum(m)
    sqrt(.5 * DKL(p, p+q) + .5*DKL(q, p+q))
}

g <- construct_information_graph(tracts, race_2010, divergence)

g

g is an igraph object with edge and node attributes. The dist edge attribute reflects how information-geometrically dissimilar are the connected nodes, where this dissimilarity is just the information distance between the nodes, calculated using the metric tensor. It's useful to visualize this network. In the plot below, dissimilar tracts have thin edges between them, while very similar ones are joined by thick edges.

edges <- g %>% as_long_data_frame() %>% tbl_df()
nodes <- data_frame(x = V(g)$x, y = V(g)$y)

ggplot() + 
    geom_polygon(aes(x = long, y = lat, group = group), fill = 'firebrick', alpha = .6, data = f_tracts) +
    geom_segment(aes(x = from_x, 
                     y = from_y, 
                     xend = to_x, 
                     yend = to_y, 
                     size = exp(-7 * dist^2)), color = 'black', data = edges) +
    ggthemes::theme_map() +
    scale_size_continuous(range = c(0,1)) + 
    guides(alpha = 'none', color = 'none', size = 'none')

Comparing to the figures above, you might notice that areas with thin or invisible edges correspond to the same "border" areas that we saw highlighted by the spatial trace above.

The Scale of Segregation

One useful way to analyze the spatial structure of segregation is via the eigenspectrum of the Laplacian matrix corresponding to the graph, a method inspired by spectral clustering. In the plot below, large gaps in the spectrum correspond to strong "signals" in the spatial structure; intiutively, it "makes sense" to cut the structure into that many pieces.

A   <- affinity_matrix(g, sigma = 20)        # square exponential affinity with specified sigma
L   <- generalized_laplacian_matrix(A)      # L_{rw} in the tutorial cited above
evL <- eigen(L, symmetric = T)              # compute the spectrum

data.frame(n = 1:30, ev = 1 - rev(evL$values)[1:30]) %>%
    ggplot() +
    aes(x = n, y = ev) +
    geom_line() +
    geom_point() +
    scale_y_continuous(trans = 'log10') + 
    geom_vline(xintercept  = 3.5, linetype = 'dashed') + 
    geom_vline(xintercept  = 8.5, linetype = 'dashed') + 
    geom_vline(xintercept = 10.5, linetype = 'dashed') + 
    theme_bw()

For example, we see that in Detroit with under the KL divergence, there's a large gap after $n = 2$ and then a smaller one at $n = 8$. Of course, it's important to realize that there is some judgment required in identifying these hard cutoffs -- it's not a single cutoff but the spectrum of $A$ that most fully describes the community structure.

Spatial Community Detection

We can try to actually identify clusters based on the affinity matrix $A$. Here we use spectral clustering, which exploits our knowledge of the eigenvalues of $A$. The following code will construct the Laplacian matrix $L$, do $k$-means clustering in the eigenspace of $L$, and then assign the resulting clusters as a new vertex attribute to g.

set.seed(1234)

k <- 8
nreps <- 1000
Z <- evL$vectors[,(ncol(evL$vectors)-k+1):ncol(evL$vectors)]

models <- data_frame(n = 1:nreps) %>% 
    dplyr::mutate(model = map(n, ~ kmeans(Z, centers = k))) %>% 
    mutate(perf = map_dbl(model, ~.$tot.withinss))

model <- models %>% 
    filter(perf == min(perf))
km <- model$model[[1]]

g        <- g %>% set_vertex_attr(name = 'cluster', 
                                  index = colnames(A), 
                                  value = km$cluster)

We've also provided a convenience function that will do the same thing:

g <- g %>% spectral_cluster(sigma = 20, k = k, nreps = nreps)

Either way, now we can plot the result:

# extract as data frame. We already extracted the edges df above. 
nodes    <- data_frame(x = V(g)$x, y = V(g)$y, cluster = V(g)$cluster)

pal <- scales::brewer_pal(palette = 'Set1')
pal <- colorRampPalette(pal(9))
pal <- pal(k)

ggplot() + 
    geom_segment(aes(x = from_x, 
                     y = from_y, 
                     xend = to_x, 
                     yend = to_y, 
                     size = exp(-7 * dist^2)), color = 'black', data = edges) +
    geom_point(aes(x = x, y = y, color = as.character(cluster)), size = 2, data = nodes) + 
    ggthemes::theme_map() +
    scale_size_continuous(range = c(0,1)) + 
    guides(alpha = 'none', color = 'none', size = 'none') + 
    scale_color_manual(values = pal)

This is a reasonably strong result; the clustering has detected the predominating division between the white suburbs to the west and the predominantly black urban core, as well as distinct areas like "Mexicantown", Grosse Point, and Hamtramck. It's worth comparing this clustering to a demographic map of Detroit.

Spatiotemporal Networks

It's also possible to use construct networks at multiple time-slices. Just like before, it's necessary to use data that has a t column with appropriate values.

g <- construct_information_graph(tracts, race_temporal, divergence = divergence)

When t column is provided, the names of g are concatenations of the GEOID of the tract and the corresponding value of $t$. There's an edge between each tract and its corresponding step forward or backward in time.

k     <- 15
sigma <- 20
g     <- g %>% spectral_cluster(sigma = sigma, k = k, nreps = 10)

It's more complex to visualize these networks in a workable way, but not hard to visualize the clusters over time:

# get the edges, only including ones that are "within" a time slice. 
edges <- g %>% 
    as_long_data_frame() %>% 
    tbl_df() %>% 
    mutate(type = ifelse(from_t == to_t, 'spatial', 'temporal')) %>% 
    filter(type == 'spatial') %>% 
    mutate(t = as.integer(stringr::str_sub(from_t, -5)))

# get the nodes, including temporal and cluster info
nodes <- data_frame(x = V(g)$x, 
                    y = V(g)$y, 
                    t = V(g)$t, 
                    cluster = V(g)$cluster)

# plot the clusters, faceting on time
pal   <- brewer.pal(9, name = "Set1") %>%
    colorRampPalette()

ggplot() + 
    geom_segment(aes(x = from_x, 
                     y = from_y, 
                     xend = to_x, 
                     yend = to_y, 
                     size = exp(-sigma * dist^2)), data = edges, color = 'grey40') + 
    geom_point(aes(x = x, y = y, color = as.character(cluster)), size = 1, data = nodes) + 
    facet_wrap(~t) + 
    scale_size_continuous(range = c(0,1)) + 
    ggthemes::theme_map() + 
    guides(color = FALSE, size = FALSE) + 
    scale_color_manual(values = pal(k)) 

Currently, this temporal clustering method has some limitations -- tracts don't change demographics very quickly over the time period we are considering, and so clusters tend to be quite persistent. The result of this persistence is that the "first" clusters that this method tends to find are sometimes not very informative -- they are just joining together a few tracts that haven't changed in time at all. So, it's fair to say that more work here is needed.

However, there's also a substantial amount of signal in the clusters. For example, the clustering detects that the historically white community of Grosse Point (the far eastern tip) has somewhat receded from 1990 to 2010, with its western-most areas being "absorbed" into the predominantly black cluster. This is easy to see if you visualize raw percentages in each tract over time.

Agglomerative Clustering

Spectral clustering is a "global" algorithm, which leads to some good properties and some less-good ones. Agglomerative (hierarchical) clustering is a local algorithm that can lead to more intuitive results in certain cases.

M <- race_temporal %>%
    group_by(group) %>%
    summarise(n = sum(n)) %>%
    select(n) %>% unlist()

divergence <- function(n,m){
    p <- n / sum(n)
    q <- m / sum(m)
    p_bar <- sum(n) / sum(m + n) * p + sum(m) / sum(m + n) * q
    r <- M / sum(M)
    sum(n) / sum(M) * DKL(p, r) +
        sum(m) / sum(M)*DKL(q, r) -
        sum(m + n) / sum(M) * DKL(p_bar, r)
}


a <- info_cluster(g, divergence)

Now we visualize the result...

k <- 10
g <- g %>% set_vertex_attr('cluster', value = a %>% cutree(k))
# get the edges, only including ones that are "within" a time slice. 
edges <- g %>% 
    as_long_data_frame() %>% 
    tbl_df() %>% 
    mutate(type = ifelse(from_t == to_t, 'spatial', 'temporal')) %>% 
    filter(type == 'spatial') %>% 
    mutate(t = as.integer(stringr::str_sub(from_t, -5)))

# get the nodes, including temporal and cluster info
nodes <- data_frame(x = V(g)$x, 
                    y = V(g)$y, 
                    t = V(g)$t, 
                    cluster = V(g)$cluster)

# plot the clusters, faceting on time
pal   <- brewer.pal(9, name = "Set1") %>%
    colorRampPalette()

ggplot() + 
    geom_segment(aes(x = from_x, 
                     y = from_y, 
                     xend = to_x, 
                     yend = to_y, 
                     size = exp(-sigma * dist^2)), data = edges, color = 'grey40') + 
    geom_point(aes(x = x, y = y, color = as.character(cluster)), size = 1, data = nodes) + 
    facet_wrap(~t) + 
    scale_size_continuous(range = c(0,1)) + 
    ggthemes::theme_map() + 
    guides(color = FALSE, size = FALSE) + 
    scale_color_manual(values = pal(k)) 

Note that hierarchical clustering produces different clusters, which reflects two main differences:

Whether these are desirable properties depends of course on the application domain.

This version of hierarchical clustering is greedily information-maximizing, so it's useful to ask how much information is captured by the clustering. This particular clustering uses r k clusters to capture r a$height[length(a$height)] - a$height[length(a$height) - k] nats of information, out of r a$height[length(a$height)] total. So, despite the fact that our data set has r nrow(race_temporal) / n_distinct(race_temporal$group) tracts in it, we can "tell most of the story" with just r k super-tracts.

Reference: White-Black Segregation in Detroit

# construct a data frame of percentages. 
percent_df <- race_temporal %>% 
    filter(t %in% c(1990, 2010)) %>% 
    group_by(tract, t) %>% 
    mutate(percent = n /sum(n)) %>%
    rename(race = group) %>% 
    filter(race %in% c('Black', 'White')) 

# plot the df 
pal <- brewer.pal(9, 'Blues') %>% 
    colorRampPalette()

f_tracts %>%
    left_join(percent_df, by = c('id' = 'tract')) %>% 
    tbl_df() %>% 
    ggplot() +
    aes(x = long, y = lat, group = group, fill = percent) + 
    geom_polygon() + 
    facet_grid(t ~ race) + 
    ggthemes::theme_map() + 
    scale_fill_distiller(palette = 'BuPu', direction = 1) + 
    theme(panel.background = element_rect(fill = 'grey80'))

Future Work

Priorities for future work on compx include refining clustering methods, performance enhancements, and improvements TBD based on suggestions from those in the planning and sociological communities.

Session Information

sessionInfo()


PhilChodrow/compx documentation built on May 8, 2019, 1:34 a.m.