In PhilChodrow/compx: Analyze the spatial complexity of compositional data

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

Introduction

This document is an introduction to information-geometric techniques for the analysis of spatial segregation and inequality using the compx package for the R programming language. Get compx by running:

# install.packages('devtools') if necessary
devtools::install_github('PhilChodrow/compx')
library(compx)

library(compx)

The core situation we consider is one in which our data consist of counts in two or more categories, distributed in space. A standard example is demographic census data, in which the categories might be racial groups, and the spatial units tracts. This document does not focus on mathematical details of the methods, which will be explained in a forthcoming paper. Rather, the focus is on the intuition behind various analytical methods, and how to carry them out in order to learn about spatial separation with freely-available software tools.

To prepare data and analyze outputs of compx functions, we will also need the following packages:

library(tidyverse)    # data import, manipulation, and viz. Need dev version of ggplot2 for viz
library(maptools)     # plot the tracts
library(RColorBrewer) # for plots
library(ggthemes)     # for theme_map
library(viridis)      # for color scales
library(units)        # needed by sf
library(sf)           # for io and viz
library(scales)       # for ggplot2 viz

Please note that the geom_sf function used by ggplot2 to visualize maps below is, at the time of writing, only available on the development branch of ggplot2. This version can be obtained by running

devtools::install_github('tidyverse/ggplot2')

Data inputs

The functions provided by package compx assume that your data is expressed in two components. The first is an object of class sf, which is short for Simple Features. These are data.frames that contain a special column -- called geometry -- for encoding spatial information such as polygon shapes. Learn more about the sf package for R at its CRAN and GitHub pages. For illustrative purposes, compx comes bundled with an sf object with the Census tracts for Wayne County, Michigan, which includes the urban core of Detroit, an oft-analyzed city in quantitative segregation studies. The tracts were originally accessed via the R package tigris.

detroit_tracts %>% 
    ggplot() + 
    geom_sf() + 
    theme_map()

The second data input must be a demographic table with class in c('tbl_df', 'tbl', 'data.frame'), which I will refer colloquially to as a "data frame." compx includes an example in detroit_race, giving racial demographic counts for each tract for decennial Censuses 1970, 1980, 1990, 2000, 2010. Due to changing questions and collection methods, only 1990 data and later is comparable with 2010. The demographic data was assembled and normalized by the Project on Diversity and Disparities at Brown University.

detroit_race

Note that detroit_race contains one row per combination of tract, time, and racial group. compx expects your demographic data to be in this tidy (or "long") format. If it's in a different shape (e.g. with a column for each racial group), you should use dplyr and tidyr packages to format it appropriately.

Required Columns

Three columns are required:

tract, the key relating the data the corresponding spdf. compx assumes that tract matches the GEOID column of spdf@data.
group, the demographic category (such as racial group, in this case).
n, the count of members of group in each tract.

Additionally, you may include an optional column for time t. compx functions will automatically use this column of detected; if you don't want to do temporal analysis, you should delete the $t$ column if you have one. Any additional columns are ignored.

Data Cleaning

Before going any further, let's do a bit of cleaning to ensure that we are only analyzing with tracts for which there is a reasonable amount of data available.

tracts_to_use <- detroit_race %>% 
    group_by(tract, t) %>%
    summarise(n = sum(n)) %>%
    filter(n > 100) 

tracts <- detroit_tracts %>% 
    filter(GEOID %in% tracts_to_use$tract)

The Information Manifold

The Animating Idea

Our analytical approach is motivated by information geometry. The core notion of information geometry is to view the data as lying on an information manifold, and then use some simple tools of differential geometry to study that manifold's properties.

The fundamental manifold property is the metric tensor $g$. The metric tensor contains complete information on how distances in information space are bent and curved, depending on where we are. It may therefore be used to compute distances. If $p, q, r \in M$, then, provided $p,q,$ and $r$ are sufficiently close, the geodesic distance between $q$ and $r$ is approximately

$$ d(q, r) \approx \sqrt{g_p(q - r, q - r)}\;.$$

When $g = I$, the identity matrix, this formula reduces to $d(q,r) \approx \lVert q-r \rVert$, the Euclidean distance. When $g$ differs substantially from $I$, we get more interesting behavior.

We can view the metric tensor $g$ as encoding the information we need to measure distances based on information, rather than on geography alone. A related and important idea of the metric tensor is that it encodes variation -- places where the components of the metric tensor are large correspond to boundary areas between different demographic clusters.

Computing the Metric Tensor

Because the metric tensor is fundamental to our approach, compx includes a function that computes it directly from tracts and appropriately keyed data. If the keyed data contains no column labeled t, then compx assumes that all data are from the same time period. The simplest case of data looks like this:

race_2010 <- detroit_race %>%
    filter(t == 2010) %>%
    select(-t)                 # remove the t column if not using it

race_2010

Now we can compute the metric tensor. There are a few arguments to specify here. They are all briefly described in the comments below, and more fully in the package documentation. One thing to note is that hessian parameter should be a function that takes in a single vector and returns a matrix. So far, compx comes bundled with appropriate Hessian functions for the KL Divergence (DKL_), Euclidean (euc_), and cumulative Euclidean (cum_euc_) matrices; more are in progress.

data  <- race_2010 %>%
    filter(tract %in% tracts$GEOID) %>%
    group_by(tract) %>%
    ungroup() 

metric_df <- compute_metric(tracts, 
                            data, 
                            km = T,
                            r_sigma = 10,
                            s_sigma = 1, 
                            smooth = T,
                            hessian = euc_)

In addition to the tracts and data parameters, compute_metric also allows users to specify whether length should be measured in km or degrees long/lat; the divergence function to use when comparing distributions, and sigma, a technical parameter relating to the computation of numerical derivatives required to compute the metric tensor. The smooth parameter controls whether mild spatial smoothing is used to avoid singularities, and is recommended for use.

The output of compute_metric is a data frame, keyed by geoid, which gives the total population of the tract as well as a list-column with the metric tensor in local coordinates. Since we didn't give a time column, there are only two local coordinates x and y.

metric_df %>% head()

We can also take a quick peek at the metric tensor itself. For example, the tensors corresponding to the first two entries of metric_df are:

metric_df$g %>% head(2)

The metric tensor is a symmetric matrix. Roughly, $g_{ii}$ encodes the strength of the dependence on the frequency field on the $i$th coordinate. Thus, if we focus on

metric_df$g[[1]]

we can compare the entry of r metric_df$local_g[[1]][1,1] in the x component to r metric_df$local_g[[1]][2,2] for the y component to get a sense for the direction in which the demographic distribution is changing most rapidly. In this case, it's the $x$ direction. Intuitively, if we stood at the point $(x,y)$ and looked east or west, we'd say more difference from where we were standing than if we looked north or south.

Analyzing the Metric Tensor

To visualize the metric structure, it's usually necessary to apply a scalar function to the metric tensor. Good ones are the trace and the square root of the determinant, both of which are usefully interpretable from the information-geometric point of view. These are:

$$ \begin{align} \text{tr } g &= \sum_i \lambda_i \ \sqrt{|\text{det } g|} & = \sqrt{\prod_i |\lambda_i|} \end{align} \;,$$

where $\lambda_i$ is the $i$th eigenvalue of $g$. While these formulae are a bit abstruse, the formulae actually point out how these measures behave. The trace is large when at least one eigenvalue is large, which occurs when demographic trends are changing rapidly in at least one direction. On the other hand, the determinant is large when all eigenvalues are large, which occurs when demographic trends are changing rapidly in all directions. What this means in practice is:

The trace is tends to be high in cities with long, linear boundaries.
The determinant tends to be high in cities with small, curvy "pockets" of variation.

Let's compute these two quantities:

metric_df <- metric_df %>% 
    mutate(trace = map_dbl(g, . %>% diag() %>% sum()),
           det   = map_dbl(g, . %>% det() %>% abs() %>% sqrt()))
metric_df %>% head()

It's useful to plot these scalars to see how they behave visually. For example, here's the trace:

tracts %>% 
    left_join(metric_df, by = c('GEOID' = 'geoid')) %>% 
    ggplot() + 
    geom_sf(aes(fill = trace), size = .1) + 
    viridis::scale_fill_viridis(trans = 'log10') + 
    ggthemes::theme_map() + 
    theme(legend.position = c(.8, .1)) +
    guides(fill = guide_colorbar(title = expression(italic(j[x]))))

One way to think about the more intense areas with higher traces is that they are areas of transition. If you compare this visualization to, e.g. UVA's Racial Dot Map you'll find that the lighter areas correspond to the predominating white/black boundary, as well as smaller subdivisions.

Correlation of Measures

While the trace and determinant are separate quantities with separate interpretations, they will tend to rather strongly correlated:

metric_df %>%
    ggplot() +
    aes(x = trace, y = det) +
    geom_point() +
    theme_bw() +
    scale_x_continuous(trans = 'log10') +
    scale_y_continuous(trans = 'log10')

Incorporating Time

compx tries to make it easy for you to incorporate time into your analysis as well. To do this, you just need to make sure that your data has a t column with appropriate integer or numeric values. Then, compute_metric will work just as before.

race_temporal <- detroit_race %>%
    filter(t %in% c(1990, 2000, 2010))           # now we don't remove the t column

metric_df <- compute_metric(tracts,
                            race_temporal,
                            km = T,
                            s_sigma = 1,
                            r_sigma = 1,
                            hessian = DKL_,
                            smooth = T)

Note that this computation is likely to take longer, as we now have essentially three times as many data points on which to compute.

This version of metric_df is different in two respects:

metric_df %>% head()

First, there's a t column for time. Second, the metric tensor is now 3 x 3, which reflects the fact that we've added a third coordinate (time).

metric_df$g[[1]]

In general, it's not a good idea to directly compare temporal components of the metric tensor to spatial ones, since they have different units. In this case, the spatial components have units of kilometers, but the temporal ones have units of years. However, it's not wrong to interpret this result as saying that, at this point in space and time, the demographic composition changes more when you move a km in space (especially north-south) than when you move a year in time.

Just like before, we can apply some scalar functions to extract information about the metric tensor. The spatial trace is the sum of the first two diagonal entries, corresponding to the components of the metric tensor that encode spatial (not temporal) variability.

trace_df <- metric_df %>%
    mutate(trace = map_dbl(g, ~ .[1:2, 1:2] %>% diag() %>% sum()))

tracts %>%
    left_join(trace_df, by = c('GEOID' = 'geoid')) %>% 
    ggplot() +
    geom_sf(aes(fill = trace), size = .1) + 
    scale_fill_viridis(option = 'magma', limits = c(1e-4, 5e0), trans = 'log10', oob = squish) +
    theme_map() +
    facet_wrap(~t) +
    theme(legend.position = c(.9,0),
          legend.justification = c(0,0)) +
    guides(fill = guide_colorbar(title = expression(italic(j[x]))))

The third component of the metric tensor quantifies the dependence on time. We can also extract that and visualize it:

t_df <- metric_df %>%
    mutate(temporal = map_dbl(g, ~ .[3, 3]))

tracts %>%
    left_join(t_df, by = c('GEOID' = 'geoid')) %>%
    ggplot() +
    geom_sf(aes(fill = temporal), size = .1) + 
    scale_fill_viridis(option = 'magma', trans = 'log10', limits = c(NA, 5e-2), oob = squish) +
    theme_map() +
    facet_wrap(~t) +
    theme(legend.position = c(.9,0),
          legend.justification = c(0,0)) +
    guides(fill = guide_colorbar(title = expression(italic(j[t]))))

We can see that, the predominantly white suburbs toward the west have generally experienced greater demographic change over time than the prodominantly black urban core toward the northeast. We also observe that the temporal component of the metric tensor tends to correlate with the spatial component of the previosu visualization. So, spatial transition areas also tend to be temporal transition areas. This makes sense if you think that the dynamics of population composition are continuous in space, so that change happens at the boundaries. However, this correlation is relatively weak.

t_df  %>%
    mutate(spatial = map_dbl(g, ~ .[1:2, 1:2] %>% diag() %>% sum())) %>%
    ggplot() +
    aes(x = spatial, y = temporal) +
    geom_point() +
    scale_y_continuous(trans = 'log10') +
    scale_x_continuous(trans = 'log10') +
    geom_smooth() +
    facet_wrap(~t) +
    theme_bw()

Thanks for reading! If you enjoyed this vignette and are interested to learn more about compx, take a look at the next vignette on studying the scales of segregation and finding spatial structure via network analysis

Session Information

sessionInfo()

PhilChodrow/compx documentation built on May 8, 2019, 1:34 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

PhilChodrow/compx
Analyze the spatial complexity of compositional data

In PhilChodrow/compx: Analyze the spatial complexity of compositional data

Introduction

Data inputs

Required Columns

Data Cleaning

The Information Manifold

The Animating Idea

Computing the Metric Tensor

Analyzing the Metric Tensor

Correlation of Measures

Incorporating Time

Session Information

R Package Documentation

Browse R Packages

We want your feedback!

PhilChodrow/compx Analyze the spatial complexity of compositional data

In PhilChodrow/compx: Analyze the spatial complexity of compositional data

Introduction

Data inputs

Required Columns

Data Cleaning

The Information Manifold

The Animating Idea

Computing the Metric Tensor

Analyzing the Metric Tensor

Correlation of Measures

Incorporating Time

Session Information

R Package Documentation

Browse R Packages

We want your feedback!

PhilChodrow/compx
Analyze the spatial complexity of compositional data