In rporta23/censusviz: Framework to Visualize Historical Census Data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 6
)

library(censusviz)
library(ggplot2)
library(dplyr)

About censusviz

The censusviz package provides an interface for exploring and visualizing historical racial demographic census data (1950-2020) sourced from IPUMS for any region in the United States (by county). The package provides functionality for visualizing the data on leaflet maps as well as for accessing the data in an accessible, tidy format such that the user can then create their own visualizations.

This package was inspired by the nepm package.

Getting Started

In the example that will follow, we will be using New York, Madison county to illustrate our package's functionality.

Reading in Data

The three data frames used in this package are tracts_long_all.rda, tracts_demo_joined1.rda, and tracts_demo_joined2.rda. Each one is read in directly from a GitHub repository due to their large sizes. tracts_demo_joined1.rda and tracts_demo_joined2.rda are combined when calling function get_data_wide() to create data_wide. tracts_long_all.rda will be called with get_data_long().

This code chunk takes about 3 minutes to run so this code has been set to not automatically evaluate. Please change eval = TRUE if you want access to the full dataset.

data_wide <- get_data_wide()

data_wide contains 2 columns: year and tract_data. There are 8 rows, and each row represents one census year 1950-2010. In the tract_data column, each value is a large shapefile containing racial demographic data joined with spatial data to draw the census tract lines for each county in the U.S. The main intended purpose of this dataset is to draw the census tract lines onto the map and to generate the dots to display the spatial distribution of racial demographics.

Filtering Data by State, County

The user can call filter_data_wide() and filter_data_long() with the proper state and county to filter out only the data that is related to that area. Here we are filtering for data only related to the county Madison in New York.

madison_data_wide <- filter_data_wide(data_wide, "New York", "Madison")

madison_data_wide

Creating Sampling Dots

After filtering the data, the user can use data_wide_filtered with create_dots() to visualize on a leaflet map. create_dots() samples random points within each census tract for each racial category so that each dot represents num_people by the set sampling value (default = 100 people per dot) to show the distribution of demographics across the country.

Here we have the sampling dot size set as 100. The sampling size can be adjusted to be higher or lower. It is important to note that the smaller the size, the more detailed the map will be however it will take a longer time for create_dots() to run especially for large counties. However, with a larger sampling size, there is a risk of under representing smaller populations since it will round down to zero if the total population size for a group is smaller than the sampling size for create_dots().

Additionally, this process can take a few minutes to several hours depending on the county size and the sampling size. The larger the county size and the smaller the sampling size, the longer the process will be for create_dots().

people <- create_dots(madison_data_wide, num_people = 100)

Generating Maps

After creating the sampling dots, this can be plotted on a map with the function base_map() which will create a blank leaflet map and add_people() which will add the sampling dots for each year.

base_map() %>%
  add_people(2020, people)

add_tracts() can also be used to add the tract lines for the county.

base_map() %>%
  add_tracts(2020, madison_data_wide) %>%
  add_people(2020, people)

Generating Graphs

data_long_filtered with dplyr and ggplot2 allows the user to filter and create graphs to explore the distribution of the populations in the selected county.

data_long <- get_data_long()
head(data_long)

data_long_filtered_Mad <- filter_data_long(data_long, "New York", "Madison")

data_long_filtered_sum_Mad <- data_long_filtered_Mad %>%
  group_by(year, race_label) %>%
  summarize(total = sum(n))

ggplot(data_long_filtered_sum_Mad, aes(x = year, y = total, color = race_label)) +
  geom_line() +
  labs(
    title = "Change in Racial Demographics over Time in Madison, NY",
    x = "Year",
    y = "Number of People",
    color = "Race"
  )

data_long_filtered_Queens <- filter_data_long(data_long, "New York", "Queens")

data_long_filtered_sum_Queens <- data_long_filtered_Queens %>%
  group_by(year, race_label) %>%
  summarize(total = sum(n))

ggplot(data_long_filtered_sum_Queens, aes(x = year, y = total, color = race_label)) +
  geom_line() +
  labs(
    title = "Change in Racial Demographics over Time in Queens, NY",
    x = "Year",
    y = "Number of People",
    color = "Race"
  )