knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 6, dpi = 96 )
You may be new to R, but not data science, in which case I suggest checking out R for Data Science. If you're not coming from a data science perspective, I have heard good things about this book.
hercacstables
If this is your first time using this package then you probably want to register
with the Census.
Registering is important because it helps you make more queries to the API.
The Census limits the number of anonymous API queries that it receives from any
one IP address.
The vignette("set_up_your_api_key")
will guide you through the processs.
This package helps R users to access data from the United States Census Bureau, especially its American Community Survey.
The American Community Survey (ACS) from the US Census's website provides a vast amount of useful data.
However, it returns those data in a weirdly idiosyncratic way. Even though the output seems tabular, the data are really organized in a tree-like fashion. This package is intended to make it easy to access and use the ACS data with R.
The database has an enormous amount of information, stored in a versatile but
cumbersome format.
Data are stored in hundreds of different tables, called "groups," that give
breakdowns of information.
For example, group
B21005
reports counts of people according to their employment status, military service,
and age.
Each row in a group reports a value for a specific slice of the topic.
For example, row 5 in group B21005
gives the number of employed veterans
between the ages of 18 and 34.
The Census combines the group code and row into a variable, B21005_005E
.
This packing of information gives the Census a very flexible way to report all
of their data.
However, it is challenging for the user because you have to somehow know the
variables that report the data that you are interested in.
In light of that, hercacstables
is designed to help people make repeated,
efficient use of the Census's API.
There are glossary tables that explicitly connect real-world concepts to
Census variables.
You can then use database-style joins to connect the concepts to the values that
you pull from the API.
Many questions that work with Census data follow a common pattern:
How did [measurement] differ among [demographic groups] and across [geographic levels] in [geographic area] during [span of time]?
A Measurement is any of the huge number of things that the Census keeps track of. Measurements are arranged in tables (called "groups") and rows. If you were drawing maps, this would be what determines the color of each area. Examples include population size (table B01001 and others), median household income (table B19013 and others), types of computers in a household (table B28001), number of vehicles used while commuting (table B08015).
A Demographic Group is a subset of the population that shares specific traits. Demographic groups can be represented either by groups (common when reporting by race or ethnicity) or rows (most other cases, like age, sex, or veteran status). If you were drawing maps, you would probably have a different version of each map for each demographic grouping. Examples include the number of Hispanic girls under 5 years old (table B01001I, row 18), the median income of Asian American households (table B19013D, row 1), the number of households with no computer (table B28001, row 11), the number of vehicles used by women while commuting (table B08015, row 3).
A Geographic Level is a Census-defined hierarchy from the whole country to small blocks. The level that you are interested in is determined by the scale of your question. If you were drawing maps, these would be the areas that appeared as different colors. Examples include congressional districts, incorporated cities, school districts, and Census tracts.
A Geographic Area is a set of specific instances of one or more geographic levels. This is the full geographic scope of your question. If you were drawing maps, this would determine their scale. Examples include whole states, metropolitan areas, Census-designated places, and counties.
Universe
and Description
fields
of hercacstables::METADATA_FOR_ACS_GROUPS
.hercacstables::METADATA_FOR_ACS_VARIABLES
:
filter by the Group
field and examine the Details
field.hercacstables::METADATA_FOR_ACS_GEOGRAPHIES
for this, too!hercacstables::fetch_data()
dplyr::inner_join()
between the fetched
data and your lookup tables.purrr::map()
and purrr::list_rbind()
to download all of the data into
one data frame.base::saveRDS()
.Let's say you want to ask this question:
How did [the number of households] differ between [Hispanic and non-Hispanic people] and across [Manchester, Nashua, and suburban areas] in [Hillsborough County, NH] during [the last 10 years]?
First, find the variables that describe the numbers of households.
One way to do this is to search hercacstables::METADATA_FOR_ACS_GROUPS
for groups
whose Universe
is "Households" and whose Description
contains "Hispanic" or
"Ethnicity."
A group's Universe
describes what it is measuring, often telling you the units
of whatever its values are.
A group's Description
is a phrase that summarizes what it reports.
#| label: search-for-relevant-groups hercacstables::METADATA_FOR_ACS_GROUPS |> dplyr::filter( .data$ACS5, stringr::str_detect(.data$Universe, "Household"), stringr::str_detect(.data$Description, "Hispanic|Ethnic") ) |> dplyr::select( "Group", "Universe", "Description" ) |> knitr::kable()
It looks like our best bet is group "B11001I." It is likely that group "B11001" contains counts of households of any race. The first row of most groups is the total value across any demographic subset that it keeps track of. Since our question does not ask about different household types, we probably just need row one from groups "B11001" and "B11001I." This gives us a good opportunity to document the ethnicities counted in each.
#| label: show-relevant-variables HOUSEHOLD_GROUPS <- c("B11001", "B11001I") household_variables <- hercacstables::METADATA_FOR_ACS_VARIABLES |> dplyr::filter( .data$Dataset == "ACS1", .data$Group %in% HOUSEHOLD_GROUPS, .data$Index == 1) |> dplyr::mutate( Ethnicity = dplyr::case_match(.data$Group, "B11001" ~ "All", "B11001I" ~ "Hispanic or Latino") ) |> dplyr::select("Group", "Index", "Variable", "Ethnicity") knitr::kable(household_variables, align = c("lrll"))
The next step is to find the codes that are related to the geographic areas and levels. We need to know the FIPS code for the state, county, and two cities. We also need to know that the geographic level that we're working with is "county subdivision." As the last part of this step, we define a lookup table to translate from the FIPS codes for the different geographies to a human-readable name.
#| label: define-geographic-codes NEW_HAMPSHIRE <- "33" COUNTY_LEVEL <- "county" HILLSBOROUGH_CO <- "011" CITY_LEVEL <- "county subdivision" MANCHESTER_NH <- "45140" NASHUA_NH <- "50260" geography_definitions <- tibble::tribble( ~ FIPS, ~ Location, HILLSBOROUGH_CO, "County-wide", MANCHESTER_NH, "Manchester", NASHUA_NH, "Nashua" ) geography_definitions |> knitr::kable()
This example looks for population changes "in the last ten years." What that really means is the ten most recent years for which we have data. We can use 1-year estimates for our question because we are dealing with geographic levels that have more than 50,000 people in them. In the case of the 1-year ACS dataset, the latest will be one or two years ago. The earliest will be 11 years before that because there was no survey in 2020.
#| label: define-time-range BUILD_DATE <- Sys.Date() LATEST_YEAR <- hercacstables::most_recent_vintage("acs", "acs1") TEN_YEARS_AGO <- LATEST_YEAR - 11L
The last ten years available from the Census are, as of r BUILD_DATE
,
r TEN_YEARS_AGO
through r LATEST_YEAR
.
#| label: define-years YEARS_INCLUDED <- c(TEN_YEARS_AGO:2019, 2021:LATEST_YEAR)
We will need to make multiple calls to the API, so it makes sense to create some reusable functions. We need two calls per year. The first one will pull the county-wide household counts. The second one will pull the household counts for each city. Each function should have "year" as its argument so that we can reuse it.
#| label: fetching-functions generalized_fetch_data <- function(.year, .level, .areas, ...) { hercacstables::fetch_data( variables = household_variables$Variable, year = .year, survey_type = "acs", table_or_survey_code = "acs1", for_geo = .level, for_items = .areas, state = NEW_HAMPSHIRE, ... ) } fetch_county_households <- function(.year){ generalized_fetch_data(.year, "county", HILLSBOROUGH_CO) } fetch_city_households <- function(.year){ generalized_fetch_data(.year, CITY_LEVEL, c(MANCHESTER_NH, NASHUA_NH), county = HILLSBOROUGH_CO) } fetch_example_data <- function(.year) { dplyr::bind_rows( fetch_county_households(.year), fetch_city_households(.year) ) }
This is where hercacstables
starts to come into its own.
We define the fetching process as few times as possible, just tweaking it for
related cases.
This leads to a lot of code reuse and efficiency, especially for reports that
you just need to update once a year.
#| label: actually-fetch-data #| cache: true raw_households <- YEARS_INCLUDED |> purrr::map(fetch_example_data) |> purrr::list_rbind()
#| label: show-the-raw-data raw_households |> dplyr::filter(.data$Year == LATEST_YEAR) |> dplyr::mutate( dplyr::across(c("Value"), scales::label_comma(accuracy = 1)) ) |> knitr::kable( align = "lllrrrl" )
The raw data are not very usable. Several of the columns still have codes, rather than human-readable values. The value column also does not explicitly state values for the suburban or for non-Hispanic folks. The "wrangling" process is where we addressing these drawbacks.
We should get rid of the columns that are Census database codes and create columns about location and demographics that have human-readable values.
#| label: wrangle-household-data-mapping households <- raw_households |> dplyr::inner_join( household_variables, by = c("Group", "Index") ) |> dplyr::mutate( FIPS = dplyr::coalesce(.data$`county subdivision`, .data$county) ) |> dplyr::inner_join( geography_definitions, by = "FIPS" ) |> dplyr::select( "Location", "Year", "Ethnicity", Households = "Value" ) households |> dplyr::filter(.data$Year == LATEST_YEAR) |> dplyr::mutate( dplyr::across(c("Households"), scales::label_comma(accuracy = 1)) ) |> knitr::kable( align = "rllr" )
Now that the columns are human-readable, we can compute the values that we are
actually interested in.
That computation involves subtracting either the cities' households from the
county's, or the number of Hispanic households from the total number of
households.
This task turns up very frequently when dealing with Census data, so here is a
helper function that you can use or adapt to do it for you:
hercacstables::subtract_parts_from_whole()
.
That function does not remove the rows that contain the all-groups category.
In our case, we must remove them so all of the calculations come out correctly.
#| label: wrangle-household-data-computing households <- households |> hercacstables::subtract_parts_from_whole( grouping_column = "Location", value_column = "Households", whole_name = "County-wide", part_names = c("Manchester", "Nashua"), remainder_name = "Suburbs" ) |> dplyr::filter( .data$Location != "County-wide" ) |> hercacstables::subtract_parts_from_whole( grouping_column = "Ethnicity", value_column = "Households", whole_name = "All", part_names = "Hispanic or Latino", remainder_name = "Not Hispanic or Latino" ) |> dplyr::filter( .data$Ethnicity != "All" ) households |> dplyr::filter(.data$Year == LATEST_YEAR) |> dplyr::mutate( dplyr::across(c("Households"), scales::label_comma(accuracy = 1)) ) |> knitr::kable( align = "lrlr" )
Now we can finally look at trends in number of households in Hillsborough County, NH, comparing between Hispanic and non-Hispanic households among Manchester, Nashua, and the suburbs.
The first step is to make some graphs. These data are 4-dimensional because they involve time, location, ethnicity, and number of households. That means we'll need more than one graph. It looked from the tables above that the number of Hispanic households is much lower than non-Hispanic households in all three locations. Let's make a two-panel graph, where each panel shows one ethnicity. That way they can have separate y axes. Both graphs will have time on the x-axis, number of households on the y, and designate location with the color of points and lines.
#| label: plot-the-results #| fig-width: 7 #| fig-height: 6 #| dpi: 96 #| warning: false households |> ggplot2::ggplot( ggplot2::aes( x = .data$Year, y = .data$Households, color = .data$Location, group = .data$Location ) ) + ggplot2::geom_line( linewidth = 2 ) + ggplot2::geom_point( size = 5 ) + ggplot2::scale_x_continuous( name = NULL, breaks = scales::breaks_width(5), minor_breaks = scales::breaks_width(1) ) + ggplot2::scale_y_continuous( name = "Number of households", limits = c(0, NA), labels = scales::label_comma(accuracy = 1) ) + ggplot2::scale_color_viridis_d( name = NULL, guide = ggplot2::guide_legend(position = "top") ) + ggplot2::facet_grid( rows = ggplot2::vars(.data$Ethnicity), scales = "free_y" )
It looks like the number of households is growing, with non-Hispanic households increasing in the suburbs and Hispanic households increasing in the cities.
Let's test this with an ANCOVA.
We'll subtract r TEN_YEARS_AGO
from the year so that the intercept estimate gives us the
value in r TEN_YEARS_AGO
, not AD 0.
#| label: statistical-model household_model <- households |> dplyr::mutate( Year = .data$Year - TEN_YEARS_AGO, Location = factor(.data$Location, levels = c("Suburbs", "Manchester", "Nashua")), Ethnicity = factor(.data$Ethnicity, levels = c("Not Hispanic or Latino", "Hispanic or Latino")) ) |> lm( Households ~ Year * Ethnicity * Location, data = _ )
I always like to look at the ANOVA table first to get a 10,000 meter view before I try to interpret specific parameters.
#| label: anova-table household_model |> anova() |> broom::tidy() |> dplyr::mutate( dplyr::across(c("sumsq", "meansq"), scales::label_comma(accuracy = 1)), dplyr::across("statistic", \(.) signif(., 4)), dplyr::across("p.value", \(.) round(., 4)) ) |> knitr::kable( align = "lrrrrr" )
It looks like EVERYTHING is significant, so let's look at all of the parameters that were in the near-significant range.
#| label: significant-parameters household_model |> broom::tidy() |> dplyr::filter( .data$`p.value` < 0.1 ) |> dplyr::mutate( dplyr::across(c("estimate", "std.error", "statistic"), \(.) signif(., 4)), dplyr::across("p.value", \(.) round(., 4)) ) |> knitr::kable( align = "lrrrr" )
The number of households in Hillsborough county overall grew from
r TEN_YEARS_AGO
-r LATEST_YEAR
.
The number of Hispanic households was much higher in the two cities than in the
suburban parts of the county.
This difference became more pronounced over the decade, for two reasons.
The number of Hispanic households grew more quickly in Nashua and Manchester
than in the suburban areas.
The number of non-Hispanic households grew more quickly in the suburbs than in
the cities.
Isn't it nice when the visual patterns are corroborated by significant
statistics?
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.