#| label: setup #| include: FALSE knitr::opts_chunk$set( collapse = TRUE, comment = "#>", cache = FALSE )
The American Community Survey API returns data in a very terse format. Each response contains four parts: geographic information, a table code, a row number, and a value. The geographic information varies by the level of detail one asks for. The value is a number that may be a population size, number of households, an income in dollars, a percentage, or several other quantities. The table code and row number are what determine the exact meaning of the value.
Generally speaking, all of the values in a table will be reporting the same kind of information. Again, the information might be counts of people or households, incomes or costs in dollars, hours commuting or working, or a percentage of just about anything. The row numbers then let you know what group of people the information is for. The first row always has a value for the entire population of a geographic area. Subsequent rows may have information about a very specific subgroup, or may contain a summary value for a combination of subgroups. For example, in a sex-and-age table, you might find the value for all men and boys (males) in row 2, then the value for boys under 5 in row 3. In short, a lot of information is packed into the two fields of table name and row number.
While packing information is efficient for storing and transmitting data, it
means that users of Census information must unpack things in order to be clear.
The hercacstables
package has many tools and glossary tables that should make
it easier and more convenient to unpack Census data.
This vignette describes some features of hercacstables
that help with
the common and repetitive chores of unpacking racial and ethnic data from
Census API responses.
The Census has several ways that it reports by racial and ethnic identity.
They do not always all agree or line up.
That makes sense because the concepts of race and ethnicity are slippery,
changing across place and time.
The first racial/ethnic system that we will discuss involves ten categories. There are seven categories of race, one of ethnicity, one that combines both, and an "All" category: "Total", "White alone", "Black or African American alone", "American Indian and Alaska Native alone", "Asian alone", "Native Hawaiian and Other Pacific Islander alone", "Some Other Race alone", "Two or More Races", "White alone, not Hispanic or Latino", and "Hispanic or Latino". This is a very coarse and unsophisticated way of characterizing identities. Nevertheless, it occurs throughout the American Community Survey.
#| label: find-subtables-with-ethnicity #| echo: false broad_subtables <- hercacstables::METADATA_FOR_ACS_GROUPS |> dplyr::filter(stringr::str_detect(.data$Group, "I$")) |> dplyr::select("Group", "Description") |> dplyr::mutate( Group = stringr::str_remove(.data$Group, "I$"), Description = .data$Description |> stringr::str_remove("\\(HISPANIC OR LATINO[^)]*\\)") |> stringr::str_squish() )
There are r nrow(broad_subtables)
separate sets of tables that subdivide their
information according to this ten-category scheme.
Each identity is designated by a suffix at the end of a table's name.
In other words, all of the values from a table with one of these suffixes in its
name will pertain to people of just one racial/ethnic category.
Since the case is so common, hercacstables
describes it with the glossary
table hercacstables::RACE_ETHNICITY_SUBTABLES
.
It maps from the suffix in a table's name to the racial/ethnic category the
table describes.
#| label: show-race-ethnicity-subtable-glossary #| echo: false hercacstables::RACE_ETHNICITY_SUBTABLES |> dplyr::select( "Census Race", "Suffix" ) |> knitr::kable()
#| label: find-race-in-table-descriptions #| echo: false tables_that_name_race <- hercacstables::METADATA_FOR_ACS_GROUPS |> dplyr::filter( stringr::str_detect(.data$Group, "\\d$"), stringr::str_detect(.data$Description, stringr::regex("RACE", ignore_case = TRUE)) ) |> dplyr::select( "Group", "Description" ) |> dplyr::left_join( dplyr::count(hercacstables::METADATA_FOR_ACS_VARIABLES, .data$Group, name = "rows"), by = "Group" )
There are also r nrow(tables_that_name_race)
tables that have the word "RACE"
in their description.
#| label: show-the-tables-that-name-race #| echo: false knitr::kable(tables_that_name_race)
They seem to fall into four categories.
There are two tables with 10 rows each, B02001
and B25006
.
These are probably convenience tables that pull data from other race-specific
subtables.
Six tables, B02008
through B020013
, have only one row each.
These seem to report inclusive counts, combining people who claim a
racial identity either as their sole identity or in combination with another.
One table, B03002
, has 21 rows.
This table appears to report detailed information about Hispanic ethnicity and
specific racial identities.
Tables B98013
and B99021
have descriptions that suggest that they report
methodological details.
Each of these categories of table deserves a little more discussion.
Tables B02001
and B25006
deal with counts of individuals and households,
respectively.
They seem to be redundant, showing the first rows of the tables in the B01001
and B11001
, respectively.
We can use hercacstables
to check that.
The first step is to make a table that connects the Census's opaque variable
names to the real-world meanings that we are interested in.
For example, tables B01001*
and B02001
deal with counts of people,
while tables B11001*
and B25006
deal with counts of households.
Similarly, the racial identity being counted is defined either by the subtable
suffix or the row number.
Finally, we are interested in whether the data come from, a subtable or a
convenience table.
We can lay that all out in a way that "maps" from the Census variable to the
real-world meaning.
#| label: check-variables-for-convenience-tables convenience_check_variables <- tibble::tribble( ~ Table, ~ Suffix, ~ Index, ~ Population, ~ Race, "B01001", "A", 1, "People", "White", "B01001", "B", 1, "People", "Black", "B02001", "", 2, "People", "White", "B02001", "", 3, "People", "Black", "B11001", "A", 1, "Households", "White", "B11001", "B", 1, "Households", "Black", "B25006", "", 2, "Households", "White", "B25006", "", 3, "Households", "Black" ) |> dplyr::mutate( Group = paste0(.data$Table, .data$Suffix), Variable = hercacstables::build_api_variable(group_code = .data$Table, race_code = .data$Suffix, item_number = .data$Index), `Table Type` = dplyr::if_else(nchar(.data$Suffix) > 0, "Subtable", "Convenience") )
#| label: show-the-convenience-table #| echo: false knitr::kable(convenience_check_variables)
With our variables defined, we can fetch the data from the Census API. Notice that, since this is an API call, the code block is set to cache its results. API calls are much slower than local functions, so it is usually a good idea to isolate them and run them as few times as possible.
#| label: check-values-for-convenience-tables LATEST_YEAR <- hercacstables::most_recent_vintage("acs", "acs1") convenience_check_values_raw <- hercacstables::fetch_data( convenience_check_variables$Variable, year = LATEST_YEAR, # the most recent one available at the time of writing for_geo = "us", # the entire nation for_items = "*", # all nation-level geographies survey_type = "acs", # as opposed to, e.g. "dec," for Decennial survey data table_or_survey_code = "acs5" # the specific survey is the 5-year ACS. )
Now that we have the raw data, we can check to see if the values from the convenience tables do, in fact, match up with the values from the subtables.
#| label: wrangle-convenience-check-data convenience_check_values <- convenience_check_values_raw |> dplyr::inner_join( convenience_check_variables, by = c("Group", "Index") ) |> dplyr::select( "Population", "Race", "Table Type", "Value" ) |> tidyr::pivot_wider( names_from = "Table Type", values_from = "Value" ) |> dplyr::mutate( Identical = dplyr::if_else(.data$Subtable == .data$Convenience, "Yes", "No") ) convenience_check_values |> dplyr::mutate( dplyr::across(tidyselect::all_of(c("Subtable", "Convenience")), scales::label_comma(accuracy = 1)) ) |> knitr::kable( align = "llrrl" )
We don't have to include any of the confusing "Group," "Index," or "Variable" columns in our final result.
Six others, tables "B02008" through "B02013", show the numbers of people who identified with specific races and ethnicities. The totals from these tables will be larger than the US population because someone who identified with more than one category will be counted in each corresponding table.
These tables correspond to six of the ten broad categories of race/ethnicity, so
they are actually already in RACE_ETHNICITY_SUBTABLES
.
I just hid them before because it would have been confusing.
#| label: show-inclusive-columns-too #| echo: false hercacstables::RACE_ETHNICITY_SUBTABLES |> dplyr::select( "Census Race", "Suffix", "Inclusive Group" ) |> knitr::kable()
The population reported in the inclusive tables should be as large, or larger,
than the population reported in the exclusive tables.
We can use hercacstables
to check this, too.
#| label: inclusive-exclusive-population-variables incl_excl_pop_variables <- hercacstables::RACE_ETHNICITY_SUBTABLES |> dplyr::filter( nchar(.data$`Inclusive Group`) > 0 ) |> dplyr::mutate( `Exclusive Group` = paste0("B01001", .data$Suffix) ) |> dplyr::select( "Census Race", Inclusive = "Inclusive Group", Exclusive = "Exclusive Group" ) |> tidyr::pivot_longer( cols = tidyselect::ends_with("clusive"), names_to = "Type of count", values_to = "Group" ) |> dplyr::mutate( Variable = hercacstables::build_api_variable(.data$Group, 1) ) knitr::kable(incl_excl_pop_variables)
#| label: fetch-inclusive-exclusive-populations #| cache: true incl_excl_pop_values_raw <- hercacstables::fetch_data( variables = incl_excl_pop_variables$Variable, year = 2022, for_geo = "us", for_items = "*", survey_type = "acs", table_or_survey_code = "acs5" )
#| label: wrangle-inclusive-exclusive-populations incl_excl_pop_values <- incl_excl_pop_values_raw |> dplyr::inner_join( incl_excl_pop_variables, by = c("Group") ) |> dplyr::select( "Census Race", "Type of count", "Value" ) |> tidyr::pivot_wider( names_from = "Type of count", values_from = "Value" ) |> dplyr::mutate( Difference = .data$Inclusive - .data$Exclusive, `Percent Multiracial` = .data$Difference / .data$Inclusive ) |> dplyr::arrange( dplyr::desc(.data$Inclusive) ) incl_excl_pop_values |> dplyr::mutate( dplyr::across(tidyselect::all_of(c("Inclusive", "Exclusive", "Difference")), scales::label_comma(accuracy = 1)), `Percent Multiracial` = scales::label_percent(accuracy = 1)( .data$`Percent Multiracial` ) ) |> knitr::kable( align = "lrrrr" )
Table B03002
contains counts of people by Hispanic ethnicity for each of the
ten broad racial identities.
As always, the first row is the total population size.
There are then two groups of rows.
Rows 2-11 count people who are not Hispanic.
Rows 12-21 count people who identify as Hispanic.
Rows 2 and 12 are the total populations of non-Hispanic and Hispanic people.
Rows 10, 11, 20, and 21 contain subgroupings that distinguish between people who
identify as biracial and those who identify as multiracial.
That means that we can take rows one through eight of
hercacstables::RACE_ETHNICITY_SUBTABLES
and map them onto rows in
B03002
.
In fact, this is so useful that it is also included in
hercacstables::RACE_ETHNICITY_SUBTABLES
.
#| label: hispanic-and-broad-race-variables #| echo: false hercacstables::RACE_ETHNICITY_SUBTABLES |> dplyr::filter( nchar(.data$`non-Hispanic`) > 0 ) |> dplyr::select( "Census Race", "non-Hispanic", "Hispanic" ) |> knitr::kable()
Let's use these to look at nationwide trends across racial identities in their percentages of Hispanic ethnicity.
First, we will define our glossary table.
#| label: define-hispanic-and-broad-race-variable hispanic_and_broad_race_variables <- hercacstables::RACE_ETHNICITY_SUBTABLES |> dplyr::filter( nchar(.data$`non-Hispanic`) > 0 ) |> dplyr::select( "Census Race", "non-Hispanic", "Hispanic" ) |> tidyr::pivot_longer( cols = tidyselect::ends_with("Hispanic"), names_to = "Ethnicity", values_to = "Variable" ) |> tidyr::separate_wider_position( cols = "Variable", widths = c(Group = 6, 1, Index = 3, 1), cols_remove = FALSE ) |> dplyr::mutate( Index = as.integer(.data$Index) ) knitr::kable(hispanic_and_broad_race_variables)
Next, we pull r LATEST_YEAR - 2005
years of ACS data.
#| label: fetch-hispanic-and-broad-race-data hispanic_and_broad_race_raw <- c(2005:2019, 2021:LATEST_YEAR) |> purrr::map( ~ hercacstables::fetch_data( hispanic_and_broad_race_variables$Variable, year = ., for_geo = "us", for_items = "*", survey_type = "acs", table_or_survey_code = "acs1" ) )
Then, we put the data into a nice, tidy format.
#| label: tidy-hispanic-and-broad-race-data hispanic_and_broad_race <- hispanic_and_broad_race_raw |> purrr::list_rbind() |> dplyr::inner_join( hispanic_and_broad_race_variables, by = "Index" ) |> dplyr::select( "Census Race", "Ethnicity", "Year", "Value" ) |> tidyr::pivot_wider( names_from = "Ethnicity", values_from = "Value", values_fill = 0 ) |> dplyr::mutate( Total = .data$Hispanic + .data$`non-Hispanic`, `Percent Hispanic` = .data$Hispanic / .data$Total ) hispanic_and_broad_race |> dplyr::slice_sample( n = 1, by = "Year" ) |> dplyr::mutate( dplyr::across(c("Hispanic", "non-Hispanic", "Total"), scales::label_comma(accuracy = 1)), dplyr::across("Percent Hispanic", scales::label_percent(accuracy = 1)) ) |> knitr::kable( align = "lrrrrr" )
Finally, we plot it
#| label: plot-changes-in-hispanic-percentage #| fig-dim: !expr "c(8, 8)" #| dpi: 72 hispanic_and_broad_race |> dplyr::mutate( `Census Race` = stringr::str_to_title(.data$`Census Race`) ) |> ggplot2::ggplot( ggplot2::aes(x = .data$Year, y = .data$`Percent Hispanic`, color = .data$`Census Race`) ) + ggplot2::geom_line( linewidth = 1, lineend = "round", linejoin = "mitre" ) + ggplot2::geom_point( size = 3 ) + ggplot2::scale_x_continuous( name = NULL, labels = scales::label_number(big.mark = ""), limits = c(2005, 2025), breaks = scales::breaks_width(5), minor_breaks = scales::breaks_width(1) ) + ggplot2::scale_y_continuous( name = "Percentage Hispanic", labels = scales::label_percent(accuracy = 1), limits = c(0, 1), breaks = scales::breaks_width(0.2), minor_breaks = scales::breaks_width(0.05) ) + ggplot2::scale_color_discrete( guide = ggplot2::guide_legend( title = NULL, position = "top", nrow = 4 ) ) + ggplot2::theme_minimal()
Tables B98013
and B99021
give information about the Census's data collection
methods.
Statisticians can use to describe how much uncertainty there is in the data
concerning racial identities.
We can skip those.
#| label: show-broad-subtables #| echo: false knitr::kable(broad_subtables)
| Minimum Race/Ethnicity Reporting Category | Definition | |-------------------------------------------|------------| | American Indian or Alaska Native | Individuals with origins in any of the original peoples of North, Central, and South America, including, for example, Navajo Nation, Blackfeet Tribe of the Blackfeet Indian Reservation of Montana, Native Village of Barrow Inupiat Traditional Government, Nome Eskimo Community, Aztec, and Maya. | | Asian | Individuals with origins in any of the original peoples of Central or East Asia, Southeast Asia, or South Asia, including, for example, Chinese, Asian Indian, Filipino, Vietnamese, Korean, and Japanese. | | Black or African American | Individuals with origins in any of the Black racial groups of Africa, including, for example, African American, Jamaican, Haitian, Nigerian, Ethiopian, and Somali. | | Hispanic or Latino | Includes individuals of Mexican, Puerto Rican, Salvadoran, Cuban, Dominican, Guatemalan, and other Central or South American or Spanish culture or origin. | | Middle Eastern or North African | Individuals with origins in any of the original peoples of the Middle East or North Africa, including, for example, Lebanese, Iranian, Egyptian, Syrian, Iraqi, and Israeli. | | Multiracial and/or Multiethnic | Those who identify with multiple race/ethnicity minimum reporting categories. | | Native Hawaiian or Pacific Islander | Individuals with origins in any of the original peoples of Hawaii, Guam, Samoa, or other Pacific Islands, including, for example, Native Hawaiian, Samoan, Chamorro, Tongan, Fijian, and Marshallese. | | White | Individuals with origins in any of the original peoples of Europe, including, for example, English, German, Irish, Italian, Polish, and Scottish. |
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.