NHGIS API Requests"

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(vcr)

vcr_dir <- "fixtures"

have_api_access <- TRUE

if (!nzchar(Sys.getenv("IPUMS_API_KEY"))) {
  if (dir.exists(vcr_dir) && length(dir(vcr_dir)) > 0) {
    # Fake API token to fool ipumsr API functions
    Sys.setenv("IPUMS_API_KEY" = "foobar")
  } else {
    # If there are no mock files nor API token, can't run API tests
    have_api_access <- FALSE
  }
}

vcr_configure(
  filter_sensitive_data = list(
    "<<<IPUMS_API_KEY>>>" = Sys.getenv("IPUMS_API_KEY")
  ),
  write_disk_path = vcr_dir,
  dir = vcr_dir
)

check_cassette_names()

# We do not expose detailed pagination options to users, but we do not want
# to save a full record of summary metadata in a .yml fixture for this
# vignette. This helper allows us to request just a few records, which
# we pretend is the full set of records for the purposes of the vignette.
get_truncated_metadata <- function(collection,
                                   type,
                                   page_size = 10,
                                   max_pages = 1,
                                   api_key = Sys.getenv("IPUMS_API_KEY")) {
  url <- ipumsr:::api_request_url(
    collection = collection,
    path = ipumsr:::metadata_request_path(collection, type),
    queries = list(pageNumber = 1, pageSize = page_size)
  )

  responses <- ipumsr:::ipums_api_paged_request(
    url = url,
    max_pages = max_pages,
    delay = 0,
    api_key = api_key
  )

  metadata <- purrr::map_dfr(
    responses,
    function(res) {
      content <- jsonlite::fromJSON(
        httr::content(res, "text"),
        simplifyVector = TRUE
      )

      content$data
    }
  )

  # Recursively convert all metadata data.frames to tibbles and all
  # camelCase names to snake_case
  ipumsr:::convert_metadata(metadata)
}

This vignette details the options available for requesting IPUMS NHGIS data and metadata via the IPUMS API.

If you haven't yet learned the basics of the IPUMS API workflow, you may want to start with the IPUMS API introduction. The code below assumes you have registered and set up your API key as described there.

In addition to NHGIS, the IPUMS API also supports several microdata projects. For details about obtaining IPUMS microdata using ipumsr, see the microdata-specific vignette.

Before getting started, we'll load ipumsr and some helpful packages for this demo:

library(ipumsr)
library(dplyr)
library(purrr)

Basic IPUMS NHGIS concepts

IPUMS NHGIS supports 3 main types of data products: datasets, time series tables, and shapefiles.

IPUMS NHGIS metadata

Of course, to make a request for any of these data sources, we have to know the codes that the API uses to refer to them. Fortunately, we can browse the metadata for all available IPUMS NHGIS data sources with get_metadata_nhgis().

Users can view summary metadata for all available data sources of a given data type, or detailed metadata for a specific data source by name.

Summary metadata

insert_cassette("nhgis-metadata-summary")

To see a summary of all available sources for a given data product type, use the type argument. This returns a data frame containing the available datasets, data tables, time series tables, or shapefiles.

ds <- get_metadata_nhgis(type = "datasets")

head(ds)

We can use basic functions from {dplyr} to filter the metadata to those records of interest. For instance, if we wanted to find all the data sources related to agriculture from the 1900 Census, we could filter on group and description:

ds %>%
  filter(
    group == "1900 Census",
    grepl("Agriculture", description)
  )

The values listed in the name column correspond to the code that you would use to request that dataset when creating an extract definition to be submitted to the IPUMS API.

Similarly, for time series tables:

# Secretly get truncated number of tst records because otherwise the .yml
# fixture becomes very large.

# Make sure that any code that uses this metadata is consistent with the output
# that would be obtained were the entire metadata set loaded!
tst <- get_truncated_metadata("nhgis", "time_series_tables")
tst <- get_metadata_nhgis("time_series_tables")

While some of the metadata fields are consistent across different data types, some, like geographic_integration, are specific to time series tables:

head(tst)

Note that for time series tables, some metadata fields are stored in list columns, where each entry is itself a data frame:

tst$years[[1]]
tst$geog_levels[[1]]

To filter on these columns, we can use map_lgl() from {purrr}. For instance, to find all time series tables that include data from a particular year:

# Iterate over each `years` entry, identifying whether that entry
# contains "1840" in its `name` column.
tst %>%
  filter(map_lgl(years, ~ "1840" %in% .x$name))

For more details on working with nested data frames, see this tidyr article.

eject_cassette("nhgis-metadata-summary")

Detailed metadata

insert_cassette("nhgis-metadata-detailed")

Once we have identified a data source of interest, we can find out more about its detailed options by providing its name to the corresponding argument of get_metadata_nhgis():

cAg_meta <- get_metadata_nhgis(dataset = "1900_cAg")

This provides a comprehensive list of the possible specifications for the input data source. For instance, for the 1900_cAg dataset, we have 66 tables to choose from, and 3 possible geographic levels:

cAg_meta$data_tables
cAg_meta$geog_levels

You can also get detailed metadata for an individual data table. Since data tables belong to specific datasets, both need to be specified to identify a data table:

get_metadata_nhgis(dataset = "1900_cAg", data_table = "NT2")

Note that the name element is the one that contains the codes used for interacting with the IPUMS API. The nhgis_code element refers to the prefix attached to individual variables in the output data, and the API will throw an error if you use it in an extract definition. For more details on interpreting each of the provided metadata elements, see the documentation for get_metadata_nhgis().

Now that we have identified some of our options, we can go ahead and define an extract request to submit to the IPUMS API.

eject_cassette("nhgis-metadata-detailed")

Defining an IPUMS NHGIS extract request

To create an extract definition containing the specifications for a specific set of IPUMS NHGIS data, use define_extract_nhgis().

When you define an extract request, you can specify the data to be included in the extract and indicate the desired format and layout.

Basic extract definitions

Let's say we're interested in getting state-level data on the number of farms and their average size from the 1900_cAg dataset that we identified above. As we can see in the metadata, these data are contained in tables NT2 and NT3:

cAg_meta$data_tables

Dataset specifications

To request these data, we need to make an explicit dataset specification. All datasets must be associated with a selection of data tables and geographic levels. We can use the ds_spec() helper function to specify our selections for these parameters. ds_spec() bundles all the selections for a given dataset together into a single object (in this case, a ds_spec object):

dataset <- ds_spec(
  "1900_cAg",
  data_tables = c("NT1", "NT2"),
  geog_levels = "state"
)

str(dataset)

This dataset specification can then be provided to the extract definition:

nhgis_ext <- define_extract_nhgis(
  description = "Example farm data in 1900",
  datasets = dataset
)

nhgis_ext

Dataset specifications can also include selections for years and breakdown_values, but these are not available for all datasets.

Time series table specifications

Similarly, to make a request for time series tables, use the tst_spec() helper. This makes a tst_spec object containing a time series table specification.

Time series tables do not contain individual data tables, but do require a geographic level selection, and allow an optional selection of years:

define_extract_nhgis(
  description = "Example time series table request",
  time_series_tables = tst_spec(
    "CW3",
    geog_levels = c("county", "tract"),
    years = c("1990", "2000")
  )
)

Shapefile specifications

Shapefiles don't have any additional specification options, and therefore can be requested simply by providing their names:

define_extract_nhgis(
  description = "Example shapefiles request",
  shapefiles = c("us_county_2021_tl2021", "us_county_2020_tl2020")
)

Invalid specifications

An attempt to define an extract that does not have all the required specifications for a given dataset or time series table will throw an error:

define_extract_nhgis(
  description = "Invalid extract",
  datasets = ds_spec("1900_STF1", data_tables = "NP1")
)

Note that it is still possible to make invalid extract requests (for instance, by requesting a dataset or data table that doesn't exist). This kind of issue will be caught upon submission to the API, not upon the creation of the extract definition.

More complicated extract definitions

It's possible to request data for multiple datasets (or time series tables) in a single extract definition. To do so, pass a list of ds_spec or tst_spec objects in define_extract_nhgis():

define_extract_nhgis(
  description = "Slightly more complicated extract request",
  datasets = list(
    ds_spec("2018_ACS1", "B01001", "state"),
    ds_spec("2019_ACS1", "B01001", "state")
  ),
  shapefiles = c("us_state_2018_tl2018", "us_state_2019_tl2019")
)

For extracts with multiple datasets or time series tables, it may be easier to generate the specifications independently before creating your extract request object. You can quickly create multiple ds_spec objects by iterating across the specifications you want to include. Here, we use {purrr} to do so, but you could also use a for loop:

ds_names <- c("2019_ACS1", "2018_ACS1")
tables <- c("B01001", "B01002")
geogs <- c("county", "state")

# For each dataset to include, create a specification with the
# data tabels and geog levels indicated above
datasets <- purrr::map(
  ds_names,
  ~ ds_spec(name = .x, data_tables = tables, geog_levels = geogs)
)

nhgis_ext <- define_extract_nhgis(
  description = "Slightly more complicated extract request",
  datasets = datasets
)

nhgis_ext

This workflow also makes it easy to quickly update the specifications in the future. For instance, to add the 2017 ACS 1-year data to the extract definition above, you'd only need to add "2017_ACS1" to the ds_names variable. The iteration would automatically add the selected tables and geog levels for the new dataset. (This workflow works particularly well for ACS datasets, which often have the same data table names across datasets.)

Data layout and file format

IPUMS NHGIS extract definitions also support additional options to modify the layout and format of the extract's resulting data files.

For extracts that contain time series tables, the tst_layout argument indicates how the longitudinal data should be organized.

For extracts that contain datasets with multiple breakdowns or data types, use the breakdown_and_data_type_layout argument to specify a layout . This is most common for data sources that contain both estimates and margins of error, like the ACS.

File formats can be specified with the data_format argument. IPUMS NHGIS currently distributes files in csv and fixed-width format.

See the documentation for define_extract_nhgis() for more details on these options.

Next steps

Once you have defined an extract request, you can submit the extract for processing:

nhgis_ext_submitted <- submit_extract(nhgis_ext)

The workflow for submitting and monitoring an extract request and downloading its files when complete is described in the IPUMS API introduction.



Try the ipumsr package in your browser

Any scripts or data that you put into this service are public.

ipumsr documentation built on Sept. 12, 2024, 7:38 a.m.