In mnpopcenter/ipumsr: Read 'IPUMS' Extract Files

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE
)

Warning: New features currently in development

The IPUMS USA microdata extract API is still in testing and is not publicly available. A call for beta testers will go out to all IPUMS USA users in late 2021 or early 2022. Moreover, the ipumsr functions for interacting with the API are only included in the development version of the package on GitHub, not in the version on CRAN. If you are a beta tester, you can download the development version of ipumsr, which includes API functions, using:

if (!require(remotes)) install.packages("remotes")
remotes::install_github("mnpopcenter/ipumsr/ipumsexamples")
remotes::install_github("mnpopcenter/ipumsr", build_vignettes = TRUE)

Overview

The IPUMS USA microdata extract API allows registered IPUMS USA users to define extracts, submit extract requests, and download extract files without visiting the IPUMS website. ipumsr includes functions that help R users
interact with the extract API from their R session.

library(ipumsr)
library(dplyr) # not necessary to use API functions, but used in some examples
library(purrr) # not necessary to use API functions, but used in some examples

Setting up your API key

If you don’t have an IPUMS USA account, register for access.

Once you're registered, you'll need to create an API key.

Once you've created an API key, you can choose to supply it as a function argument whenever interacting with the API, or you can set the value of the IPUMS_API_KEY environment variable to your key.

To set the value of the IPUMS_API_KEY environment variable, you can use:

Sys.setenv(IPUMS_API_KEY = "paste-your-key-here")

Or, if you want R to automatically set the value of that environment variable at the beginning of each session, add code like the following to a file named ".Renviron" in your project directory or user home directory:

IPUMS_API_KEY = "paste-your-key-here"

If you use the ".Renviron" approach with a project that uses Git for version control, be sure to add the ".Renviron" file to ".gitignore" so that your API key is not shared when someone clones your repository.

Defining your extract

The define_extract_micro() function returns an object of class "ipums_extract" which can then be submitted using the submit_extract() function.

extract_definition <- define_extract_micro(
  collection = "usa",
  description = "Extract for API vignette",
  samples = c("us2018a","us2019a"),
  variables = c("AGE","SEX","RACE","STATEFIP"),
  data_format = "fixed_width",
  data_structure = "rectangular",
  rectangular_on = "P"
)

Note that samples are specified using special sample ID codes, which can be browsed here on the IPUMS USA website.

The data_format, data_structure, and rectangular_on arguments have default values which match the values specified in the example above, and you can omit the argument names as long as you maintain the proper order, so you could define the same extract with:

extract_definition <- define_extract_micro(
  "usa",
  "Extract for API vignette",
  c("us2018a","us2019a"),
  c("AGE","SEX","RACE","STATEFIP")
)

Submitting your extract

To submit your extract, use:

submit_extract(extract_definition)

However, like the define_extract_micro() function, the submit_extract() function returns an "ipums_extract" object, and the returned object has been updated to include the extract number, so it can be useful to save that return object by assigning a name to it, like this:

submitted_extract <- submit_extract(extract_definition)

That way, you can use the submitted_extract object as input to check the extract's status, as shown in the next section, or to reference the extract number:

submitted_extract$number

Checking the status of your extract

To retrieve the latest status of an extract, you can use the get_extract_info() function. get_extract_info() returns an "ipums_extract" object with the "status" element updated to reflect the latest status of the extract, and the "download_links" element updated to include links to any extract files that are available for download.

The "status" of a submitted extract is one of "queued", "started", "produced", "canceled", "failed", or "completed". Only "completed" extracts can be downloaded, but "completed" extracts older than 72 hours may not be available for download, since extract files are removed after that time (see discussion of the is_extract_ready() function below).

If you assigned a name to the return value of submit_extract(), as shown above, you could get updated information on the extract, returned as an "ipums_extract" object, with:

submitted_extract <- get_extract_info(submitted_extract)

To print the latest status, you can use:

submitted_extract$status

If you forget to capture the return value of submit_extract(), you can get all information on your most recent extract (including the extract number) with:

submitted_extract <- get_last_extract_info("usa")

get_last_extract_info() is just a convenience wrapper around get_recent_extracts_info_list(), described below.

If you don't have an "ipums_extract" object in your environment that describes the extract you're interested in, and you don't want the most recent extract, you can also query the latest status of an extract by supplying the name of the IPUMS data collection and extract number of the extract, in one of two formats. Here's how you'd get the latest information on IPUMS USA extract number 33:

submitted_extract <- get_extract_info("usa:33")

submitted_extract <- get_extract_info(c("usa", "33"))

Note that in the first format, there are no spaces before or after the colon, and that in both formats, there is no need to zero-pad the extract number -- in other words, use "33", not "00033".

If you want R to periodically check the status of your extract, and only return an updated "ipums_extract" object once the extract is ready to download, you can use wait_for_extract(), as shown below:

downloadable_extract <- wait_for_extract(submitted_extract)

wait_for_extract() also accepts the same "collection:number" and c("collection", "number") specifications shown above:

downloadable_extract <- wait_for_extract("usa:33")

downloadable_extract <- wait_for_extract(c("usa", "33"))

For large extracts that take a long time to produce, or when the IPUMS servers are busy, you may not want to use wait_for_extract(), as it will tie up your R session until the extract is ready to download. Alternatively, you can use the timeout_seconds argument to set the maximum number of seconds you want the function to wait. By default, that argument is set to 10,800 seconds (3 hours).

One additional way to check whether your extract is ready to download is using the is_extract_ready() function. This function accepts either an "ipums_extract" object or a "collection:number" or c("collection", "number") specification, and returns a single TRUE or FALSE value indicating whether the extract is ready to be downloaded.

is_extract_ready(submitted_extract)
is_extract_ready("usa:33")
is_extract_ready(c("usa", "33"))

Note that the API has a limit of 60 requests with the same API key per minute, so you wouldn't want to write a loop that repeatedly uses is_extract_ready() to check your extract status.

Downloading your extract

Once your extract is ready to download, use the download_extract() function to download the data and DDI codebook files to your computer. The download_extract() function returns the path to the DDI codebook file, which can be used to read in the downloaded data with ipumsr functions. By default, the function will download files into your current working directory, but alternative locations can be specified with the download_dir argument.

ddi_path <- download_extract(submitted_extract)

# Assuming you chose either "fixed-width" or "csv" for your extract's "data_format"
ddi <- read_ipums_ddi(ddi_path)
data <- read_ipums_micro(ddi)

Or, using a "collection:number" or c("collection", "number") specification:

ddi_path <- download_extract("usa:33")
ddi_path <- download_extract(c("usa", "33"))

Sharing an extract definition

One exciting feature enabled by the IPUMS USA data extract API is the ability to share a standardized extract definition with other IPUMS users so that they can create an identical extract for themselves. ipumsr facilitates this by offering the functions save_extract_as_json() and define_extract_from_json() to write extract definitions to and read extract definitions from a standardized JSON-formatted file.

To write the definition of your USA extract number 33 to a JSON-formatted file that can be shared with other users, you could use:

usa_extract_33 <- get_extract_info("usa:33")
save_extract_as_json(usa_extract_33, file = "usa_extract_33.json")

Then, you or another user could use that JSON file to create a duplicate "ipums_extract" object with the same definition, and submit it, using:

clone_of_usa_extract_33 <- define_extract_from_json("usa_extract_33.json", "usa")
submitted_extract <- submit_extract(clone_of_usa_extract_33)

Note that the code in the previous chunk assumes that the file is saved in the current working directory. If it's saved somewhere else, replace "usa_extract_33.json" with the full path to the file.

Revising a previous extract

ipumsr also includes a convenience function for revising a previous extract definition, facilitating a "revise and resubmit" workflow. Here's how you would pull down the definition of USA extract number 33 and add a sample and a variable to it:

old_extract <- get_extract_info("usa:33")
new_extract <- revise_extract_micro(
  old_extract,
  samples_to_add = "us2018a",
  vars_to_add = "RELATE"
)

The revise_extract_micro() function returns an "ipums_extract" object that has been modified as requested and has been reset to an unsubmitted state, by stripping the extract number, status, and download links from the original extract. The revised extract can then be submitted with:

newly_submitted_extract <- submit_extract(new_extract)

Getting info on multiple recent extracts {#recent}

You can query the API for the details and status of up to ten recent extracts using the functions get_recent_extracts_info_list() and get_recent_extracts_info_tbl(). The _list version of the function returns a list of "ipums_extract" objects, whereas the _tbl version returns a tibble (enhanced "data.frame") in which each row contains information on one extract.

The list representation is useful if you want to be able to operate on elements as "ipums_extract" objects. For instance, to retrieve your most second-most-recent extract and revise it for resubmission, you could use:

second_most_recent_extract <- get_recent_extracts_info_list("usa")[[2]]
revised_extract <- revise_extract_micro(
  second_most_recent_extract, 
  samples_to_add = "us2010a"
)

Or to download all recent extracts that are ready to download, using purrr::keep() and purrr::map_chr():

ddi_paths <- get_recent_extracts_info_list("usa") %>% 
  keep(is_extract_ready) %>% 
  map_chr(download_extract)

The tibble representation is useful if you want to use functions for manipulating data.frames to find recent extracts matching particular criteria.

recent_usa_extracts_tbl <- get_recent_extracts_info_tbl("usa")

For example, to find extracts with descriptions including the word "occupation", you could use:

recent_usa_extracts %>%  
  filter(grepl("occupation", description))

Filtering on properties such as "samples" or "variables" is a little more complex, because these are stored in list columns, but it is possible. For example, to find extracts including the variable "AGE", you could use purrr::map_lgl() like this:

recent_usa_extracts %>% 
  filter(map_lgl(variables, ~"AGE" %in% .x))

To convert between these two representations, ipumsr provides the functions extract_list_to_tbl() and extract_tbl_to_list(), such that the following is TRUE:

identical(
  extract_list_to_tbl(get_recent_extracts_info_list("usa")),
  get_recent_extracts_info_tbl("usa")
)

Putting it all together, with pipes

The return values of the functions to interact with the API are configured in such a way that you can define, submit, wait for, download, and read in your extract all in one piped expression:

data <- 
  define_extract_micro(
    "usa",
    "Extract for API vignette",
    c("us2018a","us2019a"),
    c("AGE","SEX","RACE","STATEFIP")
  ) %>% 
    submit_extract() %>% 
    wait_for_extract() %>% 
    download_extract() %>% 
    read_ipums_micro()