knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE )
The IPUMS USA microdata extract API is still in testing and is not publicly available. A call for beta testers will go out to all IPUMS USA users in late 2021 or early 2022. Moreover, the ipumsr functions for interacting with the API are only included in the development version of the package on GitHub, not in the version on CRAN. If you are a beta tester, you can download the development version of ipumsr, which includes API functions, using:
if (!require(remotes)) install.packages("remotes") remotes::install_github("mnpopcenter/ipumsr/ipumsexamples") remotes::install_github("mnpopcenter/ipumsr", build_vignettes = TRUE)
The IPUMS USA microdata extract API allows registered IPUMS USA users to define
extracts, submit extract requests, and download extract files without visiting
the IPUMS website. ipumsr includes functions that help R users
interact with the extract API from their R session.
library(ipumsr) library(dplyr) # not necessary to use API functions, but used in some examples library(purrr) # not necessary to use API functions, but used in some examples
If you don’t have an IPUMS USA account, register for access.
Once you're registered, you'll need to create an API key.
Once you've created an API key, you can choose to supply it as a function
argument whenever interacting with the API, or you can set the value of the
IPUMS_API_KEY
environment variable to your key.
To set the value of the IPUMS_API_KEY
environment variable, you can use:
Sys.setenv(IPUMS_API_KEY = "paste-your-key-here")
Or, if you want R to automatically set the value of that environment variable at the beginning of each session, add code like the following to a file named ".Renviron" in your project directory or user home directory:
IPUMS_API_KEY = "paste-your-key-here"
If you use the ".Renviron" approach with a project that uses Git for version control, be sure to add the ".Renviron" file to ".gitignore" so that your API key is not shared when someone clones your repository.
The define_extract_micro()
function returns an object of class "ipums_extract" which
can then be submitted using the submit_extract()
function.
extract_definition <- define_extract_micro( collection = "usa", description = "Extract for API vignette", samples = c("us2018a","us2019a"), variables = c("AGE","SEX","RACE","STATEFIP"), data_format = "fixed_width", data_structure = "rectangular", rectangular_on = "P" )
Note that samples are specified using special sample ID codes, which can be browsed here on the IPUMS USA website.
The data_format
, data_structure
, and rectangular_on
arguments have default
values which match the values specified in the example above, and you can omit
the argument names as long as you maintain the proper order, so you could define
the same extract with:
extract_definition <- define_extract_micro( "usa", "Extract for API vignette", c("us2018a","us2019a"), c("AGE","SEX","RACE","STATEFIP") )
To submit your extract, use:
submit_extract(extract_definition)
However, like the define_extract_micro()
function, the submit_extract()
function
returns an "ipums_extract" object, and the returned object has been updated to
include the extract number, so it can be useful to save that return object
by assigning a name to it, like this:
submitted_extract <- submit_extract(extract_definition)
That way, you can use the submitted_extract
object as input to check the
extract's status, as shown in the next section, or to reference the extract
number:
submitted_extract$number
To retrieve the latest status of an extract, you can use the
get_extract_info()
function. get_extract_info()
returns an "ipums_extract"
object with the "status" element updated to reflect the latest status of the
extract, and the "download_links" element updated to include links to any
extract files that are available for download.
The "status" of a submitted extract is one of "queued", "started", "produced",
"canceled", "failed", or "completed". Only "completed" extracts can be
downloaded, but "completed" extracts older than 72 hours may not be available
for download, since extract files are removed after that time (see discussion of
the is_extract_ready()
function below).
If you assigned a name to the return value of submit_extract()
, as shown
above, you could get updated information on the extract, returned as an
"ipums_extract" object, with:
submitted_extract <- get_extract_info(submitted_extract)
To print the latest status, you can use:
submitted_extract$status
If you forget to capture the return value of submit_extract()
, you can get
all information on your most recent extract (including the extract number) with:
submitted_extract <- get_last_extract_info("usa")
get_last_extract_info()
is just a convenience wrapper around
get_recent_extracts_info_list()
, described below.
If you don't have an "ipums_extract" object in your environment that describes the extract you're interested in, and you don't want the most recent extract, you can also query the latest status of an extract by supplying the name of the IPUMS data collection and extract number of the extract, in one of two formats. Here's how you'd get the latest information on IPUMS USA extract number 33:
submitted_extract <- get_extract_info("usa:33")
or
submitted_extract <- get_extract_info(c("usa", "33"))
Note that in the first format, there are no spaces before or after the colon, and that in both formats, there is no need to zero-pad the extract number -- in other words, use "33", not "00033".
If you want R to periodically check the status of your extract, and only return
an updated "ipums_extract" object once the extract is ready to download, you
can use wait_for_extract()
, as shown below:
downloadable_extract <- wait_for_extract(submitted_extract)
wait_for_extract()
also accepts the same "collection:number"
and
c("collection", "number")
specifications shown above:
downloadable_extract <- wait_for_extract("usa:33")
or
downloadable_extract <- wait_for_extract(c("usa", "33"))
For large extracts that take a long time to produce, or when the IPUMS servers
are busy, you may not want to use wait_for_extract()
, as it will tie up your R
session until the extract is ready to download. Alternatively, you can use the
timeout_seconds
argument to set the maximum number of seconds you want the
function to wait. By default, that argument is set to 10,800 seconds (3 hours).
One additional way to check whether your extract is ready to download is using
the is_extract_ready()
function. This function accepts either an
"ipums_extract" object or a "collection:number"
or c("collection", "number")
specification, and returns a single TRUE
or FALSE
value indicating whether
the extract is ready to be downloaded.
is_extract_ready(submitted_extract) is_extract_ready("usa:33") is_extract_ready(c("usa", "33"))
Note that the API has a limit of 60 requests with the same API key per minute,
so you wouldn't want to write a loop that repeatedly uses is_extract_ready()
to check your extract status.
Once your extract is ready to download, use the download_extract()
function to
download the data and DDI codebook files to your computer. The
download_extract()
function returns the path to the DDI codebook file, which
can be used to read in the downloaded data with ipumsr functions. By default,
the function will download files into your current working directory, but
alternative locations can be specified with the download_dir
argument.
ddi_path <- download_extract(submitted_extract) # Assuming you chose either "fixed-width" or "csv" for your extract's "data_format" ddi <- read_ipums_ddi(ddi_path) data <- read_ipums_micro(ddi)
Or, using a "collection:number"
or c("collection", "number")
specification:
ddi_path <- download_extract("usa:33") ddi_path <- download_extract(c("usa", "33"))
One exciting feature enabled by the IPUMS USA data extract API is the ability to
share a standardized extract definition with other IPUMS users so that they can
create an identical extract for themselves. ipumsr facilitates this by offering
the functions save_extract_as_json()
and define_extract_from_json()
to
write extract definitions to and read extract definitions from a standardized
JSON-formatted file.
To write the definition of your USA extract number 33 to a JSON-formatted file that can be shared with other users, you could use:
usa_extract_33 <- get_extract_info("usa:33") save_extract_as_json(usa_extract_33, file = "usa_extract_33.json")
Then, you or another user could use that JSON file to create a duplicate "ipums_extract" object with the same definition, and submit it, using:
clone_of_usa_extract_33 <- define_extract_from_json("usa_extract_33.json", "usa") submitted_extract <- submit_extract(clone_of_usa_extract_33)
Note that the code in the previous chunk assumes that the file is saved in the
current working directory. If it's saved somewhere else, replace
"usa_extract_33.json"
with the full path to the file.
ipumsr also includes a convenience function for revising a previous extract definition, facilitating a "revise and resubmit" workflow. Here's how you would pull down the definition of USA extract number 33 and add a sample and a variable to it:
old_extract <- get_extract_info("usa:33") new_extract <- revise_extract_micro( old_extract, samples_to_add = "us2018a", vars_to_add = "RELATE" )
The revise_extract_micro()
function returns an "ipums_extract" object that has been
modified as requested and has been reset to an unsubmitted state, by stripping
the extract number, status, and download links from the original extract. The
revised extract can then be submitted with:
newly_submitted_extract <- submit_extract(new_extract)
You can query the API for the details and status of up to ten recent extracts
using the functions get_recent_extracts_info_list()
and
get_recent_extracts_info_tbl()
. The _list
version of the function returns a
list of "ipums_extract" objects, whereas the _tbl
version returns a tibble
(enhanced "data.frame") in which each row contains information on one extract.
The list representation is useful if you want to be able to operate on elements as "ipums_extract" objects. For instance, to retrieve your most second-most-recent extract and revise it for resubmission, you could use:
second_most_recent_extract <- get_recent_extracts_info_list("usa")[[2]] revised_extract <- revise_extract_micro( second_most_recent_extract, samples_to_add = "us2010a" )
Or to download all recent extracts that are ready to download, using
purrr::keep()
and purrr::map_chr()
:
ddi_paths <- get_recent_extracts_info_list("usa") %>% keep(is_extract_ready) %>% map_chr(download_extract)
The tibble representation is useful if you want to use functions for manipulating data.frames to find recent extracts matching particular criteria.
recent_usa_extracts_tbl <- get_recent_extracts_info_tbl("usa")
For example, to find extracts with descriptions including the word "occupation", you could use:
recent_usa_extracts %>% filter(grepl("occupation", description))
Filtering on properties such as "samples" or "variables" is a little more
complex, because these are stored in list columns, but it is possible. For
example, to find extracts including the variable "AGE", you could use
purrr::map_lgl()
like this:
recent_usa_extracts %>% filter(map_lgl(variables, ~"AGE" %in% .x))
To convert between these two representations, ipumsr provides the functions
extract_list_to_tbl()
and extract_tbl_to_list()
, such that the following is
TRUE
:
identical( extract_list_to_tbl(get_recent_extracts_info_list("usa")), get_recent_extracts_info_tbl("usa") )
The return values of the functions to interact with the API are configured in such a way that you can define, submit, wait for, download, and read in your extract all in one piped expression:
data <- define_extract_micro( "usa", "Extract for API vignette", c("us2018a","us2019a"), c("AGE","SEX","RACE","STATEFIP") ) %>% submit_extract() %>% wait_for_extract() %>% download_extract() %>% read_ipums_micro()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.