knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Once you have downloaded an IPUMS extract, the next step is to load its data into R for analysis.
For more information about IPUMS data and how to generate and download a data extract, see the introduction to IPUMS data.
IPUMS extracts will be organized slightly differently for different IPUMS projects. In general, all projects will provide multiple files in a data extract. The files most relevant to ipumsr are:
Both of these files are necessary to properly load data into R. Obviously, the data files contain the actual data values to be loaded. But because these are often in fixed-width format, the metadata files are required to correctly parse the data on load.
Even for .csv files, the metadata file allows for the addition of contextual variable information to the loaded data. This makes it much easier to interpret the values in the data variables and effectively use them in your data processing pipeline. See the value labels vignette for more information on working with these labels.
Microdata extracts typically provide their metadata in a DDI (.xml) file separate from the compressed data (.dat.gz) files.
Provide the path to the DDI file to read_ipums_micro()
to directly
load its associated data file into R.
library(ipumsr) library(dplyr) # Example data cps_ddi_file <- ipums_example("cps_00157.xml") cps_data <- read_ipums_micro(cps_ddi_file) head(cps_data)
Note that you provide the path to the DDI file, not the data file. This is because ipumsr needs to find both the DDI and data files to read in your data, and the DDI file includes the name of the data file, whereas the data file contains only the raw data.
The loaded data have been parsed correctly and include variable metadata
in each column. For a summary of the column contents, use
ipums_var_info()
:
ipums_var_info(cps_data)
This information is also attached to specific columns. You can obtain it
with attributes()
or by using ipumsr helpers:
attributes(cps_data$MONTH) ipums_val_labels(cps_data$MONTH)
While this is the most straightforward way to load microdata, it's often
advantageous to independently load the DDI file into an ipums_ddi
object containing the metadata:
cps_ddi <- read_ipums_ddi(cps_ddi_file) cps_ddi
This is because many common data processing functions have the side-effect of removing these attributes:
# This doesn't actually change the data... cps_data2 <- cps_data %>% mutate(MONTH = ifelse(TRUE, MONTH, MONTH)) # but removes attributes! ipums_val_labels(cps_data2$MONTH)
In this case, you can always use the separate DDI as a metadata reference:
ipums_val_labels(cps_ddi, var = MONTH)
Or even reattach the metadata, assuming the variable names still match those in the DDI:
cps_data2 <- set_ipums_var_attributes(cps_data2, cps_ddi) ipums_val_labels(cps_data2$MONTH)
IPUMS microdata can come in either rectangular or hierarchical format.
Rectangular data are transformed such that every row of data represents
the same type of record. For instance, each row will represent a person
record, and all household-level information for that person will be
included in the same row. (This is the case for cps_data
shown
in the example above.)
Hierarchical data have records of different types interspersed in a single file. For instance, a household record will be included in its own row followed by the person records associated with that household.
Hierarchical data can be loaded in list format or long format.
read_ipums_micro()
will read in long format:
cps_hier_ddi <- read_ipums_ddi(ipums_example("cps_00159.xml")) read_ipums_micro(cps_hier_ddi)
The long format consists of a single tibble
that
includes rows with varying record types. In this example, some rows have a
record type of "Household" and others have a record type of "Person".
Variables that do not apply to a particular record type will be filled
with NA
in rows of that record type.
To read data in list format, use read_ipums_micro_list()
. This
function returns a list where each element contains all the records for
a given record type:
read_ipums_micro_list(cps_hier_ddi)
read_ipums_micro()
and read_ipums_micro_list()
also support partial
loading by selecting only a subset of columns or a limited number of
rows. See the documentation for more details about other options.
Unlike microdata projects, NHGIS and IHGIS extracts provide their data and
metadata files bundled into a single .zip archive. read_ipums_agg()
anticipates this structure and can read data files directly from this
file without the need to manually extract the files:
nhgis_ex1 <- ipums_example("nhgis0972_csv.zip") nhgis_data <- read_ipums_agg(nhgis_ex1) nhgis_data
Within a zipped extract archive, tabular data from aggregate data projects
are typically provided in .csv format. read_ipums_agg()
passes several
arguments on to readr::read_csv()
to allow you to fine-tune the import of
these files. For instance, if the guessed column types are incorrect, use the
col_types
argument to adjust. This is most likely to occur for
columns that contain geographic codes that end up stored as numeric values:
# Convert MSA codes to character format read_ipums_agg( nhgis_ex1, col_types = c(MSA_CMSAA = "c"), verbose = FALSE )
Like microdata extracts, the data include variable-level metadata, where available:
attributes(nhgis_data$D6Z001)
However, variable metadata for aggregate data files are slightly different
than those provided by microdata products. First,
they come from a .txt codebook file (for NHGIS) or a collection of metadata
.csv files (for IHGIS) rather than an .xml DDI file. Codebooks can still be
loaded into an ipums_ddi
object, but fields that do not apply to aggregate
data will be empty. In general, aggregate data codebooks provide only variable
labels and descriptions, along with citation information.
nhgis_cb <- read_nhgis_codebook(nhgis_ex1) # Most useful metadata for NHGIS is for variable labels: ipums_var_info(nhgis_cb) %>% select(var_name, var_label, var_desc)
IHGIS codebooks are structured differently than NHGIS codebooks, so each
collection has its own codebook reader. Both produce an ipums_ddi
object.
ihgis_cb <- read_ihgis_codebook(ipums_example("ihgis0014.zip")) ipums_var_info(ihgis_cb)
By design, aggregate data codebooks are human-readable, and it may be easier
to interpret their contents in raw format. To view the codebook itself without
converting to an ipums_ddi
object, set raw = TRUE
.
nhgis_cb <- read_nhgis_codebook(nhgis_ex1, raw = TRUE) cat(nhgis_cb[1:20], sep = "\n")
For more complicated extracts that include data from multiple data sources, the provided .zip archive will contain multiple codebook and data files.
You can view the files contained in an extract to determine if this is the case:
nhgis_ex2 <- ipums_example("nhgis0731_csv.zip") ipums_list_files(nhgis_ex2)
In these cases, you can use the file_select
argument to indicate which
file to load. file_select
supports most features of the
tidyselect selection language.
(See ?selection_language
for documentation of the features supported in
ipumsr.)
nhgis_data2 <- read_ipums_agg( nhgis_ex2, file_select = contains("nation") ) nhgis_data3 <- read_ipums_agg( nhgis_ex2, file_select = contains("ts_nominal_state") )
The matching codebook should automatically be loaded and attached to the data:
attributes(nhgis_data2$AJWBE001) attributes(nhgis_data3$A00AA1790)
(If for some reason the metadata are not loaded successfully, you can load
it separately with either read_nhgis_codebook()
or read_ihgis_codebook()
.)
file_select
also accepts the full path or the index of the file to
load:
# Match by file name read_ipums_agg( nhgis_ex2, file_select = "nhgis0731_csv/nhgis0731_ds239_20185_nation.csv" ) # Match first file in extract read_ipums_agg(nhgis_ex2, file_select = 1)
IPUMS distributes spatial data for several projects.
Use read_ipums_sf()
to load spatial data from any of these sources as an
sf
object from {sf}
.
read_ipums_sf()
also supports the loading of spatial files within .zip
archives and the file_select
syntax for file selection when multiple internal
files are present.
nhgis_shp_file <- ipums_example("nhgis0972_shape_small.zip") shp_data <- read_ipums_sf(nhgis_shp_file) head(shp_data)
These data can then be joined to associated tabular data. To preserve
IPUMS attributes from the tabular data used in the join, use
an ipums_shape_*_join()
function:
joined_data <- ipums_shape_left_join( nhgis_data, shp_data, by = "GISJOIN" ) attributes(joined_data$MSA_CMSAA)
For aggregate data collections, the join code typically corresponds to the
GISJOIN
variable. However, for microdata collections, the variable name used
for a geographic level in the tabular data may differ from that in the spatial
data. Consult the documentation and metadata for these files to identify
the correct join columns and use the by
argument to join on these
columns.
Once joined, data include both statistical and spatial information along with the variable metadata.
Longitudinal analysis of geographic data is complicated by the fact that geographic boundaries shift over time. IPUMS therefore provides multiple types of spatial data:
Furthermore, some NHGIS time series tables have been standardized such that the statistics have been adjusted to apply to a year-specific geographical boundary.
When using spatial data, it is important to consult the project-specific documentation to ensure you are using the most appropriate boundaries for your research question and the data included in your analysis. As always, documentation for the IPUMS project you're working with should explain the different options available.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.