knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Integrated variables in IPUMS data often have value labels, which attach
text labels to the values taken by a variable (for example the HEALTH variable
has value labels: 1 = "Excellent", 2 = "Very good", etc.). The ipumsr
package
does import the labels, but not as factors, which may be how you were expecting them.
library(ipumsr) ddi <- read_ipums_ddi(ipums_example("cps_00015.xml")) cps <- read_ipums_micro(ddi, verbose = FALSE) cps
The first clue that some of the variables are labelled is the <dbl+lbl>
that appear below STATEFIP
, ASECFLAG
and other variables. The tibble package
prints the variable's type information below the variable name, and this "+lbl" indicates
that the variable uses the labelled
type. The function is.labelled()
will also
tell you if a variable is labelled.
By default the ipumsr package now will attempt to show the value labels when
printing out a data.frame (like the "[Wis~"
on STATEFIP above, for Wisconsin).
If your console supports colorful output, then the labels will be in
a lighter gray than the values. You can turn off this printing behavior by
setting the option options("ipumsr.show_pillar_labels" = FALSE)
)
However, as you can see, there often isn't enough space to see the labels when printing out a full data.frame. So, there are a few options to see the labels directly:
# Printing the variable directly (or a subset) head(cps$MONTH) # Just get the labels ipums_val_labels(cps$MONTH) # or if you're working interactively you can use ipums_view # ipums_view(ddi)
The usual way to connect numeric data to labels in R is in factor
variables.
Though this data type is more native to R, and more widely supported by R code,
it was designed for efficient calculations in linear models, not as a general
purpose value labeling system and so is missing important features that the
value labels provided by IPUMS require.
Factors only allow for integers to be mapped to a text label, and these integers
have to be a count starting at 1. This doesn't work for IPUMS data because often
our variables have specific meanings for the codes. For example, the variable
AGE
uses the value to mean the actual age, but does have labels
for age 0 and the top codes.
head(cps$AGE) cps$AGE_FACTOR <- as_factor(cps$AGE) head(cps$AGE_FACTOR)
It may seem like the new AGE_FACTOR
variable is okay, but it can be very
confusing!
mean(cps$AGE) # mean(cps$AGE_FACTOR) # error because data is a factor not numeric mean(as.numeric(cps$AGE_FACTOR)) # A common mistake # The "more" correct way, but NA because of the text labels mean(as.numeric(as.character(cps$AGE_FACTOR)))
Because the factor variable has to assign values starting at 1, but the AGE variable started at 0, most values were 1 higher than they should have been. Not all values are 1 higher though, because not all values exist in the data, so 85, 90, and 99 are 82, 83 and 84 respectively.
Other variables have special meanings behind certain codes. For example, often missing or NIU values are indicated in IPUMS by values starting with the number 9 that are offset from the typical values. R's factors do not allow for this separation, so the missing codes will be harder to distinguish.
Factors also require that every value be labelled, which is not always true in IPUMS data. In the AGE variable, the only values with labels are 0, 90 and 99. For all other values, there is not additional label information.
Though the labelled class does express all of the meaning provided by IPUMS value labels into R, many R functions cannot use them or even actively remove them from the data.
ipums_val_labels(cps$HEALTH) HEALTH2 <- ifelse(cps$HEALTH > 3, 3, cps$HEALTH) ipums_val_labels(HEALTH2)
Therefore, your first task when importing an IPUMS data set will usually be to convert the labelled values to other data structures. The bad news is that there's no good automatic way to do this; a lot depends on how you plan to use the variables in your analysis and your preferences.
The good news is that the ipumsr package provides several functions to make this process easier.
I think it is easiest to learn them by seeing them in action, so see below for a workflow for bringing in the CPS example extract. For your reference, here is a list of the functions:
as_factor()
(reexported from the haven package)zap_labels()
(also from haven)lbl_na_if()
lbl_collapse()
lbl_relabel()
lbl_add()
lbl_add_vals()
as_factor()
The HEALTH variable is structured just like a factor and so can be converted directly.
The as_factor()
function is the easiest way to do so.
ipums_val_labels(cps$HEALTH) cps$HEALTH <- as_factor(cps$HEALTH)
The ASECFLAG and MONTH variables can also be converted directly (these were included by default by the IPUMS extract engine, but aren't useful here because this data set only has respondents from ASEC in March)
cps$ASECFLAG <- as_factor(cps$ASECFLAG) cps$MONTH <- as_factor(cps$MONTH)
as_factor
works on data.frames by converting every labelled variable to a factor.
However, this can create confusing variables like AGE_FACTOR from above, so this
isn't the best thing to do right away.
zap_labels()
I may decide that for my analysis, the AGE variable is most useful as the numeric values.
The zap_labels()
function removes the labels.
cps$AGE <- zap_labels(cps$AGE)
The top-codes (which are only available on the CPS website) indicate that the value 80 actually indicates 80-84 and 85 indicates 85+, so another option would be to convert them to a factor with age ranges.
lbl_clean()
This extract only contains data from a few states and I don't want the factor to have
a level for the unused ones. The lbl_clean()
function keeps only labels for values
in the current data set and returns a labelled variable which we can convert to a factor.
ipums_val_labels(cps$STATEFIP) cps$STATEFIP <- lbl_clean(cps$STATEFIP) ipums_val_labels(cps$STATEFIP) cps$STATEFIP <- as_factor(cps$STATEFIP)
lbl_na_if()
The INCTOT variable has 2 labelled values that are not actually incomes: 99999998 indicates "Missing" and 99999999 indicate "Not in Universe". On the CPS website, the Universe tab indicates that the Universe for 2016 is respondents age 15+. Let's say for my analysis, I can treat these values as missing.
The lbl_na_if()
function takes the variable and a function that refers to
.val and .lbl (the values and labels respectively) and returns an indicator
of whether to set those values to NA and remove the label. You can also use the
~
notation from the purrr package to create succinct anonymous functions.
The .val and .lbl only refer to values that already have labels, they do not apply
to unlabeled values. See the lbl_add()
and lbl_add_vals()
functions below for
working with unlabeled values.
# Caution: R defaults to printing large numbers like 99999999 in rounded # exponential format (1e+08) but that's not how they are actually stored ipums_val_labels(cps$INCTOT) # All of these are equivalent INCTOT1 <- lbl_na_if(cps$INCTOT, ~.val >= 99999990) INCTOT2 <- lbl_na_if(cps$INCTOT, ~.lbl %in% c("Missing.", "N.I.U. (Not in Universe).")) INCTOT3 <- lbl_na_if(cps$INCTOT, function(.val, .lbl) { is_missing <- .val == 99999998 is_niu <- .lbl == "N.I.U. (Not in Universe)." return(is_missing | is_niu) }) # Change to a factor in the original cps data.frame cps$INCTOT <- lbl_na_if(cps$INCTOT, ~.val >= 9999990) cps$INCTOT <- as_factor(cps$INCTOT)
lbl_collapse()
The EDUC variable provides an example of a common IPUMS practice of grouping categories together by the starting digits. For example, the value 10 indicates "Grades 1, 2, 3, or 4", and 11 - "Grade 1", 12 - "Grade 2", etc.
Let's say that I only care about those more general categories provided
by the first 2 digits. The lbl_collapse()
function allows me to provide a function
that takes .val and .lbl and returns the value to assign it to. If that code is already
used, then the all of the values assigned to it will get that label, otherwise the label
of the smallest value is used. Just like with lbl_na_if()
, the purrr-style compact
syntax using ~
functions is supported.
ipums_val_labels(cps$EDUC) # %/% is integer division, which divides by the number but doesn't keep the remainder cps$EDUC <- lbl_collapse(cps$EDUC, ~.val %/% 10) ipums_val_labels(cps$EDUC) cps$EDUC <- as_factor(cps$EDUC)
lbl_relabel()
Sometimes you may wish to move the labels into new categories. For example, the categories in MIGRATE1 may not quite map what I want to use in my analysis.
The lbl_relabel()
function provides a more flexible way to group existing labelled
values into new ones. It takes a two-sided formula, where the left-hand side is a
label (defined with the lbl()
function) and the right hand side is an expression
that can use .val and .lbl to evaluate to a logical indicating which values should
be assigned to this label.
ipums_val_labels(cps$MIGRATE1) cps$MIGRATE1 <- lbl_relabel( cps$MIGRATE1, lbl(0, "NIU / Missing / Unknown") ~ .val %in% c(0, 2, 9), lbl(1, "Stayed in state") ~ .val %in% c(1, 3, 4) ) ipums_val_labels(cps$MIGRATE1) cps$MIGRATE1 <- as_factor(cps$MIGRATE1)
lbl_add()
and lbl_add_vals()
These functions allow you to create labels for values that aren't already labelled. It's harder to come up with real world examples of when these functions would be useful, but just in case you come across such a situation, here's how they work.
x <- haven::labelled( c(100, 200, 105, 990, 999, 230), c(`Unknown` = 990, NIU = 999) ) lbl_add(x, lbl(100, "$100"), lbl(105, "$105"), lbl(200, "$200"), lbl(230, "$230")) lbl_add_vals(x, ~paste0("$", .))
And now, after converting all those labels to factors, I'm ready for analysis! If you think of any other helper functions that would be useful for dealing with labels please let us know by filing an issue on github.
One implementation detail that may help you understand the lbl_* functions better is that the value labels are stored separately from the actual data. This can be important because it allows for values to exist in the data without labels (such as the non-special codes in the INCTOT variable from the example above) and also for value labels to exist even if they don't exist in the data (like the STATEFIPS that we didn't include in our extract).
The .val
and .lbl
pronouns that are usable in functions like lbl_na_if()
,
lbl_collapse()
and lbl_relabel()
only include these labelled values, not
all values in the dataset. Though this makes many calculations simpler,
because only the labelled values are considered, it can be confusing when you
want to work with the unlabeled values.
For example, considering the INCTOT variable, if there are unlabeled values
that you want to set to NA, you cannot use lbl_na_if()
directly.
# Reload cps data so that INCTOT is a labelled class again cps <- read_ipums_micro(ddi, verbose = FALSE) # Try to set all values above 1000000 to NA test1 <- lbl_na_if(cps$INCTOT, ~.val > 1000000) test1 <- zap_labels(test1) max(test1, na.rm = TRUE) # Didn't work
Instead you should add the value labels with lbl_add_vals()
(or you could use
a function that doesn't use the labels, such as dplyr::na_if()
)
test2 <- lbl_add_vals(cps$INCTOT) test2 <- lbl_na_if(test2, ~.val > 1000000) test2 <- zap_labels(test2) max(test2, na.rm = TRUE)
The haven package vignette 'semantics' has some more details about the motivation and
implementation of the labelled class. You can view it by running the command:
vignette("semantics", package = "haven")
The labelled
package provides other methods for manipulating
value labels. It is not installed by ipumsr, but is available on CRAN via the
following command: install.packages("labelled")
The questionr package includes great
functions for exploring labelled
variables. In particular, the functions
describe
, freq
and lookfor
all print out to console information about the
variable using the value labels. It is also not installed by ipumsr, but can be
installed from CRAN using: install.packages("questionr")
Finally, the foreign and
prettyR packages don't use the
labelled
class data structure from haven (which ipumsr uses), but do have very
similar concepts for attaching value labels. Code designed for these packages
could be adapted for use with the haven labelled class without too much
difficulty.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.