Value Labels in IPUMS data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

IPUMS Variable Metadata

IPUMS data come with three primary types of variable-level metadata:

The rest of this article will focus on value labels; for more about variable labels and descriptions, see vignette("ipums").

Value labels

ipumsr uses the labelled class from the haven package to handle value labels.

You can see this in the column data types when loading IPUMS data. Note that <int+lbl> appears below STATEFIP, ASECFLAG, and other variables:

library(ipumsr)

ddi <- read_ipums_ddi(ipums_example("cps_00160.xml"))
cps <- read_ipums_micro(ddi, verbose = FALSE)

cps

This indicates that the data contained in these columns are integers but include value labels. You can use the function is.labelled() to determine if a variable is indeed labelled:

is.labelled(cps$STATEFIP)

Some of the labels are actually printed inline alongside their data values, but it can be easier to see them by isolating them:

# Labels print when accessing the column
head(cps$MONTH)

# Get labels alone
ipums_val_labels(cps$MONTH)

labelled vs. factor

Base R already supports the linking of numeric data to categories using its factor data type. While factors may be more familiar, they were designed to support efficient calculations in linear models, not as a human-readable labeling system for interpreting and processing data.

Compared to factors, labelled vectors have two main properties that make them more suitable for working with IPUMS data:

Consider the case of the AGE variable. For many IPUMS products, AGE provides a person's age in years, but certain special values have other interpretations:

head(cps$AGE)

As you can see, the 0 value represents all ages less than 1, and the 90 and 99 values actually represent ranges of ages. Coercing AGE to a factor would convert all values of 0 to 1, because factors always assign values starting at 1:

cps$AGE_FACTOR <- as_factor(cps$AGE)

age0_factor <- cps[cps$AGE == 0, ]$AGE_FACTOR

# The levels look the same
unique(age0_factor)

# But the values have changed
unique(as.numeric(age0_factor))

Additionally, because not all values exist in the data, high values, like 85, 90, and 99 have been mapped to lower values:

age85_factor <- cps[cps$AGE == 85, ]$AGE_FACTOR

unique(as.numeric(age85_factor))

These different representations lead to inconsistencies in calculated values:

mean(cps$AGE)

mean(as.numeric(cps$AGE_FACTOR))

Cautions regarding labelled variables

While labelled variables provide the benefits described above, they also present challenges.

For example, you may have noticed that both of the means calculated above are suspect.

In the case of AGE_FACTOR, the values have been remapped during conversion and several are inconsistent with the original data.

In the case of AGE, we have considered all people over 90 to be exactly 90, and all people over 99 to be exactly 99---labelled variables don't ensure that calculations are correct any more than factors do!

Furthermore, many R functions ignore value labels or even actively remove them from the data:

ipums_val_labels(cps$HEALTH)

HEALTH2 <- ifelse(cps$HEALTH > 3, 3, cps$HEALTH)
ipums_val_labels(HEALTH2)

So, labelled vectors are not intended for use throughout the entire analysis process. Instead, they should be used during the initial data preparation process to convert raw data into values that are more meaningful. These can then be converted to other variable types (often factors) for analysis.

Unfortunately, this isn't a process that can typically be automated, as it depends primarily on the research questions the data will be used to address. However, ipumsr provides several functions to manipulate value labels to make this process easier.

Prepping data with value labels

Convert labelled values to other data types

Use as_factor() once labels have the correct categories and need no further manipulation. For instance, MONTH already has sensible categories, so we can convert it to a factor right away:

ipums_val_labels(cps$MONTH)

cps$MONTH <- as_factor(cps$MONTH)

as_factor() can also convert all labelled variables in a data frame to factors at once. If you prefer to work with factors, you can do this conversion immediately after loading data, and then prepare these variables using techniques you would use for factors.

cps <- as_factor(cps)

# ... further preparation of variables as factors

If you prefer to handle these variables in labelled format, you can use the lbl_* helpers first, then call as_factor() on the entire data frame.

Some variables may be more appropriate to use as numeric values rather than factors. In these cases, you can simply remove the labels with zap_labels().

INCTOT, which measures personal income, fits this description:

inctot_num <- zap_labels(cps$INCTOT)

typeof(inctot_num)

ipums_val_labels(inctot_num)

Note that labelled values are not generally intended to be interpreted as numeric values, so zap_labels() should only be used after labels have been properly handled. For example, in INCTOT, labelled values used to identify missing values are encoded with large numbers:

ipums_val_labels(cps$INCTOT)

Treating these as legitimate observations will significantly skew any calculations with this variable if not first converted to NA.

Create missing values based on value labels

Many IPUMS variables use labelled values to identify missing data. This allows for more detail about why certain observations were missing than would be available were values loaded as NA.

As we saw with INCTOT, value labels were used to identify two types of missing data: those that are legitimately missing and those that are not in the universe of observations.

ipums_val_labels(cps$INCTOT)

To convert one or both of these labelled values to NA, use lbl_na_if(). To use lbl_na_if(), you must supply a function to handle the conversion. The function should take a value-label pair as its input and output TRUE for those pairs whose values should be converted to NA.

Syntax for value label functions {#syntax}

Several lbl_* helper functions, including lbl_na_if(), require a user-defined function to handle recoding of value-label pairs. ipumsr provides a syntax to easily reference the values and labels in this user-defined function:

For instance, to convert all values equal to 999999999 to NA, we can provide a function that uses the .val argument:

# Convert to NA using function that returns TRUE for all labelled values equal to 99999999
inctot_na <- lbl_na_if(
  cps$INCTOT,
  function(.val, .lbl) .val == 999999999
)

# All 99999999 values have been converted to NA
any(inctot_na == 999999999, na.rm = TRUE)

# And the label has been removed:
ipums_val_labels(inctot_na)

We could achieve the same result by referencing the labels themselves:

# Convert to NA for labels that contain "N.I.U."
inctot_na2 <- lbl_na_if(
  cps$INCTOT,
  function(.val, .lbl) grepl("N.I.U.", .lbl)
)

# Same result
all(inctot_na2 == inctot_na, na.rm = TRUE)

You can also specify the function using a one-sided formula:

lbl_na_if(cps$INCTOT, ~ .val == 999999999)

Note that .val only refers to labelled values---unlabelled values are not affected:

x <- lbl_na_if(cps$INCTOT, ~ .val >= 0)

# Unlabelled values greater than the cutoff are still present:
length(which(x > 0))

To convert unlabelled values to NA, use dplyr::na_if() instead.

Relabel values

lbl_relabel() can be used to create new value-label pairs, often to recombine existing labels into more general categories. It takes a two-sided formula to handle the relabeling:

The function again uses the .val and .lbl syntax mentioned above to refer to values and labels, respectively.

For instance, we could reclassify the categories in MIGRATE1 such that all migration within a state is captured in a single category:

ipums_val_labels(cps$MIGRATE1)

cps$MIGRATE1 <- lbl_relabel(
  cps$MIGRATE1,
  lbl(0, "NIU / Missing / Unknown") ~ .val %in% c(0, 2, 9),
  lbl(1, "Stayed in state") ~ .val %in% c(1, 3, 4)
)

ipums_val_labels(cps$MIGRATE1)

Many IPUMS variables include detailed labels that are grouped together into more general categories. These are often encoded with multi-digit values, where the starting digit refers to the larger category.

For instance, the EDUC variable contains categories for individual grades as well as categories for multiple grade groups:

head(ipums_val_labels(cps$EDUC), 15)

You could use lbl_relabel() to collapse the detailed categories into the more general ones, but you would have to define new value labels for all the categories. Instead, you could use lbl_collapse().

lbl_collapse() uses a function that takes .val and .lbl arguments and returns the new value each input value should be assigned to. The label of the lowest original value is used for each collapsed group. To group by the tens digit, use the integer division operator %/%:

# %/% refers to integer division, which divides but discards the remainder
10 %/% 10
11 %/% 10

# Convert to groups by tens digit
cps$EDUC2 <- lbl_collapse(cps$EDUC, ~ .val %/% 10)

ipums_val_labels(cps$EDUC2)

Relabeling caveats

It is always worth checking that the new labels make sense based on your research question. For instance, in the above example, both "12th grade, no diploma" and "High school diploma or equivalent" are collapsed to a single group as they both have values in the 70s. This may be suitable for your purposes, but for more control, it is best to use lbl_relabel().

Note that lbl_relabel() and lbl_collapse() only operate on labelled values, and are therefore designed for use with fully labelled vectors. That is, if you attempt to relabel a vector that has some unlabelled values, they will be converted to NA.

To avoid this, you can add labels for all values using lbl_add_vals() before relabeling (see below). In general, this shouldn't be necessary, as most partially-labelled vectors only include labels with ancillary information, like missing value indicators. These can typically be handled by other helpers, like lbl_na_if(), without requiring relabeling.

Remove unused value labels

Some variables may contain labels for values that don't appear in the data. Unused levels still appear in factor representations of these variables, so it is often beneficial to remove them with lbl_clean():

ipums_val_labels(cps$STATEFIP)

ipums_val_labels(lbl_clean(cps$STATEFIP))

Add new labels {#lbl_add}

As mentioned above, value labels are intended to be used as an intermediate data structure for preparing newly-imported data. As such, you're not likely to need to add new labels, but if you do, use lbl_add(), lbl_add_vals(), or lbl_define().

lbl_add() takes an arbitrary number of lbl() placeholders that will be added to a given labelled vector:

x <- haven::labelled(
  c(100, 200, 105, 990, 999, 230),
  c(`Unknown` = 990, NIU = 999)
)

lbl_add(
  x,
  lbl(100, "$100"),
  lbl(105, "$105"),
  lbl(200, "$200"),
  lbl(230, "$230")
)

lbl_add_vals() adds labels for all unlabelled values in a labelled vector with an optional labeller function. (This can be useful if you wish to operate on a partially labelled vector with a function that requires labelled input, like lbl_relabel().)

# `.` refers to each label value
lbl_add_vals(x, ~ paste0("$", .))

lbl_define() makes a labelled vector out of an unlabelled one. Use the same syntax as is used for lbl_relabel() to define new labels based on the unlabelled values:

age <- c(10, 12, 16, 18, 20, 22, 25, 27)

# Group age values into two label groups.
# Values not captured by the right hand side functions remain unlabelled
lbl_define(
  age,
  lbl(1, "Pre-college age") ~ .val < 18,
  lbl(2, "College age") ~ .val >= 18 & .val <= 22
)

Once all labelled variables have been appropriately converted to factors or numeric values, the data can move forward in the processing pipeline.

Other resources

The haven package, which underlies ipumsr's handling of value labels, provides more details on the labelled class. See vignette("semantics", package = "haven").

The labelled package provides other methods for manipulating value labels, some of which overlap those provided by ipumsr.

The questionr package includes functions for exploring labelled variables. In particular, the functions describe, freq and lookfor all print out to console information about the variable using the value labels.

Finally, the foreign and prettyR packages don't use the labelled class, but provide similar functionality for handling value labels, which could be adapted for use with labelled vectors.



Try the ipumsr package in your browser

Any scripts or data that you put into this service are public.

ipumsr documentation built on Oct. 20, 2023, 5:10 p.m.