Conversion semantics

library(haven)
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

There are some differences between the way that R, SAS, SPSS, and Stata represented labelled data and missing values. While SAS, SPSS, and Stata share some obvious similarities, R is little different. This vignette explores the differences, and shows you how haven bridges the gap.

Value labels

Base R has one data type that effectively maintains a mapping between integers and character labels: the factor. This however, is not the primary use of factors: they are instead designed to automatically generate useful contrasts for linear models. Factors differ from the labelled values provided by the other tools in important ways:

Value labels in SAS are a little different again. In SAS, labels are just special case of general formats. Formats include currencies and dates, but user-defined just assigns labels to individual values (including special missings value). Formats have names and existing independently of the variables they are associated with. You create a named format with PROC FORMAT and then associated with variables in a DATA step (the names of character formats thealways start with $).

labelled()

To allow you to import labelled vectors into R, haven provides the S3 labelled class, created with labelled(). This class allows you to associated arbitrary labels with numeric or character vectors:

x1 <- labelled(
  sample(1:5), 
  c(Good = 1, Bad = 5)
)
x1

x2 <- labelled(
  c("M", "F", "F", "F", "M"), 
  c(Male = "M", Female = "F")
)
x2

The goal of haven is not to provide a labelled vector that you can use everywhere in your analysis. The goal is to provide an intermediate datastructure that you can convert into a regular R data frame. You can do this by either converting to a factor or stripping the labels:

as_factor(x1)
zap_labels(x1)

as_factor(x2)
zap_labels(x2)

See the documentation for as_factor() for more options to control exactly what the factor uses for levels.

Both as_factor() and zap_labels() have data frame methods if you want to apply the same strategy to every column in a data frame:

df <- tibble::data_frame(x1, x2, z = 1:5)
df

zap_labels(df)
as_factor(df)

Missing values

All three tools provide a global "system missing value" which is displayed as .. This is roughly equivalent to R's NA, although neither Stata nor SAS propagate missingness in numeric comparisons: SAS treats the missing value as the smallest possible number (i.e. -inf), and Stata treats it as the largest possible number (i.e. inf).

Each tool also provides a mechanism for recording multiple types of missingness:

Stata and SAS only support tagged missing values for numeric columns. SPSS supports up to three distinct values for character columns. Generally, operations involving a user-missing type return a system missing value.

Haven models these missing values in two different ways:

Tagged missing values

To support Stata's extended and SAS's special missing value, haven implements a tagged NA. It does this by taking advantage of the internal structure of a floating point NA. That allows these values to behave identical to NA in regular R operations, while still preserving the value of the tag.

The R interface for creating with tagged NAs is a little clunky because generally they'll be created by haven for you. But you can create your own with tagged_na():

x <- c(1:3, tagged_na("a", "z"), 3:1)
x

Note these tagged NAs behave identically to regular NAs, even when printing. To see their tags, use print_tagged_na():

print_tagged_na(x)

To test if a value is a tagged NA, use is_tagged_na(), and to extract the value of the tag, use na_tag():

is_tagged_na(x)
is_tagged_na(x, "a")

na_tag(x)

My expectation is that tagged missings are most often used in conjuction with labels (described below), so labelled vectors print the tags for you, and as_factor() knows how to relabel:

y <- labelled(x, c("Not home" = tagged_na("a"), "Refused" = tagged_na("z")))
y

as_factor(y)

User defined missing values

SPSS's user-defined values work differently to SAS and Stata. Each column can have either up to three distinct values that are considered as missing, or a range. Haven provides labelled_spss() as a subclass of labelled() to model these additional user-defined missings.

x1 <- labelled_spss(c(1:10, 99), c(Missing = 99), na_value = 99)
x2 <- labelled_spss(c(1:10, 99), c(Missing = 99), na_range = c(90, Inf))

x1
x2

These objects are somewhat dangerous to work with in R because most R functions don't know those values are missing:

mean(x1)

Because of that danger, the default behaviour of read_spss() is to return regular labelled objects where user-defined missing values have been converted to NAs. To get read_spss() to return labelled_spss() objects, you'll need to set user_na = TRUE.

I've defined an is.na() method so you can find them yourself:

is.na(x1)

And the presence of that method does mean many functions with an na.rm argument will work correctly:

mean(x1, na.rm = TRUE)

But generally you should either convert to a factor, convert to regular missing vaues, or strip the all the labels:

as_factor(x1)
zap_missing(x1)
zap_labels(x1)


Try the haven package in your browser

Any scripts or data that you put into this service are public.

haven documentation built on July 10, 2023, 2:04 a.m.