In base R, missing values are indicated using the specific value NA
. Regular NAs could be used with any type of vector (double, integer, character, factor, Date, etc.).
Other statistical software have implemented ways to differentiate several types of missing values.
Stata and SAS have a system of tagged NAs, where NA values are tagged with a letter (from a to z). SPSS allows users to indicate that certain non-missing values should be treated in some analysis as missing (user NAs). The haven
package implements tagged NAs and user NAs in order to keep this information when importing files from Stata, SAS or SPSS.
library(labelled)
Tagged NAs are proper NA
values with a tag attached to them. They can be created with tagged_na()
. The attached tag should be a single letter, lowercase (a-z) or uppercase (A-Z).
x <- c(1:5, tagged_na("a"), tagged_na("z"), NA)
For most R functions, tagged NAs are just considered as regular NAs. By default, they are just printed as any other regular NA.
x
is.na(x)
To show/print their tags, you need to use na_tag()
, print_tagged_na()
or format_tagged_na()
.
na_tag(x) print_tagged_na(x) format_tagged_na(x)
To test if a certain NA is a regular NA or a tagged NA, you should use is_regular_na()
or is_tagged_na()
.
is.na(x) is_tagged_na(x) # You can test for specific tagged NAs with the second argument is_tagged_na(x, "a") is_regular_na(x)
Tagged NAs could be defined only for double vectors. If you add a tagged NA to a character vector, it will be converted into a regular NA. If you add a tagged NA to an integer vector, the vector will be converted into a double vector.
y <- c("a", "b", tagged_na("z")) y is_tagged_na(y) format_tagged_na(y) z <- c(1L, 2L, tagged_na("a")) typeof(z) format_tagged_na(z)
By default, functions such as base::unique()
, base::duplicated()
, base::order()
or base::sort()
will treat tagged NAs as the same thing as a regular NA. You can use unique_tagged_na()
, duplicated_tagged_na()
, order_tagged_na()
and sort_tagged_na()
as alternatives that will treat two tagged NAs with different tags as separate values.
x <- c(1, 2, tagged_na("a"), 1, tagged_na("z"), 2, tagged_na("a"), NA) x %>% print_tagged_na() unique(x) %>% print_tagged_na() unique_tagged_na(x) %>% print_tagged_na() duplicated(x) duplicated_tagged_na(x) sort(x, na.last = TRUE) %>% print_tagged_na() sort_tagged_na(x) %>% print_tagged_na()
It is possible to define value labels for tagged NAs.
x <- c(1, 0, 1, tagged_na("r"), 0, tagged_na("d"), tagged_na("z"), NA) val_labels(x) <- c( no = 0, yes = 1, "don't know" = tagged_na("d"), refusal = tagged_na("r") ) x
When converting such labelled vector into factor, tagged NAs are, by default, converted into regular NAs (it is not possible to define tagged NAs with factors).
to_factor(x)
However, the option explicit_tagged_na
of to_factor()
allows to transform tagged NAs into explicit factor levels.
to_factor(x, explicit_tagged_na = TRUE) to_factor(x, levels = "prefixed", explicit_tagged_na = TRUE)
Tagged NAs can be converted into user NAs with tagged_na_to_user_na()
.
tagged_na_to_user_na(x) tagged_na_to_user_na(x, user_na_start = 10)
Use tagged_na_to_regular_na()
to convert tagged NAs into regular NAs.
tagged_na_to_regular_na(x) tagged_na_to_regular_na(x) %>% is_tagged_na()
haven
introduced an haven_labelled_spss
class to deal with user defined missing values in a similar way as SPSS. In such case, additional attributes will be used to indicate with values should be considered as missing, but such values will not be stored as internal NA
values. You should note that most R function will not take this information into account. Therefore, you will have to convert missing values into NA
if required before analysis. These defined missing values could co-exist with internal NA
values.
User NAs could be created directly with labelled_spss()
. You can also manipulate them with na_values()
and na_range()
.
v <- labelled(c(1, 2, 3, 9, 1, 3, 2, NA), c(yes = 1, no = 3, "don't know" = 9)) v na_values(v) <- 9 v na_values(v) <- NULL v na_range(v) <- c(5, Inf) na_range(v) v
NB: you cant also use set_na_range()
and set_na_values()
for a dplyr
-like syntax.
library(dplyr) # setting value labels and user NAs df <- tibble(s1 = c("M", "M", "F", "F"), s2 = c(1, 1, 2, 9)) %>% set_value_labels(s2 = c(yes = 1, no = 2)) %>% set_na_values(s2 = 9) df$s2 # removing user NAs df <- df %>% set_na_values(s2 = NULL) df$s2
Note that is.na()
will return TRUE
for user NAs. Use is_user_na()
to test if a specific value is a user NA and is_regular_na()
to test if it is a regular NA.
v is.na(v) is_user_na(v) is_regular_na(v)
For most R functions, user NAs values are still regular values.
x <- c(1:5, 11:15) na_range(x) <- c(10, Inf) val_labels(x) <- c("dk" = 11, "refused" = 15) x mean(x)
You can convert user NAs into regular NAs with user_na_to_na()
or user_na_to_regular_na()
(both functions are identical).
user_na_to_na(x) mean(user_na_to_na(x), na.rm = TRUE)
Alternatively, if the vector is numeric, you can convert user NAs into tagged NAs with user_na_to_tagged_na()
.
user_na_to_tagged_na(x) mean(user_na_to_tagged_na(x), na.rm = TRUE)
Finally, you can also remove user NAs definition without converting these values to NA
, using remove_user_na()
.
remove_user_na(x) mean(remove_user_na(x))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.