README.md

sentinel

S3 class that allows different flavors of missing in numeric vectors.

One can divide measures into two groups: qualitative and quantitative. However, record formats often mix the two. Some of the values are simply interpreted as is: a 2 is a 2. Some of the values are codes which represent qualities instead of numbers: an 8 means the measure's not applicable. These are sometimes called "sentinel values." And, of course, some values are just plain missing.

When handling these data in R, a common idiom is to split the column in twain: a numeric vector for the quantitative and a factor for the qualitative. This is the simplest solution and will often work fine. But it does something risky: it separates linked data. The user must remember to keep them together, and usually does this with clever variable or column names.

Clever is bad. Code with my_data[, paste0(vars, c("_num", "_flag"))] is hard to read. Code with get is hard to follow.

The sentinel package offers the sentineled class to bundle numeric and categorical missing values into a single object.

library(sentinel)

x <- sentineled(
  c(10, 20, 98, 99, NA),
  sentinels = c(98, 99),
  labels    = c("refused", "not recorded")
)
x
## [1] 10             20             <refused>      <not recorded>
## [5] NA            
## sentinel values: "" "refused" "not recorded"

The numbers are numbers, the categories are categorical, and the unknowns are just unknown.

Still a vector

A sentineled object is a vector. When subsetting, a it will remain a sentineled object with the same possible sentinel values.

x[1]
## [1] 10
## sentinel values: "" "refused" "not recorded"
x[1:2]
## [1] 10 20
## sentinel values: "" "refused" "not recorded"
x[[3]]
## [1] <refused>
## sentinel values: "" "refused" "not recorded"
x[x < 15]
## [1] 10             <refused>      <not recorded> NA            
## sentinel values: "" "refused" "not recorded"

A sentineled vector can be used in arithmetic, with all non-missing values acting like normal numeric values. If possible, a sentineled object with the appropriate sentinel values will be the result.

mean(x, na.rm = TRUE)
## [1] 15
x / 100
## [1] 0.1            0.2            <refused>      <not recorded>
## [5] NA            
## sentinel values: "" "refused" "not recorded"

It can even be a column in a data.frame.

data.frame(
  element = c("argon", "boron", "chlorine"),
  mass    = sentineled(c(3, "x", 8), "x", "scale malfunction")
)
##    element                mass
## 1    argon                   3
## 2    boron <scale malfunction>
## 3 chlorine                   8

Using the missing values

The sentinel codes are treated as missing, but the different categories of missing are stored as a factor vector in the "sentinels" attribute of the object. Use the sentinels function to access them.

sentinels(x)
## [1]                           refused      not recorded <NA>        
## Levels:  refused not recorded
x[sentinels(x) != "refused"]
## [1] 10             20             <not recorded> NA            
## sentinel values: "" "refused" "not recorded"

Notice that, for the non-missing values in x, their respective sentinel codes are blanks ("").

as.character(sentinels(x))
## [1] ""             ""             "refused"      "not recorded"
## [5] NA

It's recommended to use explanatory sentinel levels for all expected types of missing. That way, if a value is shown as just plain NA, it's a sign something went wrong in the analysis.



WerthPADOH/sentinel documentation built on May 5, 2019, 4:49 p.m.