knitr::opts_chunk$set( collapse = TRUE, error = TRUE, purl = FALSE, comment = "### >" )
While the Introductory vignette gives an overview of core labelr functions, here, we offer an ad hoc dive into a range of miscellaneous special topics and additional functionalities. Let's go.
labelr is not intended for "large" data.frames, which is a fuzzy concept. To give a sense of what labelr can handle, let's see it in action with the NYC Flights 2013 data set: a moderate-not-big data.frame of ~340K rows.
Let's load labelr and the nycflights13 package.
opening_ding <- Sys.time() # to time labelr library(labelr) library(nycflights13)
We'll assign the data.frame to one we call df.
df <- flights nrow(df)
We'll add a "frame label," which describes the data.frame overall.
df <- add_frame_lab(df, frame.lab = "On-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013.")
Note that the source data.frame (nycflights13::flights
) is a tibble. The labelr
package coerces augmented data.frames, such as tibbles and data.tables, into
"pure" Base R data.frames -- and alerts you that it has done so. The intent is to
avoid the dependencies, errors, or inconsistent and unpredictable behaviors that
might result from labelr trying to integrate with or make sense of these or other
competing, alternative data.frame constructs, which (a) by design behave
differently from standard R data.frames in various subtle or not-so-subtle ways
and which (b) may continue to evolve in the future.
Let's see what this did.
attr(df, "frame.lab") # check for attribute get_frame_lab(df) # return frame.lab alongside data.frame name as a data.frame get_frame_lab(df)$frame.lab
Now, let's assign variable NAME labels.
names_labs_vec <- c( "year" = "Year of departure", "month" = "Month of departure", "year" = "Day of departure", "dep_time" = "Actual departure time (format HHMM or HMM), local tz", "arr_time" = "Actual arrival time (format HHMM or HMM), local tz", "sched_dep_time" = "Scheduled departure times (format HHMM or HMM)", "sched_arr_time" = "Scheduled arrival time (format HHMM or HMM)", "dep_delay" = "Departure delays, in minutes", "arr_delay" = "Arrival delays, in minutes", "carrier" = "Two letter airline carrier abbreviation", "flight" = "Flight number", "tailnum" = "Plane tail number", "origin" = "Flight origin airport code", "dest" = "Flight destination airport code", "air_time" = "Minutes spent in the air", "distance" = "Miles between airports", "hour" = "Hour of scheduled departure time", "minute" = "Minutes component of scheduled departure time", "time_hour" = "Scheduled date and hour of the flight as a POSIXct date" ) df <- add_name_labs(df, name.labs = names_labs_vec) get_name_labs(df) # show that they've been added
Let's add variable VALUE labels for variable "carrier." Helpfully, a mapping of airlines' carrier codes to their full names ships with the nycflights13 package itself.
airlines <- nycflights13::airlines head(airlines)
The carrier field of airlines matches the carrier column of df (formerly, flights)
ny_val <- airlines$carrier
The name field of airlines gives us the full airline names.
ny_lab <- airlines$name
Let's use these vectors to add value labels to df. We'll demo
add_val1()
, which accepts only one variable but allows you to pass its name
unquoted.
df <- add_val1(df, var = carrier, vals = ny_val, labs = ny_lab, max.unique.vals = 20 )
(Side note on warnings: The package issues the first in what will become a series of potentially annoying warnings that you are applying value labels to a larger data.frame than labelr was built to handle. There is a reason that this is a warning, not an error: labelr will work on larger data.frames until it doesn't, which is to say that the burdens of computational intensiveness will become a drag on speed and R's in-session memory capacity. In the present case, labelr handles the data.frame just fine, but things take a little longer, and labelr seizes most opportunities to remind you that you're making it work overtime.)
Okay, back to the value-labeling. Our data.frame also has a month variable,
expressed in integer terms (e.g., 1 indicates January, 9 indicates September).
We will "hand-jam" month value labels,using add_val_labs()
. This command is
equivalent to add_val1()
, except that it requires variable names to be quoted
but allows you to supply more than one of them at a time (i.e., you can supply
a character vector of variable names). In this case, we'll use it on just one
variable.
First, we'll create our vectors of unique values and labels.
ny_month_vals <- c(1:12) # values ny_month_labs <- c( "JAN", "FEB", "MAR", "APR", "MAY", "JUN", "JUL", "AUG", "SEP", "OCT", "NOV", "DEC" ) # labels
Note that order is important here: We need to supply exactly as many values as
value labels, with each value label being uniquely associated with the value
that shares its index. For example, in the above case, ny_month_vals[3]
(here,
3
) is associated with the ny_month_labs[3] (here, "MAR"
)).
Now, let's use these two vectors to add value labels for the variable "month".
df <- add_val_labs(df, vars = "month", vals = ny_month_vals, labs = ny_month_labs, max.unique.vals = 20 )
Finally, we'll use add_quant_labs()
to provide numerical range value labels
for five quintiles of the variable "dep_time."
df <- add_quant_labs(df, "dep_time", qtiles = 5)
Let's see where these value-labeling operations have left us.
get_val_labs(df)
We can use head()
to get a baseline look at select rows and variables
head(df[c("origin", "dep_time", "dest", "year", "month", "carrier")])
Now, let's do the same for a version of df that we've modified with
use_val_labs()
, which converts all values of value-labeled variables to their
corresponding labels.
df_swapd <- use_val_labs(df) head(df_swapd[c("origin", "dep_time", "dest", "year", "month", "carrier")])
Instead of replacing values using use_val_labs()
-- something we can't directly
undo -- it might be safer to simply add "value-labels-on" character variables to
the data.frame, while preserving the parent variables. This adds nearly 1M new
cells to our df (!), but let's throw caution to the wind with add_lab_cols()
.
df_plus <- add_lab_cols(df, vars = c("carrier", "month", "dep_time")) head(df_plus[c( "origin", "dest", "year", "month", "month_lab", "dep_time", "dep_time_lab", "carrier", "carrier_lab" )])
We can use flab()
to filter df based on month and carrier, even when value
labels are "invisible" (i.e., existing only as attributes() meta-data.
# labels are not visible (they exist only as attributes() meta-data) head(df[c("carrier", "arr_delay")]) # we still can use them to filter (note: we're filtering on "JetBlue Airways", # ...NOT its obscure code "B6") df_fl <- flab(df, carrier == "JetBlue Airways" & arr_delay > 20) # here's what's returned when we filtered on "JetBlue Airways" using flab() head(df_fl[c("carrier", "arr_delay")]) # double-check that this is JetBlue head(use_val_labs(df_fl)[c("carrier", "arr_delay")])
How long did this entire NYC Flights session take (results will vary)?
the_buzzer <- Sys.time() the_buzzer - opening_ding
labelr is not a fan of NA values or other "irregular" values, which are defined as infinite values, not-a-number values, and character values that look like them (e.g., "NAN", "INF", "inf", "Na").
When value-labeling a column / variable, such values are automatically given the
catch-all label "NA" (which will be converted to an actual NA in any columns
created by add_lab_cols()
or use_val_labs()
). You do not need (and should
not try) to specify this yourself, and you should not try to over-ride labelr on
this. If you want to use labelr AND you present with these sorts of values, your
options are to accept the default "NA" label or convert these sorts of values to
something else before labeling.
With that said, let's see how labelr handles this, with an assist from our old friend mtcars (packaged with R's base distribution).
First, let's assign mtcars to a new data.frame object that we will besmirch.
mtbad <- mtcars
Let's get on with the besmirching.
mtbad[1, 1:11] <- NA rownames(mtbad)[1] <- "Missing Car" mtbad[2, "am"] <- Inf mtbad[3, "gear"] <- -Inf mtbad[5, "carb"] <- NaN mtbad[2, "mpg"] <- Inf mtbad[3, "mpg"] <- NaN # add a character variable, for demonstration purposes # if it makes you feel better, you can pretend these are Consumer Reports or # ...JD Power ratings or something set.seed(9202) # for reproducibility mtbad$grade <- sample(c("A", "B", "C"), nrow(mtbad), replace = TRUE) mtbad[4, "grade"] <- NA mtbad[5, "grade"] <- "NA" mtbad[6, "grade"] <- "Inf" # see where this leaves us head(mtbad) sapply(mtbad, class)
Now, let's add value labels to this unruly data.frame.
mtlabs <- mtbad |> add_val1(grade, vals = c("A", "B", "C"), labs = c("Gold", "Silver", "Bronze") ) |> add_val1(am, vals = c(0, 1), labs = c("auto", "stick") ) |> add_val1(carb, vals = c(1, 2, 3, 4, 6, 8), # not the most inspired use of labels labs = c( "1c", "2c", "3c", "4c", "6c", "8c" ) ) |> add_val1(gear, vals = 3:5, # again, not the most compelling use case labs = c( "3-speed", "4-speed", "5-speed" ) ) |> add_quant1(mpg, qtiles = 4) # add quartile-based value labels
get_val_labs(mtlabs, "am") # NA values were detected and dealt with
Let's streamline the data.frame with sselect()
to make it more manageable.
mtless <- sselect(mtlabs, mpg, cyl, am, gear, carb, grade) # safely select head(mtless, 5) # note that the irregular values are still here
Notice how all irregular values are coerced to NA when we substitute labels for
values with use_val_labs()
.
head(use_val_labs(mtless), 5) # but they all go to NA if we `use_val_labs`
Now, let's try an add_lab_cols()
view.
mtlabs_plus <- add_lab_cols(mtlabs, c("mpg", "am")) # creates, adds "am_lab" col mtlabs_plus <- sselect(mtlabs_plus, mpg, mpg_lab, am, am_lab) # select cols head(mtlabs_plus) # where we landed
What if we had tried to explicitly label the NA values and/or irregular values themselves? We would have failed.
# Trying to Label an Irregular Value (-Inf) mtbad <- add_val1( data = mtcars, var = gear, vals = -Inf, labs = c("neg.inf") ) # Trying to Label an Irregular Value (NA) mtbad <- add_val_labs( data = mtbad, vars = "grade", vals = NA, labs = c("miss") ) # Trying to Label an Irregular Value (NaN) mtbad <- add_val_labs( data = mtbad, vars = "carb", vals = NaN, labs = c("nan-v") ) # labelr also treats "character variants" of irregular values as irregular values. mtbad <- add_val1( data = mtbad, var = carb, vals = "NAN", labs = c("nan-v") )
Again, labelr handles NA and irregular values and resists our efforts to take such matters into our own hands.
R's concept of a factor variable shares some affinities with the concept of a value-labeled variable and can be viewed as one approach to value labeling. However, factors can manifest idiosyncratic and surprising behaviors depending on the function to which you're trying to apply them. They are character-like, but they are not character values. They are built on top of integers, but they won't submit to all of the operations that integers do. They do some very handy things in certain model-fitting applications, but their behavior "under the hood" can be counter-intuitive or opaque. Simply put, they are their own thing.
So, while factors have their purposes, it would be nice to associate value labels with the distinct values of data.frame variables in a manner that preserves the integrity and transparency of the underlying values (factors tend to be a bit opaque about this) and that allows you to view or use the labels in flexible ways.
And if you wanted to work with a factor, it would be nice if you could add value labels to it without it ceasing to exist and behave as a factor.
With that said, let's see if we can have our label-factor cake and eat it, too, using the iris data.frame that comes pre-packaged with R.
unique(iris$Species) sapply(iris, class) # nothing up our sleeve -- "Species" is a factor
Let's add value labels to "Species" and assign the result to a new data.frame that we'll call irlab. For our value labels, we'll use "se", "ve", and "vi", which are not adding much new information, but they will help to illustrate what we can do with labelr and a factor variable.
irlab <- add_val_labs(iris, vars = "Species", vals = c("setosa", "versicolor", "virginica"), labs = c("se", "ve", "vi") ) # this also would've worked # irlab_dos <- add_val1(iris, Species, # vals = c("setosa", "versicolor", "virginica"), # labs = c("se", "ve", "vi") # )
Note that we could have just as (or even more) easily used add_val1()
, which
works for a single variable at a time and allows us to avoid quoting our column
name, if that matters to us. In contrast, add_val_labs()
requires us to put
our variable name(s) in quotes, but it also gives us the option to apply a
common value-label scheme to several variables at once (e.g., Likert-style
survey responses). We'll see an example of this type of use case in action in a
little bit.
For now, though, let's prove that the iris and irlab data.frames are functionally identical.
First, note that irlab looks and acts just like iris in the usual ways that matter
summary(iris) summary(irlab) head(iris, 4) head(irlab, 4) lm(Sepal.Length ~ Sepal.Width + Species, data = iris) lm(Sepal.Length ~ Sepal.Width + Species, data = irlab) # values are same
Note also that irlab's "Species" is still a factor, just like its iris counterpart/parent.
sapply(irlab, class) levels(irlab$Species)
But irlab's "Species" has value labels!
get_val_labs(irlab, "Species")
And they work.
head(use_val_labs(irlab)) ir_v <- flab(irlab, Species == "vi") head(ir_v, 5)
Our take-aways so far? Factors can be value-labeled while staying factors, and we can use the labels to do labelr-y things with those factors. We can have both.
We may want to go further and add the labeled variable alongside the factor version.
irlab_aug <- add_lab_cols(irlab, vars = "Species")
This gives us a new variable called "Species_lab". Let's get select rows of the resulting data.frame, since we want to see all the different species.
set.seed(231) sample_rows <- sample(seq_len(nrow(irlab)), 10, replace = FALSE) irlab_aug[sample_rows, ] sapply(irlab_aug, class) with(irlab_aug, table(Species, Species_lab))
Caution: Replacing the entire data.frame using use_val_labs()
WILL coerce
factors to character, since the value labels are character values, not
recognized factor levels
ir_char <- use_val_labs(irlab) # we assign this to a new data.frame sapply(ir_char, class) head(ir_char, 3) class(ir_char$Species) # it's character
Of course, even then, we could explicitly coerce the labels to be factors if we wanted
ir_fact <- use_val_labs(irlab) ir_fact$Species <- factor(ir_char$Species, levels = c("se", "ve", "vi"), labels = c("se", "ve", "vi") ) head(ir_fact, 3) class(ir_fact$Species) # it's a factor levels(ir_fact$Species) # it's a factor
We've recovered.
with(ir_fact, tapply(Sepal.Width, Species, mean)) with(irlab, tapply(Sepal.Width, Species, mean)) with(iris, tapply(Sepal.Width, Species, mean))
Value labels work with ordered factors, too. Let's make a fictional ordered factor that we add to ir_ord. We can pretend that this is some sort of judge's overall quality rating, if that helps.
ir_ord <- iris set.seed(293) qrating <- c("AAA", "AA", "A", "BBB", "AA", "BBB", "A") ir_ord$qrat <- sample(qrating, 150, replace = TRUE) ir_ord$qrat <- factor(ir_ord$qrat, ordered = TRUE, levels = c("AAA", "AA", "A", "BBB") )
Where do we stand with this factor?
levels(ir_ord$qrat) class(ir_ord$qrat)
Now, let's add value labels to it.
ir_ord <- add_val_labs(ir_ord, vars = "qrat", vals = c("AAA", "AA", "A", "BBB"), labs = c( "unimpeachable", "excellent", "very good", "meh" ) )
Let's add a separate column with those labels as a distinct (character) variable unto itself, existing in addition to (not replacing) "qrat".
ir_ord <- add_lab_cols(ir_ord, vars = "qrat") head(ir_ord, 10) with(ir_ord, table(qrat_lab, qrat)) class(ir_ord$qrat) levels(ir_ord$qrat) class(ir_ord$qrat_lab) get_val_labs(ir_ord, "qrat") # labs are still there for qrat get_val_labs(ir_ord, "qrat_lab") # no labs here; this is just a character var
labelr offers some additional facilities for working with factors and categorical
variables. For example, functions add_lab_dummies()
(alias ald()
) and
add_lab_dumm1()
(alias ald1()
) will generate and assign a dummy (aka binary
aka indicator) variable for each unique value label of a value-labeled variable
-- factor or otherwise.
Alternatively, lab_int_to_factor()
(alias int2f()
) allows you to convert a
value-labeled integer variable (or other non-decimal-having numeric column) to a
factor, while factor_to_lab_int()
(alias f2int()
) allows you to convert a
factor to a value-labeled integer variable. Note that the latter is NOT a
straightforward "undo" for the former: the resulting unique integer values and
their ordering may differ, as we demonstrate.
First, let's convert a factor to a value-labeled integer.
class(iris[["Species"]]) iris_df <- factor_to_lab_int(iris, Species) class(iris_df[["Species"]]) head(iris_df$Species) get_val_labs(iris_df, "Species")
Now, let's value-label an integer and convert it to a factor. Note that our
variable is not a strict as.integer()
integer, but it's a numeric variable with
no decimal values, and that's good enough for lab_int_to_factor()
.
carb_orig <- mtcars carb_orig <- add_val_labs( data = mtcars, vars = "carb", vals = c(1, 2, 3, 4, 6, 8), labs = c( "1c", "2c", # a tad silly, but these value labels will demo the principle "3c", "4c", "6c", "8c" ) ) # carb as labeled numeric is.integer(carb_orig$carb) # note: carb not technically an "as.integer()" integer class(carb_orig$carb) # but it IS numeric has_decv(carb_orig$carb) # and does NOT have decimals; so, lab_int_to_fac() works levels(carb_orig$carb) # none, not a factor head(carb_orig$carb, 3) mean(carb_orig$carb) # compare to carb_to_int (below) lm(mpg ~ carb, data = carb_orig) # compare to carb_to_int (below) # note this for comparison to below (adj_r2_orig <- summary(lm(mpg ~ carb, data = carb_orig))$adj.r.squared) # compare to counterparts below AIC(lm(mpg ~ carb, data = carb_orig)) # Make carb a factor carb_fac <- lab_int_to_factor(carb_orig, carb) # alias int2f() also works class(carb_fac$carb) # now it's a factor levels(carb_fac$carb) # like any good factor, it has levels head(carb_fac$carb, 3) lm(mpg ~ carb, data = carb_fac) # again: carb is a factor # compare these model fit stats to counterparts above and below (adj_r2_fac <- summary(lm(mpg ~ carb, data = carb_fac))$adj.r.squared) # compare to counterparts above and below AIC(lm(mpg ~ carb, data = carb_fac))
Note that we can use factor_to_lab_int()
to convert "carb" from a factor to a
labeled integer variable. However, this is not a straightforward "undo" of what
we just did: the resulting labeled integer won't be identical to the "carb"
column of mtcars that we started with, because factor_to_lab_int()
converts
the supplied factor variable's values to sequentially ordered integers (from 1
to k, where k is the number of unique factor levels), ordered in terms of the
levels of the factor variable being converted.
# ??"back"?? to integer? Not quite. Compare below to carb_orig above carb_to_int <- factor_to_lab_int(carb_fac, carb) # alias f2int() also works class(carb_to_int$carb) # Is an integer levels(carb_to_int$carb) # NOT a factor mean(carb_to_int$carb) # NOT the same as carb_orig identical(carb_to_int$carb, carb_orig$carb) # really! lm(mpg ~ carb, data = carb_to_int) # NOT the same as carb_orig # Compare to counterpart calls from earlier iterations of carb (above) (adj_r2_int <- summary(lm(mpg ~ carb, data = carb_to_int))$adj.r.squared) AIC(lm(mpg ~ carb, data = carb_to_int))
Now, let's quickly demo add_lab_dummies()
. To do so, we'll revisit the
"Species" column of irlab, our factor variable from iris that we value-labeled a
few moments ago. It's still here and still has value labels.
get_val_labs(irlab, "Species")
Let's use add_lab_dummies()
to create a dummy variable for each of its labels.
irl_dumm <- add_lab_dummies(irlab, "Species") head(irl_dumm) # they're there! tail(irl_dumm) # again, they're there!
We can use add_lab_dumm1()
to achieve the same result without quoting the
column name. The countervailing advantage of add_lab_dummies()
is that it lets
you create dummy variables for more than one value-labeled variable at a time
(add_lab_dumm1()
does not).
irl_dumm2 <- add_lab_dumm1(irlab, Species) head(irl_dumm2) # again, they're there! tail(irl_dumm2) # again, they're there!
Functions for adding value labels (e.g., add_val_labs
, add_quant_labs
and
add_m1_lab
) will do partial matching if the partial argument is set to TRUE.
Let's use labelr's make_likert_data()
function to generate some fake Likert
scale-style survey data to demonstrate this more fully.
set.seed(272) # for reproducibility dflik <- make_likert_data(scale = 1:7) # another labelr function head(dflik)
We'll put the values we wish to label and the labels we wish to use in
stand-alone vectors, which we will supply to add_val_labs
in a moment.
vals2label <- 1:7 labs2use <- c( "VSD", "SD", "D", "N", "A", "SA", "VSA" )
Now, let's associate/apply the value labels to ALL vars with "x" in their name and also to var "y3." Note: partial = TRUE.
dflik <- add_val_labs( data = dflik, vars = c("x", "y3"), ### note the vars args vals = vals2label, labs = labs2use, partial = TRUE # applying to all cols with "x" or "y3" substring in names )
Let's compare dflik with value labels present but "off" to labels "on."
First, present but "off."
head(dflik)
Now, let's "turn on" (use) these value labels.
lik1 <- uvl(dflik) # assign to new object, since we can't "undo" head(lik1) # we could have skipped previous call by using labelr::headl(dflik)
Yea, verily: All variables with "x" in their name (and "y3") got the labels!
Suppose we want to drop these value labels for a select few, but not all, of
these variables. drop_val_labs
can get the job done.
dfdrop <- drop_val_labs(dflik, c("x2", "y3"), partial = FALSE )
Most of our previously labeled columns remain so; but not "x2" and "y3."
get_val_labs(dfdrop, c("x2", "y3"))
Compare to values for variable "x1" (we did not drop value labels from this one)
get_val_labs(dfdrop, "x1")
Just like we did with add_val_labs()
, we also can use a single command to drop
value labels from all variables with "x" in their variable names.
dfxgone <- drop_val_labs(dflik, c("x"), partial = TRUE # note )
"y3" still has value labels, but now all "x" var value labels are gone.
get_val_labs(dfxgone)
tabl()
Finally, let's get to know labelr's tabl()
function, which supports count
or proportion tabulations with labels turned "on" or "off" and offers some other
functionalities.
set.seed(4847) # for reproducibility df <- make_demo_data(n = 1000) # make a fictional n = 1000 data set df <- add_val1(df, # data.frame var = raceth, # var to label, unquoted since this is add_val1() vals = c(1:7), # label values 1 through 7, inclusive labs = c( "White", "Black", "Hispanic", # ordered labels for sequential vals 1-7 "Asian", "AIAN", "Multi", "Other" ) ) df <- add_val1( data = df, var = gender, vals = c(0, 1, 2, 3, 4), # the values to be labeled labs = c("M", "F", "TR", "NB", "Diff-Term"), # labs order should reflect vals order max.unique.vals = 10 ) # label values of var "x1" according to quantile ranges df <- add_quant1( data = df, var = x1, # apply quantile range value labels to this var qtiles = 3 # first, second, and third tertiles ) # apply many-vals-get-one-label labels to "edu" (note vals 3-5 all get same lab) df <- add_m1_lab(df, "edu", vals = c(3:5), lab = "Some College+") df <- add_m1_lab(df, "edu", vals = 1, lab = "Not HS Grad") df <- add_m1_lab(df, "edu", vals = 2, lab = "HSG, No College") # show value labels get_val_labs(df)
With tabl()
, tables can be generated...
...in terms of values
tabl(df, vars = "gender", labs.on = FALSE)
...or in terms of labels
tabl(df, vars = "gender", labs.on = TRUE) # labs.on = TRUE is the default
...in proportions
tabl(df, vars = c("gender", "edu"), prop.digits = 3)
...cross-tab style
head(tabl(df, vars = c("raceth", "edu"), wide.col = "gender"), 20)
...with non-value-labeled data.frames
tabl(iris, "Species") # explicit vars arg with one-var ("Species") # many-valued numeric vars automatically converted to quantile categories tabl(mtcars, c("am", "gear", "cyl", "disp", "mpg"), qtiles = 4, zero.rm = TRUE )
This is the suitably abrupt ending to our choppy, ad hoc overview of some additional labelr capabilities and special topics. Thanks for reading.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.