library(learnr) library(tutorial.helpers) library(tidyverse) # Without this hack, we fail on GHA! cols() and col_factor() have conflicting # versions in readr, vroom and scales. Of course, we want the readr versions. We # could solve this with either: # library(conflicted) # conflict_prefer("cols", "readr") # and so on. Or with: # library(tidymodels) # tidymodels_prefer() # because this applies to tidyverse functions we well # But that all seems like overkill, given that it all works for students # regardless. cols <- readr::cols col_factor <- readr::col_factor knitr::opts_chunk$set(echo = FALSE) options(tutorial.exercise.timelimit = 60, tutorial.storage = "local") x1 <- c("Dec", "Apr", "Jan", "Mar") x2 <- c("Dec", "Apr", "Jam", "Mar") month_levels <- c( "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec" ) csv <- " month,value Feb,12 Mar,56 Feb,14 Jan,12"
This tutorial covers Chapter 16: Factors from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. forcats is the core Tidyverse package for working with categorical variables, called "factors" in R. Key commands include fct()
for creating factors, fct_reorder()
for changing the order of the levels, and fct_recode()
for recoding factors.
Factors are used for categorical variables --- variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order.
Load the tidyverse library.
library(...)
library(tidyverse)
One of the nine core packages within the Tidyverse is forcats, a package dedicated to working with factors. By loading tidyverse, we automatically get access to forcats and the other "core" Tidyverse packages.
Look up the help page for forcats by entering help(package = "forcats")
at the Console. Copy/paste the lines for the first help pages.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 5)
forcats provides tools for dealing with categorical variables --- and it’s an anagram of the word "factors" --- using a wide range of helpers for working with factors.
Hit "Run Code" to create the variable x1
.
x1 <- c("Dec", "Apr", "Jan", "Mar")
x1 <- c("Dec", "Apr", "Jan", "Mar")
Note that x1
is a character variable. This can lead to all sorts of problems given that months are a good example of a categorical variable, given that there are exactly 12 possible values.
Run sort()
on x1
.
sort(...)
sort(x1)
Because x1
is a character variable, this sorts alphabetically, which is not what we want. We would prefer that the sort order correspond to the order in which months appear in the calendar.
Hit "Run Code" to create the x2
variable, another character vector. But note the misspelling of "Jan" as "Jam".
x2 <- c("Dec", "Apr", "Jam", "Mar")
x2 <- c("Dec", "Apr", "Jam", "Mar")
Because x1
and x2
are both character vectors, nothing will catch the contradiction between "Jan" and "Jam." Using factors will force us to notice such errors.
To create a factor you must start by creating a list of the valid "levels." Hit "Run Code" to create the month_levels
variable.
month_levels <- c( "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec" )
Note that month_levels
is just another character vector. We need it, however, to create a factor variable.
Run factor(x1, levels = month_levels)
.
factor(x1, ... = month_levels)
factor(x1, levels = month_levels)
The function factor()
is a part of base R, not forcats. It creates a factor variable. Notice how, in addition to the values of x1
being printed, as they are when we print a character variable, we also see the levels, printed out in order.
Wrap factor(x1, levels = month_levels)
within a call to sort()
.
sort(...(x1, levels = ...))
sort(factor(x1, levels = month_levels))
Instead of being sorted in alphabetical order, as before, the values are sorted in the order of the levels, which is almost always what we want when sorting months.
Run factor()
with two arguments: x2
and levels = month_levels
.
...(x2, ... = month_levels)
factor(x2, levels = month_levels)
Since "Jam" is not one of the levels, factor()
coerces it to be missing, as shown with the <NA>
symbol. One big advantage of working with factors is that you are prevented from using values which are not one of the levels.
Instead of factor()
, we recommend using the fct()
function from the forcats package, precisely because it generates an explicit error rather than a silent conversion to NA
.
Run fct(x2, levels = month_levels)
to see an example of this error.
...(x2, ... = month_levels)
Notice the thorough error message. "Jam" is missing from the levels
as defined in the month_levels
variable.
Run factor()
and x1
.
factor(...)
factor(x1)
Because we did not provide a levels
argument, the values for the levels will be taken from the values of the x1
vector, sorted in alphabetical order.
Run fct()
and x1
.
fct(...)
fct(x1)
Sorting alphabetically is slightly risky because not every computer will sort strings in the same way. So forcats::fct()
orders by first appearance in the original vector.
Take the code from the previous exercise use it as an argument to the function levels()
.
levels(fct(...))
levels(fct(x1))
If you ever need to access the set of valid levels directly, you can do so with levels()
.
The next exercises will be focusing on the variable csv
. Hit "Run Code" to look at csv
.
csv
Note the \n
. That signifies a new line being formed, but doesn't make csv
easier to read.
Run read_csv()
with csv
set as the argument
read_csv(...)
read_csv(csv)
csv
is now much easier to read and understand. We can see what belongs in the month column and the value column, as well as variable types the columns are.
Currently, the month column is a character type variable and the value column is a double type variable. Add the arguments col_types
to read_csv
and set it equal to "cc"
read_csv(csv, ...)
read_csv(csv, col_types = "cc")
col_types = "cc"
changes the variable types of the columns to both be characters. The first c
in cc
corresponds to the first column, and so on.
Change the value of col_types
from "cc"
to cols(month = "c")
inside of read_csv()
read_csv(csv, col_types = cols(...))
read_csv(csv, col_types = cols(month = "c"))
Change the value of col_types
to cols(month = "f"))
read_csv(csv, col_types = cols(...))
read_csv(csv, col_types = cols(month = "f"))
The month variable is now a variable of factor type. Having month as a factor will allow us to perform certain actions on it later on.
Continue the current pipe to count()
. Set the argument to month
.
... |> count(...)
read_csv(csv, col_types = cols(month = "f")) |> count(month)
While the tibble has all the correct information, it's not amazing to read. No one thinks of the months in those order. Luckily, this can be changed.
In read_csv
, change the value of col_types
to cols(month = col_factor(month_levels))
read_csv(csv, col_types = ...) |> count(month)
read_csv(csv, col_types = cols(month = col_factor(month_levels))) |> count(month)
Besides the use of factor()
and fct()
as described earlier, col_factor()
, when used within read_csv()
and similar import functions, is the most common way of creating factor variables.
The gss_cat
tibble is a data set in the forcats package. It’s a sample of data from the General Social Survey, a long-running US survey conducted by the independent research organization NORC at the University of Chicago.
Type gss_cat
and hit "Run Code."
gss_cat
gss_cat
There are 9 variables and more than 20,000 observations. Note how the print()
method for tibbles, which is called whenever you just enter the name of a tibble, like gss_cat
, gives the variable types across the top.
Look up the help page for gss_cat
by typing ?gss_cat
at the Console. Copy/paste the Description.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 2)
When referring to a tibble (or other variable) which is part of a package, you can just use the variable name if you have already loaded the package. (Recall that running library(tidyverse)
loads all the Tidyverse libraries, including forcats.) You can also refer to the variable directly using the double colon notation -- ::
-- i.e., forcats::gss_cat
.
When factors are stored in a tibble, you can’t see their levels so easily. One way to view them is with count()
. Pipe gss_cat
to count(race)
.
gss_cat |> ...(race)
gss_cat |> count(race)
The <fct>
indicator above race
indicates that it is a factor variable, not character.
When working with factors, one common operation is changing the order of the levels. Let's create this plot:
plot1 <- gss_cat |> summarize(n = n(), tvhours = mean(tvhours, na.rm = TRUE), .by = relig) |> mutate(relig = fct_reorder(relig, tvhours)) |> ggplot(aes(x = tvhours, y = relig)) + geom_point() + labs(title = "TV Watching and Religious Affiliation", subtitle = "Don't Knows watch a lot of TV", x = "TV Hours Watched Per Day", y = "Religious Affiliation") plot1
Run glimpse()
on gss_cat
.
glimpse(...)
glimpse(gss_cat)
We will be working with two variables: relig
and tvhours
. relig
is a factor variable reporting religious affiliation, if any. tvhours
is hours per day spent watching TV, on average.
Pipe gss_cat
to summarize(n = n())
gss_cat |> summarize(n = n())
gss_cat |> summarize(n = n())
Note how the letter "n" is used in two ways. First, it is the name of a new variable n
, created via summarize()
. In statistics, it is common for the letter "n" to mean the number of observations. Second, n()
is a function, hence the ()
, which calculates the number of observations. Since there is no .by
argument, the result is a tibble with a single row.
Use the same pipe again, but add .by = relig
as an argument/value pairing to summarize()
.
gss_cat |> summarize(n = n(), .by = relig)
gss_cat |> summarize(n = n(), .by = relig)
The result is a tibble with one row for each level of relig
. (Older R code will often use the group_by()
function when calculating statistics for each level of a factor. You should avoid this approach. Use the .by
argument to summarize()
and similar functions.)
Use the same code again, adding another variable creation step to summarize()
: tvhours = mean(tvhours)
.
gss_cat |> summarize(n = n(), tvhours = mean(tvhours), .by = relig)
gss_cat |> summarize(n = n(), tvhours = mean(tvhours), .by = relig)
Note that each argument (or variable creation step) in summarize()
must be separated by a command. Alas, there are NA
values present at least one person in every level of relig
.
Modify the pipe by add na.rm = TRUE
as an argument within the mean()
function.
gss_cat |> summarize(n = n(), tvhours = mean(tvhours, na.rm = TRUE), .by = relig)
gss_cat |> summarize(n = n(), tvhours = mean(tvhours, na.rm = TRUE), .by = relig)
All statistical functions in R will produce a value of NA if even a single one of the input values is NA, consistent with the rules of mathematics. Most statistical functions have a na.rm
--- short for NA remove --- which allows us to remove any NA values prior to the calculation.
Continue the pipe with a call to ggplot()
, setting the mapping
argument to aes(x = tvhours, y = relig)
.
gss_cat |> summarize(n = n(), tvhours = mean(tvhours, na.rm = TRUE), .by = relig) |> ggplot(aes(x = tvhours, y = relig))
gss_cat |> summarize(n = n(), tvhours = mean(tvhours, na.rm = TRUE), .by = relig) |> ggplot(aes(x = tvhours, y = relig))
Without a geom function, no data is plotted. But we still get the plotting area and the axis labels. Does the ordering of the religious affiliations on the y-axis seem reasonable?
Add geom_point()
to the pipe. Don't forget that calls to ggplot components are separated by plus signs, not pipes -- by +
not |>
.
gss_cat |> summarize(n = n(), tvhours = mean(tvhours, na.rm = TRUE), .by = relig) |> ggplot(aes(x = tvhours, y = relig)) + geom_point()
gss_cat |> summarize(n = n(), tvhours = mean(tvhours, na.rm = TRUE), .by = relig) |> ggplot(aes(x = tvhours, y = relig)) + geom_point()
It is hard to read this plot because there’s no overall pattern. We can improve it by reordering the levels of relig using fct_reorder()
. fct_reorder()
takes three arguments:
f
, the factor whose levels you want to modify.x
, a numeric vector that you want to use to reorder the levels.fun
, a function that’s used if there are multiple values of x
for each value of f
. The default value is median.Replace y = relig
with y = fct_reorder(relig, tvhours)
in your pipe.
gss_cat |> summarize(n = n(), tvhours = mean(tvhours, na.rm = TRUE), .by = relig) |> ggplot(aes(x = tvhours, y = fct_reorder(relig, tvhours))) + geom_point()
gss_cat |> summarize(n = n(), tvhours = mean(tvhours, na.rm = TRUE), .by = relig) |> ggplot(aes(x = tvhours, y = fct_reorder(relig, tvhours)))
Reordering religion makes it much easier to see that people in the “Don’t know” category watch much more TV, and Hinduism & Other Eastern religions watch much less.
As you start making more complicated transformations, we recommend moving them out of aes()
and into a separate mutate()
step. After the summarize()
step, insert this line: mutate(relig = fct_reorder(relig, tvhours)) |>
. Then, change y = fct_reorder(relig, tvhours)
back to y = relig
.
gss_cat |> summarize(n = n(), tvhours = mean(tvhours, na.rm = TRUE), .by = relig) |> mutate(... = fct_reorder(relig, ...)) |> ggplot(aes(x = tvhours, ... = relig)) + geom_point()
gss_cat |> summarize(n = n(), tvhours = mean(tvhours, na.rm = TRUE), .by = relig) |> mutate(relig = fct_reorder(relig, tvhours)) |> ggplot(aes(x = tvhours, y = relig)) + geom_point()
It is almost always better to complete your data transformations before starting your plot.
Finish the plot by adding a title, subtitle, and axis labels. Remember that the plot looks like this:
plot1
... + labs(... = "TV Watching and Religious Affiliation", subtitle = ..., x = ..., ... = "Religious Affiliation")
gss_cat |> summarize(n = n(), tvhours = mean(tvhours, na.rm = TRUE), .by = relig) |> mutate(relig = fct_reorder(relig, tvhours)) |> ggplot(aes(x = tvhours, y = relig)) + geom_point()+ labs(title = "TV Watching and Religious Affiliation", subtitle = "???", x = "Hours watched", y = "Religious Affiliation")
The subtitle of a plot should be the one sentence conclusion/summary/observation with which you most want viewers to come away.
Let's create this graph now.
plot2 <- gss_cat |> filter(!is.na(age)) |> count(age, marital) |> mutate( prop = n / sum(n), .by = age) |> ggplot(aes(x = age, y = prop, color = fct_reorder2(marital, age, prop))) + geom_line(linewidth = 1) + scale_color_brewer(palette = "Set1") + labs(color = "Marital", x = "Age", y = "Proportion") plot2
Pipe gss_cat to filter()
, with the argument set as !is.na(age)
gss_cat |> filter(...)
gss_cat |> filter(!is.na(age))
This will remove all rows with a NA
response to age
, allowing for easier calculations in the future.
Continue the pipe to count()
, with age
and marital
set as the arguments
... |> count(age, ...)
gss_cat |> filter(!is.na(age))|> count(age,marital)
Continue the pipe to mutate().
Create the variable prop
and set it n/sum(n)
... |> mutate(... = n/sum(n))
Add the argument .by
within the mutate()
function. Set .by
to age
... |> mutate(prop = n/sum(n), .by = ...)
Adding the .by
argument allows us to sort the tibble by age for the mutate function.
Add ggplot()
to the current pipe. Set x
to age
, y
to prop
and color
to marital
... |> ggplot(aes(x = ..., y = ..., color = ...))
Continue the pipe to geom_line()
. Inside the function, set linewidth
to 1.
... + geom_line(linewidth = ...)
Continue the pipe with scale_color_brewer()
, with the argument palette
set to "Set1".
... |> scale_color_brewer(... = "Set1")
Finish the pipe with labs()
giving an appropriate axes and legend titles.
... |> labs(x = ..., y = ..., color = ...)
This graph is confusing to read as the colors assigned to the lines don't match up well with the legend. We can use fct_reorder2()
to solve this problem.
In the ggplot()
function, change color
from marital
to fct_reorder2(marital, age, prop)
... |> ggplot(aes(x = age, y = prop, color = fct_reorder2(marital, age, prop))) |> ...
Rearranging the legend makes the plot easier to read because the legend colors now match the order of the lines on the far right of the plot. fct_reorder2(f, x, y)
reorders the factor f
by the y
values associated with the largest x
values.
More powerful than changing the orders of the levels is changing their values. This allows you to clarify labels for publication, and collapse levels for high-level displays. The most general and powerful tool is fct_recode()
. It allows you to recode, or change, the value of each level.
Pipe gss_cat
to count(partyid)
... |> count(...)
gss_cat |> count(partyid)
The levels of partyid
are terse and inconsistent. Let’s tweak them to be longer and use a parallel construction.
Like most rename and recoding functions in the Tidyverse, the new values go on the left and the old values go on the right. Pipe gss_cat
to mutate()
. Within mutate()
, use partyid = fct_recode(partyid, "Republican, weak" = "Not str republican")
to change partyid
.
gss_cat |> mutate( partyid = ...(partyid, "Republican, weak" = ... ) )
Note how the second and seventh values for partyid
have been changed from "Not str republican" to "Republican, weak". fct_recode()
is the easiest way to change the value for a given factor level. Sometimes, as here, we change the value "in place," that is, we replace partyid
with partyid
. Other times, we use mutate()
to create a new variable.
Let's change all the values for partyid
. Here is the mapping from new values to old values:
```{verbatim echo = TRUE} "Republican, strong" = "Strong republican", "Republican, weak" = "Not str republican", "Independent, near rep" = "Ind,near rep", "Independent, near dem" = "Ind,near dem", "Democrat, weak" = "Not str democrat", "Democrat, strong" = "Strong democrat"
Use these within the call to `fct_recode()`. ```r
gss_cat |> mutate( partyid = fct_recode(partyid, "Republican, strong" = "Strong republican", "Republican, weak" = "Not str republican", "Independent, near rep" = "Ind,near rep", "Independent, near dem" = ..., ... = "Not str democrat", "Democrat, strong" = "Strong democrat" ) )
fct_recode()
will leave the levels that aren’t explicitly mentioned as they are, and will warn you if you accidentally refer to a level that doesn’t exist.
To combine groups, you can assign multiple old levels to the same new level. With the same pipe as above, use this mapping:
```{verbatim echo = TRUE} "Republican, strong" = "Strong republican", "Republican, weak" = "Not str republican", "Independent, near rep" = "Ind,near rep", "Independent, near dem" = "Ind,near dem", "Democrat, weak" = "Not str democrat", "Democrat, strong" = "Strong democrat", "Other" = "No answer", "Other" = "Don't know", "Other" = "Other party"
```r
gss_cat |> mutate( partyid = fct_recode(partyid, ... ) )
Use this technique with care: if you group together categories that are truly different, you will end up with misleading results.
Continue the pipe to count(partyid)
to confirm that the recoding has worked.
... |> count(partyid)
gss_cat |> mutate( partyid = fct_recode(partyid, "Republican, strong" = "Strong republican", "Republican, weak" = "Not str republican", "Independent, near rep" = "Ind,near rep", "Independent, near dem" = "Ind,near dem", "Democrat, weak" = "Not str democrat", "Democrat, strong" = "Strong democrat", "Other" = "No answer", "Other" = "Don't know", "Other" = "Other party")) |> count(partyid)
Read the help page for fct_recode()
for more details.
If you want to collapse a lot of levels, fct_collapse()
is a useful variant of fct_recode()
. For each new variable, you can provide a vector of old levels. Replace the call to fct_recode()
in the previous pipe with this:
```{verbatim echo = TRUE} fct_collapse(partyid, "other" = c("No answer", "Don't know", "Other party"), "rep" = c("Strong republican", "Not str republican"), "ind" = c("Ind,near rep", "Independent", "Ind,near dem"), "dem" = c("Not str democrat", "Strong democrat") )
```r
gss_cat |> ...( partyid = fct_collapse(..., "other" = c("No answer", "Don't know", "Other party"), "rep" = c("Strong republican", "Not str republican"), "ind" = c("Ind,near rep", "Independent", "Ind,near dem"), "dem" = c("Not str democrat", "Strong democrat") ) ) |> ...(partyid)
Read the help page for fct_collapse()
for more details. The other_level
argument is sometimes useful.
Sometimes you just want to lump together the small groups to make a plot or table simpler. That’s the job of the fct_lump_*()
family of functions.
Pipe gss_cat
to mutate()
with relig = fct_lump_lowfreq(relig)
as its argument.
gss_cat |> mutate(relig = ...)
fct_lump_lowfreq()
is a simple starting point that progressively lumps the smallest groups categories into “Other”, always keeping “Other” as the smallest category.
Continue the pipe to the function count()
with relig
as its argument.
... |> count(...)
gss_cat |> mutate(relig = fct_lump_lowfreq(relig)) |> count(relig)
In this case it’s not very helpful: it is true that the majority of Americans in this survey are Protestant, but we’d probably like to see some more details!
Instead, we can use the fct_lump_n()
to specify that we want exactly 10 groups. Pipe gss_cat
to mutate()
with relig = fct_lump_n(relig, n = 10)
as its argument.
gss_cat |> mutate(relig = ...)
fct_lump_n()
is particularly useful when you have a factor with many levels, but you're only interested in analyzing the most common ones. Without it, analyses can become cluttered and difficult to interpret.
Continue the pipe to count()
. Add relig
as an argument, as well as sort = TRUE
to count()
.
... |> count(relig, ...)
gss_cat |> mutate(relig = fct_lump_n(relig, n = 10)) |> count(relig, sort = TRUE)
Read the documentation to learn about fct_lump_min()
and fct_lump_prop()
which are useful in other cases.
Ordered factors, created with ordered()
, imply a strict ordering and equal distance between levels: the first level is “less than” the second level by the same amount that the second level is “less than” the third level, and so on.
Run this code
ordered(c("a", "b", "c"))
ordered(c("a", "b", "c"))
You can recognize ordered factors when printing because they use <
between the factor levels. We don't recommend using ordered factors unless you have a compelling reason for doing so.
This tutorial covered Chapter 16: Factors from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. forcats is the core Tidyverse package for working with categorical variables, called "factors" in R. Key commands include fct()
for creating factors, fct_reorder()
for changing the order of the levels, and fct_recode()
for recoding factors.
If you want to learn more about factors, read "Wrangling categorical data in R)" by Amelia McNamara and Nicholas Horton.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.