library(learnr) library(tutorial.helpers) library(tidyverse) library(nycflights13) knitr::opts_chunk$set(echo = FALSE) options(tutorial.exercise.timelimit = 60, tutorial.storage = "local") treatment <- tribble( ~person, ~treatment, ~response, "Derrick Whitmore", 1, 7, NA, 2, 10, NA, 3, NA, "Katherine Burke", 1, 4 ) stocks <- tibble( year = c(2020, 2020, 2020, 2020, 2021, 2021, 2021), qtr = c( 1, 2, 3, 4, 2, 3, 4), price = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66) ) health <- tibble( name = c("Ikaia", "Oletta", "Leriah", "Dashay", "Tresaun"), smoker = factor(c("no", "no", "no", "no", "no"), levels = c("yes", "no")), age = c(34, 88, 75, 47, 56), )
This tutorial covers Chapter 18: Missing values from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. The primary focus of this tutorial will be to teach you how to use commands like [fill()
] to fill in missing values, [coalesce()
] to replace missing values with another value, and [na_if()
] to replace certain values with a missing value, [NA
]. Additionally we will look at functions like [complete()
] which lets you generate missing values from a set of variables, and how to use [anti_join()
] for missing values when joining data sets.
Load the tidyverse package with the library()
command.
library(...)
library(tidyverse)
The functions for working with missing data mostly come from dplyr and tidyr, which are core members of the tidyverse.
Run this code in order to create a tibble called treatment
.
treatment <- tribble( ~person, ~treatment, ~response, "Derrick Whitmore", 1, 7, NA, 2, 10, NA, 3, NA, "Katherine Burke", 1, 4 )
A common use for missing values is as a data entry convenience. When data is entered by hand, missing values sometimes indicate that the value in the previous row has been repeated (or carried forward).
Write treatment
and hit "Run Code".
treatment
treatment
Use glimpse()
or View()
for alternative ways to view the data.
Pipe treatment
to fill()
with the argument response
.
treatment |> ...(response)
treatment |> fill(response)
This treatment is sometimes called “last observation carried forward”, or locf for short.
Use the same code, but replace response
with everything()
. Recall that everything()
is a function which returns all the variables in a tibble.
treatment |> fill(...)
treatment |> fill(everything())
You can use the .direction
argument to fill in missing values that have been generated in more exotic ways.
Run this code to assign a vector with a missing value to the variable x
.
x <- c(1, 4, 5, 7, NA)
Many times we will see missing values that actually represent some fixed known value, most commonly 0.
Copy the previous code and use coalesce()
from dplyr
with x
and 0
as arguments to replace the missing values with 0
x <- c(1, 4, 5, 7, NA) coalesce(x,...)
x <- c(1, 4, 5, 7, NA) coalesce(x,0)
As we can see, the NA
value turned into a 0
.
Run this code to assign a vector to the variable x
.
x <- c(1, 4, 5, 7, -99)
At times we will see the opposite issue where some fixed known value actually represents a missing value.
Copy the previous code and use na_if()
from the dplyr
package and use x
and -99
to replace the -99
with a missing value.
x <- c(1, 4, 5, 7, -99) na_if(x, ...)
x <- c(1, 4, 5, 7, -99) na_if(x, -99)
This usually happens when data is generated by some older software that is forced to use a value like 99
or -999
as a missing value.
Before we continue, there’s one special type of missing value that you’ll encounter from time to time: a NaN
(pronounced “nan”), or not a number.
Multiply the pre-written vector by 10.
x <- c(NA, NaN)
x <- c(NA, NaN) x * ...
x <- c(NA, NaN) x * 10
As you can see, any mathematical operation on a missing value is still a missing value.
Compare the vector x to the number 1.
x <- c(NA, NaN)
x <- c(NA, NaN) x == ...
x <- c(NA, NaN) x == 1
Comparing NaN
with a number will give you NA
because NaN
is not a number, making it an invalid comparison.
Now, copy the code from the previous exercise and run the command is.na()
with the argument x
.
x <- c(NA, NaN) x == 1 is.na(...)
x <- c(NA, NaN) x == 1 is.na(x)
In the rare case you need to distinguish an NA
from a NaN
, you can use is.nan(x)
.
Divide 0
by 0
.
.../...
0/0
This mathematical operation yields an indeterminate result which produces NaN
.
Subtract Inf
from Inf
.
...-...
Inf-Inf
This is also an indeterminate mathematical operation.
Multiply 0
and Inf
.
...*...
0*Inf
This also produces NaN
because it is indeterminate.
Use sqrt()
to take the square root of -1
.
sqrt(...)
sqrt(-1)
This yet again produces NaN
because of its indeterminate result.
Run this code to create a tibble called stocks
stocks <- tibble( year = c(2020, 2020, 2020, 2020, 2021, 2021, 2021), qtr = c( 1, 2, 3, 4, 2, 3, 4), price = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66) )
Missing values can also be implicitly missing, if an entire row of data is simply absent from the data.
Write stocks
and hit "Run Code".
stocks
stocks
The price in the fourth quarter of 2020 is explicitly missing, because its value is NA. The price for the first quarter of 2021 is implicitly missing, because it simply does not appear in the dataset.
Pipe stocks
to the command pivot_wider()
(Don't worry this will produce an error).
stocks |> ...()
Note that this caused an error because pivot_wider()
is looking for a names_from
argument.
Copy the previous code and add the names_from
argument and set it equal to qtr
(Don't worry this will produce an error).
stocks |> pivot_wider(names_from = ...)
Note that this caused an error because the command is looking for a values_from
argument. Make sure to examine the help page for more information by typing ?pivot_wider
into the console.
Copy the previous code and add a values_from
argument setting it equal to price
.
stocks |> pivot_wider(names_from = qtr, values_from = ...)
stocks |> pivot_wider(names_from = qtr, values_from = price)
Making data wider can make implicit missing values explicit because every combination of the rows and new columns must have some value. For example, if we pivot stocks to put the quarter in the columns, both missing values become explicit.
Pipe stocks
to the command complete()
with the argument year
.
stocks |> ...(year)
stocks |> complete(year)
tidyr::complete()
allows you to generate explicit missing values by providing a set of variables that define the combination of rows that should exist.
Copy the previous code, but add the argument qtr
to the complete()
command.
stocks |> complete(year, ...)
stocks |> complete(year, qtr)
Typically, you’ll call complete()
with names of existing variables, filling in the missing combinations.
Copy the previous code and set the year
argument to 2019
.
stocks |> complete(year = ..., qtr)
stocks |> complete(year = 2019, qtr)
Sometimes the individual variables are themselves incomplete, so you can instead provide your own data.
Copying the code from the previous exercise, set the year
argument to the year range 2019:2021
.
stocks |> complete(year = ..., qtr)
stocks |> complete(year = 2019:2021, qtr)
For example, you might know that the stocks dataset is supposed to run from 2019
to 2021
, so you could explicitly supply those values for year.
Load in the nycflights13
package with library()
.
library(...)
Recall that nycflights13
is a data package that holds all the flight information from the three biggest New York City airports in 2013.
Write flights
to print it out and view the data.
flights
Recall that distinct()
displays all unique rows in the dataset.
Pipe flights
to the command distinct()
with the argument dest
.
flights |> ...(dest)
flights |> distinct(dest)
Note this produces 105 unique flight destinations.
Copy the previous code but change the argument name to faa = dest
to change the column name.
flights |> distinct(faa = ...)
flights |> distinct(faa = dest)
Note we change this because the column is listed as dest
in flights
but faa
in airports
and our next command, anti_join()
requires common variables.
Copy the previous code and pipe it with anti_join()
with the argument airports
.
flights |> distinct(faa = dest) |> anti_join(...)
flights |> distinct(faa = dest) |> anti_join(airports)
You can often only know that values are missing from one dataset when you compare it to another. anti_join()
is a particularly useful tool here because it selects only the rows in flights
that don’t have a match in airports
Pipe flights
with the command distinct()
again except with the argument tailnum
.
flights |> distinct(...)
flights |> distinct(tailnum)
Note that this produces all 4,044 unique tail number, or every unique plane.
Copy the previous code and pipe it to anti_join()
with the argument planes
.
flights |> distinct(tailnum) |> anti_join(...)
flights |> distinct(tailnum) |> anti_join(planes)
We can use two anti_join()
s to reveal that we’re missing information for four airports and 722 planes mentioned in flights
.
Run this code to assign a tibble to health
.
health <- tibble( name = c("Ikaia", "Oletta", "Leriah", "Dashay", "Tresaun"), smoker = factor(c("no", "no", "no", "no", "no"), levels = c("yes", "no")), age = c(34, 88, 75, 47, 56), )
Recall from Chapter 17 that factors are used for categorical variables, variables that have a fixed and known set of possible values.
Type health
and hit "Run Code" to view the data.
health
A final type of missingness is the empty group, a group that doesn’t contain any observations, which can arise when working with factors.
Let's count the number of smokers by piping health
to the command count()
with the argument smoker
.
health |> count(...)
health |> count(smoker)
This dataset only contains non-smokers, but we know that smokers exist; the group of non-smoker is empty.
Copy the previous code and add the .drop = FALSE
argument to keep all the groups, even those not seen in the data.
health |> count(smoker, .drop = ...)
health |> count(smoker, .drop = FALSE)
Note how we can now see the yes
row in the smoker
column.
Create a plot with ggplot()
with health
as the argument.
ggplot(...)
Note the plot is empty and we have not established the axes.
Copy the previous code and add an aesthetic mapping using aes()
ggplot(health, ...())
Note this doesn't do anything. By looking at the help page by typing ?aes()
into the console, we see that it has an x
and a y
argument.
Copy the previous code and pass in x = smoker
as an argument to the aes()
command.
ggplot(health, aes(...))
As you can see, the x-axis is defined with the smoker
column from health
.
Add the command geom_bar()
to the previous code using a +
.
ggplot(health, aes(x = smoker)) + ...
We encounter a similar issue to before as ggplot2
will also drop levels that don't have any values.
Add the scale_x_discrete()
command to the plot with a +
.
ggplot(health, aes(x = smoker)) + geom_bar() + scale_x_discrete()
This will not do anything right now, but this command allows us to manipulate the x-axis.
Copy the previous code and pass the drop
argument into the scale_x_discrete()
command and assign it to FALSE
.
ggplot(health, aes(x = smoker)) + geom_bar() + scale_x_discrete(... = FALSE)
ggplot(health, aes(x = smoker)) + geom_bar() + scale_x_discrete(drop = FALSE)
You can force levels that don't have any values to display by supplying drop = FALSE
to the appropriate discrete axis.
Type health
and hit "Run Code" to re-familiarize yourself with the tibble.
health
The same issue of empty groups that came up in plots are also commonly seen when using functions like summarize()
.
Pipe health
to group_by()
with the argument smoker
.
health |> group_by(...)
Note that we strongly urge you to avoid using group_by()
and instead use the .by
argument in summarize()
, but it is necessary in this case.
Copy the previous code and add the argument .drop = FALSE
to the group_by()
command.
health |> group_by(smoker, ... = ...)
Although not seen in this exercise, you can use .drop = FALSE
to preserve all factor levels similar to the axes in previous exercises.
Copy the previous code and continue the pipe with summarize()
with the argument n = n()
.
health |> group_by(smoker, .drop = FALSE) |> ...( n = n() )
Although seemingly redundant, recall that n = n()
is a common and useful summary that will return the number of rows in each group.
Copy the previous code and add mean_age
to the summary using the mean()
function and the argument age
.
health |> group_by(smoker, .drop = FALSE) |> summarize( n = n(), mean_age = ...(...) )
Pay attention to how the mean()
function produced a NaN
in the yes
row. This is because the mean()
command will perform sum(age)/length(age)
which here is 0/0
.
Add min_age
to the summarize()
function and set it equal to the command min()
with the argument age
.
health |> group_by(smoker, .drop = FALSE) |> summarize( n = n(), mean_age = mean(age), min_age = ...(...) )
Note that min()
produces Inf
in the yes
row because it is operated on an empty vector.
Add max_age
to the summarize()
function and set it equal to the command max()
with the argument age
.
health |> group_by(smoker, .drop = FALSE) |> summarize( n = n(), mean_age = mean(age), min_age = min(age), max_age = ...(...) )
Note that max()
produces -Inf
in the yes
row because it is operated on an empty vector.
Finally, add sd_age
to the summarize()
function and set it equal to the command sd()
with the argument age
.
health |> group_by(smoker, .drop = FALSE) |> summarize( n = n(), mean_age = mean(age), min_age = min(age), max_age = max(age), sd_age = ...(...) )
health |> group_by(smoker, .drop = FALSE) |> summarize( n = n(), mean_age = mean(age), min_age = min(age), max_age = max(age), sd_age = sd(age) )
It is important to note that the sd()
function produces Na
because it is operated on a zero-length vector. This is written in the help page so visit ?sd
for more information.
Use the length()
function on this pre-made vector of missing values.
x1 <- c(NA, NA)
x1 <- c(NA, NA) length(...)
x1 <- c(NA, NA) length(x1)
Note that this length()
function returns 2
despite that it is all missing values. The interesting results we got with the summarize()
function were a result of operating on a zero-length vector.
Use the length()
function on this pre-made empty vector.
x2 <- numeric()
x2 <- numeric() length(x2)
Not that this length()
function returns 0 as it is an empty vector. There’s an important distinction between empty vectors, which have length 0, and missing values, each of which has length 1.
Copy the code from exercise 18 and delete the .drop = FALSE
argument from the group_by()
function.
health |> group_by(smoker) |> summarize( n = n(), mean_age = mean(age), min_age = min(age), max_age = max(age), sd_age = sd(age) )
Note that this completely removes the yes
row as it is an empty group.
Now, continue the pipe with complete()
and use smoker
as the argument.
health |> group_by(smoker) |> summarize( n = n(), mean_age = mean(age), min_age = min(age), max_age = max(age), sd_age = sd(age) ) |> complete(...)
health |> group_by(smoker) |> summarize( n = n(), mean_age = mean(age), min_age = min(age), max_age = max(age), sd_age = sd(age) ) |> complete(smoker)
Sometimes a simpler approach is to perform the summary and then make the implicit missings explicit with complete()
. However, the main drawback of this approach is that you get an NA for the count, even though you know that it should be zero.
This tutorial covered Chapter 18: Missing values from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. The primary focus of this tutorial was to teach you how to use commands like [fill()
] to fill in missing values, [coalesce()
] to replace missing values with another value, and [na_if()
] to replace certain values with a missing value, [NA
]. Additionally we looked at functions like [complete()
] which let you generate missing values from a set of variables, and how to use [anti_join()
] for missing values when joining data sets.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.