library(learnr) knitr::opts_chunk$set(echo = TRUE, eval = TRUE)
For this lab, we will follow R for Data Science. A first fundamental step for any statistical analysis is to explore the data and to derive some indicators to better understand the problem. To see why this is important, we will work on a data frame containing all the flights that departed from New York City in 2013. Moreover, we will rely on the tidyverse suits of packages. We will, in particular, focus on the dplyr package.
library(tidyverse) library(nycflights13)
Let us have a look at the data.
flights
We have 19 columns. Each column contain some information, ranging from the year to the time of departure. Notice that for each column, we have a four letter abbreviation. Such abbreviation indicate the type of the variable. The possible types are:
int
for integersdbl
for doubles (real numbers)chr
for character vectors (strings)dttm
for date-times (a date and a time)lgl
for logical (TRUE
or FALSE
values)fctr
for factors, a type that R uses to represent categorical variablesdata
for dates.There are five key operations that are possible by means of the dplyr package:
All the above functions can be used in conjunction with the group_by
function, that changes the scope of the operations from the whole data frame to group-by-group batches.
The syntax is the same for every function:
This homogeneity allows for chaining different operations. Notice that none of the operations operates on the data frame passed as first argument.
Filtering allows to select a subset of the rows (i.e the observations). Imagine, for instance, that we wish to select all the flights that departed on 16/03/2013.
filter(flights, month == 3, day == 16)
We obtain a new tibble!
Comparison can be done by using the usual operators: >
, >=
, <
, <=
, !=
(not equal) and ==
(equal).
Select all the flights that departed between the 12 and the 16 of May (included).
filter(flights, month == 5, day >= 12, day <= 16)
You can store the new data frame by using the <-
operator, as usual.
Computers rely to approximation when dealing with real numbers. Take a look at the following:
sqrt(2) ^ 2 == 2
Weird, right? The imprecisions due to the implementation of numbers mean we cannot directly compare results of operations that are theoretically identical. We resort to the following:
near(sqrt(2)^2, 2)
Multiple arguments of the filter
operations, are treated as clauses related by the "and" operation. We may, however, be interested in finding the flights that have departed on the 11th of March or May.
This is no problem. We have to use the "or" operator: |
("and" is &
).
filter(flights, day == 11, month == 3 | month == 5)
Notice that you might be tempted to write month == 3 | 5
. This does not work. R will compute the operations in order: 3 | 5
is equal to TRUE
, TRUE
is then equal to 1
(FALSE
is 0
) and we will end up with flights that departed in January!
Another options, perhaps easier, is by means of the %in%
operation.
filter(flights, day == 11, month %in% c(11, 12))
Remember that c(11, 12)
is a vector of elements 11 and 12.
Sometimes we are interested in selecting only a number of variables.
This is easily done using the select()
function!
select(flights, year, month, day)
We can select all the columns (inclusive) between two.
select(flights, year:day)
Or, in a similar fashion, remove them.
select(flights, -(year:day))
There is a number of helper functions that can be used with select
:
starts_with("xyz")
: names beginning by "xyz"ends_with("xyz")
: names ending by "xyz"contains("xyz")
: names containing "xyz"Another option is to use the rename()
function, that only changes the name of the variable.
Sometimes new variables have to be computed. To do this, we resort to the mutate()
function.
flights_sml <- select(flights, year:day, ends_with("delay"), distance, air_time ) mutate(flights_sml, gain = dep_delay - arr_delay, speed = distance / air_time * 60 )
Notice that you can refer to variables you have just created.
mutate(flights_sml, gain = dep_delay - arr_delay, hours = air_time / 60, gain_per_hour = gain / hours )
The key property of any functions used to create new variables is that it must be vectorised: it must take as input a vector and outputs a vector with the same number of elements.
+
, -
, *
, ... work fine. Remember that they act element-wise!x / sum(x)
gives the proportion of the total.The last operation is summarise()
(or summarize()
).
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
Notice that we have put na.rm = TRUE
. This means that we remove from the data frame all those observations for which we have no value. Why? Because otherwise we would not be able to compute the mean.
summarise(flights, delay = mean(dep_delay, na.rm = FALSE))
summarise()
is usually best employed on sub-groups of the data. For instance, we may be interested in understanding the daily delay, rather than that of 2013.
To this aim, we resort to the group_by
function.
by_day <- group_by(flights, year, month, day) summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
Let us say we are now interested in understanding whether there is a relationship between the distance of the destination and the average delay. Let us do this together!
First of all, we need to group flights by the destination.
by_dest <- group_by(flights, dest)
Then, we have to compute the summary values we need. For each destination and arrival delay, we want to compute the average distance and average arrival delay. The variables are the distance
and arr_delay
variables.
# Mind the NA data!
by_dest <- group_by(flights, dest) delay <- summarise(by_dest, dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) )
Impressive! Yet, we may be interested in computing the numbers of data we have for each destination.
How to do so? We use the n()
function.
by_dest <- group_by(flights, dest) delay <- summarise(by_dest, count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) delay
In this way, we can filter out destinations that have too little data points. Let us say we set a threshold at 20. We also initially remove Honolulu for graphical reasons.
delay <- filter(delay, count > 20, dest != "HNL")
We are ready to plot!
ggplot(data = delay, mapping = aes(x = dist, y = delay)) + geom_point(mapping = aes(size = count), alpha = 1/3) + geom_smooth(se = FALSE)
So, we have done three operations:
To do so, we had to create intermediate data frames we are not really interested in. What we care about here are the transformation. To emphasise this, we can employ pipe: %>%
.
Let's have a look at the following.
delays <- flights %>% group_by(dest) %>% summarise( count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) %>% filter(count > 20, dest != "HNL") delays
In this way, we can focus only on the transformation. What a pipe does is basically to rewrite f(x)
as x %>% f()
, g(f(x, y), z)
as x %>% f(y) %>% g(z)
and so on.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.