library(learnr)
knitr::opts_chunk$set(echo = TRUE, eval = TRUE)

Data Transformation

For this lab, we will follow R for Data Science. A first fundamental step for any statistical analysis is to explore the data and to derive some indicators to better understand the problem. To see why this is important, we will work on a data frame containing all the flights that departed from New York City in 2013. Moreover, we will rely on the tidyverse suits of packages. We will, in particular, focus on the dplyr package.

Loading the libraries

library(tidyverse)
library(nycflights13)

nycflights13

Let us have a look at the data.

flights

We have 19 columns. Each column contain some information, ranging from the year to the time of departure. Notice that for each column, we have a four letter abbreviation. Such abbreviation indicate the type of the variable. The possible types are:

dplyr basics

There are five key operations that are possible by means of the dplyr package:

All the above functions can be used in conjunction with the group_by function, that changes the scope of the operations from the whole data frame to group-by-group batches.

The syntax is the same for every function:

  1. The first argument is the data frame (or tibble in the tidyverse parlance)
  2. The subsequent arguments describe the actions to take, using the variables name
  3. The result is a new tibble

This homogeneity allows for chaining different operations. Notice that none of the operations operates on the data frame passed as first argument.

Filtering data

Filtering allows to select a subset of the rows (i.e the observations). Imagine, for instance, that we wish to select all the flights that departed on 16/03/2013.

filter(flights, month == 3, day == 16)

We obtain a new tibble!

Comparison can be done by using the usual operators: >, >=, <, <=, != (not equal) and == (equal).

Select all the flights that departed between the 12 and the 16 of May (included).


filter(flights, month == 5, day >= 12, day <= 16)

You can store the new data frame by using the <- operator, as usual.

Comparing real numbers

Computers rely to approximation when dealing with real numbers. Take a look at the following:

sqrt(2) ^ 2 == 2

Weird, right? The imprecisions due to the implementation of numbers mean we cannot directly compare results of operations that are theoretically identical. We resort to the following:

near(sqrt(2)^2, 2)

Logical operators

Multiple arguments of the filter operations, are treated as clauses related by the "and" operation. We may, however, be interested in finding the flights that have departed on the 11th of March or May. This is no problem. We have to use the "or" operator: | ("and" is &).

filter(flights, day == 11, month == 3 | month == 5)

Notice that you might be tempted to write month == 3 | 5. This does not work. R will compute the operations in order: 3 | 5 is equal to TRUE, TRUE is then equal to 1 (FALSE is 0) and we will end up with flights that departed in January!

Another options, perhaps easier, is by means of the %in% operation.

filter(flights, day == 11, month %in% c(11, 12))

Remember that c(11, 12) is a vector of elements 11 and 12.

Selecting variables

Sometimes we are interested in selecting only a number of variables. This is easily done using the select() function!

select(flights, year, month, day)

We can select all the columns (inclusive) between two.

select(flights, year:day)

Or, in a similar fashion, remove them.

select(flights, -(year:day))

There is a number of helper functions that can be used with select:

Another option is to use the rename() function, that only changes the name of the variable.

Adding new variables

Sometimes new variables have to be computed. To do this, we resort to the mutate() function.

flights_sml <- select(flights,
                      year:day,
                      ends_with("delay"),
                      distance,
                      air_time
)

mutate(flights_sml,
       gain = dep_delay - arr_delay,
       speed = distance / air_time * 60
)

Notice that you can refer to variables you have just created.

mutate(flights_sml,
       gain = dep_delay - arr_delay,
       hours = air_time / 60,
       gain_per_hour = gain / hours
)

Some creation functions

The key property of any functions used to create new variables is that it must be vectorised: it must take as input a vector and outputs a vector with the same number of elements.

Summarising variables

The last operation is summarise() (or summarize()).

summarise(flights, delay = mean(dep_delay, na.rm = TRUE))

Notice that we have put na.rm = TRUE. This means that we remove from the data frame all those observations for which we have no value. Why? Because otherwise we would not be able to compute the mean.

summarise(flights, delay = mean(dep_delay, na.rm = FALSE))

Grouping by variables

summarise() is usually best employed on sub-groups of the data. For instance, we may be interested in understanding the daily delay, rather than that of 2013. To this aim, we resort to the group_by function.

by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))

Pipelines

Piping operations

Let us say we are now interested in understanding whether there is a relationship between the distance of the destination and the average delay. Let us do this together!

First of all, we need to group flights by the destination.


by_dest <- group_by(flights, dest)

Then, we have to compute the summary values we need. For each destination and arrival delay, we want to compute the average distance and average arrival delay. The variables are the distance and arr_delay variables.

# Mind the NA data!
by_dest <- group_by(flights, dest)
delay <- summarise(by_dest,
                   dist = mean(distance, na.rm = TRUE),
                   delay = mean(arr_delay, na.rm = TRUE)
)

Building the plot...

Impressive! Yet, we may be interested in computing the numbers of data we have for each destination. How to do so? We use the n() function.

by_dest <- group_by(flights, dest)
delay <- summarise(by_dest,
  count = n(),
  dist = mean(distance, na.rm = TRUE),
  delay = mean(arr_delay, na.rm = TRUE)
)
delay

In this way, we can filter out destinations that have too little data points. Let us say we set a threshold at 20. We also initially remove Honolulu for graphical reasons.

delay <- filter(delay, count > 20, dest != "HNL")

We are ready to plot!

ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
  geom_point(mapping = aes(size = count), alpha = 1/3) +
  geom_smooth(se = FALSE)

Finally pipelines!

So, we have done three operations:

  1. Grouped flights according to the destination
  2. Summarised to compute the average distance, the average delay and the number of observations per destination
  3. Filtered to remove points without enough observations and Honolulu, which is twice as far away as the next closest airport is.

To do so, we had to create intermediate data frames we are not really interested in. What we care about here are the transformation. To emphasise this, we can employ pipe: %>%. Let's have a look at the following.

delays <- flights %>%
  group_by(dest) %>%
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>%
  filter(count > 20, dest != "HNL")
delays

In this way, we can focus only on the transformation. What a pipe does is basically to rewrite f(x) as x %>% f(), g(f(x, y), z) as x %>% f(y) %>% g(z) and so on.



mascaretti/moxier documentation built on March 17, 2020, 3:57 p.m.