Let's start off by creating a new R script and loading tidyverse
:
library(tidyverse)
Clear everything to make sure there's nothing leftover in our environment
rm(list = ls())
Dplyr makes it easy to "chain" functions together using the pipe operator %>%
. The following diagram illustrates the general concept of pipes where data flows from one pipe to another until all the processing is completed.
The syntax of the pipe operator %>%
might appear unusual at first, but once you get used to it you’ll start to appreciate its power and flexibility.
We're using a dataset of flight departures from Houston in 2011.
| Filename | Description | |-|--------------------------------------------------------------------------------------------------------------------------------------------------| | flights.csv | Flight departures from Houston in 2011 | | weather.csv | Hourly weather | | planes.csv | Metadata for planes | | airports.csv | Metadata for airports |
We're going to use the readr
package which provides improved functions for reading datasets from files. Instead of the usual read.csv()
function, we'll use the read_csv()
function from readr
.
flights <- read_csv("https://raw.githubusercontent.com/altaf-ali/tidydata_tutorial/master/data/flights.csv") weather <- read_csv("https://raw.githubusercontent.com/altaf-ali/tidydata_tutorial/master/data/weather.csv") planes <- read_csv("https://raw.githubusercontent.com/altaf-ali/tidydata_tutorial/master/data/planes.csv") airports <- read_csv("https://raw.githubusercontent.com/altaf-ali/tidydata_tutorial/master/data/airports.csv")
Now let's examine the dataset
flights weather planes airports
Notice that because we used read_csv()
, the data frame we received now prints nicely without having to use the head()
function and does not clutter your screen.
The select
function is used to select columns.
dest
, time
and dist
)flights %>% select(dest, time, dist)
Add the arrival delay (arr_delay
) and departure delay (dep_delay
) columns as well.
flights %>% select(dest, time, dist, arr_delay, dep_delay)
Other ways to do the same
flights %>% select(dest, time, dist, ends_with("delay"))
and ...
flights %>% select(dest, time, dist, contains("delay"))
Select all columns from date
to arr
flights %>% select(date:arr)
Select all except plane
column using the minus sign
flights %>% select(-plane)
The filter()
function returns rows with matching conditions. We can find all flights to Boston (BOS) like this:
flights %>% filter(dest == "BOS")
Let's build on the previous exercise and find all flights to Boston (BOS) and select only the dest
, time
, dist
columns:
flights %>% select(dest, time, dist) %>% filter(dest == "BOS")
Now let's do the filter first and then select the columns
flights %>% filter(dest == "BOS") %>% select(dest, time, dist)
In this case the order doesn't matter, but when using pipes make sure you understand that each function is executed in sequence and the results are then fed to the next one.
Find all flights that match the following conditions:
Here's a brief summary of operators you can use:
Comparison Operators
tribble( ~Operator, ~Description, ~`Example (assume x is 5)`, ~Result, ">", "greater than", "x > 5", "FALSE", ">=", "greater than or equal to", "x >= 5", "TRUE", "<", "less than", "x < 5", "FALSE", "<=", "less than or equal to", "x <= 5", "TRUE", "==", "equal to", "x == 5", "TRUE", "!=", "not equal to", "x != 5", "FALSE" ) %>% knitr::kable()
Logical Operators
tribble( ~Operator, ~Description, "!", "not", "|", "or", "&", "and" ) %>% knitr::kable()
Other Operators
tribble( ~Operator, ~Description, ~`Example (assume x is 5)`, ~Result, "%in%", "check element in a vector", "x %in% c(1, 3, 5, 7)<br>x %in% c(2, 4, 6, 8)", "TRUE<br>FALSE" ) %>% knitr::kable()
The arrange()
function is used to sort the rows based on one or more columns
flights %>% arrange(dest)
The mutate()
function is used to create new variables.
Up until now we've only been examining the dataset but haven't made any changes to it. All our functions so far have simply displayed the results on screen but haven't created or modified existing variables. Let's see how we can create a new variable called speed
based on the distance and duration in the flights dataframe.
In this exercise we're adding a new variable to an existing dataframe so we'll just overwrite the flights
variable with the one that has a speed
column
flights <- flights %>% mutate(speed = dist / (time / 60))
Let's count the number of flights departing each day.
flights %>% group_by(date) %>% summarise(count = n())
Here's a nice little trick. You can use View()
to look at the results of a pipe operation without creating new variables.
flights %>% group_by(date) %>% summarise(count = n()) %>% View()
Of course, often times we'd want to save the summary in a variable for further analysis.
Let's find the average departure delay for each destination
delays <- flights %>% group_by(dest) %>% summarise(mean = mean(dep_delay)) delays
The unite
function is useful for combining multiple columns together. In the example below, we join the carrier
and flight
to create a unique flight_id
column.
flights %>% unite(flight_id, carrier, flight, sep = "-", remove = FALSE) %>% select(date, carrier, flight, flight_id)
The separate
function works the other way around by splitting a single column into multiple columns. Let's split the date
column into separate date
and time
columns.
flights %>% separate(date, c("date", "time"), sep = " ")
date
column into year
, month
, and day
columnsyear
, month
, and day
columns are of type integer (NOT character)separate
for an easy way to do thisAdd the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.