library(learnr) library(tidyverse) knitr::opts_chunk$set(echo = TRUE) tutorial_options(exercise.timelimit = 120, exercise.blanks = "___+") library(gapminder) gapminder$country <- as.character(gapminder$country) gapminder$continent <- as.character(gapminder$continent) data("mtcars")
We'll first load the gampinder data and then fix it so that the country
and continent
columns are text instead of 'factors'
library(gapminder) gapminder$country <- as.character(gapminder$country) gapminder$continent <- as.character(gapminder$continent)
For this practice, try to use the pipe (%>%
) wherever possible to chain together operations, with the general pattern
input_data %>% operation_1 %>% operation_2
If you want to capture the result of the chain of operations and assign it to a new variable (rather than say printing out the result to the console), you can do:
result <- input_data %>% operation_1 %>% ...
Recall that the filter
function is useful for extracting a subset of rows from a data table. Each input to filter
(in addition to the data table itself) specifies a logical condition that will be applied to select rows. If you provide multiple inputs filter(data, A, B)
this selects for 'A AND B'. You can use A | B
to specify 'A OR B'.
Let's practice some filtering. Use the filter
function (and the %>%
) to extract the specified set of rows from the gapminder
table.
The first 5 rows where continent is equal to Asia
and life expectancy is greater than 35 yrs
gapminder %>% filter(continent == "Asia", lifeExp > 35) %>% head(5)
All rows where the country begins with 'A'
(hint: the >
and <
operators also work on strings to compare alphabetical order)
gapminder %>% filter(lifeExp < 'B')
All rows where the continent is Oceania
OR the country is one of Turkey, Uruguay, or Taiwan
# gapminder %>% # filter(continent == "Oceania" | country %in% c('Turkey', 'Uruguay', 'Taiwan'))
Recall that arrange
sorts the rows of a data table based on the values in one or more columns.
select
allows you to select a subset of columns from a table as: select(col1, col2, col3)
.
Sort the rows of the gapminder
table alphabetically using the continent and country columns. First by continent, then by country
gapminder %>% arrange(continent, country)
What were the top 5 years with the highest life expectancy in China? Use the select
function to create a table that contains only the year and lifeExp for these top 5 entries.
gapminder %>% filter(country == "China") %>% arrange(desc(lifeExp)) %>% head(5) %>% select(year,lifeExp)
Recall that mutate
allows you to add new columns to a table. The new columns can also be generated as a function of existing columns in the table (e.g. in_data %>% mutate(col12_sum = old_col1 + old_col2)
).
Add a new variable to the gapminder table called GDP
which has the total GDP for each country
gapminder <- gapminder %>% mutate(GDP = pop * gdpPercap)
#adding GDP to gapminder so the rest will work gapminder <- gapminder %>% mutate(GDP = lifeExp * gdpPercap)
Add a column called high_GDP
to gapminder which has value TRUE if the GDP
is higher than the overall average, and FALSE otherwise. (It's easiest to do this in two steps).
mean_gdp <- mean(gapminder$GDP) #first calculate the overall avg GDP gapminder %>% mutate(high_GDP = GDP > mean_gdp)
Recall that summarise
allows you to compute summary statistics across rows of a table. Use summarise
to compute the mean and median GDP for counties in Asia after 1980. Your result should be a table with a single row and two columns called mean_gdp
and median_gdp
gapminder %>% filter(continent == 'Asia', year >= 1980) %>% summarise(mean_gdp = mean(GDP), median_gdp = median(GDP))
Recall that group_by
allows you to group rows by a variable or combinations of variables, which is typically used in combination with summarise
to compute summary statistics for each group in a table.
Make a table with the mean, median, max, and min life expectancies for Asian countries in the 1970's.
gapminder %>% filter(continent == "Asia", year>=1970, year<1980) %>% group_by(country) %>% summarise(mean_life_exp = mean(lifeExp), median_life_exp = median(lifeExp), max_life_exp = max(lifeExp), min_life_exp = min(lifeExp))
Make a table with the mean lifeExp by country and year
gapminder %>% group_by(country, year) %>% summarise(mean_life_exp = mean(lifeExp))
Challenge: Make a table with the mean lifeExp by country and decade. (Hint: you can use the floor
function to round numbers down to an integer)
# # The floor function will always round down. So when we divide the year by 10 , floor and then multiply back, this is equavalent to stripping it's unit digit. This gives the decade
gapminder %>% mutate(decade = floor(year/10)*10) %>% group_by(country, decade) %>% summarise(mean_life_exp = mean(lifeExp))
Find the five entries in gapminder that had the highest ratio of GDP to life expectancy, using only countries in Asia for years after 1980
gapminder %>% mutate(gpd_to_lifeexp = GDP/lifeExp) %>% filter(continent == "Asia", year > 1980) %>% arrange(desc(gpd_to_lifeexp)) %>% head(5)
Challenge: Make a table which lists the number of samples where each continent had a mean lifeExpectancy greater than 60.
gapminder %>% group_by(continent, year) %>% summarise(mean_life_exp = mean(lifeExp)) %>% filter(mean_life_exp > 60) %>% group_by(continent) %>% summarise(number_of_samples = n())
Some additional practice (optional) using the mtcars
dataset which contains info about cars. This practice is taken from the HBC course here
head(mtcars)
# Turn row name to column and convert to tibble mtcars_tb <- mtcars %>% rownames_to_column(var = "car") %>% as_tibble()
Perform the following data wrangling steps using mtcars_tb
as the starting point. Name the final variable as mtcars_final.
One of the columns is am
. It indicates transmission status, where 0 refers to automatic, and 1 refers to manual. Extract cars with manual transmission status.
We are only interested in these five columns: car
, mpg
, cyl
, wt
, am
. Select only these five columns for further analysis.
Some column names are not intuitive. Rename the cyl
to cylinder, wt
to weight, and am
to transmission.
We want to order our data. Arrange the data first by cylinder in ascending order, and then by mpg in descending order.
Check your result: if you finish the data wrangling successfully, the mtcars_final should have 13 entries with 5 features. The first entry should be Toyota Corolla, and the last entry should be Maserati Bora.
mtcars_final <- mtcars_tb %>% select(car, mpg, cylinder = cyl, weight = wt, transmission = am) %>% filter(transmission == 1) %>% arrange(cylinder, desc(mpg)) mtcars_final
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.