In AshirBorah/cp_bootcamp_r_tutorials:

library(learnr)
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE)
tutorial_options(exercise.timelimit = 120, exercise.blanks = "___+")

library(gapminder)
gapminder$country <- as.character(gapminder$country)
gapminder$continent <- as.character(gapminder$continent)
data("mtcars")

Prep

We'll first load the gampinder data and then fix it so that the country and continent columns are text instead of 'factors'

library(gapminder)
gapminder$country <- as.character(gapminder$country)
gapminder$continent <- as.character(gapminder$continent)

For this practice, try to use the pipe (%>%) wherever possible to chain together operations, with the general pattern

input_data %>% 
  operation_1 %>% 
  operation_2

If you want to capture the result of the chain of operations and assign it to a new variable (rather than say printing out the result to the console), you can do:

result <- input_data %>% 
  operation_1 %>% 
  ...

Filter

Recall that the filter function is useful for extracting a subset of rows from a data table. Each input to filter (in addition to the data table itself) specifies a logical condition that will be applied to select rows. If you provide multiple inputs filter(data, A, B) this selects for 'A AND B'. You can use A | B to specify 'A OR B'.

Let's practice some filtering. Use the filter function (and the %>%) to extract the specified set of rows from the gapminder table.

The first 5 rows where continent is equal to Asia and life expectancy is greater than 35 yrs

gapminder %>%
  filter(continent == "Asia", lifeExp > 35) %>%
  head(5)

All rows where the country begins with 'A' (hint: the > and < operators also work on strings to compare alphabetical order)

gapminder %>%
  filter(lifeExp < 'B')

All rows where the continent is Oceania OR the country is one of Turkey, Uruguay, or Taiwan

# gapminder %>% 
#   filter(continent == "Oceania" | country %in% c('Turkey', 'Uruguay', 'Taiwan'))

Arrange/Select

Recall that arrange sorts the rows of a data table based on the values in one or more columns.

select allows you to select a subset of columns from a table as: select(col1, col2, col3).

Sort the rows of the gapminder table alphabetically using the continent and country columns. First by continent, then by country

gapminder %>%
  arrange(continent, country)

What were the top 5 years with the highest life expectancy in China? Use the select function to create a table that contains only the year and lifeExp for these top 5 entries.

gapminder %>%
  filter(country == "China") %>%
  arrange(desc(lifeExp)) %>%
  head(5) %>%
  select(year,lifeExp)

Mutate

Recall that mutate allows you to add new columns to a table. The new columns can also be generated as a function of existing columns in the table (e.g. in_data %>% mutate(col12_sum = old_col1 + old_col2)).

Add a new variable to the gapminder table called GDP which has the total GDP for each country

gapminder <- gapminder %>%
  mutate(GDP = pop * gdpPercap)

#adding GDP to gapminder so the rest will work
gapminder <- gapminder %>%
  mutate(GDP = lifeExp * gdpPercap)

Add a column called high_GDP to gapminder which has value TRUE if the GDP is higher than the overall average, and FALSE otherwise. (It's easiest to do this in two steps).

mean_gdp <- mean(gapminder$GDP) #first calculate the overall avg GDP
gapminder %>%
  mutate(high_GDP = GDP > mean_gdp)

Group-by/summarise

Recall that summarise allows you to compute summary statistics across rows of a table. Use summarise to compute the mean and median GDP for counties in Asia after 1980. Your result should be a table with a single row and two columns called mean_gdp and median_gdp

gapminder %>%
  filter(continent == 'Asia', year >= 1980) %>%
  summarise(mean_gdp = mean(GDP),
            median_gdp = median(GDP))

Recall that group_by allows you to group rows by a variable or combinations of variables, which is typically used in combination with summarise to compute summary statistics for each group in a table.

Make a table with the mean, median, max, and min life expectancies for Asian countries in the 1970's.

gapminder %>%
  filter(continent == "Asia", year>=1970, year<1980) %>%
  group_by(country) %>%
  summarise(mean_life_exp = mean(lifeExp),
            median_life_exp = median(lifeExp),
            max_life_exp = max(lifeExp),
            min_life_exp = min(lifeExp))

Make a table with the mean lifeExp by country and year

gapminder %>%
  group_by(country, year) %>%
  summarise(mean_life_exp = mean(lifeExp))

Challenge: Make a table with the mean lifeExp by country and decade. (Hint: you can use the floor function to round numbers down to an integer)

# # The floor function will always round down. So when we divide the year by 10 , floor and then multiply back, this is equavalent to stripping it's unit digit. This gives the decade

gapminder %>%
  mutate(decade = floor(year/10)*10) %>%
  group_by(country, decade) %>%
  summarise(mean_life_exp = mean(lifeExp))

More practice

more with gapminder

Find the five entries in gapminder that had the highest ratio of GDP to life expectancy, using only countries in Asia for years after 1980

gapminder %>%
  mutate(gpd_to_lifeexp = GDP/lifeExp) %>%
  filter(continent == "Asia", year > 1980) %>%
  arrange(desc(gpd_to_lifeexp)) %>%
  head(5)

Challenge: Make a table which lists the number of samples where each continent had a mean lifeExpectancy greater than 60.

gapminder %>%
  group_by(continent, year) %>%
  summarise(mean_life_exp = mean(lifeExp)) %>%
  filter(mean_life_exp > 60) %>%
  group_by(continent) %>%
  summarise(number_of_samples = n())

mtcars

Some additional practice (optional) using the mtcars dataset which contains info about cars. This practice is taken from the HBC course here

head(mtcars)

# Turn row name to column and convert to tibble
mtcars_tb <- mtcars %>% 
  rownames_to_column(var = "car") %>% 
  as_tibble()

Perform the following data wrangling steps using mtcars_tb as the starting point. Name the final variable as mtcars_final.

One of the columns is am. It indicates transmission status, where 0 refers to automatic, and 1 refers to manual. Extract cars with manual transmission status.
We are only interested in these five columns: car, mpg, cyl, wt, am. Select only these five columns for further analysis.
Some column names are not intuitive. Rename the cyl to cylinder, wt to weight, and am to transmission.
We want to order our data. Arrange the data first by cylinder in ascending order, and then by mpg in descending order.

Check your result: if you finish the data wrangling successfully, the mtcars_final should have 13 entries with 5 features. The first entry should be Toyota Corolla, and the last entry should be Maserati Bora.

mtcars_final <- mtcars_tb %>%
  select(car, mpg, cylinder = cyl, weight = wt, transmission = am) %>%   filter(transmission == 1) %>%
  arrange(cylinder, desc(mpg))
mtcars_final