Exercise 1

Load the world_bank dataset

library("dplyr")
data(world_bank, package = "jrBig")

Convert the data frame into a dplyr data frame

wb = tbl_df(world_bank)
  1. Print the world_bank and wb data frames to screen. What's different? r ## wb only prints out 10 rows & and gives types
  2. What does the glimpse function do? Hint: just try it on a data frame. r ## Flips the data frame on it's side
  3. Using wb, complete the following tasks.
  4. filter
    • Filter to retain data from the Country AFG;
    • Filter to retain data for years between 1980 and 1990. r filter(wb, Country.Code == "AFG") filter(wb, Year > 1980 & Year < 1990)
  5. select
    • Remove the Year.Code column;
    • Select columns Year and gini; r select(wb, -1) select(wb, Year, gini)
  6. mutate
    • Create a new column called Year2010 that is yes for rows where Year > 2010 r mutate(wb, Year2010 = ifelse(Year > 2010, "yes", "no"))
  7. arrange
    • Sort the data by Year in descending order and gini r arrange(wb, desc(Year), gini)
  8. summarise:
    • Calculate the standard deviation of the gdp_percap column. r summarise(wb, mean(gdp_percap, na.rm = TRUE))
  9. Bonus function. What does slice do? Try
    • slice(wb, 1:3), slice(wb, 5:10), slice(wb, n())
    • What does n() do? Counts
  10. Look at the documentation for the sample_n function. Can you sample $100$ rows from the data set?

Exercise 2

  1. Calculate the mean gini and gdp_percap for each country; set na.rm=TRUE in the mean function. Hint: group by Country.
gb = group_by(wb, Year)
summarise(gb, mean(gini, na.rm = TRUE), mean(gdp_percap, na.rm = TRUE))
  1. Calculate the median gini and gdp_percap for each country per year.
gb = group_by(wb, Year, Country.Code)
summarise(gb, median(gini, na.rm = TRUE), median(gdp_percap, na.rm = TRUE))
  1. Compare r wb %>% group_by(Year, Country.Code) %>% summarise(gini = median(gini, na.rm = TRUE)) %>% summarise(max(gini, na.rm = TRUE)) and r wb %>% group_by(Country.Code, Year) %>% summarise(gini = median(gini, na.rm = TRUE)) %>% summarise(max(gini, na.rm = TRUE))
  2. Why are the answers different? What's happening?
    • Solution With each application of summarise a variable is peeled off the group_by statement.

Exercise 3

Using the pipe operator, link the following operations together (for the wb data set)

  1. Filter to retain data from the Country AFG;
    • then remove the Year.Code column;
    • then sort the data by Year in descending order and gini `r wb %>% select(Country == "AFG") %>% arrange(desc(YEAR), gini)
  2. Filter to retain data for years between 1980 and 1990;

    • then select columns Year and gini;
    • create a new column called Year2010 that is yes for rows where Year > 2010

    r wb %>% filter(Year > 1980 & Year < 1990) %>% select(-1) %>% mutate(Year2010 = ifelse(Year > 2010, "yes", "no"))

Exercise 4

  1. Create an sql lite (or mysql, pgsql) database r db = src_sqlite(path = tempfile(), create = TRUE) wb_sqlite = copy_to(db, world_bank, temporary = FALSE) wb_sqlite = tbl(db, "world_bank") src_desc(db) ## Gives you some details src_tbls(db) ## Lists the tables in the DB
  2. Extract the top 50 rows from this table. r head(wb_sqlite, 50)
  3. Redo exercise 3, but using the data base query.
    • Examine the underlying SQL code;
    • Use collect to get the database.

Tip: Check out dplyr's CRAN page.



jr-packages/jrBig documentation built on Jan. 1, 2020, 2:02 p.m.