Exercise 1

Load the world_bank dataset

library("dplyr")
data(world_bank, package = "jrBig")

Convert the data frame into a dplyr data frame

wb = tbl_df(world_bank)
  1. Print the world_bank and wb data frames to screen. What's different?
  2. What does the glimpse function do? Hint: just try it on a data frame.
  3. Using wb, complete the following tasks.
  4. filter
    • Filter to retain data from the Country AFG;
    • Filter to retain data for years between 1980 and 1990.
  5. select
    • Remove the Year.Code column;
    • Select columns Year and gini;
  6. mutate
    • Create a new column called Year2010 that is yes for rows where Year > 2010
  7. arrange
    • Sort the data by Year in descending order and gini
  8. summarise:
    • Calculate the standard deviation of the gdp_percap column.
  9. Bonus function. What does slice do? Try
    • slice(wb, 1:3), slice(wb, 5:10), slice(wb, n())
    • What does n() do?
  10. Look at the documentation for the sample_n function. Can you sample $100$ rows from the data set?

Exercise 2

  1. Calculate the mean gini and gdp_percap for each country; set na.rm=TRUE in the mean function. Hint: group by Country.
gb = group_by(wb, Year)
summarise(gb, mean(gini, na.rm = TRUE), mean(gdp_percap, na.rm = TRUE))
  1. Calculate the median gini and gdp_percap for each country per year.
gb = group_by(wb, Year, Country.Code)
summarise(gb, median(gini, na.rm = TRUE), median(gdp_percap, na.rm = TRUE))

Using the pipe operator, link the following operations together (for the wb data set)

  1. Filter to retain data from the Country AFG;
    • then remove the Year.Code column;
    • then sort the data by Year in descending order and gini
  2. Filter to retain data for years between 1980 and 1990;
    • then select columns Year and gini;
    • create a new column called Year2010 that is yes for rows where Year > 2010
wb = tbl_df(world_bank)
wb %>% 
  filter(Country.Code == "AFG") %>%
  select(-1) %>%
  mutate(Year2010 = Year > 2010)

Compare r wb %>% group_by(Year, Country.Code) %>% summarise(gini = median(gini, na.rm = TRUE)) %>% summarise(max(gini, na.rm = TRUE)) and r wb %>% group_by(Country.Code, Year) %>% summarise(gini = median(gini, na.rm = TRUE)) %>% summarise(max(gini, na.rm = TRUE)) * Why are the answers different? What's happening?

Exercise 3

  1. Create an sql lite (or mysql, pgsql) database r db = src_sqlite(path = tempfile(), create = TRUE) wb_sqlite = copy_to(db, world_bank, temporary = FALSE) wb_sqlite = tbl(db, "world_bank") src_desc(db) ## Gives you some details src_tbls(db) ## Lists the tables in the DB
  2. Extract the top 50 rows from this table.
  3. Redo exercise 3, but using the data base query.
    • Examine the underlying SQL code;
    • Use collect to get the database.

Tip: Check out dplyr's CRAN page.



jr-packages/jrBig documentation built on Jan. 1, 2020, 2:02 p.m.