In jr-packages/jrBig: Jumping Rivers: R for Big Data

Exercise 1

Load the world_bank dataset

library("dplyr")
data(world_bank, package = "jrBig")

Convert the data frame into a dplyr data frame

wb = tbl_df(world_bank)

Print the world_bank and wb data frames to screen. What's different? r ## wb only prints out 10 rows & and gives types
What does the glimpse function do? Hint: just try it on a data frame. r ## Flips the data frame on it's side
Using wb, complete the following tasks.
filter
- Filter to retain data from the Country AFG;
- Filter to retain data for years between 1980 and 1990. r filter(wb, Country.Code == "AFG") filter(wb, Year > 1980 & Year < 1990)
select
- Remove the Year.Code column;
- Select columns Year and gini; r select(wb, -1) select(wb, Year, gini)
mutate
- Create a new column called Year2010 that is yes for rows where Year > 2010 r mutate(wb, Year2010 = ifelse(Year > 2010, "yes", "no"))
arrange
- Sort the data by Year in descending order and gini r arrange(wb, desc(Year), gini)
summarise:
- Calculate the standard deviation of the gdp_percap column. r summarise(wb, mean(gdp_percap, na.rm = TRUE))
Bonus function. What does slice do? Try
- slice(wb, 1:3), slice(wb, 5:10), slice(wb, n())
- What does n() do? Counts
Look at the documentation for the sample_n function. Can you sample $100$ rows from the data set?

Calculate the mean gini and gdp_percap for each country; set na.rm=TRUE in the mean function. Hint: group by Country.

gb = group_by(wb, Year)
summarise(gb, mean(gini, na.rm = TRUE), mean(gdp_percap, na.rm = TRUE))

gb = group_by(wb, Year, Country.Code)
summarise(gb, median(gini, na.rm = TRUE), median(gdp_percap, na.rm = TRUE))

Compare r wb %>% group_by(Year, Country.Code) %>% summarise(gini = median(gini, na.rm = TRUE)) %>% summarise(max(gini, na.rm = TRUE)) and r wb %>% group_by(Country.Code, Year) %>% summarise(gini = median(gini, na.rm = TRUE)) %>% summarise(max(gini, na.rm = TRUE))
Why are the answers different? What's happening?
- Solution With each application of summarise a variable is peeled off the group_by statement.

Using the pipe operator, link the following operations together (for the wb data set)

Filter to retain data from the Country AFG;
- then remove the Year.Code column;
- then sort the data by Year in descending order and gini `r wb %>% select(Country == "AFG") %>% arrange(desc(YEAR), gini)
Filter to retain data for years between 1980 and 1990;
- then select columns Year and gini;
- create a new column called Year2010 that is yes for rows where Year > 2010
r wb %>% filter(Year > 1980 & Year < 1990) %>% select(-1) %>% mutate(Year2010 = ifelse(Year > 2010, "yes", "no"))

Create an sql lite (or mysql, pgsql) database r db = src_sqlite(path = tempfile(), create = TRUE) wb_sqlite = copy_to(db, world_bank, temporary = FALSE) wb_sqlite = tbl(db, "world_bank") src_desc(db) ## Gives you some details src_tbls(db) ## Lists the tables in the DB
Extract the top 50 rows from this table. r head(wb_sqlite, 50)
Redo exercise 3, but using the data base query.
- Examine the underlying SQL code;
- Use collect to get the database.

Tip: Check out dplyr's CRAN page.

jr-packages/jrBig documentation built on Jan. 1, 2020, 2:02 p.m.

Note that we can't provide technical support on individual packages. You should contact the package authors for that.