In jr-packages/jrBig: Jumping Rivers: R for Big Data

Exercise 1

Load the world_bank dataset

library("dplyr")
data(world_bank, package = "jrBig")

Convert the data frame into a dplyr data frame

wb = tbl_df(world_bank)

Print the world_bank and wb data frames to screen. What's different?
What does the glimpse function do? Hint: just try it on a data frame.
Using wb, complete the following tasks.
filter
- Filter to retain data from the Country AFG;
- Filter to retain data for years between 1980 and 1990.
select
- Remove the Year.Code column;
- Select columns Year and gini;
mutate
- Create a new column called Year2010 that is yes for rows where Year > 2010
arrange
- Sort the data by Year in descending order and gini
summarise:
- Calculate the standard deviation of the gdp_percap column.
Bonus function. What does slice do? Try
- slice(wb, 1:3), slice(wb, 5:10), slice(wb, n())
- What does n() do?
Look at the documentation for the sample_n function. Can you sample $100$ rows from the data set?

Exercise 2

Calculate the mean gini and gdp_percap for each country; set na.rm=TRUE in the mean function. Hint: group by Country.

gb = group_by(wb, Year)
summarise(gb, mean(gini, na.rm = TRUE), mean(gdp_percap, na.rm = TRUE))

Calculate the median gini and gdp_percap for each country per year.

gb = group_by(wb, Year, Country.Code)
summarise(gb, median(gini, na.rm = TRUE), median(gdp_percap, na.rm = TRUE))

Using the pipe operator, link the following operations together (for the wb data set)

Filter to retain data from the Country AFG;
- then remove the Year.Code column;
- then sort the data by Year in descending order and gini
Filter to retain data for years between 1980 and 1990;
- then select columns Year and gini;
- create a new column called Year2010 that is yes for rows where Year > 2010

wb = tbl_df(world_bank)
wb %>% 
  filter(Country.Code == "AFG") %>%
  select(-1) %>%
  mutate(Year2010 = Year > 2010)

Compare r wb %>% group_by(Year, Country.Code) %>% summarise(gini = median(gini, na.rm = TRUE)) %>% summarise(max(gini, na.rm = TRUE)) and r wb %>% group_by(Country.Code, Year) %>% summarise(gini = median(gini, na.rm = TRUE)) %>% summarise(max(gini, na.rm = TRUE)) * Why are the answers different? What's happening?

Exercise 3

Create an sql lite (or mysql, pgsql) database r db = src_sqlite(path = tempfile(), create = TRUE) wb_sqlite = copy_to(db, world_bank, temporary = FALSE) wb_sqlite = tbl(db, "world_bank") src_desc(db) ## Gives you some details src_tbls(db) ## Lists the tables in the DB
Extract the top 50 rows from this table.
Redo exercise 3, but using the data base query.
- Examine the underlying SQL code;
- Use collect to get the database.

Tip: Check out dplyr's CRAN page.

jr-packages/jrBig documentation built on Jan. 1, 2020, 2:02 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com