Exercise 1

Load the package and make a connection

library(sparklyr)
sc = spark_connect(master = "local")

In this exercise, we're going to use the nycflights13 data set. Look at the associated help page ?nycflights13::flights to get an overview of the data set. Load the dplyr package

library(dplyr)

and copy the dataframe across

flights_tbl = copy_to(sc, nycflights13::flights, "flights")

List the available tables using src_tbls(sc). Look

Exercise 2

  1. Using the filter dplyr function, select flights where the airtime was greater than 10 hours
flights_tbl %>% filter(air_time > 10 * 60)
  1. Lets investigate how delay varies with day of the month. The following code groups by day of the week, then works out the conditional arrivial delay
delay = flights_tbl %>% 
  group_by(day) %>%
  summarise(delay = mean(arr_delay))

Now compare

delay
delay_collect = delay %>% collect()

What's the difference? We can the delays easily ggplot2 (or base graphics)

ggplot(delay, aes(day, delay)) +
  geom_point() +
  geom_smooth()
  1. Produce a similar plot for dep_delay

  2. Tricky question: Create a boxplot graphic where the x-axis is day of the week and the y-axis is the delay.

Links

Tasks



jr-packages/jrBig documentation built on Jan. 1, 2020, 2:02 p.m.