Exercise 1

First load the package and make a Spark connection

library("sparklyr")
sc = spark_connect(master = "local")

Then load the dplyr package

library("dplyr")

In this practical we're going to use the nycflights13 data set. Run

?nycflights13::flights

to get an overview of the data set.

To copy the data frame across to your Spark cluster, use the copy_to function

flights_tbl = copy_to(sc, nycflights13::flights, "flights")

List the available tables using src_tbls(sc).

Exercise 2

  1. Using the filter dplyr function, select flights where the airtime was greater than 10 hours

    r flights_tbl %>% filter(air_time > 10 * 60)

  2. Lets investigate how delay varies with day of the month. The following code groups by day of the week, then works out the conditional arrival delay

    r delay = flights_tbl %>% group_by(day) %>% summarise(delay = mean(arr_delay))

    Now compare delay with r delay_collect = delay %>% collect() What's the difference? We can the delays easily ggplot2 (or base graphics) ```r library(ggplot2)

    Why does delay not work below?

    ggplot(delay_collect, aes(day, delay)) + geom_point() + geom_smooth() `` 3. Produce a similar plot fordep_delay` variable.

  3. Tricky question: Create a boxplot graphic where the x-axis is day of the week and the y-axis is the delay.

    r library(lubridate) delay$wday = wday(dmy(paste(delay$day, delay$month, delay$year)), label = T)

Links



jr-packages/jrBig documentation built on Jan. 1, 2020, 2:02 p.m.