First load the package and make a Spark connection
library("sparklyr") sc = spark_connect(master = "local")
Then load the dplyr
package
library("dplyr")
In this practical we're going to use the nycflights13
data set. Run
?nycflights13::flights
to get an overview of the data set.
To copy the data frame across to your Spark cluster, use the copy_to
function
flights_tbl = copy_to(sc, nycflights13::flights, "flights")
List the available tables using src_tbls(sc)
.
Using the filter
dplyr function, select flights where the airtime was greater than 10 hours
r
flights_tbl %>% filter(air_time > 10 * 60)
Lets investigate how delay varies with day of the month. The following code groups by day of the week, then works out the conditional arrival delay
r
delay = flights_tbl %>%
group_by(day) %>%
summarise(delay = mean(arr_delay))
Now compare delay
with
r
delay_collect = delay %>% collect()
What's the difference? We can the delays easily ggplot2 (or base graphics)
```r
library(ggplot2)
ggplot(delay_collect, aes(day, delay)) +
geom_point() +
geom_smooth()
``
3. Produce a similar plot for
dep_delay` variable.
Tricky question: Create a boxplot graphic where the x-axis is day of the week and the y-axis is the delay.
r
library(lubridate)
delay$wday = wday(dmy(paste(delay$day, delay$month, delay$year)), label = T)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.