Load the package and make a connection
library(sparklyr) sc = spark_connect(master = "local")
In this exercise, we're going to use the nycflights13
data set. Look at
the associated help page ?nycflights13::flights
to get an overview of the data set.
Load the dplyr
package
library(dplyr)
and copy the dataframe across
flights_tbl = copy_to(sc, nycflights13::flights, "flights")
List the available tables using src_tbls(sc)
. Look
filter
dplyr function, select flights where the airtime was greater than 10 hoursflights_tbl %>% filter(air_time > 10 * 60)
delay = flights_tbl %>% group_by(day) %>% summarise(delay = mean(arr_delay))
Now compare
delay delay_collect = delay %>% collect()
What's the difference? We can the delays easily ggplot2 (or base graphics)
ggplot(delay, aes(day, delay)) + geom_point() + geom_smooth()
Produce a similar plot for dep_delay
Tricky question: Create a boxplot graphic where the x-axis is day of the week and the y-axis is the delay.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.