sparkreadr is an extension for sparklyr to read flat files into Spark DataFrames. It uses the Spark package spark-csv.
sparkreadr requires the sparklyr package to run
I recommend the latest stable version of sparklyr available on CRAN
install.packages("sparklyr")
Install the development version of sparkreadr from this Github repo using devtools
library(devtools)
devtools::install_github("emaasit/sparkreadr")
If Spark is not already installed, use the following sparklyr command to install your preferred version of Spark:
library(sparklyr)
spark_install(version = "2.0.0")
The call to library(sparkreadr)
will make the sparkreadr functions
available on the R search path and will also ensure that the
dependencies required by the package are included when we connect to
Spark.
library(sparkreadr)
We can create a Spark connection as follows:
sc <- spark_connect(master = "local")
sparkreadr provides the function spark_read_readr
to read readr data
files into Spark DataFrames. It uses a Spark package called spark-readr.
Here's an example.
mtcars_file <- system.file("extdata", "mtcars.csv", package = "sparkreadr")
mtcars_df <- spark_read_csv(sc, path = mtcars_file, table = "readr_table")
mtcars_df
The resulting pointer to a Spark table can be further used in dplyr statements.
library(dplyr)
mtcars_df %>% group_by(cyl) %>%
summarise(count = n(), avg.mpg = mean(mpg), avg.displacment = mean(disp), avg.horsepower = mean(hp))
Look at the Spark log from R:
spark_log(sc, n = 100)
Now we disconnect from Spark:
spark_disconnect(sc)
Thanks to RStudio for the sparklyr package that provides functionality to create Extensions.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.