rmr2package allows the R to use Hadoop MapReduce.
rmr2is due to two reasons.
SparkRwas released as a separated component of Spark in 2014.
They are still in the process of deciding on the API of
In previous courses, we used
SparkR to access Spark
sparklyrpackage is a better long term bet.
sparklyr package is on CRAN and installed in the usual way
To use the
sparklyr package, you need a local copy of Spark.
Particular versions can then be installed
spark_install(version = "2.0.0")
Running this command will allow you to check that Spark is installed correctly.
You can connect to both local instances of Spark as well as remote clusters. To illustrate the Spark, we'll connect to a local instance
# Typically, you'll want to connect to a remote instance. sc = spark_connect(master = "local")
The returned Spark connection,
sc, provides a remote
data source to (in this case) a local Spark cluster.
Typically you don't read in CSV files or convert existing data frames into Spark;
If you could get objects into R in the first place why are you using Spark?
You can copy R data frames into Spark using the
library("dplyr") data(movies, package="ggplot2movies") movies_tbl = copy_to(sc, movies)
movies_tlb just looks like a standard
Notice that we are using lazy evaluation; we don't know how many rows there are in this data set.
# Doesn't work tail(movies_tbl)
Typically you would have more than a single table.
dplyrjust slots into the work flow.
summarise(movies_tbl, count = n(), no_of_action = sum(action), no_of_animation = sum(animation))
by_action = group_by(movies_tbl, action) summarise(by_action, mean = mean(rating), sd = sd(rating))
sc) implements a DBI interface for Spark.
dbGetQueryto execute SQL commands.
library("DBI") dbGetQuery(sc, "SELECT * FROM movies LIMIT 3")
In this example, we'll perform logistic regression to determine how a movies
Comedy, etc) affects it's overall
m = movies_tbl %>% mutate(good = as.numeric(rating > mean(rating))) %>% ml_logistic_regression(good ~ Action + Animation + Comedy + Drama + Documentary + Romance + Short)
m contains the relevant regression coefficients
movie_types = as.data.frame(diag(7)) colnames(movie_types) = c("Action", "Animation", "Comedy", "Drama", "Documentary", "Romance", "Short")
Then we copy the data frame to our Spark instance
movie_types_tbl = copy_to(sc, movie_types, overwrite=TRUE)
predict(m, movie_types_tbl, type="response")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.