4.1 What is Apache Spark?

4.1 What is Apache Spark?

What is Apache Spark?

The Spark stack

The Spark stack

Does anyone use it?

What about Hadoop?

R and Spark: SparkR

R and Spark: Sparklyr

library("sparklyr")

4.2 A first Spark instance

4.2 A first Spark instance

The sparklyr package is on CRAN and installed in the usual way

install.packages("sparklyr")

To use the sparklyr package, you need a local copy of Spark.

library("sparklyr")
spark_available_versions()

Installing Spark

Particular versions can then be installed

spark_install(version = "2.0.0")

Running this command will allow you to check that Spark is installed correctly.

spark_installed_versions()

Local & Remove clusters

You can connect to both local instances of Spark as well as remote clusters. To illustrate the Spark, we'll connect to a local instance

# Typically, you'll want to connect to a remote instance.
sc = spark_connect(master = "local")

The returned Spark connection, sc, provides a remote dplyr data source to (in this case) a local Spark cluster.

Reading data

Typically you don't read in CSV files or convert existing data frames into Spark;

If you could get objects into R in the first place why are you using Spark?

Example

You can copy R data frames into Spark using the dplyr::copy_to function

library("dplyr")
data(movies, package="ggplot2movies")
movies_tbl = copy_to(sc, movies)

tlb objects

The object movies_tlb just looks like a standard dplyr object.

options(width=55)
movies_tbl

Notice that we are using lazy evaluation; we don't know how many rows there are in this data set.

# Doesn't work
tail(movies_tbl)

All tables

src_tbls(sc)

Typically you would have more than a single table.

Resilient distributed datasets (RDD)

Using dplyr

http://spark.rstudio.com/dplyr.html

Using dplyr

summarise(movies_tbl, 
          count = n(), 
          no_of_action = sum(action), 
          no_of_animation = sum(animation))

group_by

by_action = group_by(movies_tbl, action)
summarise(by_action, 
          mean = mean(rating), 
          sd = sd(rating)) 

as before.

Using SQL

library("DBI")
dbGetQuery(sc, "SELECT * FROM movies LIMIT 3")

Machine learning/Statistcal algorithms

http://spark.rstudio.com/mllib.html

Machine learning/Statistcal algorithms

Machine learning/Statistcal algorithms

In this example, we'll perform logistic regression to determine how a movies classification (Action, Comedy, etc) affects it's overall movie rating.

m = movies_tbl %>%
  mutate(good = as.numeric(rating > mean(rating))) %>%
  ml_logistic_regression(good ~ Action + Animation + 
                             Comedy + Drama + Documentary + 
                             Romance + Short)

Machine learning/Statistcal algorithms

The object m contains the relevant regression coefficients

m

Prediction

movie_types = as.data.frame(diag(7))
colnames(movie_types) = c("Action", "Animation", "Comedy",
                          "Drama", "Documentary", "Romance",
                          "Short")

Then we copy the data frame to our Spark instance

movie_types_tbl = copy_to(sc, movie_types, overwrite=TRUE)

Predict

predict(m, movie_types_tbl, type="response")


jr-packages/jrBig documentation built on April 3, 2018, 6:57 a.m.