In doorisajar/tcrug.spark: sparklyr demos for Twin Cities R Users' Group 2018-09

options(htmltools.dir.version = FALSE)
library(sparklyr)
library(dplyr)
library(sparklyr)
library(dbplyr)

WindLogics?

R&D subsidiary of NextEra Energy
- over 10,000 wind turbines
- lots of solar
- Utilities: FPL, Gulf
- Energy trading
We're hiring!

Why Spark?

Spark is a distributed computing framework
If your data won't fit in memory...
- Did I mention we have more than 10,000 wind turbines?

Why `sparklyr`?

We have custom models that are already implemented in R
We're already comfortable working in R
We want to use R package functions that may not exist in other languages

Getting Spark

Install sparklyr
- install.packages('sparklyr')
Install Spark:
- brew install apache-spark
- Download it from spark.apache.org
- sparklyr::spark_install(...)
Amazon Web Services Elastic MapReduce (AWS EMR) clusters can be preconfigured to include Spark by default

Getting Spark working

Create a Spark configuration file:

config_name:
  spark.dynamicAllocation.enabled: true
  spark.yarn.executor.memory: ?
  spark.yarn.executor.memoryOverhead: ?
  sparklyr.log.console: true

Sometimes you'll need extra options, especially when working on AWS:

  spark.hadoop.fs.s3a.access.key:                           
  spark.hadoop.fs.s3a.secret.key: 
  spark.hadoop.fs.s3a.endpoint: s3.us-east-1.amazonaws.com
  spark.driver.extraJavaOptions: -Dcom.amazonaws.services.s3.enableV4=true
  spark.executor.extraJavaOptions: -Dcom.amazonaws.services.s3.enableV4=true
  sparklyr.shell.driver-java-options: -Dcom.amazonaws.services.s3.enableV4=true

Getting Spark working, part 2

I've skipped some steps here since there's a whole other talk tonight on setting up R / RStudio on AWS!

I already had an Amazon Machine Image created with R, RStudio, and various packages installed
Port 8787 (or some other port you specify) needs to be open so you can access the RStudio UI through a browser
If using AWS EMR, you need to create a user for RStudio that has a password (by default AWS instances are only accessible via keypair)
That user needs to have their own Hadoop scratch directory and permissions for it. So for my user rstudio:

sudo -u hdfs hadoop fs -mkdir -p /user/rstudio
sudo -u hdfs hadoop fs -chown rstudio /user/rstudio

The Spark session

conf <- spark_config("some/path/config.yml")

sc <- spark_connect(master = "yarn",
                    spark_home = "/usr/lib/spark",
                    config = conf)

master can be: - local for prototyping - yarn for use with a Hadoop cluster managed by the YARN scheduler (standard for AWS EMR)

spark_home is wherever you installed Spark, or /usr/lib/spark if on AWS.