options(htmltools.dir.version = FALSE)
library(sparklyr)
library(dplyr)
library(sparklyr)
library(dbplyr)

WindLogics?


Why Spark?


Why sparklyr?


Getting Spark


Getting Spark working

Create a Spark configuration file:

config_name:
  spark.dynamicAllocation.enabled: true
  spark.yarn.executor.memory: ?
  spark.yarn.executor.memoryOverhead: ?
  sparklyr.log.console: true

Sometimes you'll need extra options, especially when working on AWS:

  spark.hadoop.fs.s3a.access.key:                           
  spark.hadoop.fs.s3a.secret.key: 
  spark.hadoop.fs.s3a.endpoint: s3.us-east-1.amazonaws.com
  spark.driver.extraJavaOptions: -Dcom.amazonaws.services.s3.enableV4=true
  spark.executor.extraJavaOptions: -Dcom.amazonaws.services.s3.enableV4=true
  sparklyr.shell.driver-java-options: -Dcom.amazonaws.services.s3.enableV4=true

Getting Spark working, part 2

I've skipped some steps here since there's a whole other talk tonight on setting up R / RStudio on AWS!

sudo -u hdfs hadoop fs -mkdir -p /user/rstudio
sudo -u hdfs hadoop fs -chown rstudio /user/rstudio

The Spark session

conf <- spark_config("some/path/config.yml")

sc <- spark_connect(master = "yarn",
                    spark_home = "/usr/lib/spark",
                    config = conf)

master can be: - local for prototyping - yarn for use with a Hadoop cluster managed by the YARN scheduler (standard for AWS EMR)

spark_home is wherever you installed Spark, or /usr/lib/spark if on AWS.


Demo time!

Thanks to Bruno Rodrigues for pointing me to a large public dataset!

http://www.brodrigues.co/blog/2018-02-16-importing_30gb_of_data/



doorisajar/tcrug.spark documentation built on May 9, 2019, 8:14 a.m.