options(htmltools.dir.version = FALSE) library(sparklyr) library(dplyr) library(sparklyr) library(dbplyr)
R&D subsidiary of NextEra Energy
We're hiring!
Spark is a distributed computing framework
If your data won't fit in memory...
sparklyr
?We have custom models that are already implemented in R
We're already comfortable working in R
We want to use R
package functions that may not exist in other languages
Install sparklyr
install.packages('sparklyr')
Install Spark:
brew install apache-spark
spark.apache.org
sparklyr::spark_install(...)
Amazon Web Services Elastic MapReduce (AWS EMR) clusters can be preconfigured to include Spark by default
Create a Spark configuration file:
config_name: spark.dynamicAllocation.enabled: true spark.yarn.executor.memory: ? spark.yarn.executor.memoryOverhead: ? sparklyr.log.console: true
Sometimes you'll need extra options, especially when working on AWS:
spark.hadoop.fs.s3a.access.key: spark.hadoop.fs.s3a.secret.key: spark.hadoop.fs.s3a.endpoint: s3.us-east-1.amazonaws.com spark.driver.extraJavaOptions: -Dcom.amazonaws.services.s3.enableV4=true spark.executor.extraJavaOptions: -Dcom.amazonaws.services.s3.enableV4=true sparklyr.shell.driver-java-options: -Dcom.amazonaws.services.s3.enableV4=true
I've skipped some steps here since there's a whole other talk tonight on setting up R / RStudio on AWS!
rstudio
:sudo -u hdfs hadoop fs -mkdir -p /user/rstudio sudo -u hdfs hadoop fs -chown rstudio /user/rstudio
conf <- spark_config("some/path/config.yml") sc <- spark_connect(master = "yarn", spark_home = "/usr/lib/spark", config = conf)
master
can be:
- local
for prototyping
- yarn
for use with a Hadoop cluster managed by the YARN scheduler (standard for AWS EMR)
spark_home
is wherever you installed Spark, or /usr/lib/spark
if on AWS.
Thanks to Bruno Rodrigues for pointing me to a large public dataset!
http://www.brodrigues.co/blog/2018-02-16-importing_30gb_of_data/
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.