4.1 What is Apache Spark?

4.1 What is Apache Spark?

What is Apache Spark?

The Spark stack

The Spark stack

Does anyone use it?

What about Hadoop?

SparkR

4.2 A first Spark instance

4.2 A first Spark instance

A first Spark instance

Next we initialise the Spark cluster and create a SparkContext

sc = sparkR.init(master = "local")

A first Spark instance

sc = sparkR.init(sparkPackages = "com.databricks:spark-csv_2.11:1.0.3")

A first Spark instance

When we finish our Spark session, we should terminate the Spark context via

sparkR.stop()

4.3 Resilient distributed datasets (RDD)

4.3 Resilient distributed datasets (RDD)

4.3.1 Example: Moby Dick

Example: Moby Dick

In any case, the moby object is an RDD object

R > moby
# MapPartitionsRDD[3] at textFile at NativeMethodAccessorImpl.java:-2

Transformation

Example: Transformation

Example: Transformation

Action

Example: Action

If we want to know how many rows contain the word Moby, we use the count function

## The answer is 77
count(mobys)

Lazy evaluation

Lazy evaluation

Caching

Spark's RDDs are (by default) recomputed each time you run an action on them.\sidenote{Alternatively, we could use \texttt{cache(moby)}, which is the same as \texttt{persist} with the default level of storage.} To reuse RDDs in multiple operations we can ask Spark to persist it via

## There are different levels of storage
persist(mobys, "MEMORY_ONLY")

If you are not planning on reusing the object, don't use persist.

Summary

To summarise, every Spark session will have a similar structure.

4.4 Loading data: creating RDDs

4.4 Loading data: creating RDDs

Loading data: creating RDDs

vec_sp = SparkR:::parallelize(sc, 1:100)

Being lazy again

As before, we don't actually compute vec_sp until it is needed,

count(vec_sp)

Example: 4.5 Spark dataframes

Example: 4.5 Spark dataframes

Suppose we have already initialised a Spark context. To use Spark data frames, we need to create an SQLContext, via

sql_context = sparkRSQL.init(sc)

The SQLContext enables us to create data frames from a local R data frame,

chicks_sp = createDataFrame(sql_context, chickwts) 

Example: Spark dataframes

\noindent or from other data sources, such as CSV files or a 'Hive table'. If we examine the newly created object, we get

R > chicks_sp
# DataFrame[weight:double, feed:string]

Example: Spark dataframes

\noindent An S3 method for head is also available, so

R > head(chicks_sp, 2)
#  weight      feed
#1    179 horsebean
#2    160 horsebean

gives what we would expect. We can extract columns using the dollar notation, chicks_sp$weight or using select

R > select(chicks_sp,  "weight")
# DataFrame[weight:double]

Example: Spark dataframes

\noindent We can subset or filter the data frame using the filter function

filter(chicks_sp, chicks_sp$feed == "horsebean")

\noindent Using Spark data frames, we can also easily group and aggregate data frames. (Note this is similar to the dplyr syntax). For example, to count the number of chicks in each feed group, we group, and then summarise:

chicks_cnt = groupBy(chicks_sp, chicks_sp$feed) %>%
  summarize(count = n(chicks_sp$feed))

Example: Spark dataframes

Then use head to view the top rows

head(chicks_cnt, 2)
#      feed count
#1   casein    12
#2 meatmeal    11

We can also use arrange the data by the most common group

arrange(chicks_cnt, desc(chicks_cnt$count))

4.6 Resources

4.6 Resources



jr-packages/jrBig documentation built on Jan. 1, 2020, 2:02 p.m.