SparkContext: The 'SparkContext' Class

Description Details Public fields Methods Examples

Description

This class was designed as a thin wrapper around Spark's SparkContext. It is initialized when spark_submit is called and inserted into the workspace as sc. Note, running sc$stop will end your session. For information on methods and types requirements, refer to the javadoc:

Details

Not all methods are implemented due to compatability and tidyspark best practice usage conflicts. If you need to use a method not included, try calling it using call_method(sc$jobj, <yourMethod>).

Public fields

jobj

SparkContext java object

getConf

get the SparkConf

Methods

Public methods


Method new()

Create a new SparkContext

Usage
SparkContext$new(sc = NULL)
Arguments
sc

optional, can instatiate with another SparkContext's jobj.


Method print()

print SparkContext Add File

Usage
SparkContext$print()

Method addFile()

Add a file to be downloaded with this Spark job on every node.

Usage
SparkContext$addFile(path, recursive = F)
Arguments
path

string

recursive

boolean Add Jar


Method addJar()

Adds a JAR dependency for all tasks to be executed on this SparkContext in the future.

Usage
SparkContext$addJar(path)
Arguments
path

string App Name


Method appName()

get the App name Broadcast

Usage
SparkContext$appName()

Method broadcast()

Broadcast a vairable to executors. cancelAllJobs

Usage
SparkContext$broadcast(value)
Arguments
value

the variable to broadcast.


Method cancelAllJobs()

Cancel all jobs that have been scheduled or are running. cancelJobGroup

Usage
SparkContext$cancelAllJobs()

Method cancelJobGroup()

Cancel active jobs for the specified group.

Usage
SparkContext$cancelJobGroup(groupId)
Arguments
groupId

string clearJobGroup


Method clearJobGroup()

Clear the current thread's job group ID and its description. defaultMinPartitions

Usage
SparkContext$clearJobGroup()

Method defaultMinPartitions()

Default min number of partitions for Hadoop RDDs when not given by user Notice that we use math.min so the "defaultMinPartitions" cannot be higher than 2. defaultParallelism

Usage
SparkContext$defaultMinPartitions()

Method defaultParallelism()

Default level of parallelism to use when not given by user emptyRDD

Usage
SparkContext$defaultParallelism()

Method emptyRDD()

Get an RDD that has no partitions or elements.

Usage
SparkContext$emptyRDD()
Returns

RDD isLocal


Method isLocal()

is the Spark process local?

Usage
SparkContext$isLocal()
Returns

boolean jars


Method jars()

is the Spark process local?

Usage
SparkContext$jars()
Returns

a jobj representing scala.collection.Seq<String> master


Method master()

why is roxygen making me do all these...

Usage
SparkContext$master()
Returns

string Parallelize


Method parallelize()

Distribute a list (or Scala collection) to form an RDD.

Usage
SparkContext$parallelize(seq, numSlices = 1L)
Arguments
seq

list (or Scala Collection) to distribute

numSlices

number of partitions to divide the collection into

Details

Parallelize acts lazily. If seq is a mutable collection and is altered after the call to parallelize and before the first action on the RDD, the resultant RDD will reflect the modified collection. Pass a copy of the argument to avoid this., avoid using parallelize(Seq()) to create an empty RDD. Consider emptyRDD for an RDD with no partitions, or parallelize(Seq[T]()) for an RDD of T with empty partitions.

Returns

RDD setCheckpointDir


Method setCheckpointDir()

Set the directory under which RDDs are going to be checkpointed. setJobDescription

Usage
SparkContext$setCheckpointDir(directory)
Arguments
directory

string, path to the directory where checkpoint files will be stored (must be HDFS path if running in cluster)


Method setJobDescription()

Set a human readable description of the current job. setJobGroup

Usage
SparkContext$setJobDescription(value)
Arguments
value

string


Method setJobGroup()

Assigns a group ID to all the jobs started by this thread until the group ID is set to a different value or cleared.

Usage
SparkContext$setJobGroup(groupId, description, interruptOnCancel)
Arguments
groupId

string

description

string

interruptOnCancel

If TRUE, then job cancellation will result in Thread.interrupt() being called on the job's executor threads. This is useful to help ensure that the tasks are actually stopped in a timely manner, but is off by default due to HDFS-1208, where HDFS may respond to Thread.interrupt() by marking nodes as dead. setLocalProperty


Method setLocalProperty()

Set a local property that affects jobs submitted from this thread, such as the Spark fair scheduler pool. sparkuser

Usage
SparkContext$setLocalProperty(key, value)
Arguments
key

string

value

string


Method sparkUser()

Who AM I? startTime

Usage
SparkContext$sparkUser()

Method startTime()

still surprised I have to write these. but the big bad orange warnings that roxygen throws are just sooooo ugly stop

Usage
SparkContext$startTime()

Method stop()

Shut down the SparkContext. textFile

Usage
SparkContext$stop()

Method textFile()

Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings. version

Usage
SparkContext$textFile(path, minPartitions)
Arguments
path

string, path to the text file on a supported file system

minPartitions

int, suggested minimum number of partitions for the resulting RDD


Method version()

The version of Spark on which this application is running. Union RDDs

Usage
SparkContext$version()

Method union()

Build the union of a list of RDDs.

Usage
SparkContext$union(rdds)
Arguments
rdds

a list of RDDs or RDD jobjs

Returns

RDD wholeTextFiles


Method wholeTextFiles()

Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI.

Usage
SparkContext$wholeTextFiles(path, minPartitions)
Arguments
path

Directory to the input data files, the path can be comma separated paths as the list of inputs.

minPartitions

A suggestion value of the minimal splitting number for input data.

Returns

RDD


Method clone()

The objects of this class are cloneable with this method.

Usage
SparkContext$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
## Not run: 
spark <- spark_session()
sc <- spark$sparkContext
sc$defaultParallelism()
an_rdd <- sc$parallelize(list(1:10), 4)
sc$getConf$get("spark.submit.deployMode")

spark_session_stop()

## End(Not run)

danzafar/tidyspark documentation built on Sept. 30, 2020, 12:19 p.m.