SparkContext: The 'SparkContext' Class
In danzafar/tidyspark: A Tidy Interface to Spark

Description Details Public fields Methods Examples

This class was designed as a thin wrapper around Spark's SparkContext. It is initialized when spark_submit is called and inserted into the workspace as sc. Note, running sc$stop will end your session. For information on methods and types requirements, refer to the javadoc:

Not all methods are implemented due to compatability and tidyspark best practice usage conflicts. If you need to use a method not included, try calling it using call_method(sc$jobj, <yourMethod>).

jobj: SparkContext java object
getConf: get the SparkConf

Public methods

SparkContext$new()
SparkContext$print()
SparkContext$addFile()
SparkContext$addJar()
SparkContext$appName()
SparkContext$broadcast()
SparkContext$cancelAllJobs()
SparkContext$cancelJobGroup()
SparkContext$clearJobGroup()
SparkContext$defaultMinPartitions()
SparkContext$defaultParallelism()
SparkContext$emptyRDD()
SparkContext$isLocal()
SparkContext$jars()
SparkContext$master()
SparkContext$parallelize()
SparkContext$setCheckpointDir()
SparkContext$setJobDescription()
SparkContext$setJobGroup()
SparkContext$setLocalProperty()
SparkContext$sparkUser()
SparkContext$startTime()
SparkContext$stop()
SparkContext$textFile()
SparkContext$version()
SparkContext$union()
SparkContext$wholeTextFiles()
SparkContext$clone()

Method `new()`

Create a new SparkContext

Usage

SparkContext$new(sc = NULL)

Arguments

sc: optional, can instatiate with another SparkContext's jobj.

Method `print()`

print SparkContext Add File

Usage

SparkContext$print()

Method `addFile()`

Add a file to be downloaded with this Spark job on every node.

Usage

SparkContext$addFile(path, recursive = F)

Arguments

path: string
recursive: boolean Add Jar

Method `addJar()`

Adds a JAR dependency for all tasks to be executed on this SparkContext in the future.

Usage

SparkContext$addJar(path)

Arguments

path: string App Name

Method `appName()`

get the App name Broadcast

Usage

SparkContext$appName()

Method `broadcast()`

Broadcast a vairable to executors. cancelAllJobs

Usage

SparkContext$broadcast(value)

Arguments

value: the variable to broadcast.

Method `cancelAllJobs()`

Cancel all jobs that have been scheduled or are running. cancelJobGroup

Usage

SparkContext$cancelAllJobs()

Method `cancelJobGroup()`

Cancel active jobs for the specified group.

Usage

SparkContext$cancelJobGroup(groupId)

Arguments

groupId: string clearJobGroup

Method `clearJobGroup()`

Clear the current thread's job group ID and its description. defaultMinPartitions

Usage

SparkContext$clearJobGroup()

Method `defaultMinPartitions()`

Default min number of partitions for Hadoop RDDs when not given by user Notice that we use math.min so the "defaultMinPartitions" cannot be higher than 2. defaultParallelism

Usage

SparkContext$defaultMinPartitions()

Method `defaultParallelism()`

Default level of parallelism to use when not given by user emptyRDD

Usage

SparkContext$defaultParallelism()

Method `emptyRDD()`

Get an RDD that has no partitions or elements.

Usage

SparkContext$emptyRDD()

Returns

RDD isLocal

Method `isLocal()`

is the Spark process local?

Usage

SparkContext$isLocal()

Returns

boolean jars

Method `jars()`

is the Spark process local?

Usage

SparkContext$jars()

Returns

a jobj representing scala.collection.Seq<String> master

Method `master()`

why is roxygen making me do all these...

Usage

SparkContext$master()

Returns

string Parallelize

Method `parallelize()`

Distribute a list (or Scala collection) to form an RDD.

Usage

SparkContext$parallelize(seq, numSlices = 1L)

Arguments

seq: list (or Scala Collection) to distribute
numSlices: number of partitions to divide the collection into

Details

Parallelize acts lazily. If seq is a mutable collection and is altered after the call to parallelize and before the first action on the RDD, the resultant RDD will reflect the modified collection. Pass a copy of the argument to avoid this., avoid using parallelize(Seq()) to create an empty RDD. Consider emptyRDD for an RDD with no partitions, or parallelize(Seq[T]()) for an RDD of T with empty partitions.

Returns

RDD setCheckpointDir

Method `setCheckpointDir()`

Set the directory under which RDDs are going to be checkpointed. setJobDescription

Usage

SparkContext$setCheckpointDir(directory)

Arguments

directory: string, path to the directory where checkpoint files will be stored (must be HDFS path if running in cluster)

Method `setJobDescription()`

Set a human readable description of the current job. setJobGroup

Usage

SparkContext$setJobDescription(value)

Arguments

value: string

Method `setJobGroup()`

Assigns a group ID to all the jobs started by this thread until the group ID is set to a different value or cleared.

Usage

SparkContext$setJobGroup(groupId, description, interruptOnCancel)

Arguments

groupId: string
description: string
interruptOnCancel: If TRUE, then job cancellation will result in Thread.interrupt() being called on the job's executor threads. This is useful to help ensure that the tasks are actually stopped in a timely manner, but is off by default due to HDFS-1208, where HDFS may respond to Thread.interrupt() by marking nodes as dead. setLocalProperty

Method `setLocalProperty()`

Set a local property that affects jobs submitted from this thread, such as the Spark fair scheduler pool. sparkuser

Usage

SparkContext$setLocalProperty(key, value)

Arguments

key: string
value: string

Method `sparkUser()`

Who AM I? startTime

Usage

SparkContext$sparkUser()

Method `startTime()`

still surprised I have to write these. but the big bad orange warnings that roxygen throws are just sooooo ugly stop

Usage

SparkContext$startTime()

Method `stop()`

Shut down the SparkContext. textFile

Usage

SparkContext$stop()

Method `textFile()`

Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings. version

Usage

SparkContext$textFile(path, minPartitions)

Arguments

path: string, path to the text file on a supported file system
minPartitions: int, suggested minimum number of partitions for the resulting RDD

Method `version()`

The version of Spark on which this application is running. Union RDDs

Usage

SparkContext$version()

Method `union()`

Build the union of a list of RDDs.

Usage

SparkContext$union(rdds)

Arguments

rdds: a list of RDDs or RDD jobjs

Returns

RDD wholeTextFiles

Method `wholeTextFiles()`

Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI.

Usage

SparkContext$wholeTextFiles(path, minPartitions)

Arguments

path: Directory to the input data files, the path can be comma separated paths as the list of inputs.
minPartitions: A suggestion value of the minimal splitting number for input data.

Returns

RDD

Method `clone()`

The objects of this class are cloneable with this method.

Usage

SparkContext$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

## Not run: 
spark <- spark_session()
sc <- spark$sparkContext
sc$defaultParallelism()
an_rdd <- sc$parallelize(list(1:10), 4)
sc$getConf$get("spark.submit.deployMode")

spark_session_stop()

## End(Not run)

danzafar/tidyspark documentation built on Sept. 30, 2020, 12:19 p.m.

danzafar/tidyspark index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

danzafar/tidyspark A Tidy Interface to Spark

SparkContext: The 'SparkContext' Class In danzafar/tidyspark: A Tidy Interface to Spark

Description

Details

Public fields

Methods

Public methods

Method new()

Usage

Arguments

Method print()

Usage

Method addFile()

Usage

Arguments

Method addJar()

Usage

Arguments

Method appName()

Usage

Method broadcast()

Usage

Arguments

Method cancelAllJobs()

Usage

Method cancelJobGroup()

Usage

Arguments

Method clearJobGroup()

Usage

Method defaultMinPartitions()

Usage

Method defaultParallelism()

Usage

Method emptyRDD()

Usage

Returns

Method isLocal()

Usage

Returns

Method jars()

Usage

Returns

Method master()

Usage

Returns

Method parallelize()

Usage

Arguments

Details

Returns

Method setCheckpointDir()

Usage

Arguments

Method setJobDescription()

Usage

Arguments

Method setJobGroup()

Usage

Arguments

Method setLocalProperty()

Usage

Arguments

Method sparkUser()

Usage

Method startTime()

Usage

Method stop()

Usage

Method textFile()

Usage

Arguments

Method version()

Usage

Method union()

Usage

Arguments

Returns

Method wholeTextFiles()

Usage

danzafar/tidyspark
A Tidy Interface to Spark

SparkContext: The 'SparkContext' Class
In danzafar/tidyspark: A Tidy Interface to Spark

Method `new()`

Method `print()`

Method `addFile()`

Method `addJar()`

Method `appName()`

Method `broadcast()`

Method `cancelAllJobs()`

Method `cancelJobGroup()`

Method `clearJobGroup()`

Method `defaultMinPartitions()`

Method `defaultParallelism()`

Method `emptyRDD()`

Method `isLocal()`

Method `jars()`

Method `master()`

Method `parallelize()`

Method `setCheckpointDir()`

Method `setJobDescription()`

Method `setJobGroup()`

Method `setLocalProperty()`

Method `sparkUser()`

Method `startTime()`

Method `stop()`

Method `textFile()`

Method `version()`

Method `union()`

Method `wholeTextFiles()`

Method `clone()`