Description Details Public fields Methods Examples
This class was designed as a thin wrapper around Spark's
SparkContext
. It is initialized when spark_submit
is called
and inserted into the workspace as sc
. Note, running
sc$stop
will end your session. For information on methods and types
requirements, refer to the javadoc:
Not all methods are implemented due to compatability
and tidyspark best practice usage conflicts. If you need to use a method not
included, try calling it using call_method(sc$jobj, <yourMethod>)
.
jobj
SparkContext
java object
getConf
get the SparkConf
new()
Create a new SparkContext
SparkContext$new(sc = NULL)
sc
optional, can instatiate with another SparkContext's jobj.
print()
print SparkContext
Add File
SparkContext$print()
addFile()
Add a file to be downloaded with this Spark job on every node.
SparkContext$addFile(path, recursive = F)
path
string
recursive
boolean Add Jar
addJar()
Adds a JAR dependency for all tasks to be executed on this SparkContext in the future.
SparkContext$addJar(path)
path
string App Name
appName()
get the App name Broadcast
SparkContext$appName()
broadcast()
Broadcast a vairable to executors. cancelAllJobs
SparkContext$broadcast(value)
value
the variable to broadcast.
cancelAllJobs()
Cancel all jobs that have been scheduled or are running. cancelJobGroup
SparkContext$cancelAllJobs()
cancelJobGroup()
Cancel active jobs for the specified group.
SparkContext$cancelJobGroup(groupId)
groupId
string clearJobGroup
clearJobGroup()
Clear the current thread's job group ID and its description. defaultMinPartitions
SparkContext$clearJobGroup()
defaultMinPartitions()
Default min number of partitions for Hadoop RDDs when not given by user Notice that we use math.min so the "defaultMinPartitions" cannot be higher than 2. defaultParallelism
SparkContext$defaultMinPartitions()
defaultParallelism()
Default level of parallelism to use when not given by user emptyRDD
SparkContext$defaultParallelism()
emptyRDD()
Get an RDD that has no partitions or elements.
SparkContext$emptyRDD()
RDD isLocal
isLocal()
is the Spark process local?
SparkContext$isLocal()
boolean jars
jars()
is the Spark process local?
SparkContext$jars()
a jobj representing scala.collection.Seq<String>
master
master()
why is roxygen making me do all these...
SparkContext$master()
string Parallelize
parallelize()
Distribute a list (or Scala collection) to form an RDD.
SparkContext$parallelize(seq, numSlices = 1L)
seq
list (or Scala Collection) to distribute
numSlices
number of partitions to divide the collection into
Parallelize acts lazily. If seq is a mutable collection and is altered after the call to parallelize and before the first action on the RDD, the resultant RDD will reflect the modified collection. Pass a copy of the argument to avoid this., avoid using parallelize(Seq()) to create an empty RDD. Consider emptyRDD for an RDD with no partitions, or parallelize(Seq[T]()) for an RDD of T with empty partitions.
RDD setCheckpointDir
setCheckpointDir()
Set the directory under which RDDs are going to be checkpointed. setJobDescription
SparkContext$setCheckpointDir(directory)
directory
string, path to the directory where checkpoint files will be stored (must be HDFS path if running in cluster)
setJobDescription()
Set a human readable description of the current job. setJobGroup
SparkContext$setJobDescription(value)
value
string
setJobGroup()
Assigns a group ID to all the jobs started by this thread until the group ID is set to a different value or cleared.
SparkContext$setJobGroup(groupId, description, interruptOnCancel)
groupId
string
description
string
interruptOnCancel
If TRUE, then job cancellation will result in Thread.interrupt() being called on the job's executor threads. This is useful to help ensure that the tasks are actually stopped in a timely manner, but is off by default due to HDFS-1208, where HDFS may respond to Thread.interrupt() by marking nodes as dead. setLocalProperty
setLocalProperty()
Set a local property that affects jobs submitted from this thread, such as the Spark fair scheduler pool. sparkuser
SparkContext$setLocalProperty(key, value)
key
string
value
string
sparkUser()
Who AM I? startTime
SparkContext$sparkUser()
startTime()
still surprised I have to write these. but the big bad orange warnings that roxygen throws are just sooooo ugly stop
SparkContext$startTime()
stop()
Shut down the SparkContext. textFile
SparkContext$stop()
textFile()
Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings. version
SparkContext$textFile(path, minPartitions)
path
string, path to the text file on a supported file system
minPartitions
int, suggested minimum number of partitions for the resulting RDD
version()
The version of Spark on which this application is running. Union RDDs
SparkContext$version()
union()
Build the union of a list of RDDs.
SparkContext$union(rdds)
rdds
a list of RDDs or RDD jobjs
RDD wholeTextFiles
wholeTextFiles()
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI.
SparkContext$wholeTextFiles(path, minPartitions)
path
Directory to the input data files, the path can be comma separated paths as the list of inputs.
minPartitions
A suggestion value of the minimal splitting number for input data.
RDD
clone()
The objects of this class are cloneable with this method.
SparkContext$clone(deep = FALSE)
deep
Whether to make a deep clone.
1 2 3 4 5 6 7 8 9 10 | ## Not run:
spark <- spark_session()
sc <- spark$sparkContext
sc$defaultParallelism()
an_rdd <- sc$parallelize(list(1:10), 4)
sc$getConf$get("spark.submit.deployMode")
spark_session_stop()
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.