Configuration for a Spark application. Used to set various Spark
parameters as key-value pairs.

Most of the time, you would create a SparkConf object with
SparkConf(), which will load values from spark.* Java system
properties as well. In this case, any parameters you set directly on
the SparkConf object take priority over system properties.

For unit tests, you can also call SparkConf(false) to skip
loading external settings and get the same configuration no matter
what the system properties are.

All setter methods in this class support chaining. For example,
you can write conf.setMaster(“local”).setAppName(“My app”).

Note

Once a SparkConf object is passed to Spark, it is cloned
and can no longer be modified by the user.

Create an Accumulator with the given initial value, using a given
AccumulatorParam helper object to define how to add values of the
data type if provided. Default AccumulatorParams are used for integers
and floating-point numbers if you do not provide one. For other types,
a custom AccumulatorParam can be used.

Add a .py or .zip dependency for all tasks to be executed on this
SparkContext in the future. The path passed can be either a local
file, a file in HDFS (or other Hadoop-supported filesystems), or an
HTTP, HTTPS or FTP URI.

Read a directory of binary files from HDFS, a local file system
(available on all nodes), or any Hadoop-supported file system URI
as a byte array. Each file is read as a single record and returned
in a key-value pair, where the key is the path of each file, the
value is the content of each file.

Note

Small files are preferred, large file is also allowable, but
may cause bad performance.

Broadcast a read-only variable to the cluster, returning a
L{Broadcast<pyspark.broadcast.Broadcast>}
object for reading it in distributed functions. The variable will
be sent to each cluster only once.

Read an ‘old’ Hadoop InputFormat with arbitrary key and value class from HDFS,
a local file system (available on all nodes), or any Hadoop-supported file system URI.
The mechanism is the same as for sc.sequenceFile.

A Hadoop configuration can be passed in as a Python dict. This will be converted into a
Configuration in Java.

Read an ‘old’ Hadoop InputFormat with arbitrary key and value class, from an arbitrary
Hadoop configuration, which is passed in as a Python dict.
This will be converted into a Configuration in Java.
The mechanism is the same as for sc.sequenceFile.

Read a ‘new API’ Hadoop InputFormat with arbitrary key and value class from HDFS,
a local file system (available on all nodes), or any Hadoop-supported file system URI.
The mechanism is the same as for sc.sequenceFile.

A Hadoop configuration can be passed in as a Python dict. This will be converted into a
Configuration in Java

Read a ‘new API’ Hadoop InputFormat with arbitrary key and value class, from an arbitrary
Hadoop configuration, which is passed in as a Python dict.
This will be converted into a Configuration in Java.
The mechanism is the same as for sc.sequenceFile.

Create a new RDD of int containing elements from start to end
(exclusive), increased by step every element. Can be called the same
way as python’s built-in range() function. If called with a single argument,
the argument is interpreted as end, and start is set to 0.

Assigns a group ID to all the jobs started by this thread until the group ID is set to a
different value or cleared.

Often, a unit of execution in an application consists of multiple Spark actions or jobs.
Application programmers can use this method to group all those jobs together and give a
group description. Once set, the Spark web UI will associate such jobs with this group.

If interruptOnCancel is set to true for the job group, then job cancellation will result
in Thread.interrupt() being called on the job’s executor threads. This is useful to help
ensure that the tasks are actually stopped in a timely manner, but is off by default due
to HDFS-1208, where HDFS may respond to Thread.interrupt() by marking nodes as dead.

Read a directory of text files from HDFS, a local file system
(available on all nodes), or any Hadoop-supported file system
URI. Each file is read as a single record and returned in a
key-value pair, where the key is the path of each file, the
value is the content of each file.

If use_unicode is False, the strings will be kept as str (encoding
as utf-8), which is faster and smaller than unicode. (Added in
Spark 1.2)

Aggregate the values of each key, using given combine functions and a neutral
“zero value”. This function can return a different result type, U, than the type
of the values in this RDD, V. Thus, we need one operation for merging a V into
a U and one operation for merging two U’s, The former operation is used for merging
values within a partition, and the latter is used for merging values between
partitions. To avoid memory allocation, both of these functions are
allowed to modify and return their first argument instead of creating a new U.

Mark this RDD for checkpointing. It will be saved to a file inside the
checkpoint directory set with SparkContext.setCheckpointDir() and
all references to its parent RDDs will be removed. This function must
be called before any job has been executed on this RDD. It is strongly
recommended that this RDD is persisted in memory, otherwise saving it
on a file will require recomputation.

Aggregate the elements of each partition, and then the results for all
the partitions, using a given associative function and a neutral “zero value.”

The function op(t1, t2) is allowed to modify t1 and return it
as its result value to avoid object allocation; however, it should not
modify t2.

This behaves somewhat differently from fold operations implemented
for non-distributed collections in functional languages like Scala.
This fold operation may be applied to partitions individually, and then
fold those results into the final result, rather than apply the fold
to each element sequentially in some defined ordering. For functions
that are not commutative, the result may differ from that of a fold
applied to a non-distributed collection.

Merge the values for each key using an associative function “func”
and a neutral “zeroValue” which may be added to the result an
arbitrary number of times, and must not change the result
(e.g., 0 for addition, or 1 for multiplication.).

Compute a histogram using the provided buckets. The buckets
are all open to the right except for the last which is closed.
e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50],
which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1
and 50 we would have a histogram of 1,0,1.

If your histogram is evenly spaced (e.g. [0, 10, 20, 30]),
this can be switched from an O(log n) inseration to O(1) per
element (where n is the number of buckets).

Buckets must be sorted, not contain any duplicates, and have
at least two elements.

If buckets is a number, it will generate buckets which are
evenly spaced between the minimum and maximum of the RDD. For
example, if the min value is 0 and the max is 100, given buckets
as 2, the resulting buckets will be [0,50) [50,100]. buckets must
be at least 1. An exception is raised if the RDD contains infinity.
If the elements in the RDD do not vary (max == min), a single bucket
will be used.

Mark this RDD for local checkpointing using Spark’s existing caching layer.

This method is for users who wish to truncate RDD lineages while skipping the expensive
step of replicating the materialized data in a reliable distributed file system. This is
useful for RDDs with long lineages that need to be truncated periodically (e.g. GraphX).

Local checkpointing sacrifices fault-tolerance for performance. In particular, checkpointed
data is written to ephemeral local storage in the executors instead of to a reliable,
fault-tolerant storage. The effect is that if an executor fails during the computation,
the checkpointed data may no longer be accessible, causing an irrecoverable job failure.

This is NOT safe to use with dynamic allocation, which removes executors along
with their cached blocks. If you must use both features, you are advised to set
spark.dynamicAllocation.cachedExecutorIdleTimeout to a high value.

Set this RDD’s storage level to persist its values across operations
after the first time it is computed. This can only be used to assign
a new storage level if the RDD does not have a storage level set yet.
If no storage level is specified defaults to (MEMORY_ONLY).

Can increase or decrease the level of parallelism in this RDD.
Internally, this uses a shuffle to redistribute data.
If you are decreasing the number of partitions in this RDD, consider
using coalesce, which can avoid performing a shuffle.

withReplacement – can elements be sampled multiple times (replaced when sampled out)

fraction – expected size of the sample as a fraction of this RDD’s size
without replacement: probability that each element is chosen; fraction must be [0, 1]
with replacement: expected number of times each element is chosen; fraction must be >= 0

seed – seed for the random number generator

Note

This is not guaranteed to provide exactly the fraction specified of the total
count of the given DataFrame.

Return a subset of this RDD sampled by key (via stratified sampling).
Create a sample of this RDD using variable sampling rates for
different keys as specified by fractions, a key to sampling rate map.

Output a Python RDD of key-value pairs (of form RDD[(K, V)]) to any Hadoop file
system, using the old Hadoop OutputFormat API (mapred package). Keys/values are
converted for output using either user specified converters or, by default,
org.apache.spark.api.python.JavaToWritableConverter.

Output a Python RDD of key-value pairs (of form RDD[(K, V)]) to any Hadoop file
system, using the old Hadoop OutputFormat API (mapred package). Key and value types
will be inferred if not specified. Keys and values are converted for output using either
user specified converters or org.apache.spark.api.python.JavaToWritableConverter. The
conf is applied on top of the base Hadoop conf associated with the SparkContext
of this RDD to create a merged Hadoop MapReduce job configuration for saving the data.

Output a Python RDD of key-value pairs (of form RDD[(K, V)]) to any Hadoop file
system, using the new Hadoop OutputFormat API (mapreduce package). Keys/values are
converted for output using either user specified converters or, by default,
org.apache.spark.api.python.JavaToWritableConverter.

Output a Python RDD of key-value pairs (of form RDD[(K, V)]) to any Hadoop file
system, using the new Hadoop OutputFormat API (mapreduce package). Key and value types
will be inferred if not specified. Keys and values are converted for output using either
user specified converters or org.apache.spark.api.python.JavaToWritableConverter. The
conf is applied on top of the base Hadoop conf associated with the SparkContext
of this RDD to create a merged Hadoop MapReduce job configuration for saving the data.

Output a Python RDD of key-value pairs (of form RDD[(K, V)]) to any Hadoop file
system, using the org.apache.hadoop.io.Writable types that we convert from the
RDD’s key and value types. The mechanism is as follows:

Pyrolite is used to convert pickled Python RDD into RDD of Java objects.

Keys and values of this Java RDD are converted to Writables and written out.

Zips this RDD with another one, returning key-value pairs with the
first element in each RDD second element in each RDD, etc. Assumes
that the two RDDs have the same number of partitions and the same
number of elements in each partition (e.g. one was made through
a map on the other).

The ordering is first based on the partition index and then the
ordering of items within each partition. So the first item in
the first partition gets index 0, and the last item in the last
partition receives the largest index.

This method needs to trigger a spark job when this RDD contains
more than one partitions.

Items in the kth partition will get ids k, n+k, 2*n+k, ..., where
n is the number of partitions. So there may exist gaps, but this
method won’t trigger a spark job, which is different from
zipWithIndex

Flags for controlling the storage of an RDD. Each StorageLevel records whether to use memory,
whether to drop the RDD to disk if it falls out of memory, whether to keep the data in memory
in a JAVA-specific serialized format, and whether to replicate the RDD partitions on multiple
nodes. Also contains static constants for some commonly used storage levels, MEMORY_ONLY.
Since the data is always serialized on the Python side, all the constants use the serialized
formats.

Destroy all data and metadata related to this broadcast variable.
Use this with caution; once a broadcast variable has been destroyed,
it cannot be used again. This method blocks until destroy has
completed.

A shared variable that can be accumulated, i.e., has a commutative and associative “add”
operation. Worker tasks on a Spark cluster can add values to an Accumulator with the +=
operator, but only the driver program is allowed to access its value, using value.
Updates from the workers get propagated automatically to the driver program.

While SparkContext supports accumulators for primitive data types like int and
float, users can also define accumulators for custom types by providing a custom
AccumulatorParam object. Refer to the doctest of this module for an example.

These APIs intentionally provide very weak consistency semantics;
consumers of these APIs should be prepared to handle empty / missing
information. For example, a job’s stage ids may be known but the status
API may not have any information about the details of those stages, so
getStageInfo could potentially return None for a valid stage id.

To limit memory usage, these APIs only provide information on recent
jobs / stages. These APIs will provide information for the last
spark.ui.retainedStages stages and spark.ui.retainedJobs jobs.