Checkpointing in Spark

It is mechanism available in spark that enables us to save an RDD to a reliable storage system like S3 or HDFS so that spark does not forgets the RDD’s lineage completely.When RDD’s lineage dependencies becomes too long it is suggested to keep track of it using checkpoints. It is similar to the process where Hadoop stores intermediate results in disk so that it can recover from failures .Since checkpointing is on external file location it can be reused again by other applications.

Checkpointing in SPark Streaming

For a streaming application to operate continuously it mush be resilient to any kind of failures.
A streaming application must operate 24/7 and hence must be resilient to failures unrelated to the application logic (e.g., system failures, JVM crashes, etc.). For this to be possible, Spark Streaming needs to checkpoint enough information to a fault- tolerant storage system such that it can recover from failures.

There are two types of checkpointing available

Metadata Checkpointing

Data checkpointing

To summarize metadata checkpointing is primarily needed for recovery from driver failures, whereas data or RDD checkpointing is necessary even for basic functioning if stateful transformations are used.

Significance of Sliding Window operation

Sliding Window controls transmission of data packets between various computer networks. Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data. Whenever the window slides, the RDDs that fall within the particular window are combined and operated upon to produce new RDDs of the windowed DStream.

DStream in Spark

Discretized Stream is a sequence of Resilient Distributed Databases that represent a stream of data. DStreams can be created from various sources like Apache Kafka, HDFS, and Apache Flume. DStreams have two operations –
• Transformations that produce a new DStream.
• Output operations that write data to an external system.

Different types of transformations on DStreams

Stateless Transformations- Processing of the batch does not depend on the output of the previous batch. Examples – map (), reduceByKey (), filter ().

Stateful Transformations- Processing of the batch depends on the intermediary results of the previous batch. Examples –Transformations that depend on sliding windows.