Reference: Spark Streaming Best Practices

This article contains best practices on running Spark Streaming jobs with Qubole.

Executors and Receivers:

In YARN, the same executor can be used for both receiving and processing. Each receiver is like a long running task, so each of them occupy a slot/core. If there are free slots/cores in the executor then other tasks can be run on them.

Optimal Cluster Size:

The best way is to start with a good cluster size/min executors.

Number of executors should be at least equal to the number of receivers.

Also the number of cores per executor should be set such that the executor has some spare capacity to process apart from just running the receiver.

The number of cores allocated to the Spark Streaming application must be more than the number of receivers. Otherwise the system will receive data, but not be able to process it.

Setting spark.streaming.backpressure.enabled to true:

This enables the Spark Streaming to control the receiving rate based on the current batch scheduling delays and processing times so that the system receives only as fast as the system can process.

But the processing is up to the spark engine and there is autoscaling there.

P.S: If you want to receive multiple streams of data in parallel in your streaming application, create multiple input DStreams. This will create multiple receivers which will simultaneously receive multiple data streams. But note that a Spark worker/executor is a long-running task, hence it occupies one of the cores allocated to the Spark Streaming application. Therefore, it is important to remember that a Spark Streaming application needs to be allocated enough cores to process the received data, as well as to run the receiver(s).

Running streaming app through analyze:

Pros:

In analyze there are no temporary classnames. The SCALA compiler generates reusable permanent classnames. So saving to checkpoint and restarting from checkpoint does not cause problems.

Cons:

Tapp layer force kills the app after 36 hours

Running streaming app through notebook shell interpreter:

Cons

Notebooks generates temp code and those temp class names get saved into checkpoints and then on recovery from checkpoint it fails. The workaround is to wrap all the code in an object.

Running streaming app through notebook scala interpreter:

Cons

Notebooks generates temp code and those temp class names get saved into checkpoints and then on recovery from checkpoint it fails. The workaround is to wrap all the code in an object.