Spark integrates Kafka and manually maintains offset

In development, we often use SparkStreaming to read and process data in kafka in real time. After version 1.3 of SparkStreaming, KafkaUtils provides two methods to create DStream:

Receiver reception: KafkaUtils.createDstream

There is a Receiver as a resident Task running in Executor waiting for data, but a Receiver is inefficient, need to open multiple, then manually merge data, and then process, very troublesome, and the Receiver machine hangs up, part of the data will be lost, need to open WAL (prewritten log) to ensure data security, then efficiency. It will decrease!
Receiver connects the Kafka queue through zookeeper, calls the higher-order API of Kafka, stores offset in zookeeper, maintains by Receiver, and stores Spark data in executor.
spark also saves an offset in Checkpoint to ensure data is not lost when consuming, which may lead to inconsistencies in data.
So no matter what the point of view, Receiver mode is not suitable for use in development.

2.Direct Connection: KafkaUtils.createDirectStream

Direct method is to connect directly to Kafka partition to obtain data, call Kafka low-order API, offset itself to store and maintain. By default, it is maintained by Spark in checkpoint, eliminating the inconsistency with zk (of course, it can also be maintained manually by itself, the offset exists in mysql, redis), and can read large data directly from each partition. The parallel capability is greatly improved.
Therefore, based on Direct mode can be used in development, and with the help of the characteristics of Direct mode + manual operation can ensure that the data Exactly once accurate.

The following code demonstrates manual maintenance of submission offset to MySQL database (Spark-Kafka-0-10 version integration)