Using SimpleConsumer

Why use SimpleConsumer?

The main reason to use a SimpleConsumer implementation is you want greater control over partition consumption than Consumer Groups give you.

使用一个SimpleConsumer 实现的最主要原因是你想要更强地控制分区消费而不是使用Consumer Groups。

For example you want to:

Read a message multiple times多次消费一条数据

Consume only a subset of the partitions in a topic in a process在程序中只消费一个topic的部分分区点的数据

Manage transactions to make sure a message is processed once and only once管理事务确保每条信息只被处理一次

Downsides of using SimpleConsumer

The SimpleConsumer does require a significant amount of work not needed in the Consumer Groups:

SimpleConsumer确实需要做大量的工作，而这些工作在Consumer Groups中是不需要的。

You must keep track of the offsets in your application to know where you left off consuming.你必须自己追踪offset来知道已经消费到什么位置。

You must figure out which Broker is the lead Broker for a topic and partition你必须对于某个topic分区来说，哪个Broker是lead Broker

You must handle Broker leader changes你必须需处理Broker leader的改变。

Steps for using a SimpleConsumer

Find an active Broker and find out which Broker is the leader for your topic and partition找出对于你的topic分区来说的active Broker，并且找出哪个Broker是leader

Determine who the replica Brokers are for your topic and partition确定对于你的topic分区来说，哪些是replica Brokers

Build the request defining what data you are interested in建立request，定义你感兴趣点的数据

Fetch the data获取数据

Identify and recover from leader changes在leader改变时，识别并恢复

Finding the Lead Broker for a Topic and Partition

The easiest way to do this is to pass in a set of known Brokers to your logic, either via a properties file or the command line. These don’t have to be all the Brokers in the cluster, rather just a set where you can start looking for a live Broker to query for Leader information.

The call to topicsMetadata() asks the Broker you are connected to for all the details about the topic we are interested in.

调用topicsMetadata()访问已连接的Broker来获取我们感兴趣的topic的所有细节。

The loop on partitionsMetadata iterates through all the partitions until we find the one we want. Once we find it, we can break out of all the loops.

partitionsMetadata 在所有分区上循环知道找出我们想要的那个。一旦找到，退出循环。

Finding Starting Offset for Reads

Now define where to start reading data. Kafka includes two constants to help, kafka.api.OffsetRequest.EarliestTime() finds the beginning of the data in the logs and starts streaming from there, kafka.api.OffsetRequest.LatestTime() will only stream new messages. Don’t assume that offset 0 is the beginning offset, since messages age out of the log over time.

This method uses the findLeader() logic we defined earlier to find the new leader, except here we only try to connect to one of the replicas for the topic/partition. This way if we can’t reach any of the Brokers with the data we are interested in we give up and exit hard

Since it may take a short time for ZooKeeper to detect the leader loss and assign a new leader, we sleep if we don’t get an answer. In reality ZooKeeper often does the failover very quickly so you never sleep.

Note that the ‘readOffset’ asks the last read message what the next Offset would be. This way when the block of messages is processed we know where to ask Kafka where to start the next fetch.

Also note that we are explicitly checking that the offset being read is not less than the offset that we requested. This is needed since if Kafka is compressing the messages, the fetch request will return an entire compressed block even if the requested offset isn't the beginning of the compressed block. Thus a message we saw previously may be returned again. Note also that we ask for a fetchSize of 100000 bytes. If the Kafka producers are writing large batches, this might not be enough, and might return an empty message set. In this case, the fetchSize should be increased until a non-empty set is returned.