Beginning with Confuent Platform 3.0, we are including Confluent Control Center with Confluent Platform, to make it
easy to monitor the entire Confluent Platform. This web based application allows you to measure message delivery
end to end, to assure that every message is delivered from producer to consumer, to measure how long messages take
to be delivered, and to determine the source of any problems in your cluster. To learn more about Control Center,
see Introduction.

In addition, you can monitor individual components of Kafka using Apache Kafka’s internal metrics.
Kafka uses Yammer Metrics for metrics reporting in both the server and the client. This can be
configured to report stats using pluggable stats reporters to hook up to your monitoring system.

Number of partitions on this broker. This should be mostly even across all brokers.

kafka.server:type=ReplicaManager,name=LeaderCount

Number of leaders on this broker. This should be mostly even across all brokers. If not,
set auto.leader.rebalance.enable to true on all brokers in the cluster.

kafka.server:type=ReplicaManager,name=IsrShrinksPerSec

If a broker goes down, ISR for some of the partitions will shrink. When that broker is up again,
ISR will be expanded once the replicas are fully caught up. Other than that, the expected
value for both ISR shrink rate and expansion rate is 0.

kafka.server:type=ReplicaManager,name=IsrExpandsPerSec

When a broker is brought up after a failure, it starts catching up by reading from the leader.
Once it is caught up, it gets added back to the ISR.

kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica

Maximum lag in messages between the follower and leader replicas. This is controlled by the
replica.lag.max.messages config.

We expose counts for ZooKeeper state transitions, which can help to spot problems, e.g., with
broker sessions to ZooKeeper. The metrics currently show the rate of transitions per second for
each one of the possible states. Here is the list of the counters we expose, one for each possible
ZooKeeper client states:

The server the client is connected to is currently LOOKING, which means that it
is neither FOLLOWING nor LEADING. Consequently, the client can only read the ZooKeeper
state, but not make any changes (create, delete, or set the data of znodes).

The ZooKeeper session has expired. When a session expires, we can have leader changes and
even a new controller. It is important to keep an eye on the number of such events across
a Kafka cluster and if the overall number is high, then we have a few recommendations:

Check the health of your network

Check for garbage collection issues and tune it accordingly

If necessary, increase the session time out by setting the
value of zookeeper.connection.timeout.ms.

The maximum lag in terms of number of records for any partition in this window. An increasing
value over time is your best indication that the consumer group is not keeping up with the
producers.

fetch-size-avg

The average number of bytes fetched per request.

fetch-size-max

The average number of bytes fetched per request.

bytes-consumed-rate

The average number of bytes consumed per second.

records-per-request-avg

The average number of records in each request.

records-consumed-rate

The average number of records consumed per second.

fetch-rate

The number of fetch requests per second.

fetch-latency-avg

The average time taken for a fetch request.

fetch-latency-max

The max time taken for a fetch request.

fetch-throttle-time-avg

The average throttle time in ms. When quotas are enabled, the broker may delay fetch requests
in order to throttle a consumer which has exceeded its limit. This metric indicates how throttling
time has been added to fetch requests on average.

The number of group joins per second. Group joining is the first phase of the rebalance
protocol. A large value indicates that the consumer group is unstable and will likely
be coupled with increased lag.

join-time-avg

The average time taken for a group rejoin. This value can get as high as the configured session
timeout for the consumer, but should usually be lower.

join-time-max

The max time taken for a group rejoin. This value should not get much higher than the configured
session timeout for the consumer.

sync-rate

The number of group syncs per second. Group synchronization is the second and last phase
of the rebalance protocol. Similar to join-rate, a large value indicates group instability.

sync-time-avg

The average time taken for a group sync.

sync-time-max

The max time taken for a group sync.

heartbeat-rate

The average number of heartbeats per second. After a rebalance, the consumer sends heartbeats
to the coordinator to keep itself active in the group. You can control this using the
heartbeat.interval.ms setting for the consumer. You may see a lower rate than configured
if the processing loop is taking more time to handle message batches. Usually this is OK as
long as you see no increase in the join rate.

The rate at which this consumer commits offsets to ZooKeeper. This is only relevant if
offsets.storage=zookeeper. Monitor this value if your ZooKeeper cluster is under
performing due to high write load.