How to Monitor Zookeeper

Update: We hosted a live Hangout on Air with some members of Server Density engineering and ops teams. Amongst other topics, we discussed how we use Zookeeper here @ Server Density. Check out the recorded session at the bottom of this blog post.

Apache Zookeeper works at the zoo—not your usual zoo, but similar—and does what you’d expect. You know, keep your service-oriented architecture nice and clean.

It provides a distributed hierarchical file system that helps with the difficulties associated with services working in different machines (discovery, registration, configuration, locking, leader selection, queueing, etc). All data replicates across all nodes and the leader performs atomic broadcasts to other servers, therefore guaranteeing strong ordering on changes propagation.

Zookeeper nodes (ZNodes) are like hierarchical file system files (eg. /foo/foo1, /bar/taz, /dev/null/full). They store any data inside, and notify watchers on any event pertaining to them.

Zookeeper can be quite a tricky service to manage. From a client programming point of view there are plenty of low level and error handling pitfalls. That explains the popularity of higher level API wrappers, like the one created by Netflix team (Curator).

With that in mind, here is our very own checklist of best practices, including key Zookeeper metrics and alerts we monitor with Server Density.

Monitoring Zookeeper: Metrics and Alerts

As per previous articles, our general rule of thumb is “collect all possible/reasonable metrics that can help when troubleshooting, alert only on those that require an action from you”. Well, the Zookeeper list that satisfies this criteria is not that long.

Zookeeper process is running

You can also use the following script to check if the server is running:

$INSTALL_PREFIX/zk-server-3/bin/zkServer.sh status

Or if you run Zookeeper via supervisord (recommended) you can alert the supervisord resource instead.

System Metrics

Metric

Comments

Suggested Alert

Memory usage

Zookeeper should run entirely on RAM. JVM heap size shouldn’t be bigger than your available RAM. That is to avoid swapping.

None

Swap usage

Watch for swap usage, as it will degrade performance on Zookeeper and lead to operations timing out (set vm.swappiness = 0).

When used swap is > 128MB.

Network bandwidth

Zookeeper servers can incur a high network usage. Keep an eye on this, especially if you notice any performance degradation. Also look out for dropped packet errors. Zookeeper standards are: 20% writes, 80% reads. More nodes result in more writes and higher overall traffic.

None

Disk usage

Zookeeper data is usually ephemeral and small. Still we recommend dataLogDir to be on a dedicated partition and watch for disk usage. Use purge task to clean up dataDir and dataLogDir.

When disk is > 85% usage.

Zookeeper disk writes are asynchronous which means they shouldn’t have high IO requirements. Still, keep an eye on this, especially if your server is shared with other services, say Kafka.

Here is how Server Density graphs disk usage and memory usage. Note the up and down curves created by the purge task:

If you are after more detailed metrics, you can access those through JMX. You could also take the DIY road and go for JMXTrans and Graphite, or use Nagios/Cacti/Ganglia with check_zookeeper.py. Alternatively, you can save time (and preserve your sanity) by choosing a hosted service like Server Density (that’s us!).

If you want to test the quality and performance of your Zookeeper ensemble, then zk-smoketest with zk-smoketest.py and zk-latencies.py are great tools to check out.

Zookeeper Management tools

There are not too many management options out there. The folks at Netflix have released Exhibitor, a tool that provides some basic monitoring, log cleaning up (for old versions), backup/restore, ensemble configuration and nodes visualization. There is also zookeeper_dashboard, but it hasn’t been updated in years.

Further reading

Did this article pique your interest in Zookeeper? Nice, keep reading. We found Scott Leberknight’s Zookeeper series of blog posts to be worthwhile. We also like these presentations:

So what about you? Do you have a checklist or any best practices for monitoring Zookeeper? What systems do you have in place and how do you monitor them? Any interesting reads to suggest?

Tech chat: processing billions of events a day with Kafka, Zookeeper and Storm

Free eBook: 4 Steps to Successful DevOps

This eBook will show you how we i) hacked our on-call rotation to increase code resilience, ii) broke our infrastructure, on purpose, to debug quicker and increase uptime, and iii) borrowed practices from the healthcare and aviation industry, to reduce complexity, stress and fatigue. And speaking of stress and fatigue, we’ve devoted an entire chapter on how we placed humans at the centre of Ops, in order to increase their productivity and boost the uptime of the systems they manage. What are you waiting for, download your free copy now.