Mozilla metrics team technical articles

As documented in THRIFT-601, sending random data to Thrift can cause it to leak memory.
At Mozilla, we use a web load balancer to distribute traffic to our Thrift machines, and the default liveness check it uses is a simple TCP connect. We also had Nagios performing TCP connect checks on these nodes for general alerting.

All these connects were causing the Thrift servers to start generating OOM errors sometimes as quickly as a few days after being started.

I wrote a test utility that performs a legitimate Thrift API call (it actually tries to get the schema of the .META. table) and returns a success if it can execute the call.

The utility can either run from the command line, or it can use the lightweight HTTP server class that is part of the Sun JRE 6 and it will listen for a request to /thrift/health and report back the status.

$ java -jar HbaseThriftTester.jar
usage: HbaseThriftTester [-timeout <ms>] <mode> <host:port>...
-check Immediately checks the following host:port
combinations and returns a summary message with an
exit value of the number of failures.
-listen <port> Run as an HTTP daemon listening on port. Checks the
hosts every time /thrift/health URL is requested.
-timeout <seconds> Number of seconds to wait for Thrift call to
complete

The app is bundled up using one-jar so it is simple and easy to call from within a Nagios script or some-such. Maybe it will be useful to someone else. Just pull down the project then build with ant.