Apache Ambari - Zeppelin Alert Checks Wrong PID

Feb 27, 2018

Overview

Apache Ambari makes managing distributed systems like Apache Hadoop easier. One of the capabilities of Ambari is alerting. These alerts can alert administrators and trigger automatic recovery. Ambari can manage Apache Zeppelin. This management includes starting, stopping, and alert when Zeppelin stops.

Apache Zeppelin Alert Check False Alarms

Over the course of a few weeks, my team received multiple alerts pointing to Apache Zepplin stopping. We investigated and found that Apache Zeppelin had never stopped and these were false alarms. When alerts go off for false alarms, it reduces the confidence in the alerting system.

Root Cause of Apache Zeppelin Alert Check False Alarms

I tracked down the cause of the false alarms to be Ambari checking the wrong PID file. Apache Zeppelin creates multiple PID files:

Apache Zeppelin process

Each Zeppelin interpreter

Apache Ambari uses glob.glob(...) to search for PID files for alerting. In our case, Apache Zeppelin runs as the zeppelin user. The Apache Zeppelin interpreters have PID files that ends up being alphabetically before the Apache Zeppelin process PID file.

The Apache Ambari alert check (0.6.0 and 0.7.0) is not checking the Apache Zeppelin process PID specifically. Instead, it relies on the order of PID files in the zeppelin_dir_dir.

pid_file=glob.glob(zeppelin_pid_dir+'/zeppelin-*.pid')[0]

If an interpreter is stopped (which can happen in normal circumstances) then the Ambari alert will trigger incorrectly even when the Apache Zeppelin process is running. The Ambari Agent logs showed the wrong PID file being checked.

What is next?

In late January 2017, I created AMBARI-22834 to raise awareness of this issue. Recently, @matthias created PR 304 to address this issue. We are waiting on an Apache Ambari committer to review and commit this change. Until then we have made the adjustments to the Apache Ambari Apache Zeppelin alert locally to reduce the false alarms.