Building the right thing, building it right, fast

Who I am

My name is Jakub Holy and I’m a software craftsmanship enthusiast and a (mainly JVM-based) developer since ~ 2005, consultant, and occasionally a project manager, working currently with Iterate AS in Norway. More under About and #opinion.

Aggregating Error Logs to Send a Warning Email When Too Many of Them – Log4j, Stat4j, SMTPAppender

Our development team wanted to get notified as soon as something goes wrong in our production system, a critical Java web application serving thousands of customers daily. The idea was to let it send us an email when there are too many errors, indicating usually a problem with a database, an external web service, or something really bad with the application itself. In this post I want to present a simple solution we have implemented using a custom Log4J Appender based on Stats4j and an SMTPAppender (which is more difficult to configure and troubleshoot than you might expect) and in the following post I explore how to achieve the same effect with the open-source Hyperic HQ monitoring SW.

The Challenge

We faced the following challenges with the logs:

It’s unfortunately normal to have certain number of exceptions (customers select search criteria yielding no results, temporary, unimportant outages of external services etc.) and we certainly don’t want to be spammed because of that. So the solution must have a configurable threshold and only send an alert when it is exceeded.

The failure rate should be computed for a configurable period (long enough not to trigger an alert because of few-minutes outages yet short enough for the team to be informed ASAP when something serious happens).

Once an alert is send, no further alerts should be send again for some time (ideally until the original problem is fixed), we don’t want to be spammed because of a problem we already know about.

The Solution

We’ve based our solution on Lara D’Abreo’s Stat4J, which provides a custom Log4J appender that uses the logs to compute configurable measures and triggers alerts when they exceed their warning or critical thresholds. It is couple of years old, alpha-quality (regarding its generality and flexibility) open-source library, which is fortunately simple enough to be modified easily for one’s needs.

So we have tweaked Stat4J to produce alerts when the number of alerts exceeds thresholds and keep quiet thereafter and combined that with a Log4J SMTPAppender that listens for the alerts and sends them via e-mail to the team.

Stat4J Tweaking

The key components of Stat4J are the Stat4jAppender for Log4J itself, calculators (measures) that aggregate the individual logs (e.g. by counting them or extracting some number form them), statistics that define which logs to consider via regular expressions and how to process them by referencing a calculator, and finally alerts that log a warning when the value of a statistics exceeds its limits. You can learn more in an article that introduces Stat4J.

We have implemented a custom measure calculator, RunningRate (to count the number of failures in the last N minutes) and modified Stat4J as follows:

We’ve enhanced Alert to support a new attribute, quietperiod, so that once triggered, subsequent alerts will be ignored for that duration (unless the previous alert was just a warning while the new one is a critical one)

We’ve modified the appender to include the log’s Throwable together with the log message, which is then passed to the individual statistics calcualtors, so that we could filter more precisely what we want to count

Finally we’ve modified Alert to log alerts as errors instead of warnings so that the SMTPAppender wouldn’t ignore them

calculator.minuteRate.period (in ms) – count errors over this period, reset the count at its end; a reasonable value may be 10 minutes

alert.TooManyErrorsRecently.warn and .critical – trigger the alert when so many errors in the period has been encountered; reasonable values depend on your application’s normal error rate

alert.TooManyErrorsRecently.quietperiod (in ms) – don’t send further alerts for this period not to spam in a persistent failure situation; the reasonable value depends on how quickly you usually fix problems, 1 hour would seem OK to me

Notice that statistic.RunningErrorRate.first.match is a regular expression defining which logs to count; “.*” would include any log, “your\.package\..*Exception” any exception in the package and so on, you can even specify logs to exclude using a negative lookahead ((?!X))

Log4J Configuration

Now we need to tell Log4J to use the Stat4j appender to count error occurences and to send alerts via email:

#9 Only send ERRORs to Stat4J, we are not interested in less serious exceptions

#14 “alerts” is the log category used by Stat4jAppender to log alerts (the same you would create via Logger.getLogger(“alerts”)); as mentioned, SMTPAppender will without respect to the configuration only process ERRORs and higher

Issues with the SMTPAppender

It is quite tricky to get the SMTPAppender working. Some pitfall:

SMTPAppender ignores all logs that are not ERROR or higher without respect to how you set its threshold

If you specify a non-existing From domain then some recipient’s mail servers can just delete the email as spam (e.g. GMail)

To send emails, you of course need mail.jar (and for older JVMs also activation.jar), here are instructions for Tomcat

And one $100 tip: to debug it, run your application in the debug mode and set a method breakpoint on javax.mail.Transport#send (you don’t need the source code) and when there, set this.session.debug to true to get a very detailed log of the following SMTP communication in the server log.

Sidenote

The fact that this article is based on Log4J doesn’t mean I’d personally choose it, it just came with the project. I’d at least consider using the newer and shiny Logback instead :-).

Conclusion

Stat4j + SMTPAppender are a very good base for a rather flexible do-it-yourself alerting system based on logs and e-mail. You can achieve the same thing out-out-the-box with Hyperic HQ plus much much more (provided that you get your admins to open two ports for it), which I will describe in the next blog post.

Links

An alternative for preventing the SMTPAppender from spamming in persisten failure situations (aside of its built-in buffer size): log4j-email-throttle

EventConsolidatingAppender – announced via mailing list in 2/2011 – “the
purpose of this appender is to consolidate multiple events that are received
by a single logger within a specified number of seconds into a single event;
this single consolidated event is then forwarded to a ‘downstream’ appender”

Share this:

Like this:

LikeLoading...

Related

This entry was posted on October 15, 2011 at 9:28 pm and is filed under j2ee, Tools.
Tagged: log4j, logging, monitoring, ops. You can follow any responses to this entry through the RSS 2.0 feed.
Both comments and pings are currently closed.

One Response to “Aggregating Error Logs to Send a Warning Email When Too Many of Them – Log4j, Stat4j, SMTPAppender”