LMAX Exchange blog - FX industry thought leadership

Looking for heap distress

​​​​​Recently at LMAX Exchange we’ve had a couple of services suffer from memory leaks.

In both cases, we noticed the problem much later than we’d like. One stricken application started to apply backpressure on (really rather important) upstream components. Another caused an account related action to be intermittently unavailable.

What we’d like to be able to do is detect when any of our java services is in acute heap distress as soon as it enters that state. We already have monitoring around causes of backpressure; we want to spot our issue a little earlier on in the causality graph (that’s two ridiculous names you’ve invented just in one paragraph. Enough! – Ed).

Handily, each service already publishes its heap usage and gc times to our home-grown monitoring app. Even better, our alerting application already streams live events from that monitor into some alerting rules, so we have some existing scaffolding to build on top of.

We reworked the way the existing streaming sensors in our alerting app worked until we had something like the following:

Occasionally, a different thread will call sense to see if any faults are currently present.

In our case, the keys we have available to us (per application, per host) are ‘heap.used’, ‘heap.max’ (both in bytes) and ‘gc.pause’. Can we reliably detect heap distress from such a stream of events?

Bad ideas we considered

Percentage of wall clock time spent in GC

We could work out what percentage of time an application is spending in GC, and alert if it goes above a threshold.

This isn’t too bad an idea. Unfortunately, that metric is published at most once per second per application, and it lists the maximum pause time seen in that second, not the sum. We could fix this, but we’d have to wait for a software release to get feedback.

Persistent memory usage above a threshold percentage

One thing we see in both of our failing applications is that they see consistent heap usage somewhere near to their max value. We could keep track of how many observations we see above a particular usage threshold, and once that observation count reaches a large enough number, alert.

This would detect errors like the ones that we would see, but it would also detect slow growing heaps in well behaved applications. Alert code that produces more false alarms than real ones is, from our experience, worse than no alerting at all.

Alerting if an inferred GC reclaims less than a certain percentage of memory

We can keep hold of the last seen usage, and when a new value arrives, check if it is smaller. If it is, we can infer that a GC has occurred. If the amount of memory freed by that GC is too small, we could alert.

Closer still, but can still generate some false positives. A collection can be triggered even when percentage heap usage is low; at that point it might be impossible to reclaim the required amount.

Our eventual solution

Alert if, after an inferred GC, more than x% of the heap is still in use.

Still not perfect, but just about good enough. Why? Well, we see multiple ‘heap.used’ metrics published across a GC cycle, some of which may still be above the warning usage threshold. Those intermediate values could cause false positives.

In practice, we found that we call ‘currentFaults’ infrequently enough to not be a problem. If we did, it would be the work of a couple of minutes to create a wrapper for our detector that only reported faults that had been present for multiple seconds.

Bugs we totally missed

Some applications run on multiple hosts in the same environment

This one was a bit silly.

We run our applications in six or seven different ‘environments’ that we consider important enough to monitor. Each environment might have application ‘x’ deployed on hosts ‘h1’ and ‘h2’. We managed to not test this at all in our spike implementation and immediately generated large numbers of false positives. Whoops.

More mysterious false positives

Having now written enough tests to convince ourselves that we weren’t totally inept, we launched the alerting app locally. We turned the warning threshold down to 10% to see if it spotted anything, and immediately saw some alerts. Clicking each alert (the ‘url’ feature of a fault is used to allow a user to see the data upon which the decision to alert was raised) led us to the following chart.

Aha. We appear to have gone back in time. How is this happening, though?

Well, we deliberately use UDP to transmit metrics from our applications to our monitoring. We’re OK with occasionally losing some metrics, we’re definitely not happy if being unable to send them prevents our app from actually running. This is a potentially controversial choice, let’s agree to not discuss it here :-).

When metrics arrive at the monitoring app, it has two jobs. One, to squirrel the data into a time series database on disk. Two, to transmit the data to appropriately subscribed websocket clients.

The chart and raw data above are both served, post hoc, via the on disk store. The websocket feed is not. The UDP packets of monitoring data were arriving out of order, and it was truthfully forwarding those metrics to our alerting in the same way.

We could probably fix that with some judiciously applied buffering, but given that each event we see is timestamped, we can simply drop events that are timestamped earlier than the last value we saw for a given service/host pair.

Any opinions, news, research, analyses, prices or other information ("information") contained on this Blog, constitutes marketing communication and it has not been prepared in accordance with legal requirements designed to promote the independence of investment research. Further, the information contained within this Blog does not contain (and should not be construed as containing) investment advice or an investment recommendation, or an offer of, or solicitation for, a transaction in any financial instrument. LMAX Exchange has not verified the accuracy or basis-in-fact of any claim or statement made by any third parties as comments for every Blog entry.

LMAX Exchange will not accept liability for any loss or damage, including without limitation to, any loss of profit, which may arise directly or indirectly from use of or reliance on such information. No representation or warranty is given as to the accuracy or completeness of the above information. While the produced information was obtained from sources deemed to be reliable, LMAX Exchange does not provide any guarantees about the reliability of such sources. Consequently any person acting on it does so entirely at his or her own risk. It is not a place to slander, use unacceptable language or to promote LMAX Exchange or any other FX, Spread Betting and CFD provider and any such postings, excessive or unjust comments and attacks will not be allowed and will be removed from the site immediately.

LMAX Exchange will clearly identify and mark any content it publishes or that is approved by LMAX Exchange.

FX and CFDs are leveraged products that can result in losses exceeding your deposit. They are not suitable for everyone so please ensure you fully understand the risks involved. The information on this website is not directed at residents of the United States of America, Australia (we will only deal with Australian clients who are "wholesale clients" as defined under the Corporations Act 2001), Canada (although we may deal with Canadian residents who meet the "Permitted Client" criteria), Singapore or any other jurisdiction where FX trading and/or CFD trading is restricted or prohibited by local laws or regulations.

LMAX Exchange Group is the holding company of LMAX Limited and LMAX Broker Limited | LMAX Exchange is a trading name of LMAX Limited, which operates a multilateral trading facility, authorised and regulated by the Financial Conduct Authority (Reference number: 509778) and is a company registered in England and Wales (number 6505809) | LMAX Global is a trading name of LMAX Broker Limited which is authorised and regulated by the Financial Conduct Authority (Reference number: 783200) and is a company registered in England and Wales (number 10819525) | Our registered address is Yellow Building, 1A Nicholas Road, London W11 4AN.

Sign up for Global FX Insights, the daily market commentary from LMAX Exchange