The premise

We developed a software that runs on several geographically distant independent “stations” that collect a multitude of environmental measurement data. This data is preprocessed and stuffed into data packages, which are periodically transferred to a control center. The software of this control center, also developed by us, receives the data packages, stores them on disk and in a huge database and extracts the overall state of the measurement network from raw data. If you describe the main task of the network on this level, it sounds nearly trivial. But the real functionality requirements are manifold and the project grew large.

We kept the whole system as modular as necessary to maintain an overall grasp of what is going on where in the system and installed a sufficient automatic test coverage for the most important parts. The system is still under active development, but the main parts of the network are in production usage without real changes for years now.

The symptoms

This might explain that we were very surprised when our customer told us that the control center had lost some data packages. Very soon, it turned out that the control center would randomly enter a state of “denial”. In this state, it would still accept data packages from the stations and even acknowledge their arrival (so the stations wouldn’t retry the transmission), but only write parts of the package or nothing at all to the disk and database. When the control center entered this state, it would never recover from it. But when we restarted the software manually, everything would run perfectly fine for several days and then revert back into denial without apparent trigger.

We monitored the control center with every means on our disposal, but its memory consumption, CPU footprint and threading behaviour was without noticeable problem even when the instance was in its degraded state. There was no exception or uncommon entry logged in the logfiles. As the symptom happened randomly, without external cause and with no chance of reversal once it happened, we soon suspected some kind of threading issue.

The bug

The problem with a threading issue is that you can’t just reproduce the bug with an unit or system test. We performed several code reviews until we finally had a trace. When a data package arrives, a global data processing lock is acquired (so that no two data packages can be processed in parallel) and the content of the package is inspected. This might trigger several network status changes. These change events are propagated through the system with classic observer/listener structures, using synchronous calls (normal delegation). The overall status of the network is translated in a human readable status message and again forwarded to a group of status message listeners. This is a synchronous call again. One of the status message listeners was the software driver for a LED ticker display. This module was a recent addition to the control center’s hardware outfit and used to display the status message prominently to the operators. Inside this LED software driver, some bytes are written to a socket stream and then the driver awaits an answer of the hardware device. To avoid the situation that two messages are sent to the device at the same time, a lock is acquired just before the message is sent. This code attracted our attention. Lets have a look at it:

The main problem here is the object the lock is acquired upon: the reference of lastMessage is mutable! We call this a liquid lock, because the lock isn’t as solid as it should be. It’s one of the more hideous multithreading pitfalls as it looks like everything’s fine at first glance. But this lock doesn’t have a complete “locking” effect because each caller may acquire the lock of a different instance. And a lock with a flawed locking behaviour is guaranteed to fail (in production). The liquid lock is like the bigger brother of the local lock. It isn’t local, but its mutability cause the same problems.

The bug finally turned out to be caused by the liquid lock in the LED display driver that got notified of system message changes when a data package arrived. But only if multiple messages were sent at once to the device, discarding some of the necessary answers in this circumstance or if the connection to the LED hardware would fail in the midst of a transmission, the system would not return from the write attempt. If one thread wouldn’t return to the data package processor, the global data processing lock would not be freed (read the start of this chapter again, this is the most important lock in the system!). And while the data processing lock was still held, all other data packages would be received, but piling up to obtain the lock. But the lock would never be returned from the thread waiting on an answer from a hardware device that had no intention to send another answer. This was when the control center appeared to be healthy but didn’t process any data packages anymore.

The conclusion

If you want to avoid the category of liquid lock multithreading bugs, make sure that all your lock instance references are immutable. Being final is an important property of lock instance references. Avoid to retrieve your locks from notoriously muteable data structures like collections or arrays. The best thing you can do to avoid liquid locks is to “freeze” all your lock instances.

Another insight from this story is that software modules have to be separated threadwise, too. It was a major design flaw to let the data processing thread, while holding the main processing lock, descend down into the deep ends of the LED driver, eventually getting stuck there for infinity. Some simple mechanisms like asynchronous listener notification or producer/consumer queues for pending transmission requests would have helped to confine the effects of the liquid lock bug inside the LED module. Without proper thread separation, it took down the whole software instance.