In 2015/2016 I was the lead engineer on a team that was tasked with building a customer service chat application for WebstaurantStore. You're probably familiar with the idea - you visit an ecommerce site and a little notification pops up prompting you to chat with a customer service agent. We wanted to provide our customers and our Customer Service Representatives (CSRs) with a better chat experience by integrating product and customer data directly into the chat, so we decided to build a system in-house. The system was dubbed "Switchboard."

To understand the "Tantrum Spiral" bug the we encountered, I first need to supply some high-level information about how Switchboard works.

Switchboard at a Glance

One of the main requirements for the chat app is transparent resiliency - if a customer closes the chat accidentally and then clicks on the "Chat Now" button again, their conversation should pick up exactly where they left off (and with the same CSR, if they are available). If the customer already has a chat window open and they click the "Chat Now" button again, both chat windows should remain connected and in sync. Customers can do all sorts of weird things with browser tabs, and we need to make sure that Switchboard Just Works in all of those scenarios.

To that end, Switchboard makes a distinction between "users" and "connections". A user describes the logical "who" - either a customer, a CSR, or an anonymous user, while a connection describes the logical "where" - i.e., where do I send the messages? Users can have multiple simultaneous connections. A connection is essentially an "address" - a unique ID that points to a websocket or http connection that's held in memory on one of the app servers. The connection addresses are stored in a Redis set, with the UserID as the key. When a message need to be sent to a user, we look up their connection addresses in Redis, and then broadcast messages to all of the connections that correspond to those addresses.

Switchboard also has a feature that allows CSRs to see the status and chat count of other CSRs in real time. In practice, this means that a status message is broadcast to each CSR every time a customer or CSR connects or disconnects. The status messages look something like this:

This bug gets its name from Dwarf Fortress, a game where you manage a colony of dwarves - telling them where to dig for materials, what to build, what to farm, etc. Occasionally a dwarf will get annoyed with the type of work it's been assigned, and in response the dwarf will throw a tantrum and start knocking over furniture and punching other dwarves. Usually, the misbehaving dwarf is punished for its actions. In some cases, the punishment or even death of the tantrum-throwing dwarf will cause other dwarves to throw a tantrum - and thus the colony spirals out of control, leading to the eventual death of the entire group of dwarves.

One fateful day, the Switchboard app servers suffered a similar demise.

The first evidence of the probelm was a wave of disconnects affecting our CSRs. When the chat client is forcibly disconnected from the server, the user sees a message letting them know they've been disconnected. This isn't an uncommon occurrence for customers (especially those on mobile networks), but was rarely a problem for our CSRs. In any case, the client automatically tries to reconnect to the server, and if it's successful, the user won't know that anything bad has happened.

The office building that houses our CSRs was undergoing some construction at the time, so the wave of disconnects wasn't too unusual. However, when the disconnects continued and increased in frequency, we knew something else was wrong.

A quick glance at our dashboards showed that the network pipe between the app servers and the Redis cluster was completely saturated - not good, and definitely not normal. We saw in the app server logs that calls to Redis were failing, which resulted in an unhandled exception that rolled the app server process (in Node.js the mantra is "fail fast and restart"). When the app server rolled, it forcibly disconnected all clients that were connected to that particular server, and those clients attempted to reconnect to one of the other available app servers.

The next step was trying to figure out why Redis network IO was pegged. We discovered that each CSR had old, inactive connection addresses hanging around in Redis, and with that discovery, all of the pieces started to make sense...

An Unfortunate Series of Events

Here's the sequence of events. An internet disruption caused all CSRs to disconnect from the app servers. When a client disconnected in this fashion, the websocket library we used didn't properly fire a "disconnect" event, which meant that the app server never has a chance to clean up the (now disconnected) connection addresses in Redis.

When the clients automatically reconnected to a different app server, the app server would broadcast status messages to all CSRs, including the old addresses. When an app server tries to publish a message to an address that it doesn't have in memory (i.e., the client is connected to a different app server), it uses Redis pub/sub to publish that message to the other app servers so that whatever app server does has the connection can pass it along.

By this point, you probably see where this is going. Each time an internet disruption occurred throughout the day, a bunch of "ghost" addresses would pile up in Redis. Eventually there would be enough ghost addresses that the status broadcast messages would saturate the connection between the app servers and Redis. When that happened, the app server would roll - disconnecting all clients and accruing even more ghost addresses - and when the clients tried to reconnect to another app server, the status broadcast would kick off yet again.

In other words, one app server would throw a tantrum, which would cause another app server to throw a tantrum, and so on.

For an interim solution, we manually deleted the list of addresses for each CSR from Redis and asked the CSR to logout and log back in to Switchboard. This purged the ghost connections and reduced the number of addresses that the status broadcast was sent to.

For the long term, we ultimately settled on adding TTLs to each address stored in Redis. Each client now sends heartbeat messages to the app server, which updates the TTL for that client/address. If the client disconnects in such a way that the address isn't removed from Redis, it will eventually expire.

We accomplished this by changing the list of addresses from a Set to a Sorted Set with the score indicating the expiration time for that address. When we fetch the addresses from Redis, we first delete any addresses from the set where the score is less than the current time (using ZREMRANGEBYSCORE).

After the fix, Switchboard has become more robust to both internal and external network outages. If you have any questions about Switchboard's design or architecture, feel free to reach out to me @cas002 on Twitter.