In my last post about message queues, I suggested that data contracts - specifically Google Protocol Buffers - can be extremely useful for communicating over queues. Today I wanted to share the process we use at work to build and distribute a C# class library containing compiled protobufs using PowerShell, TFS, and NuGet. We start with raw .proto files and end up with a library distributed via NuGet that we can easily reference in multiple projects.

Protobuf Project Structure

We keep all of the .proto files we share with other teams in a dedicated repo named DataContracts. The structure of the repo looks like this:

The top-level Src folder contains all of the proto files. The Output folder has subfolders for each language we target - in this post, I'm only going to talk about the C# output. Inside of the CSharp folder there's a bare-bones Visual Studio solution that includes a class library project with no files, as well as a packages.config file for nuget dependencies.

The Deploy folder contains all of the PowerShell scripts that we use to compile .proto files into .cs files, validate the nuspec version, and modify the .csproj file.

Process Overview

Here's an overview of the process we use to create the library:

Validate the version number in the nuspec file

Compile .proto files into .cs files

Modify the class library project so that it includes all generated .cs files

Restore NuGet packages

Compile the solution/project

Package the .dlls into a NuGet package

Push the new package to our NuGet server

Validating Version Numbers

We use some basic criteria to validate the version numbers for our NuGet package. If the branch we're building is develop, the version must have an "-alpha" suffix. If the branch is master, the version must NOT have a suffix. This simple check ensures that prerelease code is always tagged with a prerelease version number.

The PowerShell code for accomplishing the validation is pretty straightforward.

The $branchName variable is passed as an argument to the script. In TFS you can get the current branch name by referencing the build variable $(Build.SourceBranch).

Compiling Proto Files

There's nothing complicated about compiling protobufs into C# classes. We execute protoc.exe and specify --proto_path=Src and --csharp_out=Output\CSharp\Src to set the input and output directories, respectively. The protobuf compiler generates the .cs files and drops them in the Output\CSharp\Src folder. We put them in this separate Src folder so that it's easy to find them all later in the build process.

Modifying the .csproj File

Next, we add the generated .cs files to our empty class library project. We do this by manipulating the .csproj file to add each generated file to the list of files that need to be compiled.

Even the most basic .csproj file is quite verbose. There's some boilerplate at the very top, followed by several PropertyGroup elements that define compliation targets. Below that we have several ItemGroup elements. The first ItemGroup contains references - both to NuGet packages and core assemblies. The second ItemGroup contains the files the need to be compiled - this is the element we need to modify. Finally, the third ItemGroup contains files that don't need to be compiled (such as config files).

To add our generated .cs files to that second ItemGroup, we use a PowerShell script.

First we read in the csproj file as XML and find the second ItemGroup element

Now we do some funky stuff to clean up the XML that PowerShell generated. When it created the "Compile" elements, it added an extra "xmlns" attribute to each element. We need to remove this attribute from each element or Visual Studio will complain. To do that, we read the file contents back in as a string and then do a simple string replace.

Most larger systems can benefits in some way from the introduction of queueing. Message queues can be used for asynchronous communication, task buffers, and more - but at what cost? In most cases, the answer is additional complexity with regard to operations, monitoring, and troubleshooting. I wanted to shared some lessons I've learned for reducing (or at least preparing for) the additional complexity of message queues.

Message names should be past tense events

Message queues, like HTTP APIs, require a disciplined attention to semantics. Sloppy practices in this area can result in tight coupling and dependency/deployment hell. The easiest way to prevent messages from turning into RPC calls is to restrict the message context to the domain of the publisher. In general, the publisher shouldn't care about the services that are consuming its messages. Naming messages as past tense events (e.g., AccountCreated or PaymentAccepted) frames the message as a notification to other systems.

Don't conflate message content and message routing

When you write an email, the contents of the email usually depends on the intended recipient. It makes sense to assume that same relationship applies to messages sent over a queue. However, that assumption can quickly lead to duplicated code, duplicated messages, and a giant mess. Learn from my mistakes - message content should be independent of message routing. Don't include the names of publishers or subscribers in message names. Messages should be business objects that can stand on their own without the need for additional context (e.g., a purchase order or a payment). Likewise, routing a message shouldn't require inspecting the message's contents. Keep routing keys, intended recipients, etc out of the message itself - instead, put that routing-related information in message headers, queue names, or some other queue-specific metadata.

Use data contracts

Having data contracts in place for messages ensures that both the publisher and subscribers understand the data in the same way. By using a library like Google Protocol Buffers, you can also realize performance gains from faster serialization and smaller message size. Protocol buffers have the added benefit of allowing type-safe communication between services written in different languages. Keep these contracts in a separate repo, and make sure to use semantic versioning for all releases. Consider distributing the contracts via some sort of package e.g., NuGet, npm, Composer.

Create a client library

Creating a client library for interacting with message queues is especially helpful if you have multiple systems and teams communicating with each other via queues. The library should provide methods for creating and connecting to queues using a standardized convention. It can also include audit-related functionality (discussed below), which guarantees consistent and reliable metrics. Having this common layer of abstraction in place provides a relatively small surface area for changes if you want to swap out your queue technology down the road.

Have audit infrastructure in place

Having visibility into the status of your publishers, subscribers, and queues in absolutely paramount for the successful operation of your system. The more hops your data makes, the more important auditing is - queues that span datacenters are a prime target for intermittent interruptions.

There are two types of data that you can collect to give you visibility into your queues - event data and metrics. Event data is related to specific messages. A publisher might log a "published" event that references a message ID and a timestamp. Metrics, on the other hand, provide summary data like the number of messages consumed by a process in the last 30 seconds.

If you don't have many cycles to spend on auditing infrastructure, at the very least you should log event data. As long as the events have timestamps, you can calculate metrics as you aggregate the event logs.

If your system spans multiple datacenters, you might want to create an additional "audit" service as a secondary method of tracking message delivery. This service would simply consume all published messages and keep a log of event data - rather than "published" or "consumed" events, the audit service would produce "observed" events. This data can help identify replication/federation problems by giving you the means to determine the last piece of your pipeline that encountered a given message. You can also track latency with greater detail, i.e., how long it takes messages to replicate from Datacenter A to Datacenter B.

Create a proxy API

Using a message queue doesn't mean that all of your services need to talk to the queue directly. Maybe you use a language that doesn't have an official library for interacting with your queue of choice, or maybe you'd like to accept messages for another system in a queued fashion, but that other system is a scheduled job that can only make HTTP requests. In these cases, you can make a simple HTTP proxy API that takes the payload from an HTTP request and puts it on a queue. In the other direction, the proxy can subscribe to messages from a queue, and then POST to an HTTP endpoint in some other system to notify that system that a message has arrived.

Do you have more tips about implementing message queues that you'd like to share? Tweet me @cas002 to share the lessons you've learned.

In 2015/2016 I was the lead engineer on a team that was tasked with building a customer service chat application for WebstaurantStore. You're probably familiar with the idea - you visit an ecommerce site and a little notification pops up prompting you to chat with a customer service agent. We wanted to provide our customers and our Customer Service Representatives (CSRs) with a better chat experience by integrating product and customer data directly into the chat, so we decided to build a system in-house. The system was dubbed "Switchboard."

To understand the "Tantrum Spiral" bug the we encountered, I first need to supply some high-level information about how Switchboard works.

Switchboard at a Glance

One of the main requirements for the chat app is transparent resiliency - if a customer closes the chat accidentally and then clicks on the "Chat Now" button again, their conversation should pick up exactly where they left off (and with the same CSR, if they are available). If the customer already has a chat window open and they click the "Chat Now" button again, both chat windows should remain connected and in sync. Customers can do all sorts of weird things with browser tabs, and we need to make sure that Switchboard Just Works in all of those scenarios.

To that end, Switchboard makes a distinction between "users" and "connections". A user describes the logical "who" - either a customer, a CSR, or an anonymous user, while a connection describes the logical "where" - i.e., where do I send the messages? Users can have multiple simultaneous connections. A connection is essentially an "address" - a unique ID that points to a websocket or http connection that's held in memory on one of the app servers. The connection addresses are stored in a Redis set, with the UserID as the key. When a message need to be sent to a user, we look up their connection addresses in Redis, and then broadcast messages to all of the connections that correspond to those addresses.

Switchboard also has a feature that allows CSRs to see the status and chat count of other CSRs in real time. In practice, this means that a status message is broadcast to each CSR every time a customer or CSR connects or disconnects. The status messages look something like this:

This bug gets its name from Dwarf Fortress, a game where you manage a colony of dwarves - telling them where to dig for materials, what to build, what to farm, etc. Occasionally a dwarf will get annoyed with the type of work it's been assigned, and in response the dwarf will throw a tantrum and start knocking over furniture and punching other dwarves. Usually, the misbehaving dwarf is punished for its actions. In some cases, the punishment or even death of the tantrum-throwing dwarf will cause other dwarves to throw a tantrum - and thus the colony spirals out of control, leading to the eventual death of the entire group of dwarves.

One fateful day, the Switchboard app servers suffered a similar demise.

The first evidence of the probelm was a wave of disconnects affecting our CSRs. When the chat client is forcibly disconnected from the server, the user sees a message letting them know they've been disconnected. This isn't an uncommon occurrence for customers (especially those on mobile networks), but was rarely a problem for our CSRs. In any case, the client automatically tries to reconnect to the server, and if it's successful, the user won't know that anything bad has happened.

The office building that houses our CSRs was undergoing some construction at the time, so the wave of disconnects wasn't too unusual. However, when the disconnects continued and increased in frequency, we knew something else was wrong.

A quick glance at our dashboards showed that the network pipe between the app servers and the Redis cluster was completely saturated - not good, and definitely not normal. We saw in the app server logs that calls to Redis were failing, which resulted in an unhandled exception that rolled the app server process (in Node.js the mantra is "fail fast and restart"). When the app server rolled, it forcibly disconnected all clients that were connected to that particular server, and those clients attempted to reconnect to one of the other available app servers.

The next step was trying to figure out why Redis network IO was pegged. We discovered that each CSR had old, inactive connection addresses hanging around in Redis, and with that discovery, all of the pieces started to make sense...

An Unfortunate Series of Events

Here's the sequence of events. An internet disruption caused all CSRs to disconnect from the app servers. When a client disconnected in this fashion, the websocket library we used didn't properly fire a "disconnect" event, which meant that the app server never has a chance to clean up the (now disconnected) connection addresses in Redis.

When the clients automatically reconnected to a different app server, the app server would broadcast status messages to all CSRs, including the old addresses. When an app server tries to publish a message to an address that it doesn't have in memory (i.e., the client is connected to a different app server), it uses Redis pub/sub to publish that message to the other app servers so that whatever app server does has the connection can pass it along.

By this point, you probably see where this is going. Each time an internet disruption occurred throughout the day, a bunch of "ghost" addresses would pile up in Redis. Eventually there would be enough ghost addresses that the status broadcast messages would saturate the connection between the app servers and Redis. When that happened, the app server would roll - disconnecting all clients and accruing even more ghost addresses - and when the clients tried to reconnect to another app server, the status broadcast would kick off yet again.

In other words, one app server would throw a tantrum, which would cause another app server to throw a tantrum, and so on.

For an interim solution, we manually deleted the list of addresses for each CSR from Redis and asked the CSR to logout and log back in to Switchboard. This purged the ghost connections and reduced the number of addresses that the status broadcast was sent to.

For the long term, we ultimately settled on adding TTLs to each address stored in Redis. Each client now sends heartbeat messages to the app server, which updates the TTL for that client/address. If the client disconnects in such a way that the address isn't removed from Redis, it will eventually expire.

We accomplished this by changing the list of addresses from a Set to a Sorted Set with the score indicating the expiration time for that address. When we fetch the addresses from Redis, we first delete any addresses from the set where the score is less than the current time (using ZREMRANGEBYSCORE).

After the fix, Switchboard has become more robust to both internal and external network outages. If you have any questions about Switchboard's design or architecture, feel free to reach out to me @cas002 on Twitter.