Michael Zalimeni, Senior Software Engineer

We do a lot of things in Bronto that involve time. Thanks to the nature of commerce marketing, our system has an inherent sense of it. Email deliveries are scheduled, contact data includes dates (e.g. birthday, last order date), and engagement with contacts is often measured by the rate of their interactions (e.g. opens, clicks, website visits).

The more contacts and marketing events we take in, the more work we have to do to keep everything up to date. This particularly affects things like Segments, which often rely on a moving interval of time (e.g. “birthday is today”) to determine contact membership. As you can imagine, with the amount of data created by our clients and their customers, this work begins to add up very quickly.

When you need to defer actions in software, you have two options: maintain a queue of work for later evaluation or hand that work off to something else that will deliver it at the appropriate time. For a long time, we’ve had an implementation of the first option for Segments in the form of regularly scanned HBase tables; however, in our case, this solution eventually ceased to scale well and suffered from the “noisy neighbor” problem. As a result, we needed something that would allow for dealing with future events on a large scale.

Bigger on the Inside

While there were numerousexamples of the implementation of a distributed “scheduled event” system in enterprise software, nothing was readily available that suited our needs, which included:

Storage of terabytes of data.

Rapidly publishing gigabytes of data at a time.

Reasonable use of/addition to our existing virtual infrastructure.

We also preferred something extensible to tailor to our use cases, but general enough to be used by existing and future services. With these priorities in mind, we decided to build something to meet our needs, and we called it Tardis (yet another sci-fi reference we’ve managed to weave into our work).

Building Tardis

Given our requirement for something capable of very high throughput for both inbound and outbound events, we had a great starting point in the codebase of TattleTail, which had already solved the problem of mass data ingestion and sorting through the use of HDFS and MapReduce jobs.

Building on some of TattleTail’s key components, Tardis ingests and stores future events at a rate high enough to keep up with producers and in a manner enabling efficient access when they become due for publishing. Events due in the near term, which would not necessarily be processed quickly enough by the MapReduce jobs, are stored using reliable queuing in Redis combined with a pre-existing delay mechanism for later retrieval.

On the outbound side, we also had an existing solution – the client library for our homegrown distributed message broker – to provide the publishing throughput we would need. Tardis is already able to publish due events at acceptable rates from a single host; we are currently seeing rates around 150K/s, and expect this to grow with some optimizations that we may undertake in the future.

With a lot of additional code for dealing with date-time manipulation, data compaction and reactive data routing, Tardis is now live in production and undergoing parallel integration testing.

Looking Forward

In the future, we hope to increase the feature set of Tardis to include things like event cancellation. All events sent to Tardis are idempotent, so we expect this to constitute more of a performance enhancement for Tardis than its downstream consumers.

We also plan to start testing the migration of other features to the use of Tardis, such as Workflow delay nodes, and tune performance in preparation for Cyber Monday.

And we’d love to open-source Tardis one day to help others solve similar problems and invite collaborators to improve on our solution.