Ghosts in the Pipes - Diagnosing and Fixing a 0.005% Discrepancy

As a result of frequently testing in production, this year we successfully completed work to migrate parts of our data-ingestion pipeline to a new architecture. The way in which we implemented this new pipeline exposed an interesting (and ephemeral) bug which I’m going to talk about in this post.

Verify Branch-by-Abstraction

When we’re replacing a system, one of the techniques in our playbook is to run the new system in parallel to the old for a period of time. We multiplex the original input into the new system - this lets us compare both performance and correctness, and once we’re satisfied with both we can perform the switchover. You can read more about Verify Branch-by-Abstraction here.

During this period of work, we were happy with the performance gains but we were identifying a daily discrepancy of around 0.005% globally. It only manifested in high-volume events (such as ad impressions), and there was no perceivable pattern to either the amount that vanished or the time that it happened. We couldn’t write this off as “one of those things” in case the error was systemic and would propagate errors out to financial systems - causing actual monetary losses.

The nature of the pipeline made it easy to isolate where the discrepancy was occurring - the old monolithic application was occasionally reporting processing batches with more events than the new pipeline was reporting, so there must be some weirdness with the inputs into these two separate components.

We examined the (archived) inputs to both pipelines and counted the events, both of which agreed with the old system. We ran the new application on the same input and … it also agreed with the old system. Huh.

Catching it in the Act

We couldn’t begin to fix the problem if we couldn’t prove it was actually happening, let alone why. After a bit of thought, we reasoned that archiving the input may be tampering with the “evidence” - our attempt at exposing the glitch was to move the inputs to a different part of the system and look at those instead.

When our alerting signalled another discrepancy, we looked at the input, ran the new application … and it still agreed with the old system. Despite everything we were being told by our reports, both systems gave the same output and were being given the same input locally, but apparently not in production. What was going on?

The next attempt was the addition of metadata to the IPC format we use - each event additionally recorded which batch of inputs it came from. Once we’d taken a look at the outputs for both processes, our suspicions were confirmed by the new pipeline’s output containing fewer events from batches with discrepancies.

Gotcha.

The Difference between Copy and Move

We took a step back and examined how we were multiplexing our data. The original pipeline moved input batches into a directory for ingestion, and we copied those inputs into a different directory for ingestion. In a moment of clarity, we perceived what was going wrong.

Òn the same device, mv uses rename(), which is an atomic operation. But copying does not have the same guaranteed atomicity - it would be possible for the input to be incompletely copied before the new pipeline began ingesting it. If the input batch had a significant portion of high-volume events (inputs are batched on time, not size), then the window for such a failure to occur would be greatly increased.

The old pipeline used mv for ingestion, and the new pipeline used cp to copy the same input.

We deployed a 5 minute fix (copy the input to a staging location and move it to the new pipeline ingestion when it was done), which eliminated the discrepancy entirely. Both systems were now reporting identical outputs.

Conclusions

We were pleased with the way that our new system turned out, in addition to learning a valuable lesson about multiplexing data across systems in future. Most importantly, being able to trace an event’s journey through the entire system - from ad unit to reporting database - made zeroing in on where the failure was occurring nearly trivial, so we were able to concentrate our efforts and mitigate the issue.