Menu

Identifying the pain points of service integrations

Imagine that you run an educational journal for programmers that has paying customers but is too small to support full-time work.

Together with your friend Huan, you maintain a custom web application that supports the journal and its subscribers. Built on a tight budget, the project mostly consists of glue code around common open source tools combined with a handful of web service integrations.

Over the years, you’ve started to learn that there are considerable costs and risks involved in using code that isn’t under your own control. You’ve been bitten several times by action at a distance that wasn’t accounted for when designing your software, and consequently you’re now more cautious when integrating with external services.

As part of an annual retrospective, you will meet with Huan today to look back on a few of the biggest pain points you’ve had with third-party software integrations.

You’ve made a promise to each other not to turn this discussion into a game of “Who is to blame?”, a perspective that can easily generate more heat than light. Instead, you will focus on guarding against similar problems in future work, and if possible, preventing them from occurring in the first place.

Plan for trouble when your needs are off the beaten path

“How about we start by talking through the bracket stripping issue?” Huan asks with some hesitation. Immediately, your face turns red.

This issue is truly embarrassing, but it did teach you some important lessons the hard way. Now is as good a time as any to reflect upon it.

What exactly was the bracket stripping issue? It was a mistaken assumption that plain-text emails sent via a third-party newsletter service would be delivered without any modifications to their contents whatsoever.

A handful of test emails and a cursory review of the service’s documentation didn’t raise any red flags. But upon testing the second email you planned to send out, you noticed that all occurrences of the [] character sequence were silently deleted from the message.

This is a subtle problem, one that would have minimal impact (if any) on emails that didn’t contain special characters. But because the emails you were delivering had code samples in them, any sort of text transformations were problematic. The fact that the [] character sequence appeared often in code samples made things much worse.

You contacted customer support to find out if they could fix the problem. Their response, summed up: “Yes, this problem exists, and we can’t fix it easily because this strange behavior is fundamentally baked into our email delivery infrastructure.”

One suggested workaround was inserting a space between the opening and closing brackets, which worked for some use cases, but not all. There were some situations where you really did need to have the exact [] character sequence; otherwise, code samples would fail with hard-to-spot syntax errors whenever someone attempted to copy and run them.

As a result, the newsletter service could not be used for sending full-text emails, so you changed up the plan and decided to host the articles on the Web. This is when Huan joined the project (there was no way you could build out the web application you needed and write articles for the journal at the same time).

Looking back, this kind of rushed decision making could have been prevented.

Had the flaw in the newsletter service been discovered before it was rolled out to paying customers, you wouldn’t have been in such a hurry to find a workaround.

As you try to think through what you could have done differently, you struggle to come up with anything. You let Huan know you’re drawing a blank, and ask her what she thinks could have been done better.

She suggests it might have been a good idea to do an email delivery smoke test: “Take a big corpus of sample articles and run them through the service to make sure they rendered as expected. This would have likely caught the bracket stripping issue, and wouldn’t have cost much to build.”

Her suggestion is a good one, but something about it makes you uneasy. You ask yourself why you didn’t think to do something along those lines in the first place. That gets the gears turning in your head, and you start to see the flawed thought process that got you stuck in this mess to begin with:

You: A smoke test would definitely have helped, but it never occurred to me that it was necessary. And that was the real problem.

In theory, we should be approaching every third-party system with distrust until it is proven to be reliable. In practice, time and money constraints often cause us to drive faster than our headlights can see.

Huan: So you’re saying that you ran into problems because you rushed into using a particular solution without carefully evaluating it?

You: Well, it’s pretty clear that I was looking for a shortcut. So I chose a newsletter service based on its reputation. This sounds like a reasonable way of evaluating options quickly, but it isn’t.

Huan: But if the service was so popular, doesn’t that count for something? I feel like I could have just as easily made the same assumption.

You: Do you know that great burger shop down the street from here? The one everyone is always raving about as if it’s the best place ever?

Huan: SuperFunBurger! I love SuperFunBurger. Where are you going with this?

You: What do you think about their fish sandwiches?

Huan: Dunno. Never had them. Didn’t even know they were on the menu, to be honest. Everyone always goes there for the burgers.

Because Huan is used to your way of talking in cheesy riddles, she has no trouble deciphering your main point: in the case of this particular newsletter service, you ordered a fish sandwich, not a burger.

With this issue laid out plainly on the table, you spend a few minutes talking through what could have been done differently:

You could have searched around more to see if others were using the service in the way you planned to use it. It probably would have been hard to find similar examples to your own use case, and that alone would have triggered some warning bells that would have caused you to slow down a bit.

It wasn’t safe to treat this uncommon use case as if it would have just worked. Instead, it would have been better to treat it with the same level of uncertainty as any other source of unknowns. Noticing the risks would have caused you to write more comprehensive tests, and might have led you to consider running a private beta for at least a few weeks before allowing open signups.

The other critical question that was never asked was “what do we do if this service doesn’t work as expected?”—a question that should probably be asked about any important dependency in a software system. Even if you treated it as a pure thought experiment rather than as a way of developing a solid backup plan, asking that question would have left you less surprised when things did fail, and better prepared to respond to the problem.

Remember that external services might change or die

For your next case study, you talk about the time the site’s login system suddenly stopped working, which caught both you and Huan by surprise.

In a weird way, this story relates to the email newsletter issue that you just worked through. The original plan was to deliver articles directly into a subscriber’s inbox so that they could begin reading right away without any extra steps required. Changing this plan complicated things a bit.

Because the problems with the newsletter service caused you to move the articles into a web application—and because you were publishing materials that were meant to be shared with members only—you needed to implement an authentication scheme.

Forcing subscribers to remember a username and password would have been a bad experience, so you opted to use an authentication provider that most subscribers already used daily. This made it possible to share members-only links without subscribers having to remember yet another set of login credentials.

Setting up this feature was painless, and after initial development, you never needed to think about it at all. That is, until the day you tried to send out the 35th issue of your publication and received dozens of email alerts from the application’s exception reporting system.

You were able to get a quick fix out within an hour of seeing the first failure, but your immediate response was predictably incomplete. You then implemented a more comprehensive fix, but that patch introduced a subtle error, which you only found out about a few days later.

The technical reasons for this failure aren’t especially interesting, but you suspect that identifying the weak spots that allowed the problem to happen in the first place will generate some useful insights. To uncover some answers about that, you suggest playing a game of Five Whys.

Huan agrees to play the role of investigator and begins her line of questioning:

Why did the authentication system suddenly fail?

The application depended on a library that used an old version of the authentication provider’s API, which was eventually discontinued. As soon as the provider turned off that API, the login feature was completely broken.

Why were you using an outdated client library?

Authentication was one of the first features implemented in the web application and it worked without any complications. From there forward, that code was totally invisible in day-to-day development work.

No one ever considered the possibility that the underlying API would be discontinued, let alone without getting some sort of explicit notification in advance.

Why did you assume the API would never be discontinued?

The client library worked fine at the time it was integrated, and no one raised concerns about the implementation details or the policies of the service it wrapped.

When this feature was developed, no careful thought was given to the differences between integrating with a third-party library and a third-party web service.

An outdated third-party library will continue to work forever—as long as incompatible changes aren’t introduced into a codebase and its supporting infrastructure. Because of the limited budget, the maintenance policy for the project was to only update libraries when it was absolutely necessary to do so (for things like security patches and other major problems).

A web service is an entirely different sort of dependency. Because a service dependency necessarily involves interaction with a remote system that isn’t under your control, it can potentially change or be discontinued at any time. This wasn’t considered in the project’s maintenance plan.

To the extent this issue was given any thought at all, it was assumed that due to the popularity of the service providing the authentication API, users would be notified if the provider ever decided to make a breaking change or discontinue their APIs.

Why did you assume the API provider would notify its users?

In retrospect, it’s obvious that every company sets its own policies, and unless an explanation of how service changes will be communicated is clearly documented, it isn’t safe to assume you’ll be emailed about breaking changes.

That said, there was another complication that obfuscated things. The client library used in the application wasn’t maintained by the authentication service provider themselves, but instead was built by a third party. It was already an API version behind at the time it was integrated, and from the service provider’s perspective, the tool was a legacy client.

The company that provided the authentication service also had an awkward way of announcing changes, which consisted of a handful of blog posts mixed in with hundreds of others on unrelated topics, and a Twitter account that was only created long after several of their APIs were deprecated. If we had researched how to get notified up front, we would have been better prepared to handle service changes before they could negatively impact our customers.

Why didn’t you research how to get notified about service changes as soon as you built the authentication feature?

This feature was built immediately after the problematic integration with the email newsletter service, and it was thrown together in a hurry as an alternative way to get articles out to subscribers without delays in the publication schedule.

There were hundreds of decisions to make, and none of the work being done under that pressure was carefully considered. In a more relaxed setting, it would have been easy to learn a lesson from the newsletter service problems: third-party systems are not inherently trustworthy, even if they’re popular.

But in that moment, there was still some degree of confidence that third-party services would work without problems, now and forever. And the only real explanation for that line of thinking is a lack of practical experience with service-related failures.

After completing her investigation, Huan concludes that it would be a good idea to audit all of your active projects to see what services they depend upon, and figure out how to be notified about changes for each of them. The two of you agree to set aside some time to work on that soon and continue on with your discussion.

Look for outdated mocks in tests when services change

Although the formal “Five Whys” exercise dug up some interesting points, it didn’t tell the complete story of what went wrong and why. After continued discussion around the topic, you realize that—once again—flawed testing strategies were partially to blame for your problems.

It was easy to spot both the immediate cause of the authentication failure and the solution, because you were just one of many people who had been relying on the deprecated APIs. A web search revealed that an upgrade to your API client was needed—a seemingly straightforward fix.

Knowing it wouldn’t be safe to update an important library without some test coverage, you spot-checked the application’s acceptance test suite and saw that it had decent coverage around authentication—both for success and failure cases.

Feeling encouraged by the presence of those tests, and seeing that the suite was still passing after the library was updated—you assumed that you had gotten lucky and that no code changes were needed when upgrading the client library. Manually testing first in development and then in production gave you even more confidence that things were running smoothly.

A couple days later, an exception report related to the authentication service proved you wrong. It was clear the tests had missed something important. Or at least, that’s the imaginary version of the story you somehow managed to remember in detail many months after it happened.

* * *

You dig through the project’s commit logs to confirm your story, and sure enough, find an update to the mock objects in response to the final round of failures. You mention this to Huan, who doesn’t appear surprised.

“That’s absolutely right,” she says. “We should have had some sort of live test against the real API. Because we didn’t, our tests were never doing as much for us as we thought they were. And if we had a live test that ran before each announcement email, we might have caught this issue before it impacted our customers.”

Your gut reaction tells you that Huan is probably right, but hindsight is 20/20 and testing isn’t cheap.

That said, it would have paid to dig deeper when you spot-checked the test coverage. The mock objects were wired up at a low level, so the acceptance tests looked identical with or without mocks. Because of this, it was easy to forget they were even there.

After upgrading the library and seeing no test failures, and then checking that things were working as expected via manual testing, you assumed that you’d be in the clear. The automated tests would catch any unexpected changes in the client library’s interface, and the manual test would verify that the service itself was working.

For the common use case of subscriber logins, this approach toward testing worked fine, meaning the first fix you rolled out after upgrading the client library did solve the problem for existing customers. It took a few more days to notice that the fix was incomplete; when a new customer tried to sign up, the authentication system kicked up an error.

This was a subtle problem. To complete authentication for an existing subscriber, all that was needed was the unique identifier that corresponded to user records in the database. When creating new accounts, however, you also needed to access an email address provided in the service’s response data.

Upon the upgrade, the data schema had changed, but only for the detailed metadata about the user; the identifier was still in the same place with the same name. For this reason, active subscribers were able to sign in just fine, but new signups were broken until the code was updated to use the new data schema.

Turning this over a few times in your head, you can’t help but acknowledge Huan’s point about the importance of live testing:

You: You’re absolutely right; testing the real authentication service would have helped. But I also wish that I had been more careful when auditing our test coverage.

Huan: What would you have done differently?

You: I wouldn’t just look at the tests; I’d also check the code supporting them.

This would have led me to the configuration file where we mocked out the response data from the authentication service. And if I saw that file, I would have probably wondered whether or not it needed to be modified when we upgraded.

Huan: Sure, that’s a good idea, too. In the future, let’s try to both catch these issues in code review by looking for outdated mocks whenever a service dependency changes, and also having at least minimal automated tests running against the services themselves on a regular schedule.

You: Sounds like a plan!

Expect maintenance headaches from poorly coded robots

For the final discussion topic of the day, Huan suggests discussing something that initially appears to be a bit of a tangent: a situation where a web crawler was triggering hundreds of email alerts within the span of a few minutes.

As you think about the problem, you begin to see why she suggested it: on the open Internet, you don’t need to just worry about your own integrations, it’s also essential to pay attention to the uninvited guests who integrate with you.

You do a quick archaeological dig through old emails, tickets, and commits to reconstruct a rough sequence of events related to the problem:

An initial flood of exception reports hit your inbox at 3:41 AM on a Wednesday. You happened to be awake, and immediately blacklisted the IP address of the crawler as a first step. Once things seemed to have settled down, you sent Huan an email asking her to investigate, and headed off to bed.

When you woke up the next morning, Huan had already discovered the source of the problem and applied a two-line patch that appeared to resolve the issue. An incorrectly implemented query method was causing exceptions to be raised rather than allowing the server to respond gracefully with a 404 status code. This code path would be impossible to reach through ordinary use of the application; it was the result of a very strange request being made by the poorly coded web crawler that was hammering the server.

At 4:16 AM on Friday, you saw a similar flood of exception reports. The specific error and IP address had changed, but the strange requests were an exact match to what you had seen on Wednesday. Although the crawler made hundreds of requests within the span of two minutes, it stopped after that. This is cold comfort, but it could have been worse.

Toward the middle of the day on Friday, you pinged Huan for a status update but never heard back. By then, you had noticed that the problem most likely had come from a minor refactoring she had done after solving the original problem (which you hadn’t even reviewed before the second email flood happened).

At 6:24 AM on Sunday, the bot crawled the site for a third time, triggering the same flood of exceptions that you saw on Friday. At this point, you looked into the issue yourself and made a small fix. You also added a regression test that was directly based on kinds of requests the crawler was making, to ensure that the behavior wouldn’t accidentally break in the future.

On Monday, you traded some emails that clarified what went wrong, why it happened, and also what issues lurked deep beneath the surface that made this problem far more dangerous than it may have first appeared.

Reminiscing about this chain of events is uncomfortable. Seeing a problem happen not once, not twice, but three times is embarrassing, to say the least. It stings a bit to look back on these issues even though many months have passed.

You: I’m sorry for how I handled this one.

These emails make it sound like you messed up, but it’s clear that my failure to communicate was mostly to blame.

Huan: Maybe that was part of it, but I could have been a bit more careful here, too.

I never thought about how the risks to the email delivery service tied into all of this, I just thought you were annoyed to keep getting spammed with alerts.

Your initial response was focused on the surface-level issue: your inbox was getting bombarded with lots of unhelpful email alerts, and that was a nuisance. These errors weren’t even real failures that needed attention; they were the result of a bot doing things that no human would ever think to do.

Taking a higher-level view of the problem, what you had was an exposed endpoint in the application that could trigger the delivery of an unlimited amount of emails. To make matters worse, the email service you were using had a limited amount of send credits available per month, and each of these alert blasts was eating up those credits.

Upon closer investigation, you also noticed that the exception reporting mechanism was using the same email delivery service as the customer notification system. This was a bad design choice, which meant the potential for user impact as a result of this issue was significant.

Left untouched for a couple weeks, your monthly send credits probably would have been exhausted by these email floods. However, it’s likely that you wouldn’t have had to wait that long, because the system was delivering emails just as fast as the bot could make its requests. It is entirely possible that the service provider could have throttled or rejected requests as a result of repeatedly exceeding their rate limits.

This is yet another problem that could be chalked up in part to inexperience: the two of you had seen crawlers do some weird stuff, but neither of you had ever seen them negatively impact service before.

What your email history shows is that at some point you became aware of the potential problems but didn’t explicitly communicate them to Huan. That wasn’t obvious at the time; it only became clear after doing this retrospective today. This serves as a painful but helpful reminder that clear communication is what makes or breaks response efforts in emergency situations.

Remember that there are no purely internal concerns

You spend a few moments reviewing the retrospective notes from the email you wrote many months ago. It includes many useful thoughts that you and Huan have internalized by now:

It’s important to add regression tests for all discovered defects, no matter how small they seem.

It’s important to check for mock objects in test configurations so that you won’t be lulled into a false sense of safety when a test suite passes even though the mocked-out API client is no longer working correctly.

There are risks involved in sharing an email delivery mechanism between an exception reporting system and a customer notification system.

It’s worth looking into using a better exception reporter that rolls up similar failures rather than sending out an alert for each one.

These are all good ideas. It’s a sign of progress that these bits of advice now seem obvious and have been put in place in most of your recent projects. But why weren’t those issues considered in the first place? Why did it take a painful failure to draw attention to them?

You: I think that one common thread behind all of this stuff is that we had a well-intentioned but misguided maintenance process.

Huan: How so?

You: Well, I think the spirit of what we were trying to do was reasonable. We had agreed to treat customer-facing issues as a very high priority no matter how minor they were, and then freed up capacity to do that by underinvesting in some of the internal quality and stability issues.

Huan: So what’s wrong with that? Isn’t that what we still do? I thought this is one of those things you like to write about as if it were a good thing.

You pause for a moment. Then you finally find a way to put into words something that has been banging around in your head throughout this entire retrospective:

“Maybe there’s no such thing as a purely internal concern. Maybe as long as our code talks to and interacts with the outside world, there will always be potential for customer impact when things aren’t working as expected. If we pay more attention to what is happening at the boundaries of our system, and treat any issue that happens there as one worthy of careful attention, we’d probably get a better result.”

The room falls silent for the first time in an hour. You let the dust settle for a few moments; then you and Huan head over to SuperFunBurger to try some of their wonderful fish sandwiches.