Menu

Spotting Hidden Dependencies in Incremental Changes

Imagine that you work for a product company that is known for their massive knowledge base of high-quality documentation.

The business is fortunate enough to have very dedicated customers, many of whom also write their own blog posts and articles to share helpful ideas on how to get the most out of your company’s products.

To encourage the ongoing development of community-based learning materials, you’ve been asked to build a public wiki that will be hosted alongside the official knowledge base website.

You would prefer to implement this functionality as a standalone project, but for strategic reasons that haven’t been fully explained, your product manager expects you to integrate the new wiki features into the existing knowledge base. The wiki will live in its own area, but it’ll share the same codebase and infrastructure.

The challenge will be to bring the wiki online without having any negative impact on the existing website. On the surface this seems easy because no old code will need to be modified to support the new features, but deeper issues lurk below the waterline.

There’s no such thing as a standalone feature

You spend a couple days building out a minimal wiki, and the features you build end up looking quite similar to what exists within the knowledge base system. The only major difference between the two tools is that the original system is only used by a handful of trusted administrators, while this new wiki will be editable by anyone who visits your company’s website.

To get some early feedback on your work, you show the wiki to Bill, your product manager. Bill spends three minutes playing around with it, and then turns to you and says, “This looks greaaaat. I’m gonna need you to get this shipped by Friday, OK?”

Rushing this new feature out the door seems like a terrible idea, but you try to make the best you can of a bad situation. You settle down and start to think through what could go wrong once this feature is live in production.

At first glance, it almost feels like there’s not so much to worry about. The wiki lives in its own distinct section of the website, and you avoided modifying any of the code from the original content management system while adding this new feature set. Even if the wiki itself ends up crashing and burning, what is the worst that can happen to the rest of the website?

After a few moments, you notice something worth being concerned about: allowing anyone to create and edit pages without any restrictions is a huge risk from a storage standpoint.

There are many possible attack vectors to be concerned about, ranging from building huge documents to soak up storage capacity, to building tons of tiny documents, to building documents as quickly as possible to overload the storage mechanism itself.

Because both the knowledge base and the wiki use the same storage mechanism, an attack on the wiki could take the knowledge base down along with it. This is an example of an infrastructure-level dependency that isn’t immediately obvious when you’re looking at a newly introduced change to a codebase.

That idea leads you to notice another important point: the same web server hosts both tools. The conversion of Markdown to HTML for the wiki articles is handled in-process, and it’s not especially performant. Someone wishing to disrupt the service wouldn’t even need to wait until storage ran out; processing would grind to a halt as soon as the Markdown converter was overloaded with requests.

In light of these issues, you take a few steps to mitigate risks. They’re nothing fancy, but they should help prevent a catastrophe:

You limit the maximum number of pages to no more than 1,000 documents.

You limit the size of each wiki page to no more than 500 kilobytes.

You move the Markdown processing for the wiki into a work queue, and limit the queue size to 20 pending jobs, raising a “Please try again” error when the queue is overloaded.

You add monitoring to track wiki page creation, deletion, and editing–and set up alerts for when these events are happening more frequently than you’d expect during ordinary operations.

You add availability monitoring for the knowledge base website, pinging it twice per minute to ensure it is still accessible and responding within an acceptable amount of time. This should have been done long ago, but the clear need for improved monitoring makes this a perfect opportunity to add it in.

These measures on their own are not enough to make things completely safe. However, spending an hour to guard against the basic risks that come along with shared infrastructure dependencies is time well spent.

Confident that these changes have made your code much less dangerous, you let Bill know that the tool is ready to ship.

If two features share a screen, they depend on each other

A few weeks have gone by and the wiki has survived in production without major headaches.

You were given a new project to work on immediately after the initial version shipped, and so it’s been some time since you’ve had to even think about the wiki. But just this morning, you received an email from Sandi in marketing that will shift your attention back to it for a little while:

Hello programmer friend,

I’m not sure if you’ve been looking at the analytics dashboard for the wiki lately, but we’re definitely seeing some growth in activity.

One thing I noticed when looking at the analytics data is that although we have almost 80 pages in the wiki, most people tend to only visit the articles that are directly linked from our most popular landing pages.

If it wouldn’t be too much trouble, I’d like for you to spend a bit of time working on a new feature that will help customers explore the site.

What I’d like to see is a sidebar that lists the five most popular pages, the five newest pages, the five most recently updated pages, and five randomly selected pages.

We’d like to promote this new feature in our monthly newsletter, which will be sent out in the next couple days. So if you can sneak some time in to work on this before then, that would be excellent.

-Sandi

Adding this new sidebar is a reasonable request, and building it shouldn’t be all that complicated. But as usual, it’s something that you’ve been asked to do in a hurry, and that makes you nervous. Will all this rushed work come back to bite you later?

You could probably tell Sandi that you’d like a little more time to build out the feature, and that wouldn’t create any massive problems for anyone. But before doing that, you decide to do a quick spike and see how far you can get in a single sitting.

Adding the new sidebar won’t require modifying any existing behavior except for the UI for viewing wiki pages. In theory, this seems to be a low-risk change. In practice, you know there’s no such thing.

Looking over Sandi’s request, you realize that listing the five newest pages, the five most recently updated pages, and five random pages will be easy, because all of this information can easily be pulled with a simple database query. Determining the most popular pages is a more complicated task, so you put it off for now and focus on the low-hanging fruits.

You code up these simple queries and dump them into an ugly little sidebar on the right side of the wiki page. It takes about 20 minutes to cobble together, but it looks surprisingly functional. You wrap the whole thing in a feature flipper and make it so that the new sidebar will only be visible to developers. Two minutes later, the feature is live in production and you’re ready to kick the tires.

The first time you visit the wiki, the sidebar looks like it’s working perfectly. It is filled with a list of page links, along with a timestamp that indicates when each page was last updated.

After refreshing the page a couple more times, you hit your first problem: the page completely fails to load, and you get a generic “We’re sorry, but something went wrong” page instead. This seemingly self-contained change managed to break the whole wiki!

You check your email inbox and sure enough, there’s already an exception report waiting to be reviewed. You quickly discover the source of the problem: a handful of old records that had null values for their “last updated” timestamps, which were created before you started tracking update times.

This wasn’t an issue until a few minutes ago, because those timestamps hadn’t been displayed anywhere in the UI yet. The fix for this issue is easy: use the console to set any null timestamps to the date the wiki was rolled out, and then add a constraint to prevent records from being created with null timestamp fields in the future.

The lesson to be learned from this failure is that changes to database schemas always require some thought about data consistency. No matter how well isolated components are at the code level, there can still be hidden dependencies at the data layer. This means that a schema update that’s meant to support a feature in one area of the codebase may break other seemingly unrelated features down the line–which is exactly what happened here.

You deploy a quick fix for the timestamp issue and then resume your therapeutic clicking of the browser refresh button. After half a dozen clicks you end up hitting another serious issue, but one that’s very easy to fix.

You initially designed the sidebar to have a flexible width, with the idea that it would be allowed to expand a bit to accommodate longer page titles. But this is a half-baked idea that doesn’t take into account the fact that one of the real wiki pages has the title “How to do something really amazing with the WidgetProFlexinator that you never thought was possible!”

By allowing the sidebar to expand to fit extremely long titles, the page contents themselves are stuffed into a tiny column that’s completely unreadable. This is so silly that it’s laughable, but it also serves as a useful reminder of another subtle dependency: if two features are displayed on the same page, you have to take steps to make sure they don’t interfere with each other.

You set a maximum column width on the sidebar and redeploy. You hit the reload button until you’re fairly confident that you’ve seen every single page in the wiki show up in the sidebar at least once. Things appear to be working fine.

You tweak the feature flipper configuration to enable the sidebar for Sandi’s account, and you send her a quick email to let her know about your progress:

Hi Sandi,

I need to spend some time thinking about the “most popular” list, but we’ve rolled out an experimental sidebar with everything else you asked for. It’s only visible to you and the development team for now, but please try it out and let us know what you think.

-Your humble programmer friend

Within an hour after receiving Sandi’s email, you’ve not only delivered something that she can give feedback on, but you also found and fixed a minor data consistency bug. Feeling satisfied with your progress, you take a break and go out for a walk.

Avoid non-essential real-time data synchronization

When you return to the office, a response from Sandi is already waiting for you:

Hi there!

Functionally, the sidebar looks very close to what we need. Two quick notes, though:

1) The “most popular” list is pretty important, because right now people mostly land on the wiki through organic search or via links to specific pages that get shared on social media. Even though these pages get a lot of visits individually, there currently isn’t anything linking them together and we’d like to fix that.

2) Can you pick any color scheme for the sidebar other than “light brown text on an electric green background?” My own preference would be to match the look and feel of the sidebars from the knowledge base pages, but anything that doesn’t burn the eyes would be an improvement. 😛

Any chance you can take care of these issues and ship by Thursday?

-Sandi

You often make work-in-progress features a bit rough on purpose to prevent others from thinking they’re ready to ship, but she has a point–electric green is a step too far. Before moving on, you take a few minutes to roll out an updated version of the code that replaces the intentionally hideous color scheme with something that looks similar to the knowledge base styling, as Sandi suggested.

You start to think through the popularity ranking feature. To implement it, you’ll need to pull down data from the site’s analytics service. This could be done in real time through a search for the top 5 most visited pages in a specific time period, but this would result in an API call every time a wiki page loaded, which seems pretty wasteful. Even worse, this approach would unnecessarily introduce a hard dependency on an external service.

Your past experience has taught you that external service integrations are often full of headaches, because they can fail in all sorts of weird and unpleasant ways. You have to assume with every service integration that it may be slow to respond, it may reject requests due to rate limiting issues, it may have periods of downtime, it may return empty responses or incorrectly formatted responses, or it may trigger timeout errors–and if none of those things end up happening, it may still find some other way to ruin your day sooner or later.

If there was a genuine need to work with real-time data, you’d have no choice but to invest time and energy into writing robust, fault-tolerant code. But in this case, the popularity ranking would still be reasonably accurate if you simply updated the page visitor counts a few times per day. For that reason, writing a minimal script that will be run as a scheduled job is probably the right approach here.

You write a script that connects to the analytics API, looks up the stats for each page, and then imports the total visitor count for each page into the application’s database. This script will be run by cron every four hours, and if any sort of error occurs or if it fails to complete its task within a reasonable time frame, you’ll be notified. But for the most part, there’s no real consequence to intermittent failures because this code will be running outside of the main application. The worst that can happen upon failure is that the popularity rankings become slightly out of date.

By taking this approach, you’ve reduced the scope of the problem in the application itself to another simple database query, making it no more complicated to implement than the “new pages” and “recently updated pages” features. You’ve also sidestepped the need to add further configuration information or libraries to the main web application, because your script runs in a separate standalone process and only shares information via the database layer.

Putting all these pieces together takes a couple hours, but by the end of the day you have the functionally complete feature running in production. Sandi takes one more look at it and lets you know that it looks good to her, so you roll it out to a small number of the wiki visitors and check to make sure that nothing bad happens.

After you’re reasonably convinced that the sidebar is working properly, you set aside some time to clean up the code and put it through a proper review before it is officially announced on Thursday. Once that work is done, you roll the change out to everyone. By the start of the following week, Sandi is able to see some interesting changes in the analytics data that indicate that the feature is actually doing what she hoped it would.

Look for problems when code is reused in a new context

It has been three months since you last touched the wiki, and it has been working great for the most part. But today, all of that will change in an instant.

You arrive in the office to find Bill nervously pacing back and forth while talking on the phone. You can only hear one side of the conversation, but it’s obvious that there is something seriously wrong.

“No, of course the wiki isn’t sponsored by an herbal supplement company! We’re not running any sort of advertisements at all.”

“No, super-cheap-pills-for-you.com isn’t a domain the company owns.”

“No, we’re not trying to pull some sort of practical joke, nor are we trying to damage the reputation of the company. I really can’t believe you’re even suggesting that.”

“When did you first get a complaint about this problem? Just this morning? OK, that’s good news. We’ll stop the line and get working on a fix right away.”

Bill ends the call and sits down next to you. He starts trying to explain what is happening, but you’re already one step ahead of him.

“I pulled up the wiki as soon as you mentioned herbal supplements,” you say. “It looks like we’ve got a major issue here: we’re allowing <script> tags and who knows what else in the Markdown files. I’m working on a patch now that will temporarily redirect the entire wiki to a maintenance page, until we can assess the damages.”

As soon as the maintenance page is up, you begin working on a script to detect the presence of HTML tags in the Markdown documents. This will help determine just how many pages have been affected, and what to do about it.

The report reveals that of the 150 pages that currently exist in the wiki, 32 pages use at least some inline HTML. But of those, only 12 of them are using the <script> tag. This could have been a lot worse if the issue hadn’t been caught so quickly.

You generate a comprehensive list of links to match this report, breaking them into three groups: “No HTML,” “HTML without script tags,” and “HTML with script tags.” Bill clicks through the “No HTML” links while you work on the others.

Every single page with a <script> tag on it illustrates the same behavior. It shows a modal window that says, “One moment while we redirect you to our sponsor’s website…,” and then it redirects the visitor to super-cheap-pills-for-you.com. This is incredibly irritating, but at least it seems like this is a single incident of abuse rather than a rampant problem.

For all of the documents that are using HTML tags other than <script>, there doesn’t appear to be anything evil going on. Most uses of HTML appear to be from contributors who don’t fully understand the Markdown format and instead stick to using the basic HTML tags that they’re already familiar with. A handful of pages use HTML for more elaborate purposes, like displaying tables or embedding videos from other websites. The embed codes remind you that <iframe> is another tag that could potentially be used for abuse, but at least so far that hasn’t happened.

Bill finishes auditing the Markdown-only documents and doesn’t find any obvious signs of abuse. At this point, the wiki has been rejecting all incoming traffic for about half an hour, but you now have a much better understanding of the problem.

You start working on restoring partial functionality to the wiki to minimize negative customer impact. You first strip the <script> tags from the dozen documents that were infected with them, and then you deploy some code that allows read-only access to the wiki pages. Bill calls the customer support team to notify them about your progress, and for the moment it seems that tensions have eased as a result.

With the immediate crisis averted, you can now start dealing with the underlying cause of the problem: a Markdown processor that might have worked fine for a handful of trusted administrators, but isn’t safe for use by random bots on the Internet.

At its root, this is another hidden dependency issue. You reused a tool that was reasonably configured for one purpose, without considering how that configuration might be harmful when applied in a slightly different context. In doing so, you focused on the superficial similarities of the two use cases rather than their fundamental differences, and that clouded your judgment. This is an example of bad code sharing practices, and it is something to learn from.

Going a bit deeper, the more subtle issue is that by not explicitly disallowing or restricting HTML tags from the start, you implicitly allowed for their use. This undefined behavior led contributors to believe this was an officially supported feature, even though it’s clearly a defect from your perspective.

There is no question that the underlying security risks must be dealt with; it is essential to prevent anonymous visitors from injecting arbitrary JavaScript code into wiki pages. However, you also need to minimize the damaging effects of your repair.

After thinking about the issue, you decide that stripping all HTML tags is not the way to go. Although they represent a small percentage of the total number of pages in the wiki, some of the most popular articles make use of HTML in interesting ways that would be permanently broken by such a coarse-grained change.

You look into HTML sanitizing libraries and eventually find something that’s fit for this purpose: it strips away any <script> tags, restricts <iframe> tags to a whitelist of specific trusted domains, and takes care of other edge cases that might cause issues.

To assess the impact of this change on the existing documents, you compare the raw HTML output from the Markdown processor to the sanitized output for each page. Most of the documents are unmodified by the sanitization process, leaving only five pages that need to be manually edited before the new rules can be applied.

To make sure that this particular issue never happens again without being noticed, you spend the rest of the afternoon writing tests for all the nefarious examples you can think of. This gives you some amount of satisfaction, but you worry that this won’t be the last case of abuse you’ll ever need to deal with on this wiki project. And that lingering thought keeps you on edge, even as you close up shop for the day.