Month: November 2012

Quality Assurance does not stop after the software receives the “thumbs up” from the QA team. QA must continue while the product is Live! … because QA is not perfect, and real users only exist on a Production system. We need to be humble and accept that our design, development and quality processes will not catch all the issues. Consequently, we must equip ourselves with tools that will allow us to catch these problems in Production as early as possible … rather than “wait for the phone to ring”

When the product exits QA, it simply means that we have we’ve run out of ideas on how to make the system fail. Unfortunately, this does not imply that the system, once in Production, will not fail. If we are successful and get a high volume of traffic, the simple law of large numbers guarantees that our users will find yet-never-thought-of ways to – unintentionally – make the system fail. These are part of the “unknown unknowns” as Mr. Donald Rumsfeld would say. Deploying the product on the production servers, and handing-off (abdicating?) the responsibility to keeping it up to the Ops team shows wishful thinking or naïveté, or both.

Why QA must continue in Production

There are a few categories of issues that one needs to anticipate in Production:

Functional defects: in essence, bugs that neither developers, nor QA caught – while this is the obvious category that comes to mind, it is far from being the only source of issues

User experience (UX) defects: Product works “as spec’d”, but users either can’t figure how to make the product work, or don’t like it. A typical example is a high abandon rate in a purchasing experience, or any kind of work flow, or a feature that’s never used, a button that’s never clicked.
This is not reserved to new products, by improving the layout of a given page, we may have broken another feature on that same page

Performance issues: while we may have run performance, and load tests, in our QA environments, the real world always offers surprises. Furthermore, if we are lucky enough to have the kind of traffic that Google or Facebook have, there is no other way but to test and fine-tune performance in production
Running tests on non-production systems requires to not only simulate the load of the system, but also to simulate the “weight” of existing data (e.g. in database, file system) as well as longevity to ensure that there is no resource leak (memory, threads, etc)

Operational issues: while all cloud applications are typically clustered for high-availability, there are other sources of failure than equipment failure:

External resources, such as partners, data feeds, can fail, or have bugs of their own, or simply not keep up their response time. Sometimes, the partner updates the API without notification.

User-provided data can be mal-formed, or in an unexpected format, or a new data format can be introduced after the launch of the product

System resources can be consumed at an unexpected rate. Databases are notorious for having non-linear response times based on load: as long as the load is under a given threshold response time is high, but once the load exceeds this threshold response time can deteriorate very rapidly.

A couple of examples:

At my previous company, weeks after the product had been launched, we started receiving occasional complaints that some of the user-created videos were not showing up in their timeline. After (reluctantly) poking around in our log files, we did find out that about 10% of the videos that had been uploaded to our site for the past 2 weeks (but not earlier) were not processed properly. Our transcoder simply failed. Worse, it failed silently. The root cause was a minor modification to the video format introduced by Apple after our product was released. Since this failure was occurring for a small fraction of our users, and we had no “operational instrumentation” in our code, it took us a long time to even become aware of it.

Recently, we launched a product that exchanges data with our partner. Their API is well documented, and we tested our product in their sandbox environment, as well as their production environment. However, after launch, we had reports of occasional failures. It turns out that users on our partner’s site were modifying the data in ways that we did not expect, and causing the API to return error codes that we had never seen. Our code duly logged this problem each time it occurred in our log files … among the thousands of other log events generated every minute

Performing QA on Production Systems

As I mentioned, the Google and Facebook of the world, do a lot (if not most) of their QA on Production systems. Because they run hundreds of thousands of servers, they can use a small subset to run tests will live user data. This is clearly a fantastic option.

Similarly, “A/B comparisons” techniques are typically used in Marketing to compare 2 different user experiences, where the outcome (e.g. a purchase) can be measured. The same technique can be applied in testing, e.g. to validate that a fix of an intermittent bug difficult to reproduce does work.

To detect failures, or QoS degradations, with external causes (e.g. partner API times out a lot)

To monitor resource utilization for each service or application – at a finer grain than provided by Operations monitoring tools which are typically at the server level.

The point is that if a user can’t buy a book on our website because our servers crash under load – this is a bug. While the crash is not due to code written incorrectly, it is due to the absence of code warning us that the system was running out of steam … this is still a bug.

In order to monitor quality in Production, we need to:

Clean up the code that writes to log files: eliminate all logging used for code testing, or statements such as “the code should never reach here”. Instead, write messages that will be meaningful to the poor soul who, a few weeks later, will be poring over megabytes of log files on a Sunday night trying to figure out why the system crashed

Use a log aggregation system, like GrayLog2 (open source), so log files from multiple nodes in the same cluster, as well as nodes from different services can (a) be searched from a console and (b) viewed, time-aligned, on a single page (critical for troubleshooting). GrayLog2 can handle hundreds of millions of log events and terabytes of data.

MEASURE: establish a base line for response time, resources consumption, errors – and trigger alerts when the metrics deviate from the baseline beyond a predetermined threshold

Track that core functions – from a user perspective – complete, and log when, and ideally, why, they fail along with key parameters. E.g.: are users able to upload files to our system, are failures related to file size, time of day, location of user, etc?

Log UX and operationally meaningful events to track how users actually use the system, what features are most used and track them over time. These metrics are critical for the Product Management team

Monitor resource utilization and correlate with usage patterns. Quantify key usage parameters in order to scale the right resources in advance of the demand. For example, as traffic grows, the media server and the database servers may grow at the different rates.

Integrate alarms from application errors into the Ops monitoring tools: e.g. too many “can’t connect” errors should trigger an Ops alert that our partner is down – slow response time on a single server in a cluster may indicate the disk is failing

Quality is not a one-time event, it is an everyday activity, because users change their behaviors, partners change their APIs, systems get full and slow down. What used to work yesterday, may not work today, or no longer be good enough for our customers. As a consequence, the concept the “test driven” development must be extended to the Production systems, and our code must be instrumented to provide metrics that confirm that everything works as desired, and alerts when they don’t. But that’s not sufficient, developers and QA engineers must also take the time to look at the data, not just when a fire drill has been called, but also on a regular basis to understand how the system is being used, and how resources are consumed as the system scales, and apply this knowledge to subsequent releases.