Testing

Testing Complex Systems

By Gigi Sayfan, June 10, 2014

An overview of a layered approach to testing complex systems.

Mock objects have some important limitations:

It is difficult to mock objects that return complex data structures: Consider an object that returns a complex graph of objects that have various dependencies and relationships. It is very tedious, error-prone, and labor-intensive to construct the canned response necessary for mocking.

Mocks can get out of sync with the object they mock: This is particularly nasty in error-handling situations. Suppose a method used to return an error code when a certain operation failed, but the behavior changed to raise an exception. If the object that calls the method is still expecting an error code and the mock object wasn't updated to raise an exception, then the test will still pass. So, the code will be deployed to production and can run for years until this error condition happens  then all hell will break loose.

Testing Web Applications, Services, and APIs

In the Python world, there are many solid and battle-tested Web frameworks. They cover the entire spectrum from a behemoth like Django to a single file micro-framework like Bottle. Many of them provide extensions for publishing REST APIs. Here is quick an example of tests built on top of the Flask framework and the Flask-RESTful extension.

Fully testing Web applications is often a difficult task. Sophisticated interfaces can often be tested properly only by humans. Web services that return only data (JSON, YAML, csv) are easier. Flask provides good support for controller testing via an internal test client and its development server. You can read more about Flask and check out some tests for a play server I wrote for details and examples.

HTTP is the transport of choice for many service-oriented distributed systems. As such, there are several libraries that capture HTTP-based interactions and can replay them later. The vcrpy is a good option for Python.

Testing Data Store Code

Any non-trivial system will have a persistent layer. It could be a relational database, it could be a NoSQL database, it could be some cloud storage, and more than likely, it consists of some combination of these For many reasons (caching, security, flexibility, and testing), it is recommended that you hide your actual data stores behind a data-access layer. If you do this, then you should be able to mock your data store for most unit and integration tests. However, sometimes you do need to exercise your data store code, too. This will happen in system-wide tests, and also when you have a relational database in the mix (or something like MongoDB) and there is a lot of business logic associated with your data model. In these cases, you will want to have a test data store that you can populate with test data (more on how to do that later). For relational databases, Python has a great option in the form of SQLite, which comes with Python since version 2.5. SQLite is an embedded database that's super fast and can even be instantiated in memory. I have used it successfully to test a lot of data-oriented code.

Some parts of advanced systems are not controlled by code, but rather, are data driven. The code consists of some generic workflow and the actual logic and the specific operations that take place are derived from external data stored in configuration files, databases, or even a remote service. This kind of system, while very flexible and powerful, is harder to maintain and test. Changes in the behavior of the system are not fully reflected by the latest code in source control, but a combination of code and data.

Again, you have to be diligent here and make sure that the data-driven code behaves the same across all environments. Ideally, you will develop specialized tools to manage this process.

Testing Data-Intensive Systems

Beyond unit and integration tests, there are system tests. Large distributed systems typically store and manage lots of data. To test a complicated system properly, you want your test to run on data that has similar properties to production data (data statistics, usage patterns, value range, etc.). There are two common approaches for generating test data.

Synthetic Data: Artificial data generated by a program that satisfies some desirable properties for the test. It could contain invalid data to see how the system handles it, or some data for rare use cases that don't show up often in the wild. The benefit of synthetic data is that you have full control. The downside is that it's difficult to maintain, and sometimes it's not clear if it reflects properly production data.

Production Data Snapshot: With a production snapshot (often from a backup), you know you're running your test on real-world data. The pros and cons are exactly the opposite of the synthetic data approach. You get high-fidelity data, but it's hard to control and test special cases.

It is also possible to combine the two approaches and have a base snapshot of production data, then add some synthetic data for specific use cases.

Dealing with Deployments, Versions, and Upgrades

You're about to roll out a new version of your system, which includes significant changes. You did your part on the testing front from local tests, through integration tests, and all the way to various system tests in the staging environment. But it doesn't matter how good your tests are. Unless you have an exact replica of your production environment for testing, you can't be totally sure what the impact going to be. Obviously, you can't risk breaking your production environment, corrupting your data, or making it unavailable or too slow. There are several strategies that will let you test the water.

First and foremost, you should be able to quickly revert any code and configuration changes at the first sign of trouble. But some changes go beyond the code (particularly in heavily data-driven systems). A good strategy in such cases is to do a multi-phase deployment. First, deploy the new changes side-by-side with the existing system. Whatever new data you collect will be stored in both the "old" way and the "new" way. Accessing data will still be done via the "old" way. Now you have new data accumulating both ways, in parallel, and you can verify everything is working well by comparing old to new. Once you are sure everything is OK, you can roll out another change that switches the data access to the new data and move entirely to the "new" way. You still collect data using the "old" way, too, so if something is wrong, you can just roll back the data access change and no data will be lost or corrupted. Finally, after the new data is accessed, you can phase out the "old" way of storing the data (which may involve migration).

Another method for controlled rollout of new changes is to deploy to a small number of servers first. If there is any problem, roll back and fix. This, of course, requires that the distributed system can work with both old and new nodes at the same time.

Separating Production Code from Test Code

Building testable systems is hard. You must design for testability or at least refactor towards testability. You're in bad shape if your code is littered with lots of conditionals like:

if TEST_MODE:
do_test_thingy()
else:
do_production_thingy()

You may think it's OK because there is just one global switch TEST_MODE and it is easy to make sure it is always False in production. But the reason it is really bad is that it means that do_test_thingy() is callable from your production code and can be be executed unconditionally somewhere else. Your goal should be to eliminate as many issues as possible ahead of time.

Note that it may be workable to have a single switch to select a test configuration file (or something similar) to control the entire behavior in one place. It is often a good idea to have the switch be an environment variable that contains the name of the configuration file:

config_file = os.environ.get('ACME_CONFIG_FILE', 'config.py')

Another bad habit is having various test utilities that may be handy for non-test scenarios. If the boundary is vague, people will use a cool test decorator that retries every operation five times in their production code to protect against slow response from some service. Next thing you know, someone decides to arbitrarily change some knob in the retry decorator to make a test pass (because it's only a test decorator), and suddenly your system hangs for 10 minutes when a server is down instead of failing fast. Mixing production and test code is a slippery slope. The relationship should be unidirectional: Tests use production code. If you have shared code that you want to use in both tests and production code, move that code to a separate non-test module or package. Then, it can be used by anyone.

The best practice is: Production code never imports any test code. If multiple tests from multiple packages want to share some test helpers, put this shared test code in a separate test helpers package that only tests will import and use.

Conclusion: Be Reasonable

"Be Reasonable." This seems like good advice in general. As you no doubt realize, proper testing is hard. Typically, you won't be able to fully test everything. If you decide to focus on testing and really get as close as possible to the holy grail of a fully tested system, you may find that you're reaching diminishing returns. It may be very difficult to evolve the system and make changes. Exhaustive tests may slow you down. The tests, themselves, may become so complicated that they will contain their own bugs, and when a test fails, you may have a hard time seeing what the test is complaining about whether it really is a failure.

When you have layered code, consider skipping testing each and every layer. Maybe testing from the outside is good enough. Match your testing level to the criticality of the subsystem under test. Take into account other factors like how easy it is to recover from failures and how easy it is to fix them with and without tests.

Even with all this thoughtful reasoning, I can still assure you that you don't test enough.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!