Performance Testing levels of depth

Having inherited a huge mass of performance tests that were developed by many different teams over more than 5 years, one of the first things I had to do was try to figure out what each test did and make some sense of them. As I cataloged each one, exploring what they did and what they were intending to measure, they began to sort themselves out into three main categories based on the ‘depth’ of testing: Micro-Benchmark, Feature/Module, and Product/Real-World tests. These borders are a bit fuzzy but the various levels of testing have some important distinctions.

These levels of depth build upon each other and should generally be executed and evaluated in this order. One hopes that a problem with the creation of your base data object will be quicker to identify, fix, and confirm fixed via a simple Micro-Benchmark level test instead of noticing that nearly all the Product level tests have slowed down.

Micro-benchmark level tests
Micro-benchmark tests are designed to test small sections of code at a very basic level. The most common type of this test runs a small bit of code in a tight loop that repeats many times and measures the amount of time elapse. Often these should be very focused on a specific aspect of the software and stripped down to avoid other factors influencing the results. Some examples of these sort of tests are math functions, object creation, fileIO, database interaction, text/image drawing, screen transition, etc. We repeat the operation many times so any change in behavior is magnified.

I’ve labelled these “micro-benchmark” rather than “benchmark” for a couple of reasons. First, to avoid confusion with the general use of the word “benchmark” in testing. Second to draw attention to the focus on keeping the part tested here very small. Many of the tests one may encounter in a “benchmark” suite of tests (e.g. Octane 2.0) are what I’m calling “micro-benchmark” tests.

These tests are developed with a firm understanding of the underlying code. They may be developed in coordination with the code developer and may even start out as a development tool that an engineer uses runs while figuring out the best way to implement a feature. Micro-benchmark tests of this sort are very interested in pointing out differences in performance between implementations and in underlying architectures.

Because the measurements made by micro-benchmark tests should be very focused any change in performance here should be easy to trace back to a cause. They also may provide some protection of the base units of code from later changes.

While most often concerned with time-based measurements (“How many ms does it take to generate 100,000 objects?”), this type of test may also focus on memory use.

Feature/Module level tests
Feature/Module level tests are designed to test the performance of specific features and modules in isolation. These types of performance tests are probably best built as part of feature/module development and the existence of the tests be a part of the definition of feature complete and/or your definition of “done”. These tests do not require as much understanding of the underlying code as a micro-benchmark test–you don’t need to understand the code which anti-aliases text or renders the video in order to measure the fps of a scrolling text box or of a video.

While it is important to measure, understand, and prove performance of a new feature during development, these tests are often considered “protective” once the feature/module is complete. If someone else comes through to fix a bug later to the feature/module without fully understanding the code or changes underlying code the feature/module is dependent upon a performance change can be caught.

These tests should try to focus on the feature/module so any significant performance change can be attributed to the feature/module. A test of basic video playback should still try to isolate video from other factors such as text rendering, network performance, etc. that may cause a test to report a drop that isn’t caused by the video playback.

Because these are more involved tests the performance variance by platform and over time may be much wider and require more thought than a ‘all out’ micro-benchmark test. A ‘bouncing balls’ type test written for an average desktop browser could perform too poorly to be useful on a mobile device but also not tax the system of a new high-end desktop in a couple years. There’s various ways to try to anticipate this: different ball counts per platform, stepping up the ball count and taking readings at each step, increasing the number of balls until performance is degraded below a certain point, etc.

This sort of test may measure time (“How long does this transaction take?”), they measure a broader scope than the micro-benchmarks tight-loop, and may measure different parameters such as frames per second (fps), memory use, load on the system, etc.

Product level tests
The Product level tests are designed to replicate real-world use cases. These may focus on a certain area (e.g. specific graphics pipeline, account creation) but are intended to measure how everything behaves together. These may be tests run against a literally live system or they may be a more canned and controlled simulation which replicates expected real world use cases. Though we want to test something akin to a real-world situation it is still helpful to remove as much noise as possible. When creating a product level test for a helicopter attack game you will get more reliable, consistent, useful data if you can remove causes of randomness: exclude download time when extraneous, hardcode the randomseed or otherwise hardwire the enemy movements, program the player’s moves (or at least require no interaction), etc.

These gestalt tests may be the most important from an end-user and system health point of view, because they really let you know how all the pieces are working together. These are the tests that the Product team should be most concerned with. But they are also generally the hardest for the engineering team to reverse engineer to find the root cause of a change. With a robust set of micro-benchmark and feature/module tests you may already have indicators of where the problem is (and isn’t). If you’re able to keep these easier to deconstruct, you’ll save engineering time and aggravation. Very high-end, complex tests may have simpler versions: for the helicopter attack game we may want one version to test our sprite animation with little logic and another logic-based with little animation.

These tests will often measure such things as rendering performance (fps), memory, start-up time, CPU or battery usage, interaction speed, concurrent users, etc.

Bonus level:Meta-level tests
This probably isn’t really a different category from the Product level of testing, but when an example of this cropped up it turned out to require a different measurement mechanism. Rather than building out a specific test, the meta-level performance test metric is a measure of meaningful data for running an entire test suite (e.g. “What does the memory footprint look like after executing all the tests in the test suite?”). It may even be adding performance-type tracking around a suite of non-performance pass/fail tests. This sort of test is likely covering a higher level measurement such as memory usage, battery drain, cpu use, etc. A fall in performance here will likely require a bit of thought as to what the root cause may be.