blog

Let's talk about testing

Performance, Load and Stress Testing

When we talk about either of performance, load and stress testing, we sometimes encounter confusion about the three terms. No wonder, since they all involve the same tools and techniques! And to make matters worse, we ourselves often lump them together under the “perf & load” label.

But while all three test methods share similar tools and approaches, their goals are very different indeed.

Similarities

As hinted at above, all three kinds of testing have very similar base requirements. In either case:

There is a system you want to test. It could consist of many machines and many software components, or just one of each, but it’s typically treated as a unit, the application.

You will have some workflows through the application that should be reasonably close to actual (expected) user behaviour.

Additionally, you will need to have a reasonable estimate about the distribution of workflows in overall load. That is, at any given time, will you have more users reading the blog, or more users using the checkout workflow on your web shop component?

Lastly, you will be talking about the number of “virtual users” you want to run the test with. Each virtual user is a software agent that runs through one of the workflows, then picks up another workflow, according to the above distribution.

Given all this information, you can simulate a reasonably realistic workload for the application.

That’s what you’ll do in all three of these test methods, but that’s also where the similarities end.

Load Testing

Load testing aims to answer the question does the application handle my given workload?

The typical approach to load testing is to derive from business requirements how many users the application needs to handle. That is the number of virtual users you want to reach in your load test.

You then start ramping up virtual users from zero or a handful, until the prescribed number of virtual users is reached. If the application doesn’t produce undue amounts of errors, your load test succeeded.

Note that there is a threshold metric that needs defining, namely how many errors are reasonable? That can turn out to be the hardest question to answer up front, given that some errors – especially temporary ones – may be entirely forgiveable in your use case.

Stress Testing

Stress testing, by contrast, aims to answer the question what workload can the application handle?

Stress testing can be thought of as exploratory load testing; instead of increasing the load until a number of virtual users is reached, you would instead increase the load until a number of errors is reached.

That’s possibly the exact same threshold metric as given above, but not necessarily. For a hybrid application with a web shop and a blog component, the blog component typically does not contribute to the business revenue. If it produces errors for a number of virtual users, while the web shop component still holds up under the same load, you might not care.

For stress testing, it is therefore increasingly important to decide which error rate is acceptable for the most critical application components.

Since you are simulating realistic load by letting virtual users run through realistic work flows, another way to look at this is to ask which of the workflows may fail.

Performance Testing

Performance testing aims to answer a compound question that is closely related to the results of the previous types of testing, namely at which workload does my application respond within an acceptable limits?

At first glance this looks almost exactly like stress testing; the difference lies in how to treat valid responses from the application that take too long to produce for the user to care.

For websites, there are various numbers floating around. Even back in 2011, KISSmetrics reported the following thresholds, which are – more or less – still relevant today:

3% of visitors would abandon a page if it didn’t load in under a second.

16% of visitors would abandon a page if it didn’t load in 1-5 seconds.

30% of visitors would leave after 6-10 seconds.

16% would leave after 11-15 seconds.

15% would leave after 16-20 seconds.

20% would wait for 20+ seconds for a page to load.

In performance testing, your rule of thumb is that anything that takes longer than ten seconds to load needs improvement, take longer than 15 seconds and you lost the customer. Staying below 5 seconds can be considered good.

Again, as with stress testing, workflows matter – a user might wait patiently for the checkout page to finish the payment process, but may not want to wait that long for a product page to load.

When to test what?

When you execute one of the tests, chances are that you can report some ballpark numbers for another type of test. For example, a performance test will also tell you whether your expected user numbers can be served.

But the three types of test also test slightly different parts of the application, so make sense at different times during the development process:

Performance tests are mainly tests of the software architecture and algorithms. If both are sufficiently good at processing many user requests in parallel, development can move on. To test this, you don’t need the most powerful deployment environment, nor the largest number of virtual users. What you primarily need is large amounts of test data, as many algorithms slow down significantly with the amount of data tested.

Stress tests are mainly tests of a combination of software and reference hardware. Assuming the software algorithms and architecture is good enough, a limited stress test will allow you to understand how much hardware you would need to serve the number of users your business demands. Stress tests also reveal architectural issues that might have remained hidden under performance test conditions.

Load tests finally verify the numbers extrapolated from stress tests, and must occur under the most realistic circumstances possible. It is highly recommended that you load test a staging environment that is identical in scope, size and data as your future production environment.

Viewing the three types of testing from this point of view, it appears as if they’re clearly ordered by scale – from the least amount of virtual users and test machines to the highest. That view is not entirely wrong, but may obscure a little what each kind of test aims for.

Literature

Note that other literature subsumes both load testing and stress testing under the performance testing label. That is entirely understandable, as either type of testing attempts to establish data on how the application performs.

However, it is less helpful when for example discussing page rendering performance, a particular type of performance metric. In it, it is recorded which part of a web page rendering process takes what time – downloading images, CSS, JavaScript, DNS resolution, you name it.

Web page performance testing can be done quite well with a single virtual user executing a single workflow. There are plenty of examples of systems whose behaviour varies here under different load conditions, but the first useful data can be obtained without load or stress testing.

We therefore prefer to talk about performance testing to be mainly aimed at obtaining data on software behaviour, stress testing mainly aimed at obtaining data on system behaviour, and load testing mainly aimed at verifying set performance goals are reached.

Related Testing

Two other kinds of testing are related to the three above, and should not be forgotten about in a good test plan:

Soak Testing

Soak testing or endurance testing measures whether an expected load can be handled by the application over a longer period of time. You would typically want to monitor memory utilization and performance degradation throughout the test.

Spike Testing

Spike testing is a particular kind of load test where the load is suddenly increased, or ramped up very steeply. It aims to establish how well the application can respond to sudden load spikes, for example when it goes viral.

Conclusion

Although all the test methods above share a common set of requirements, and are performed using a common set of tools, they differ greatly in which part of a system they test, and what relevant questions they answer about the system.