Mechanization provided human operators with machinery to assist them with the muscular requirements of work. Whereas automation is the use of control systems and information technologies reducing the need for human intervention. In the scope of industrialization, automation is a step beyond mechanization.

In testing, however, I see those two concepts often dangerously mixed: people tend to mechanize testing, thinking they are automating it. As a result, the estimated value of such “automation” is completely wrong: it will not take risks into account, it will not find important regressions, and it will give a tester a false sense of safety and completeness. Why? Because of the way mechanization is created.

Both automation and mechanization have their value in testing. But they are different, and therefore should always be clearly distinguished. Whereas automation is a result of application risk analysis, based on knowledge of the application and understanding of testing needs (and thus finding tools and ways to cover those risks), mechanization goes from the opposite direction: it looks for out of box tools and easiest ways to mechanize some testing or code review procedures, and makes an opportunistic usage of those methods. Mechanization, for instance, is great for evaluating a module/function/piece of code. It is, if you will, a quality control tool, but it does not eliminate a need for quality assurance testing, either manual or automated.

It’s like making a car: each detail of the car was inspected (probably in mechanized, or very standardized way) and has a “QC” stamp on it. But after all details were assembled, do they know if and how the car will drive? If car manufacturers did, they would not have to pay their test drivers. Yet, test drivers are the ones that provide the most valuable input, that is closest to actual consumer’s experience. In software, testers can play a role of both, quality control personnel, and test drivers. Taking away one of those roles, or thinking that test drive can be replaced with “QC” stamp is not the way to optimize testing. And bare existence of mechanical testing should not be the reason to consider application “risk free” or “regression proof”. Otherwise you may end up in a situation where your car does not drive, and you don’t even know.

Quite frankly I am lucky: developers I worked with (well, most of them) were quite open to interactions with testers, didn’t take the “bad news” testers brought personally, and were able to learn to work with testers efficiently. In other words – they valued services testers provided, which I think is one of the most important factors in this relationship.
However, even those, who valued testers, often pointed that testing lacks some predictability, and the path tester follows looks almost accidental. And this is where I think the major difference between developers and testers is: while developers’ work is algorithmic by nature (define-design-do), and thus it’s possible to develop a fairly detailed plan ahead of time, same level of planning may not be possible when you try to break things. It doesn’t mean that testers always “shoot in the dark”: they have to have a master plan and initial evaluation of risks. But very often the most efficient next step tester takes depends to the large extend on the findings and outcomes of the previous one. And thus it can be difficult for a tester to specify precisely what and how he or she will be testing in several days from now.
Another aspect: developers have a clear finish line. Developer is done, when he or she implemented the required functionality, and it works (whatever this means for particular culture, project or a developer). For testers the finish line is always arbitrary (you can barely run out of test cases), and can only be defined as an abstract moment when tester ran out of important test cases. The term important is very well defined in the context of testing: important to stakeholders, company reputation, etc. But since it doesn’t have an associated precise mathematical formula, testers often have hard time explaining the others when they will be done, causing developers to feel uneasy. It takes some skill and experience (not to mention talent, which I think is the main factor of any work) for tester to choose the best point in time when tester can say “I’m done”. And it takes some skill and experience for developers to trust tester’s judgment. First test your testers, then trust your testers. The first part of this statement can be a subject of a separate post, and the second is described well in Product Bistro: Love Developers, and Trust QA by Mitchell Ashley.

1. Anything reusable must be in a shared location. Personal email accounts, local file folder, or someone’s head are not places where reusable information can be stored. Also the format of the information, stored in a shared location has to be readable by anyone.

2. Store only unique information, that is information that may take time to recover (results of the research, internal information, abstracts from multiple sources, etc.). But do not duplicate or link anything, that is a “Google search away”, unless it’s used so often, that it makes sense to store it in a well known location. Another exception is rare information, or information, located on the site that may disappear or change its contents (e.g. someone’s personal page, forum, etc).

3. When you run into outdated or incorrect information inside the shared repository, update it or at least mark it as such. This will save some time for others, and also will make sure that the quality of the information does not degrade over time.

Two of the most popular (and least pleasant) conversations between testers and developers, go like this:

Testers: “It turns they changed it, but they didn’t bother to tell us!”
Dev: “Oh, we didn’t think it was important!”

Testers: “We told them about this issue long time ago, but they didn’t care”
Dev: “From their explanation it didn’t sound like an important issue”

However those conversations are easily eliminated, by following two simple rules:

Any change that can affect some portion of the code somewhere should cause a discussion with the goal to understand the risks. Change can be “accepted” with no further investigation/testing if it’s not considered to be risky by all sides, or it can cause the need for additional investigation and testing. I don’t call for discussion around each line of code: I’m sure every team can come up with a unit that makes sense for a particular product (e.g. component, module, or any other definition of functional area). And such discussion is especially important when no other changes are done, no extensive testing is planned in the same functional area, and thus regressions can slip undiscovered.

Similarly any issue or concern that is being raised, should not be discarded without understanding how it affects the product, which risks it involves, whether it requires further investigation and testing, etc. This especially applies to those “last day” findings, which often are discarded too fast, because everyone is in “Let’s ship it!” mood.

When it comes to those types of testing, where we measure how application performs in different conditions, and compare its metrics against different other applications or standards, almost each testing culture has its own definitions for the same terms, which is very confusing. Even performance testing itself may mean completely different things to different people. Commonly there are three groups of perceptions:

Performance testing seen as a specific type of testing, separate from other types, such as load, stress, etc. One of the most ardent defenders of this point of view, explains his position in his blog post.

And in the middle there are all those, who believe that performance is an umbrella term for some types of testing, but does not include others. For example I once had a lengthy discussion with a colleague, who believed performance testing to be any type of testing that deals with operations on application (for example load, stress, endurance), but not with operations on data (e.g. volume).

Advocates of each approach have their supporting arguments, of course, and references to literature. So when working in a specific company, it’s always a good idea to either find out an existing or to establish a new common vocabulary. But besides that, does it really matter how you call it? Not at all. Even though I do have my preferences, and I would prefer to have one common umbrella term (and performance testing is seems perfect for this role), and I would like to see an agreement on what each of the other terms means, I am ready to give up on any definitions, if it shifts focus from reaching the testing goals to fighting the terms.

And as opposed to definitions, the goals of performance testing are usually quite clear.

***
The first goal is to make sure anticipated workload can be supported: application will not “break”, and its performance characteristics (time, throughput, or any other measurements relevant for the application) will not degrade below acceptable limits. Such testing is 80% planning, and 20% execution, as environment, transaction/event distribution, amount of operations/events, and volumes of data must be carefully planned to represent typical production environments with maximal possible proximity. This involves two important preparative steps:

Understanding the performance goals, finding the appropriate and relevant metrics, defining typical distribution of the operations, and typical environment, etc. This step involves interviewing multiple stakeholders, getting confusing “wish lists”, or an answer “I don’t know”… All of which could be a topic for the separate post.

Each transaction or operation should be tested by itself (including concurrency test) to eliminate the obvious problems (e.g. functional bugs, memory leaks, or other inefficiencies) within the transaction itself.

Usually first outcome of this type of testing is the necessity to deal with resolvable bottlenecks, e.g. inefficiencies within the environment and application itself. On later stages some further fine tuning can be required (e.g. database / application maintenance procedures and policies, hardware recommendations, etc.). This type of testing completes, when

Application is able to support all anticipated workload scenarios: it doesn’t break, and its performance characteristics are acceptable.

We collected performance characteristics of the application for each of those scenarios. Those characteristics can be used as a benchmark for other tests, or other versions of application.

We can specify hardware requirements and maintenance procedures required to support an anticipated workload.

***
Once we know that anticipated workload can be properly supported, the workload on the application is increased to and beyond the limits, which can be subdivided into two sub-goals:

Finding workload limit: maximal workload application can handle with acceptable performance characteristics. This is the point before application breaks or its performance significantly degrades. This point may exhibit non-optimal performance characteristics, just acceptable.

Finding what happens when workload exceeds the limit, that is the application breaks, its performance characteristics degrade to unacceptable levels, or one of the “unresolvable” bottlenecks is reached

This testing can use the same environment and transaction/event distribution as the load testing, but the amount of operations and volumes of data must be increased gradually to reach and exceed the limit. One of the most important goals of this testing is to make sure that anticipated workload is not dangerously close to workload limit, and that application fails predictably and somewhat gracefully.

Sometimes understanding whether application performs well with anticipated load is impossible, since nobody seems knows what the anticipated load is. In such case, working backwards (i.e.: finding the application limit, and understanding whether it’s acceptable, and close to typical workload) can help.

***
At the same time we can start the lengthiest of all test: test, that verifies whether anticipated workload can be sustained for a long period of time. This type of testing is commonly called endurance testing. “Long” is defined differently for different types of applications, but in enterprise environment we should be talking about at least few weeks. As a basis we can still use the same scenarios as in the first test, but it’s good to extend them with some additional “real environment” features, such as periodical issues with an underlying structure (e.g. lost network connections) and erroneous inputs, if those are not part of the original tests already. This testing can reveal “hidden” issues, like memory corruptions, caused by multiple failures, or insignificant memory leaks, that turn into a real problem over the time. It also allows to find out a magical “number of hours application can run without a failure” measure, so popular with managers.

***
When it’s already known how the anticipated workload is handled, and what are the application limits, the workload can be sharply or slowly increased from regular to maximum and then decreased back to regular, as if it goes through the rush-hour, peak or spike. Another version of this test, takes the workload from none to maximum and then back to idle. The goal of both tests is to observe how the application behaves and how long does it recover after a / peak. This test might be more important for certain applications, where such waves of workload should be expected on regular basis. In such case it might be a good idea to combine this type of testing with testing for long period of time, by creating alternating anticipated-maximal-anticipated-low-… workloads on an application for an extended period of time. Also distribution of the transactions or operations during the spike might be different from distribution of operations with regular workload (for example: in the morning many people try to login, while very few started to do something else).

***
Another common test that a distributed enterprise application may require is a test that determines when to scaling up or scaling out will be more beneficial to handle an increasing workload the hardware on which applications runs. Scaling up (also called vertical scaling) adds more hardware resources to an existing machines (e.g. more memory, faster hard drive, etc.), while scaling out (also called horizontal scaling) adds additional machines and distributes work between them. Here’s an example of such testing performed by Pentaho.

***
Another interesting area is the size and growth rate of the data in application back-end storage (e.g. database, or file system). Naturally this testing is part of all of the above tests, as all of them will produce large quantities of data, and thus resolution of many bottlenecks will require estimation and adaptation for data growth, or we may want to run tests on a large developed database, rather than on an empty one. However this testing can also be done separately, with the goal to provide recommendations specifically targeting DBAs / system administrators, who may not be involved or familiar with application itself.