Improving the Accuracy of Tests by Weighing the Results

Test automation is often hard to do well, especially when testing a complex system where the results may be dynamic and hard to predict precisely. For instance, how would you test a price finder for highly volatile products such as airline tickets, where the data comes from live, third-party systems? While manual testers can interpret such results to decide whether a system is working correctly, the interpretation may be much harder to codify in an automated test. In this week's column, Julian Harty describes an approach that may help you to improve the accuracy of your automated testing even in these volatile environments.

Heuristic Test OraclesOne of the biggest challenges when automating tests is deciding whether the results are good or bad. We could compare against a reference-commonly called a test oracle-however complete test oracles for complex systems may be extremely expensive and are likely to have their own problems. Doug Hoffman wrote a great article on heuristic test oracles in the March/April 1999 issue of STQE magazine. These are relatively simple, albeit imperfect, models of the behaviors we expect from a system.

Let's see if we can create a heuristic test oracle that is able to cope in complex, volatile environments. The oracle will need to be sufficiently accurate to enable automated tests to score results and determine if the results fall into one of three categories: good, bad, or unknown.Weighing Responses to Tests

In software as in life there are things we notice that help confirm whether something is satisfactory or unsatisfactory. Let's call these things that affect our judgment of the results "factors." Some factors provide stronger indications than others. When using these factors to rate results, we will assign higher scores (or "weightings") to the stronger indicators. By assigning higher weightings to the stronger indicators, we enable these to have a stronger influence on the overall outcome.

Some factors are positive, and when we detect them we are more likely to view the result as satisfactory. Others are negative and more likely to lead us to view the result as unsatisfactory.

As a real-life example of how factors affect our judgment, let's examine the way we judge a meal at a restaurant. Positive indicators might be the décor (a minor factor), the service (a medium factor), the taste of the meal (a strong factor), etc. Negative factors might be dirty crockery or, worse, an insect in the food.

For the airline ticket scenario mentioned in the summary, positive factors might include: structured HTML results with a section heading titled Flights (a minor factor), a Web form such as bookNow (a medium factor), and well-formatted results for the locations and dates requested (a strong factor). Negative indicators might be the lack of the Web form bookNow (a medium factor), or HTTP error codes like the dreaded "HTTP 500 server error" (a major factor). Sometimes the negative factor may override any or all of the positive indicators, such as the server error in this example. Some factors may be structural, such as the HTML elements, and others may be related to data, such as the details of the flights.

For test automation, we need to assign scores to the factors so we can codify them. A good way to start is to use a table (see table 1 below).

Table 1: Factors and Assigned Weightings

Indicator

Weighting

Range

Comments

HTML heading: Flights

Low

0 if missing, 1 if present

bookNow HTML form

Medium

-1 if missing, +2 if present

Well-formatted flight data

High

0 if missing, +5 if >= 1 result

We may want to consider having an upper limit, based on business rules

HTTP error codes

High

-10 for either 404 or 500 errorcodes

The ranges should be set so an overall negative score corresponds to unacceptable results and an overall positive score indicates acceptable results. A good way to assign weightings is to ask someone to describe the factors that help him decide whether a result is valid or trustworthy. We can also create mock-ups and suggest possible outcomes and see how they react. For instance, ask "how would you feel if the flight information was written as one long string?"

When we construct our tests, try to find out if we can "peek behind the curtain" to access the underlying data. If so, the data may help strengthen our tests by providing a stronger correlation between the results received and the results expected.

As always, we may choose to ignore some indicators, such as the time and date on the page. If so, record what you've ignored for future reference. This will help you and others to differentiate between these factors and whatever else is provided in the results.

Try to find ways to insert guards (or assertions, as they're known in programming terminology) that validate your tests and help ensure that your tests detect major changes to the system. For example, if the airline ticket Web site's prices are currently limited to US dollars, put in a guard that will fail the tests if other currencies are detected. Deliberately aborting the tests is often preferable to blind execution, which might report erroneous results.

I recommend working with others to review the results of your initial tests to decide if your tests are trustworthy and to calibrate the weightings if need be. We need to focus on improving our tests. Look for both false positives (tests that claim to pass when they should fail) and false negatives (tests that claim to fail when they should pass). Whenever you find false positives or false negatives, review your test scripts and the weightings to find ways to make the tests more reliable and robust. Even after making improvements I still expect my tests to miss some problems. After all, they rely on imperfect heuristics; however that's ok as long as they find the important problems!

Implementation NotesPattern matching is a practical way to implement matching the factors. Simple patterns might involve using string matching and field comparisons (for example to verify the HTTP response code field). More complex patterns might rely on regular expressions or structured queries (such as using XPATH expressions).

Field comparisons are useful when the data is structured and you are able to reliably predict where the relevant field will be in the response. For HTTP, the response code is easy to locate and easy to match. I recommend always adding a test for the value returned.

String matching is generally well supported and easier for non- technical users to work with. However, the string might be found elsewhere within the response, so take care to match the string in the correct part of the response.

XPATH is useful for matching structured data and to hone in on the relevant section of data to compare within the response.

We can-and generally need to-combine the various pattern-matching techniques to determine whether the factor is present and "correct."

The scoring should be simple to code: If a pattern is matched correctly, return the respective score; otherwise return to zero. For cases such as the presence or absence of Web forms (such as bookNow), return the relevant score to +2 if found, otherwise -1. More complex scores can be implemented using case statements, etc. The value of some results will affect whether or not you evaluate other results. For instance if you detect an "HTTP 500" error you don't need to check the flight data on that page as the server has already reported an error.

By combining heuristic test oracles with the weighting of responses we can make our automated tests more powerful and reliable, even in volatile environments with complex dynamic data. The concepts are fairly simple, and you can work with experienced users to determine the factors and weightings. I hope you find these techniques useful.Let me know how you get on with these suggestions. Your comments and feedback will help to improve the material further.

Thanks to Mike Kelly and Harry Robinson who have already helped to improve this article.

About the author

A senior test engineer at Google, Julian Harty finds ways to test lots of fun products, including the mobile wireless software used by millions of users worldwide. He's been involved in software and online systems for more than twenty years and enjoys working with others to find ways to solve testing challenges productively. A presenter at both STAREAST and STARWEST, Julian has been involved in international conferences and workshops on software testing.