Investigating Software

Tuesday, 16 August 2011

Testing doesn't complete, it might end, it might finish, but it doesn't complete. There's too much to test. If you ever need confirmation of this, test something, something that's been tested already. Better still test a piece of software, you know has been tested by someone you think is a brilliant tester. A good tester like you, will still find new issues, ambiguities and bugs.

That's because the complexity of modern software is huge: as well as all the potential code paths of your code, there's all the other underlying code's paths and the near infinite domain of data it might process. Thats part of the beauty of testing, you have to be able to get a handle on this vast test space. That is, review a near infinite test-space in a [very] finite time-frame.

We are unable to give a complete picture of the product to our clients. But we are also free to find out new issues, that have so far eluded others. In fact the consequences are potentially more dramatic. We will always be sampling a sub-section of the potential code, data and inputs. The unexplored paths will always out number the mapped paths. As such the number of un-discovered issues is always going to be greater than the number already found. Or at least, we will not have time and/or resources to prove otherwise. As such, it's the tests you haven't run or even dreamed of, that are probably most significant.

As I learn, I become better equipped to see more issues in the software. My new knowledge allows me to better choose regions of the software's behaviour to examine. I can realise questions that previously I did not even think of. Each new question opens up a new part of that near-infinite set of tests, I've yet to complete.

For example, I learned that some Unicode characters can have multiple representations. The two representations are equivalent, but for example may utilise 2 codepoints to represent one character, rather than one codepoint representing one character. A good example would be the letter A, with a Grave accent:
À
À

Depending on your browser/OS they might look the same or different. Changing the font might help distinguish between them:ÀÀ

My text editor actually renders them quite differently, even though they are meant to display the same:

Until I knew about this feature of unicode; I didn't know to ask the right questions. How would the software handle this? Could it correctly treat these as equal? This whole area of testing would not of been examined, if I hadn't taken the time to learn about this 'Canonical Equivalence' property of Unicode normalisation.

This is a situation when I would actively avoid using most test automation, until I was clear as to the my understanding of the potential issues. Therefore I stopped using my previous scripts, and used cURL. The benefit of cURL is that it gives me direct and visible control of what I request from a site/API. It will make the exact request I ask of it, with very little fuss and certainly no frills. I can be sure its not going to try and encode or interpret what I'm requesting, but rather, repeat it verbatim.

This example had an interesting result when used against the Guardian Content API. My first tests included this query to the Guardian's Content Search:

The non-combining query, including the letter À (Capital A with a grave accent or %C3%80 in a single character):

At first glance these two results look fairly similar, but a closer look shows the first response includes a didYouMean field. In theory these two queries should be treated equivalently. This minor difference suggests they were not being treated so, but this was also a fairly minor issue. As a tester I knew I had to examine this further, find out how big/bad could this difference be?

Rather than slip back into automation, I realised that what I needed was an example that demonstrates the potential magnitude of the issue. This was a human problem, or opportunity, I needed an example that would clearly diagnose an issue in one representation of the characters and not in the other. So I needed a query that could be affected by these differences and if interpreted correctly, deliver many news results. The answer was Société Générale a high profile and recent news story, with a non-ASCII, accented company name.

This response shows that the query found 0 results, and suggested something else.

At this point it looked like there was an issue. But how could I be sure? maybe the Unicode NFC behaviour was purely hypothetical, not used in reality. So I needed an oracle, something that would help me decide if this behaviour was a bug. I switched to another news search system, one that generally seems reliable and would be respected in a comparison: Google News.

I used cURL to make two queries to the Google News site, using the two different queries. This requires a minor tweak, to modify the user-agent of cURL, to stop it being blocked by Google. The results showed that google returned almost the same results, for both versions of "Société Générale". There were some minor differences, but these appeared to be inconsistent, possibly unrelated. The significant feedback from these google news pages was that Google returns many results for both forms of character representation, and those results are virtually identical. It would therefore appear that there is an issue with the Guardians handling of these codepoints.

Thanks to this investigation, we have learned of another possible limitation in the Guardian Search API. A limitation that could mean a user would not find news related to an important and current news event. This kind of investigation is at the heart of good testing, results learned from testing are quickly analysed, compared with background knowledge and used to generate more and better tests. Tools are selected for their ability to support this process, increasing the clarity of our results, without forcing us to write unneeded code, in awkward DSLs.

Tuesday, 9 August 2011

In my last post I discussed how test automation could be used to do things that I couldn't easily do unaided. In that example, execute thousands of news 'content searches' and help me sort through them. With the help of some simple test automation I found some potential issues with the results returned by the REST API.

In that case, I started out with the aim of implementing a tool. But your testing might not lead you that way, often your own hands-on investigation can find an issue. But you don't know how widespread it is, is it a one-off curiosity? or a sign of something more widespread.?

Again, this is where test automation can help, and if done well, without being an implementation or maintenance burden. Many test automation efforts are blind to the very Agile idea of YAGNI or You Ain't Gonna Need it . They often presume to know all that needs to be tested in advance, deciding to invest most of their time writing 'tests' blindly against a specification, that is as-yet un-implemented. This example shows how simple test automation based on your own feedback during testing can be very powerful.

The test:

I had the idea that the Guardian Content API might be overly fussy with its inputs. Often software is written [and tested] using canned data, that's designed 'to work'. These data inputs usually confirm to a happy path, and even then, only a subset of the data that could be considered 'happy path' is used.

Using the Guardian's own HTML GUI ( API Explorer ) that allows you to easily query the REST API, by hand, I tried a few quick tests. These included, some random text as well as a few typical characters that are likely to occur in text but I suspect would not be present in the usual canned test data. For example, a single SPACE character.

That quick test, of the SPACE character highlighted an issue. Entering a single SPACE character into the Tags search API explorer, appeared to cause the HTML GUI to not return a response. The API Explorer appeared to hang. At that point, I didn't know the cause, the issue could be a problem with this developer-GUI, and not the API itself.

A closer examination using Firebug, clearly identified the cause as a HTTP 500 Error from the server.

I could have reported this one issue. That despite the documentation stating any free text is ok for this field a simple space character can expose a failure. But using some simple automation I was better able to define the extent and distribution of this issue. For example, is it a general problem with entering single characters? Does it only affect one part of the API?

With a minor script change, my previous Ruby API-tool could report error responses and details of whether a JSON response object was returned by the server. (Though a simple cURL based shell script could have just as easily done the job.) I also wrote a little script to output every ASCII character:

The output of the above was directed to a file, and used as the input for my API-script. The script now reads one ASCII character at a time and uses that character to query the Guardians Content API. As I had found this issue in the Tags Search I also ran this script against the Guardians Tags search API.

This is ideal ground for test automation, there are 128 ASCII characters, and I'm examining 2 services, making 256 queries. That's too many to do by hand, but trivial for a simple test automation script. Common characters that are not available in ASCII, are nonetheless very common in English. And therefore will be present in the body and headlines of the Guardians content. A simple example is the € (euro) symbol. The script would also allow me to query these many [millions of] non-ASCII characters if my current testing suggested that it might be fruitful.

The results when filtered to only show non-HTTP 200 results clearly indicate the Tags API is less robust that the Content Search API, over these inputs. For example the space character produced no error in Content Search but did in the Tags Search. The same is true of the Horizontal tab, both characters that might be present in 'any free text'.

The lack of consistency between the two APIs is the most striking factor, to me as a tester. The two systems clearly handles these inputs differently. That information is invaluable to testers. We can instantly use this information in our next round of testing as well as discussions with programmers and product owners. Asking such questions as:

What code is shared between the two services? There are clearly some differences.

How do the two APIs handle these characters in combination? with each other or 'typical' english words.

As error handling code is itself often flaky, what new bugs can I find in the Tags search API?

How badly will the APIs, in particular the Tags API, handle non-ASCII characters? Should the APIs handle all UTF-8 codepoints without exposing failures?

As far as the product owner is concerned, What is expected behaviour when a character is not-accepted?

Light weight, easy to build test automation that lets the team quickly get a mental-model of how the software actually-works is clearly valuable. I'm using the computer to do the laborious work its good at, extending my reach as a tester to help me see the bigger picture. Showing that there is more than just one isolated character not being handled well, but in fact the Tags Search API is generally a lot more prone to failure. This, more exploratory, automation is freeing me to do analysis and face to face communication with team members and product owners. Allowing me to adapt quickly to those discussions and how I see the software behaving, a fundamentally more Agile (and agile) approach.

Friday, 5 August 2011

Have you ever had to test an API that's accessible over the internet? or even one thats available internally within your organisation? They often take the form of a REST service (or similar) through which other software can easily access information in a machine readable form.

Even if you are not familiar with these APIs, you've probably heard-of or seen the results of them. Some examples of APIs are the Twitter API , Flickr and the Guardian's Open Platform. Some examples of what people have built using the Flickr API are published on the flickr site. Despite being 'machine-readable' they are often human readable, greatly helping you test and debug them.

Companies use these APIs to ease the distribution of their content, encourage community and commercial development around their content or to simply provide a clear and documentable line between their role as data-provider and where the consumer's role begins.

When testing an API like the above, many teams slip into the test-automation-binary-world-view. That is, they discard the clever testing approaches they've used before with command-line or Graphical User Interfaces, and start to think only in terms of PASS and FAIL. The knowledge that the 'consumer' is another piece of software seems to cause many teams to assume that they no longer need to learn the application, and the 'testing' becomes a series of wrote steps. Read the specification, define some fixtures, write some PASS/FAIL tests and run.

That's good, if you want to create [literally] an executable specification. As well as your companies actual API implementation you now have a mirror image of that API in your test automation. If you have used a tool such as Cucumber, you may now have an ambiguous natural language [style] definition of the rigidly implemented actual API. That might be useful in your organisation for communication, but as far as testing goes, its fairly limited.

What have you learned? What new avenues for finding problems have you uncovered? To make your tests work and work reliably, you've probably had to incorporate a range of canned data that 'works' with your tests. So once they do work, for this narrow range of code-paths, they will keep 'working' and not revealing any new information about your software, no matter how many times you run them.

As testers we know the vast wealth of bugs are yet to be found. We can write PASS/FAIL tests all day, trying to predict these bugs and barely scratch the surface. The real 'gotchas', the unexpected bugs, are not going to be in those canned tests. We have to use an explore the software to find them.

These APIs have a common theme, they are for communicating data. We are not just talking about simple unit-level methods that add/subtract/append etc. These methods work with human generated content, they try to represent the content in a form machines can work with and deliver. This makes these systems 'messy' as people don't think like machines. Users don't know or care what the executable specification says, They use Flickr, write content for the Guardian, or tweet. When software has to handle these messy real world situations bugs are born. The domain of possible inputs and range of outputs for these systems is huge, and as testers its our role to attempt to find the ambiguities, unknowns and bugs that affect the API and the data it needs to work with.

The tools we need to find these issues and bugs are readily accessible and generally free. This simple example, my initial look at the Guardian Content API, shows how to make test automation aid rather than hinder your testing. It isn't a canned test and data, it doesn't provide you with a reassuring green light, but does highlight the value of a more exploratory form of test automation. My example uses the live Guardian Content API, and took a matter of minutes to code.

As a tester, I noticed some good areas to investigate further. For example from experience I have noticed that time and dates are an area very prone to bugs. So I decided to take a closer look at the date-times returned by the API. They appear to be written in an ISO style.

Some 'tester' questions that immediately ran through my head were: is that always the case? Are the times widely distributed throughout the day or is there a rounding bug?

To help investigate these questions, I wrote a short ruby script that takes a list of words, queries the Guardian's content API and outputs a summary of the results. The results look like this:200,aa,2011-08-03T06:45:36+01:00,world/2011/aug/03/china-calls-us-debt-manage200,aa,2011-07-31T18:52:29+01:00,business/2011/jul/31/us-aaa-credit-rating200,aa,2011-07-31T00:06:35+01:00,world/2011/jul/31/hong-kong-art-culture-china200,aa,2011-07-29T15:36:00+01:00,world/2011/jul/29/spain-early-election
...

The results fields are:

[200] is the HTTP response code,

[aa] is the word that I queried,

[e.g. 2011-08-03T06:45:36+01:00] is the 'webPublicationDate' of the article,

[e.g. world/2011/aug/03/china-calls-us-debt-manage] is the 'id' of the article.

What words do I use in my queries?

For this test, I chose a list of 'known to be good words', that are definitely in the english language. They should bring back a large set of results, [from this index of english language articles]. This word list is easily found, most UNIX systems have a free word list built-in. On my Mac its here on the file-system: /usr/share/dict/web2

The word list contains 234936 entries, too many for a person to manually query, but trivial for our test automation to work with.

I started the script passing in the 'words' list file as an argument. The script takes each word, queries the server, records a summary of the first page of results and continues hour after hour. This is ideal grunt-work for test automation, using the computer to run tasks that a human can't (thousands of boring queries) - when a human can't (half the night).

The Results

When I return to the script a few hours later, it is still in the 'a...' section of the words but has already recorded over 125,000 results. If you remove the duplicates, that is still over 45,000 separate articles.

I looked through the summary results file, and decided that If I extract out the dates, and plot them on a graph I might get a better visualisation of any anomalies. I used a simple set UNIX commands, to output a list of times in year order. At this point I discovered an issue.

When sorted, it became clear there are some odd dates, for example, here are the first ten dates:

"The Content API is a mechanism for selecting and collecting Guardian content. Get access to over 1M articles going back to 1999, articles from today's coverage, tags, pictures and video" http://www.guardian.co.uk/open-platform/faq

This is invaluable data, its opened up a whole series of questions and hypothesis for testing.
e.g.:

Does the 'webPublicationDate' refer to the actual publication date in the newspaper rather than the website? (the above examples might suggest that)

Do the date-filter options in the API, correctly include/exclude these results?

Will other features/systems of the site/API handle these less-expected dates correctly?

Do we need to update the developer documentation to warn them of the large date range possible in the results?

What dates are valid? How do we handle invalid dates etc?

Are results dated from the future (as well as distant past) also returned? (Such as those subject to a news-embargo )

Should I write a script to check all the dates reported match those in the web-page?

These questions are now the immediately the basis for the next set of tests, and provide some interesting lines of enquiry for the next discussion with our programmers and product owners.

The interesting thing here is that we used test-automation to improve our testing, by achieving something that couldn't be done practically by a person un-aided. Also it is highly unlikely these new questions would of been asked if we had merely created an executable specification. That process assumes that we know the details of the problems in advance. But as we saw in this example, we had no idea that this situation would occur in advance. To suggest we did, would likely be an instance of the hindsight bias. That is, We believe we 'should' of known what to look for the issue, even though we and others failed given all the evidence present at the time.

Writing test automation that provides us with new information, new avenues for investigation, is clearly a better way to learn about your API. As opposed to focusing your time and resources on simple binary PASS / FAIL tests on a set of canned data. The very nature of its open ended investigation, lends itself to the tight loop of investigation, define hypothesis, and test at the heart of good testing.