Posts Tagged “tests”

This is the second part of my unit testing advice. See the first part on this blog.

If you need any introduction you should really read the first part. I’ll just present the other three ideas I wanted to cover.

Focusing on common cases

This consists of testing only/mostly common cases. These tests rarely fail and give a false sense of security. Thus, tests are better when they also include less common cases, as they’re much more likely to break inadvertently. Common cases not only break far less often, but will probably be caught reasonably fast once someone tries to use the buggy code, so testing them has comparatively less value than testing less common cases.

The best example I found was in the wrap_string tests. The relevant example was adding the string “A test of string wrapping…”, which wraps not to two lines, but three (the wrapping is done only on spaces, so “wrapping…” is taken as a single unit; in this sense, my test case could have been clearer and use a very long word, instead of a word followed by ellipsis). Most of the cases we’ll deal with will simply wrap a given word in two lines, but wrapping in three must work, too, and it’s much more likely to break if we decide to refactor or rewrite the code in that function, with the intention to keep the functionality intact.

See other examples of this in aa20bce (no tests with more than one consecutive newline, no tests with lines of only non-printable characters), b248b3f (no tests with just dots, no valid cases with more than one consecutive slash, no invalid cases with content other than slashes), 5e771ab (no directories or hidden files), f8ecac5 (invalid hex characters don’t fail, but produce strange behaviour instead; this test actually discovered a bug), 7856643 (broken escaped content) and 87e9f89 (trailing garbage).

Not trying to make the tests fail

This is related to the previous one, but the emphasis is on trying to choose tests that we think will fail (either now or in the future). My impression is that people often fail to do this because they are trying to prove that the code works, which misses the point of testing. The point is trying to prove the code doesn’t work. And hope that you fail at it, if you will.

The only example I could find was in the strcasecmpend tests. Note how there’s a test that checks that the last three characters of string “abcDEf” (ie. “DEf”) is less than “deg” when compared case-insensitively. That’s almost pointless, because if we made that same comparison case-sensitively (in other words, if the “case” part of the function breaks) the test still passes! Thus it’s much better to compare the strings ”abcdef” and “Deg”.

Addendum: trying to cover all cases in the tests

There’s another problem I wanted to mention. I have seen several times before, although not in the Tor tests. The problem is making complicated tests that try to cover many/all cases. This seems to stem from the idea that having more test cases is good by itself, when actually more tests are only useful when they increase the chances to catch bugs. For example, if you write tests for a “sum” function and you’re already testing [5, 6, 3, 7], it’s probably pointless to add a test for [1, 4, 6, 5]. A test that would increase the chances of catching bugs would probably look more like [-4, 0, 4, 5.6] or [].

So what’s wrong with having more tests than necessary? The problem is they make the test suite slower, harder to understand at a glance and harder to review. If they don’t contribute anything to the chance of catching bugs anyway, why pay that price? But the biggest problem is when we try to cover so many test cases than the code produces the test data. In this cases, we have all the above problems, plus that the test suite becomes almost as complex as production code. Such tests become much easier to introduce bugs in, harder to follow the flow of, etc. The tests are our safety net, so we should be fairly sure that they work as expected.

When reviewing tests written by other people I see patterns in the improvements I would make. As I realise that these “mistakes” are also made by experienced hackers, I thought it would be useful to write about them. The extra push to write about this now was having concrete examples from my recent involvement in Tor, that will hopefully illustrate these ideas.

These ideas are presented in no particular order. Each of them has a brief explanation, a concrete example from the Tor tests, and, if applicable, pointers to other commits that illustrate the same idea. Before you read on, let me explicitly acknowledge that (1) I know that many people know these principles, but writing about them is a nice reminder; and (2) I’m fully aware that sometimes I need that reminder, too.

Tests as spec

Tests are more useful if they can show how the code is supposed to behave, including safeguarding against future misunderstandings. Thus, it doesn’t matter if you know the current implementation will pass those tests or that those test cases won’t add more or different “edge” cases. If those test cases show better how the code behaves (and/or could catch errors if you rewrite the code from scratch with a different design), they’re good to have around.

I think the clearest example were the tests for the eat_whitespace* functions. Two of those functions end in _no_nl, and they only eat initial whitespace (except newlines). The other two functions eat initial whitespace, including newlines… but also eat comments. The tests from line 2280 on are clearly targeted at the second group, as they don’t really represent an interesting use case for the first. However, without those tests, a future maintainer could have thought that the _no_nl functions were supposed to eat whitespace too, and break the code. That produces confusing errors and bugs, which in turn make people fear touching the code.

See other examples in commits b7b3b99 (escaped ‘%’, negative numbers, %i format string), 618836b (should an empty string be found at the beginning, or not found at all? does “\n” count as beginning of a line? can “\n” be found by itself? what about a string that expands more than one line? what about a line including the “\n”, with and without the haystack having the “\n” at the end?), 63b018ee (how are errors handled? what happens when a %s gets part of a number?), 2210f18 (is a newline only \r\n or \n, or any combination or \r and \n?) and 46bbf6c (check that all non-printable characters are escaped in octal, even if they were originally in hex; check that characters in octal/hex, when they’re printable, appear directly and not in octal).

Testing boundaries

Boundaries of different kinds are a typical source of bugs, and thus are among the best points of testing we have. It’s also good to test both sides of the boundaries, both as an example and because bugs can appear on both sides (and not necessarily at once!).

The best example are the tor_strtok_r_impl tests (a function that is supposed to be compatible with strtok_r, that is, it chops a given string into “tokens”, separated by one of the given separator characters). In fact, these extra tests discovered an actual bug in the implementation (ie. an incompatibility with strtok_r). Those extra tests asked a couple of interesting questions, including “when a string ends in the token separator, is there an empty token in the end?” in the “howdy!” example. This test can also be considered valuable as in “tests as spec”, if you consider that the answer to be above question is not obvious and both answers could be considered correct.

See other examples in commits 5740e0f (checking if tor_snprintf correctly counts the number of bytes, as opposed the characters, when calculating if something can fit in a string; also note my embarrassing mistake of testing snprintf, and not tor_snprintf, later in the same commit), 46bbf6c (check that character 21 doesn’t make a difference, but 20 does) and 725d6ef (testing 129 is very good, but even better with 128—or, in this case, 7 and 8).

Testing implementation details

Testing implementation details tends to be a bad idea. You can usually argue you’re testing implementation details if you’re not getting the test information from the APIs provided by whatever you’re testing. For example, if you test some API that inserts data in a database by checking the database directly, or if you test the result of a method call was correct by checking the object’s internals or calling protected/private methods. There are two reasons why this is a bad idea: first, the more implementation details you tests depend on, the less implementation details you can change without breaking your tests; second, your tests are typically less readable because they’re cluttered with details, instead of meaningful code.

The only example I encountered of this in Tor were the compression tests. In this case it wasn’t a big deal, really, but I have seen this before in much worse situations and I feel this illustrates the point well enough. The problem with that deleted line is that it’s not clear what’s it’s purpose (it needs a comment), plus it uses a magic number, meaning if someone ever changes that number by mistake, it’s not obvious if the problem is the code or the test. Besides, we are already checking that the magic number is correct, by calling the detect_compression_method. Thus, the deleted memcmp doesn’t add any value, and makes our tests harder to read. Verdict: delete!

I hope you liked the examples so far. My next post will contain the second half of the tips.

A couple of week ago I discovered LeakFeed, an API to fetch cables from Wikileaks. I immediately thought it would be cool to play a bit with it and create some kind of application. After a couple of failed ideas that didn’t really take off, I decided to exploit my current enthusiasm for Javascript and build something without a server. Other advantages were that I knew Angular, an HTML “power up” written in Javascript (what else?), which I knew it would ease the whole process a lot, and I even got the chance to learn how to use Twitter’s excellent Bootstrap HTML and CSS toolkit.

What I decided to build is a very simple interface to search for leaked cables. I called it LeakyLeaks (see the code on GitHub). Unfortunately the LeakFeed API is quite limited, so I had to limit my original idea. However, I think the result is kind of neat, especially considering the little effort. To build it, I started writing support classes and functions using Test-Driven Development with Jasmine. Once I had that basic functionality up and running I started building the interface with Bootstrap and, at that point, integrating the data from LeakFeed with Angular was so easy it’s almost ridiculous. And as LeakFeed can return the data using JSONP, I didn’t even need a server: all my application is simply a static HTML file with some Javascript code.

All this get-data-from-somewhere-and-display-it is astonishingly simple in Angular. There’s this functionality (“resources”) to declare sources of external data: you define the URLs and methods to get the data from those resources, and then simply call some methods to fetch the data. E.g.: you can get the list of all tags in LeakFeed from http://api.leakfeed.com/v1/cables/tags.json (adding a GET parameter callback with a function name if you want JSONP). Similarly, you can get the list of all offices in LeakFeed from http://api.leakfeed.com/v1/cables/offices.json. In Angular, you can declare a resource to get all this information like this:

Once you have declared it, using it is as simple as calling the appropriate method on the object. That is, you can get the tags by calling this.LeakFeedResourceProxy.getTags(), and the offices by calling this.LeakFeedResourceProxy.getOffices(). And when I say “get the tags”, I mean get a list of Javascript objects: no JSON, text or processing involved. If you assign the result of those functions to any property (say, this.availableOffices), you’ll be able to show that information like so (the |officename is a custom filter to show the office names with a special format):

The cool thing is, thanks to Angular’s data binding, anytime the value of that variable changes, the representation in the HTML changes, too. That is, if you assign another value to this.availableOffices the select box will be automatically updated to have the new set of options! But the data binding is two-way, so any changes you make in the UI also get reflected back in the Javascript variables. This further simplifies many tasks and makes programming with Angular a breeze.
There are other nice things about Angular (and many nice things about Bootstrap and Jasmine of course!), but I think that’s enough for today :-)

This post is probably not about what you’re thinking. It’s actually about automated testing.

Different stuff I’ve been reading or otherwise been exposed to in the last weeks has made me reach a sort of funny comparison: code is (or can be) like science. You come up with some “theory” (your code) that explains something (solves a problem)… and you make sure you can measure it and test it for people to believe your theory and build on top of it.

I mean, something claiming to be science that can’t be easily measured, compared or peer-reviewed would be ridiculous. Scientists wouldn’t believe in it and would certainly not build anything on top of it because the foundation is not reliable.

I claim that software should be the same way, and thus it’s ridiculous to trust software that doesn’t have a good test suite, or even worse, that may not even be particularly testable. Trusting software without a test suite is not that different from taking the word of the developer that it “works on my machine”. Scientists would call untested science pseudo-science, so I am tempted to call code without tests pseudo-code.

Don’t get me wrong: sure you can test by hand, and hand-made tests are useful and necessary, but that only proves that the exact code you tested, without any changes, works as expected. But you know what? Software changes all the time, so that’s not a great help. If you don’t have a way to quickly and reliably measure how your code behaves, every time you make a change you are taking a leap of faith. And the more leaps of faith you take, the less credible your code is.

The other day I was in a conversation with some developer that was complaining about some feature. He claimed that it was too complex and that it had led to tons of bugs. In the middle of the conversation, the developer said that the feature had been so buggy that he ended up writing a lot of unit tests for it. To be honest I don’t think there were a lot of bugs after those tests were written, so that made me think:

Maybe the testers in his team are doing too good of a job?

Because, you know, if testers are finding enough of “those bugs” (the ones that should be caught and controlled by unit tests and not by testers weeks after the code was originally written), maybe some developers are just not “feeling the pressure” and can’t really get that they should be writing tests for their code. If testers are very good, things just work out fine in the end… sort of. And of course, the problem here is the trailing “sort of”.

I know I’m biased, but in my view there is a ton of bugs that should never be seen by someone that is not the developer itself. Testers should deal with more complex, interesting, user-centred bugs. Non-trivial cases. Suboptimal UIs. Implementation disagreements between developers and stakeholders. That kind of thing. It’s simply a waste of time and resources that testers have to deal with silly, easy-to-avoid bugs on a daily basis. Not to mention that teams shouldn’t have to wait for days or weeks until they find basic bugs via exploratory testing. Or that a lot of those are much, much quicker to test with unit tests than having to create the whole fixture/environment for them to be found with exploratory testing.

My current conclusion is that pushing on the UI/usability side is not only good for the user, but it’s likely to produce better code as it will be, on average, more complex and will have to be better controlled by QA (code review, less ad-hoc design, …) and automated tests. Maybe developers will start hating me for that, but hopefully users will thank me.

I had said that I was going to publish the slides for a couple of talks I had given over the last couple of months, and I just got around to actually do it, so here they are:

Software automated testing 123, an entry-level talk about software automated testing. Why you should be doing it (if you’re not already), some advice for test writing, some basic concepts and some basic examples (in Perl, but I trust it shouldn’t be too hard to follow even if you don’t know the language).

Taming the Snake: Python unit tests, another entry-level talk, but this time about Python unit testing specifically. How to write xUnit style tests with unittest, some advice and conventions and some notes on how to use the excellent nosetests tool.

Just a quick note about them: the slides shouldn’t be too hard to understand without me talking, but of course you’ll lose some stuff that is not written down, some twists, clarifications of what I mean exactly by different things and whatnot. In particular, the “They. don’t. make. sense. Don’t. write. them” stuff refers to tests that don’t have a reliable/controlled environment to run into. I feel really strong about them, so I wanted to dedicate a few more seconds to smashing the idea that they’re ok, hence the extra slides :-)

After porting Haberdasher to Rails 2, I had forgotten to execute all the testsuites I had (unit, functional and acceptance, with Selenium and Selenium on Rails). The bad news is that they didn’t pass. The good news is that it wasn’t such a big problem making them pass again.

The functional tests failed because of some stupid change in Rails 2. Namely, it seems that now you can’t make more than one request in a single functional test method (bug?). The acceptance tests had some minor problems due to some changes I made in the interface. The rest worked without problems.

Now that everything is ported and working like a charm, it’s time to make some interesting changes. I had been wanting to use a really cool library called RemixUI, made by my former company, Fotón Sistemas Inteligentes, and these days I finally had the chance to use the latest version. RemixUI is a “web widget” library, similar to DJWidgets, MCWidgets and RemixWidgets (all of them available in the Fotón BerliOS page, but unfortunately obsolete), that makes it much easier to write validation, integration between client side and server side, interface improvements with Javascript, reusable widgets/controls, etc.

I haven’t used it that much yet, but I’m really eager to change all the forms and controls in the application to take advantage of the cool stuff offered by RemixUI. The problem now is that the RemixUI gem is not public yet, so I can’t really release the new version of Haberdasher. I’ll try to make them put the Gem somewhere public, so I can release Haberdasher, and other people can have a look at RemixUI.