Thursday, August 4, 2011

Quite often I see someone put forward an argument that Integration testing reaps superior code coverage than unit testing and here I want to explain why that simply isn't true. I won't go into detail on why one is better than the other, suffice to say that both techniques have value.

The most convincing argue comes down to simple mathematics. Lets say we have a class structure A -> B -> C, A depends on B and B depends on C. Then lets say that there are 10 code paths through each class, this includes all possibilities of branching including exceptions (referred to below as b). Finally, we want to have 100% code coverage, a pointless but aspirational goal.

If we're unit testing then each class is tested in isolation the total number of tests will be:

(A * b) + (B * b) + (C * b) = 30

If, instead we do integration testing and want to get 100% coverage we have to account for all the possible interactions between the classes, so the formula instead looks like this:

(A * b) * (B * b) * (C * b) = 1000

So even though one integration test might cover more than one unit test, to get the same level of coverage from integration testing would take much more work. Bear in mind that this is a trivial object graph, I real one could require millions/billions of tests.

Friday, June 10, 2011

A spaghetti build, much like spaghetti code is a hair pulling experience. Most developers (but not nearly enough) know how to avoid spaghetti code, refactoring, extracting classes and methods from tight loops, removing unexpected consequences. Quit frequently however, these same principles don't get applied to our build scripts. Here I want to show an example of a spaghetti build and then some techniques on how to avoid them.

Here I put together an example of what I would consider a spaghetti build, regrettably some of this comes straight from the MS Build documentation:

At first glance a build like this might look like a good idea. The entirety of the build process has been simplified down to two targets and all configuration is handled by a single property and life seems pretty simple. The trade off however is flexibility, the above approach has in essence created the "one build to rule the all", it is an anti pattern that can result in individual tasks being unable to run in isolation.

Another problem is that it violates the repeatability principle. It is impossible to run a "server" build without being logged in to the server or by checking in code changes. A solid build system should instead be repeatable anywhere, no special rules should exist for a CI build. I addressed how to handle properties in this post.

Tasks should declare dependencies.

In the above example tasks didn't declare their dependencies because the meta tasks (clientBuild, serverBuild) called them in an explicit order. A cleaner and more flexible approach is to have each task declare it's dependencies and let MS Build work out which order to run them in. The testUnit task for example will depend on the code having been compiled first, so it should be declared like this:

A general rule of thumb is that if a task depends on the output of another task then that dependency should be explicitly declared.

Users should declare targets

The package task is an example of where the reverse is true. Before packaging the unit test and integration tests need to pass, it would seem natural to make these dependencies. Package however doesn't depend on the output of testUnit or testIntegration so it should be left up to the user to declare this dependency instead:

>msbuild build.proj /target:testUnit;testIntegration;package

This way the build still fails if the tests fail but someone can be working on the installer without worrying about long build times.

Another common example is code versioning, generating an AssemblyInfo class with the appropriate info. Here is a hypothetical version task to do this:

It would be tempting to have the compile task dependant on this, but this would trigger a full recompile every time (MS Build will only recompile if the files have changed). Release builds are usually the only one that cares about correct version so it doesn't make sense to slow down every build for one, relatively rare ocurence. The compile task will use the output from the version task but does not depend on it, It simply doesn't care one way or the other.

Anyone that does cares about the correct AssemblyInfo being generated should should make this explicit:

>msbuild build.proj /target:version;compile

When users declare their targets it keeps the dependency chain simple, shallow and, above all, understandable. Once again I want to stress that the CI server should be considered as just another user, no more or less important than the developers.

Don't call other tasks

MS Build come equiped with a call target task which should be avoided whenever possible. Many times I've seen people try to refactor build scripts to conditionally call other tasks. The above example had a "Server" build and a "local" build, other variations include "trunk" and "stable" as well as "test" and "release". What they all have in common is that the violate the repeatability principle. And the points I outlined above give you all you need to avoid this antipattern.

Another quite common example is having test task that executes an integration and unit test task (to save keystrokes:

One way these meta tasks come about is when something needs to be repeated more than once. A "stable" build for example might produce a 32 and a 64 bit version that would be a waste of time if every developer had to repeat for every check in. A far easier solution to this is to enlist he CI server rather than the build system. A typical project might contain the following CI configurations:

UltraProduct Trunk

UltraProduct Test

UltraProduct 1.2 32 bit

UltraProduct 1.2 64 bit

UltraProduct 1.4 32 bit

UltraProduct 1.4 64 bit

The details will differ wildly depending on your project. Some only need a stable and development version while others will need to maintain several stable versions and a few development branches simultaneously. Scrum developers might decide on a build for each sprint so as not to have people checking in half implimented features.

Remember, not all builds have to run every check in, a test build might only need to happen when the testers are ready for a new version, a release build can generally happen overnight and a trunk build doesn't need to pass all integration tests. Each build configuration will have different needs and should be unaffected by other configurations.

These are a few simple rules I've come up with from real world situations, developing builds scripts and working with others. If you have suggestions for more I'd like to hear them.

Keeping database structures up to date with other developers, not to mention production and testing envireonments is an Incredibly common problem and Fluent Migrator is a seriously good library that helps solve this problem elegantly. Amongst other eco systems the idea of migrations is well accepted, ruby has rails migrations, java has migrate4j and python has yoyo migrations, but for some reason the idea has been slow to gain traction in the c# world.

One of the barriers to adoption so far has been documentation, so in this article I want to show how to leverage Fluent Migrator. There are plenty of good articles on writing migrations so I will only touch on that briefly. Instead, I want to show how to use it as an integral part of the build process, how to avoid some common pitfalls and how to integrate it within the greater project lifecycle.

Whats Wrong With *

For start I'm going to assume that sharing a database between developers is bad idea. Someone may need to make large scale changes that disrupt other developers. Having long runnning feature branches is impossible (without another database) and it does nothing to solve the problems your eventually going to face at deployment time. This is obviously not an optimal solution.

Of the many half solutions I've seen, simple sql scripts are the most common. Advanced versions of this have sequentially numbered scripts and some sort of batch process to run them in order, in other word, half of Fluent Migrator. Often they will contain checks to ensure there not executed multiple times (so the same column doesn't get added twice for instance), another 1/4 of Fluent Migrator.

From a simplicity point of view this may seem rather enticing but doesn't work nearly as well in practice. This approach isn't very branch/merge friendly and migrations will need to be run in an explicitly defined order. Sql is also particularly verbose language, especially for structural changes, how many of you can remember the syntax to create a table with an index of the top of your head?

Other solutions I've come across, such as Visual Studio database projects, suffer from similar shortcomings and are deeply rooted in the old school "quarterly release" type project, where a once off upgrade script is written for every release. One particularly notable "solution" I came across was using data dude to generate patch scripts. This was so slow, cumbersome and error prone I still shudder just thinking about it.

With that in mind here are the goals for this solution:

Local - Upgrades should be able to run anywhere from the developers PC to the server.

Fast - This is a freeby (with Fluent Migrator) most migrations will take seconds from start to finish.

Frequent - Once fast and easy are solved you'll want to run them much more frequently.

Production Ready - Migrations should be able to be deployed at a moments notice. This is a by product of the above three.

Step 1 - Backup, Delete, Restore, Obfuscate

The first few requirements are basically "filling in the gaps", handling the tasks that Fluent Migrator doesn't deal with. Never the less they are an important part of the overall process. In these examples I will be using Sqlite because it's simple, file based nature makes it easy to follow the examples without getting lost in the details. Here are the important bits of the database.build file (I'll post the whole file at the end), this is created with the same techniques I outlined here:

Backup and delete should be fairly self explanatory but restore needs a bit more explanation. I've found it more useful to start from a "known good" version of the database rather than starting from scratch every time. For this I strongly favor using backups from production databases. This gives us much more realistic data to work with when we're developing, quantatively and qualitatively. Failing that, a database with somewhat realistic looking test data would do.

A pleseant side effect of using real production data is that, by the time a release comes around, we have run our data migrations dozens/hundreds/thousands of times. This will give much more confidence with the process and make releases much less stressful. An emergency patch can also be tested and verified using much the same process as a full release (because it's so quick and easy), so no more cowboy live patches.

Obviously if we have real data it may need obfuscating due to any number of privacy issues that could pop up. Security on developer workstations probably isn't as thorough as it is on the server and we don't want people taking sensitive data out of the office. However, I have purposely left the task empty, simply because it will vary so much from project to project.

Step 2 - Writing a Migration

Before any migrations can be written they need a project, a Fluent Migrator script is simply c# library (dll) after all. Usually it gets names something like YourProjectName.Migrations. Here is my recommended database structure:

As you can see each revision will get it's own folder, sooner or later you will have enough migrations to warrant this. The other thing of note is that each version has a data and structure folder. The strucure folder will contain the bread and butter of Fluent Migrator, adding/removing tables and columns. The data folder exists to make it easier to find migrations much create/update/delete any reference data that our applications will inviteably have.

Finally some actual code. This example is a simple migration that creates a products table with a name, description and primary key:

The interesting thing here is the version number which is is probably the least intuitive part of this article. The convention used is: {major}{minor}{year}{month}{day}{hour}{minute}. Major/Minor are our product versions, this ensures that migrations are executed in the order the product is developed. This is particularly important if a project has long running branches with features and bug fixes. The rest is the time the migration is created, hours and minutes simply ensure (almost always) that there are no colisions with other developers writing migrations concurrently.

This numbering scheme will save nuch frustration at merge time.

Step 3 - Running Migrations

At the time of writing the msbuild task isn't working with the version of Fluent Migrator I'm using, it also isn't documented so I'll be using using the exec task coupled with the Fluent Migrator console runner. One other thing to remember is that Fluent Migrator is currently a c# 3.5 project so all migrations projects need to be as well. Here is the MS Build task to execute the migrations:

There are quite a few properties here but it's the dbVersion property which is the most important. The default is set (at the top of the script) to 0, which will run all migrations. Because the precedence rules I outlined in my last article are used (Practical MS Build - Flexible Configuration), it is easy to migrate to a specific version if needed. From the command line simply specify a new value for the property:

One of the most cited reasons for dismissing Fluent Migrator is that it doesn't handle data migrations. This is throwing the baby out with the bath water. It is true that Fluent Migrator doesn't handle this but it does provide an excellent framework to execute and track such migrations. Because we're pragmatic programmers that use the right tool for the job, we'll want to modify our data with a language made for just that: SQL. Fortunately Fluent Migrator allows us to execute arbitrary sql, we just need a little bit of organisation and self discipline.

Lets say a migration needs to add a meta column to the products table created above and that we need to populate it with the description, to serve as a place holder until a real person edits it. The first part is easy, just create another migration which adds the column:

Next we need to create an sql file to hold our script. Generally I stick to the same naming conventions used for the migration classes so I created 01_01_2011_05_02_19_44_PopulateMetaOnMigrations.sql in the same directory as the migration classes:

UPDATE products
SET meta = [description]

This is about as straight forward as sql scripts get. Next I create a resource (0101_SqlMigrations.resx) file in the version directory and add the above script as a file resource. This compiles the script into the dll which simplifies things when we need to use our migraions externally (installers etc). The last thing to do is modify the migration class above with a line to execute the file:

A fast, flexible and complete solution to manage databases. With this approach you will never again will you fear long running branches. Never gain have to develop against a database being modified by others. Most importantly, never again dread database upgrades at release time. Once you get used to a solution like this going back to anything else will seem slow and archaic and error prone.

Friday, April 22, 2011

Any sufficiently complex build script will contain dozens of variables. Database names, IIS sites, compile options and test configurations are good examples of variable that can change from machine to machine, even branch to branch.

Many scripts I've seen rely on various assumptions about the machine it will be running. Others force everything to be explicitly defined for every build run. Many try to do both (TFS) at once and result in a spaghetti like mess.

This is what I have found to be an effective approach to juggling these variables, avoiding unnecessary surprises yet still remaining flexible

Declaring the Properties

First of all, a build script should declare all it's variables near the top. These values should generally be sensible defaults, values that will work for most developers most of the time. I'll get to what the condition attributes are for further down, here is a typical PropertyGroup section:

Next is the echo task. This will probably seem repetitive but we will be overriding these variables from various sources so it's important to have a sanity check, especially when it's running on a build server. The code itself is straight forward:

To run this script open up the Visual Studio command prompt, cd to the project directory and type:

msbuild project.build /target:echo

Overriding Properties

There are going to be many times when we want to override the default values and this is exactly what the Condition attribute is for. If someone (usually the build server) needed to use a different database it can be overridden from the command line:

databaseConnection will only be defined if the Condition attribute evaluates to true. Because we defined the property from the command line the condition evaluated to false. This simply bit of code has created a precedence order where command line property > default property. In real world usage this saves a great deal of automagic configuration and/or assumptions about the operating environment.

Developer Configuration

Of course a real world build script could have dozens of properties and typing these into the command line every time would get a bit monotonous. This can be avoided by creating a properties file that will save all our personal settings.

A properties file (I usually call it local.properties) sits in the root directory (along with the main build file) and is just a special MS Build script the will be imported into the main script. A properties file should never be kept in source control and should be explicitly excluded. Here is what a properties file will look like:

This will import the properties file if it exists and the properties declared in this file will override those in the main script. In this case the properties file overrides the nunitPath property because this developer installed it to a location that differs from the rest of the team, this is no big deal because he only has to specify it once and never think about it again.

Now the precedence order is: command line property > properties file property > default property. This is a good rule that will generally suite everyone. A developer with all the defaults will have no configuration yet configuration will be easy for anyone with different needs.

And that's it, the basis of a flexible, configurable and maintainable build system with MS Build. I intend to write a series of articles on MS Build in the near future as well as a working sample project, as I did with the Frictionless WCF example, so stay tuned.