Monday, 16 May 2011

Testing Drives the Need for Flexible Configuration

If you look at our system’s production configuration settings you would be fooled into thinking that we only need a simple configuration mechanism that supports a single configuration file. In production it’s always easier because things have settled down, but during testing is when the flexibility of your configuration mechanism really comes into play.

I work on distributed systems which naturally have quite a few moving parts and one of the biggest hurdles to development and maintenance in the past has been because the various components cannot be independently configured so that you can cherry-pick which services you run locally and which you draw from your integration/system test environment. Local (as in on your desktop) integration testing puts the biggest strain on your configuration mechanism as you probably can only afford to run a few of the services that you might need unless your company also provides those big iron boxes for developer workstations[#].

In the past I’ve found the need to override component settings using a variety of criteria and the following list is definitely not exhaustive, but gives the most common reasons I have encountered:-

Per-Environment

The most obvious candidate is environmental as there is usually a need to have multiple copies of the system running for different reasons. I would hazard a guess that most teams generally have separate DEV, TEST & PROD environments to cover each aspect of the classic software lifecycle. For small systems, or systems with top-notch test coverage the DEV & TEST environments may serve the same purpose. Conversely I have worked on a team that had 7 DEV environments (one per development stream), a couple of TEST environments and a number of other special environments used for regulatory purposes, all in addition to the single production one.

What often distinguishes these environments is the instances of the external services that you can use. Generally speaking all production environments are ring-fenced so that you only have PROD talking to PROD to ensure isolation. In some cases you may be lucky enough to have UAT talking to PROD, perhaps to support parallel running. But DEV environments are often in a sorry state and highly untrusted so are ring-fenced for the same reason as PROD, but this time for the stability of everyone else’s systems.

Where possible I like the non-production environments to be a true mirror of the production one, with the minimum changes required to work around environmental differences. Ideally we’d have infinite hardware so that we could deploy every continuous build to multiple environments configured for different purposes, such as stress testing, fault injection, DR failover etc. But we don’t. So we have to settle for continuous deployment to DEV to run through some basic scenarios, followed by promotion to UAT to provide some stability testing. What this means is that our inputs are often the same as for production, but naturally our outputs have to be different. But you don’t want to have to configure each output folder separately, so you need some variable-based mechanism to keep it manageable.

The Disaster Recovery (DR) environment is an interesting special case because it should look and smell just like production. A common technique for minimising configuration changes during a failover is to use DNS Common Names (CNAMEs) for the important servers, but that isn’t always foolproof. Kerberos delegation in combination with CNAMEs is a horribly complicated affair. And that’s when you have no control over the network infrastructure.

Per-Machine

Next up is machine specific settings. Even in a homogenous Windows environment you often have a mix of 64-bit and 32-bit hardware, slightly different hard disk partitioning, or different performance for different services. Big corporations love their “standard builds” which helps minimises the impact but even those change over time as the hardware and OS changes – just look at where user data has been stored in Windows over the last few releases. The ever changing security landscape also means that best practices change and these will, on occasion, have a knock-on effect on your system set up.

By far the biggest use for per-machine overrides though is during development, i.e. when running on the developers workstation. While unit testing makes a significant contribution to the overall testing process you still need the ability to easily cobble together a local sandbox in which you can do some integration testing. I believe the DEV environment cannot be a free-for-all and should be treated with almost the same respect as production because if the DEV environment is stable (and running the latest code) you can often reduce the setup time for your integration testing sandbox by drawing on the DEV services instead of running them locally.

Per-Process-Type

Virtually all processes in the system will probably share the same basic configuration, but certain processes will have specific tasks to do and so they may need to be reconfigured to work around transient problems. One of the reasons for using lots of processes (that share logic via libraries) is exactly to make configuration easier because you can use the process name as a “configuration variable”.

The command line is probably the default mechanism most people think of when you want to control the behaviour of a process, but I find it’s useful to distinguish between task specific parameters, which you’ll likely always be providing, and background parameters that remain largely static. This means that when you use the “--help” switch you are not inundated with pages of options. For example a process that always needs an input file will take that on the command line, as it might an optional output folder; but the database that provides all the background data will be defaulted using say an .ini file.

Per-User

My final category is down to the user (or service account) under which the process runs. I’m not talking about client-side behaviour which could well be entirely dynamic, but server-side where you often run all your services under one or more special accounts. There is often an element of crossover here with the environment as there may be separate DEV, TEST and PROD service accounts to help with isolation. Support is another scenario where the user account can come into play as I may want to enable/disable certain features to help avoid tainting the environment I’m inspecting, such as using a different logging configuration.

Getting permissions granted is one of those tasks that often gets forgotten until the last minute (unless DEV is treated liked PROD). Before you know it you switch from DEV (where everyone has way too many rights) to UAT and suddenly find things don’t work. A number of times in the past I’ve worked on systems where a developer’s account has been temporarily used to run a process in DEV or UAT to keep things moving whilst the underlying change requests bounce around the organisation. Naturally security is taken pretty seriously and so permissions changes always seem to need three times as many signatures as other requests.

Hierarchical Configuration

Although most configuration differences I’ve encountered tend to fall into one specific category per setting, there are some occasions where I’ve had cause to need to override the same setting based on two categories, say, environment and machine (or user and process). However because the hardware and software is itself partitioned (e.g. environment/user) it’s usually been the same as overriding on just the latter (e.g. machine/process).

What this has all naturally lead to is a hierarchical configuration mechanism, something like what .Net provides, but where <machine> does not mean all software on that machine, just my system. It may also take in multiple configuration providers, such as a database, .ini files, registry[*] etc. My current system only uses .config style[$] files at present and on start-up each process will go looking for them in the assembly folder in the following order:-

System.Global.config

System.<environment>.config

System.<machine>.config

System.<process>.config

System.<user>.config

Yes, this means that every process will hit the file-system looking for up to 5 files, but in the grand scheme of things the hit is minimal. In the past I have also allowed config settings and the config filename to be overridden on the command line by using a global command line handler that processes the common settings. This has been invaluable when you want to run the same process side-by-side during support or debugging and you need slightly different configurations, such as forcing them to write to different output folders.

Use Sensible Defaults

It might appear from this post that I’m configuration mad. On the contrary, I like the ability to override settings when it’s appropriate, but I don’t want to be forced to provide settings that have an obvious default. I don’t like seeing masses of configuration entries just because someone may need to use it one day – that’s what source code and documentation is for.

I once worked on a system where all configuration settings were explicit. This was intentional according to the lead developer because you then knew what settings were being used without having to rummage around source code or find some (probably out-of-date) documentation. I understand this desire but it made testing so much harder as there was a single massive configuration object to bootstrap before any testable code ran. I shouldn’t need to provide a valid setting for some obscure business rule when I’m trying to test changes to the messaging layer – it just obscures the test.

Configuration Formats

I’m a big fan of simple string key/value pairs for the configuration format – the old fashioned Windows .ini file still does it for me. Yes XML may be more flexible but it’s also far more verbose. Also, once you get into hierarchical configurations (such as .Net .config files), its behaviour becomes unintuitive as you have to question whether sections are merged at the section level, or individual entries within each section. These little things just make integration/systems testing harder.

I mentioned configuration variables earlier and they make a big difference during testing. You could specify, say, all your input folders individually, but when they are related that’s a real pain when it comes to environmental changes, e.g.

One option is to generate your configuration from some sort of template, but I find that a little too invasive. It’s pretty easy to emulate the environment variable syntax so you only have one setting to change:-

[Variables] SharedData=\\Server\PROD FeedsRoot=%SharedData%\Imports

[Feeds] SystemX=%FeedsRoot%\SystemX SystemY=%FeedsRoot%\SystemY

You can even chain onto the environment variables collection so that you can use %TEMP% and %ProgramFiles% when necessary.

[#] Quite how anyone was ever expected to develop solid, reliable, multi-threaded services with a machine with only a single or dual hyper-threaded CPU is quite beyond me. I remember 10 years ago when we had 1 single dual-CPU box in the corner of the room which was used “for multi-threaded testing”. Things are better now, but sadly not by that much.

[*] Environment variables are great for controlling local processes but are unsuitable when it comes to Windows services because a machine reboot is required when they change. This is because the environment variables that a service process receives is inherited from the SCM (Service Control Manager), so you’d need to restart the SCM as well as the service (it doesn’t notice changes like the Explorer shell does). So, in this scenario I would favour using the Registry instead as you can get away with just bouncing the service.

[$] Rather than spend time up-front learning all about the .Net ConfigurationManager I just created a simple mechanism that happened to use files with a .config extension and that also happened to use the same XML format as for the <appSettings> section. The intention was always to switch over to the real .Net ConfigurationManager, but we haven’t needed to yet – even our common client-side WCF settings use our hierarchical mechanism.