Why Config?

26 Mar 2012

When I first started playing with scala in 2008, I was dismayed by the state of server configuration in the java world. A lot of java servers were still using property files, or worse, XML. XML is meant to be easily parsed by computers, but is really hard for humans to read and edit, and tends to hide useful information in baroque syntax and line noise. The python world was still clinging to Windows-era “INI” files, and the ruby world had invented something called YAML, with its own odd syntax.

[Alex Feinberg pointed out that the use of the term “config” can be overly general. In this post, I’m talking specifically about configuration used to bootstrap a cluster of machines all running the same server code. Shared configuration required by multiple server clusters is a different problem, and obviously not well-served by any solution that only works on the JVM.]

We had gone through many iterations of config file formats at my previous job, as we moved from perl to C++ to java, but it was a very private company, terrified of open source, so we shared none of what we learned. I thought it was time to spead some best-practices around, so I wrote “configgy” lazily over a couple of months as I learned scala.

Configgy

The core ideas behind “configgy” were:

A config file should be a text file, primarily readable by humans. It should be unambiguous and have minimal syntax.

A server’s configuration should really just be a set of (string) keys and values. The values can be bools, ints, strings, lists of strings.

You should be able to take blocks of these key/value sets and nest them, so subsystems can have their own isolated configuration.

The API should be similarly minimal, like a typesafe hashtable, and should allow subsystems to “subscribe” to configuration values and get notified if they’ve changed.

The end result was pretty successful, and we used it at Twitter for several years. An example chunk of a config file might look like this:

Unfortunately, I had gone in the wrong direction, and it took a while for the mounting evidence (and my coworkers) to convince me.

What’s wrong

Some of the problems with configgy show up in the config file example I pasted above:

There’s no schema. “port” should be an int, but there’s no place to declare that. There’s no definition for what should be in the config file at all. What are the keys? What do they do? You have to document it separately in a text file, if you’re really ambitious.

The available types aren’t sufficient. Durations are really common in server configuration because they specify timeouts, and there’s no real support for them. You have to drop sly hints in the field names (like “msec” for milliseconds) and hope people are paying close attention.

Extending the available types will never cover all cases. The “roll” field above can only have a few possible values, but there’s no simple way to define a new enumerated type like that.

Other problems only show up in daily use:

validation: How do you validate that the config file won’t cause a server crash hours after it starts? There’s nothing forcing “timeout_msec” to be an int, so it may throw an exception minutes later, when the code first tries to call .toInt on it.

defaults: What is the default timeout? Is there one? Configgy supported providing a default value in the API, but how do you know what that is when you’re editing the config file — especially if you didn’t write the original code?

One of the biggest faults should get its own section, because I have a lot to say about it.

Reloading config files

Configgy had a lot of code to support reloading config files on the fly, allowing a server to “subscribe” to a key and change its behavior if a config file was reloaded. It seemed really clever at the time, but experience taught me and my coworkers that it’s a really bad idea in practice.

How often do you change a config file on the fly and ask the server to reload it? And more importantly, when? Murphy’s Law tells us the answer: when something is broken, it’s the middle of the night, and it needs to be fixed immediately.

But because we only did this in a crisis, the code was effectively untested. If you aren’t regularly using some part of a server, you can’t trust it enough to depend on it in a crisis. In a crisis, I want only tools that I’ve used before and am confident in. It only takes a couple of incidents where reloading a config file doesn’t actually fix the server’s behavior before your policy becomes: Fix the config file offline, then restart the server.

The ability to reload configuration became just another moving part: something you had to think about, but would never actually use in a crunch.

This could probably be solved by adding automated testing that changes your config file, asks the server to reload, and then re-runs a suite of tests. But it just didn’t seem worth it. As a practical matter, the server needs to startup cleanly after any kind of unclean shutdown (“kill -9” or a fire) — and must be tested to do so — so you don’t need any other feature for reloading the config file. Just change the file and kill the server. Now it’s running with the new config!

How to fix it

If you read my post from last year about patterns, you know where this is heading. There’s one obvious way to define a set of named, type-safe fields: write a scala trait. Your config file can then just be a scala file that you compile and evaluate when the server starts.

Your config trait should be a builder that creates a server from config, like this:

The apply method assembles a Server from the configuration. After that, your config file can be:

new ServerConfig {
port = 12345
timeout = 250.milliseconds
}

The important lines look just like the configgy version, and are executed as part of the constructor.

Now you have a schema (the config trait), and every field has a type, declared in the trait and enforced by the scala compiler. If you need a specialized type, like an enum, you can make one. I especially like how readable timeouts become. It’s unambiguous that the duration is specified in milliseconds, and you could use seconds if you want.

How does it work?

The key is Eval, a component of util-eval that makes it easier to compile and execute scala code from inside the JVM. Scala already exposes this functionality — the scala compiler runs on the JVM, after all, and the REPL needs to do line-by-line compilation — but the API is arcane and marked with a “No serviceable parts inside” label. The Eval class simplifies it to:

The result of evaluating a config file is a new ServerConfig object (or similar), and calling apply on that will return a fully-initialized Server object, so loading the config file and starting the server boils down to:

If you add some exception handling to log errors, you end up with the code inside RuntimeEnvironment in ostrich, which we use to bootstrap server startup from config files in a deployed server.

Sleight of hand

There are two problems I listed above that aren’t solved by this simple solution: validation and default values. So you have to add a little bit of code to finish up.

If a config file can be compiled and executed, then it’s valid. The result of the evaluation is a config object (ServerConfig in this example) that doesn’t have any side-effects and can be safely evaluated at compile time. So that’s what we do: the last phase of a build executes the server jar with a special "--validate" option that compiles the config files and exits. If that succeeds, the config files are valid and they won’t crash the server in production.

In the example above, all the fields had default values, which is not always what you want. For those cases, we defined a basic Config trait. It allows you to mark a field as required with no default value, or optional, or lazily computed.

Implicits handle the conversion from a normal type to a “required” or “optional” type (optional types just use scala’s Option class), so the config file looks the same.

The Config trait fits completely in one file, with less than 100 code lines (according to cloc). That’s an incredible improvement over configgy.

Postscript

This post is a little overdue, but better late than never. :–)

I wrote this because it was important to me to share the knowledge, not because i did all (or even most) of the work. I carefully avoided naming coworkers while writing this post, because it disturbed the flow, but they all deserve callouts:

John Kalucki first spelled out for me why the implementation of default values was bad. Matt Freels and Ed Ceaser implemented the first draft of the Config class and pulled me in to help iterate on it. Nick Kallen opened my eyes to the dangers of depending on a server’s “shutdown” and “reload” behavior.