Given that software development has so many ties into mathematics, I’m always surprised by how seldom it seems to show up in my day to day life. However, today was a little bit different.

Finding success through failure

One of the items that I dealt with, where some math was useful, was in some test failures that we’ve been seeing at work. The software is run through its automated tests in a number of variations. There are two axes of variation: the style of installation and the operating system it is running on. The number of total variations that get run is then simply the product of the sizes of the two axes. In this case there are 4 and 14, which gives a total of 56 variations.

Each time the entire suite of tests ran, there would be some seemingly random number of them that would fail for a reason that we couldn’t, and still can’t, explain. It was completely random from everything that we could see. However, the number that failed seemed to be somewhat consistent. It always seemed to end up with around 11 of the variations failing. That meant that 11/56 = 0.19642 or ~20% of the variations failed every time.

Now, it is really tempting to take those failed variations and run just them, and none of the others, again. For some reason it feels like you are learning something about the failures by doing that (there are cases in which you might, but let’s not get into that right now). However, the key to this is that which ones fail has been random in every occurrence that we have seen it. If 20% of the original set failed, then the expectation is that 20% of the rerun set would fail.

In my real life situation, those 11 failed variations were rerun. What happened? Well, what should I expect to happen? If this really is a random occurrence, then 20% of the 11 should fail again. 11 * 0.20 = 2.2. So I should expect about 2 of the variations to fail again. What actually happened? One of them failed. In such a small sample set, that seems to fall pretty well inside the realm of probability.

Let’s step back for a bit. In this kind of a situation, how many re-runs would I need to do to get a “fully passing” run. One in which, by rerunning the failures, I eventually get to the point where every variation has passed? This comes about by repeated application of the 20%. In the first run 20% fail. In the second run 20% of the 20% fail, which is 4% of the total. On a third run 20% of the 20% of the 20% fail, which is 0.8% of the total. At that point there is a 99.2% chance that every failed variation has passed.

This is exactly what we saw when the failed variations were re-run. On the third run (the second rerun) everything had magically passed! Everything was good! Unicorns and rainbows!

Except not. This was just the inevitable outcome of rerunning the failed variations, when the failures strike randomly. The 20% chance of failure is still there. It hasn’t gone away.

Mathematics destroys my illusions.

Something was a little fishy in there. Let’s look at some of those numbers again. The first run had a 20% failure rate (11/56 = 0.19642857 … not quite 20% but close enough for our purposes). However, a 20% failure rate predicts that the subsequent run would see 2.2 failures. Since a variation can only pass or fail, I interpret that as either two or three failures. But there was only one failure! What happened?

Let me explain a little bit about what the failure is that is showing up in these tests. Every once in a while the system times out. The client is making a request to the server and the connection appears to just hang, timeout, and then fail. The only hunch that we have is that this is somehow related to load on the system. When the tests are run, there are about 30 of the test variations running at once, in addition to any other test jobs that might be running at the same time. The hypothesis is that all of the running tests are somehow causing things to slow down and, at times, time out. This is bolstered by the fact that when we try to run any one of the tests that shows the failure independent of the entire run of all of the variations we don’t see the errors.

So let’s go back to those numbers. If this really is some sort of load problem on the testing infrastructure, then by rerunning just the smaller number of variations (11), that might have changed the probability of encountering the problem. Instead of it being a 20% probability, which predicts seeing 2 failures, there seems to be a 9% probability. That seems like another data point that points toward this being something related to the load on the test infrastructure.

Recently, puppet got a new way of defining environments. Along with the new environments, came a new way of handling the caching of the environments inside the puppet master. I found out that what puppet is now doing, and was doing in the past, wasn’t well understood by users and so this post is to get rid of some of the mystery.

I’m going to cover what an environment is internal to puppet, what the caching of an environment is, and what it isn’t, and what it used to be. In other post I’ll cover how to evaluate your situation and tune the environment cache and your workflow.

What is an environment?

Seems like such a simple question…but it isn’t. What an environment is in your puppet installation really depends on your workflow. In some workflows the environment is used to represent different stages in a build and deploy pipeline (dev, test, stage, production). In other workflows, I’ve heard of environments used to represent different clusters of servers or data centers. Internal to puppet, the concept of an environment becomes much simpler (but still not very easy). At its absolute simplest, an environment inside puppet is an instance of the Puppet::Node::Environment class. A Puppet::Node::Environment has a name, a set of initial manifests, and a set of modules, each with its own manifests.

In addition to all of these pieces that are directly related to an environment, there are also numerous other pieces that are indirectly related. Custom functions, which are written in ruby, are indirectly associated with the environment. Custom types are in a similar boat. Neither of these are completely under puppet’s control since they are extra ruby code that is loaded into the runtime.

Why does any of this matter? Any time a puppet master needs to construct a catalog of an agent node, it requires having all of the information about an environment on hand. Scanning the filesystem for the relevant directories and files, as well as parsing and validating puppet code, can all take time and put extra load on the server. If you’ve ever investigated a computer that is performing too much IO, then you’ll know how detrimental to overall performance unnecessary disk operations can be.

Improving Performance of Catalog Creation

Given that a puppet environment is needed for creating a catalog, and given that it can be expensive, both in terms of IO operations and in terms of CPU time, what can be done? The answer is to reduce unnecessary disk and cpu operations. But now we are faced with another question. Which operations are unnecessary?

A somewhat simplistic algorithm for what puppet does when creating a catalog is (ignoring lots of variations and other complications):

Every one of these operations are necessary, but the results of some change more often than others. Listing, reading, and parsing the manifest files only needs to happen when the file contents change, or when a file is added or removed. All of the other steps need to happen every time, either because the facts change every time, or because evaluating the program code has side effects which could affect the outcome. For example, custom functions that read and write to databases could have different results on every invocation.

The environment caching in puppet takes the approach of preserving the parsed manifest files between uses of an environment. This allows the master to skip reading, parsing, and validating the manifests every time it needs to build a catalog. The last piece of the caching puzzle is when puppet will evict an environment from the cache and start again.

Environment Cache Eviction

A cache eviction strategy defines when to remove an entry from a cache. Some of the most common strategies are to have a fixed size cache and then use an LRU (Least Recently Used) algorithm to remove entries when more space is needed in the cache. Puppet uses a much simpler cache eviction strategy: time-to-live. When an environment is placed into the cache the time is recorded. After a configured amount of time, the entry will be removed and the environment’s manifests will need to be parsed again. Right now puppet only decides to evict a specific environment cache entry when that specific environment is requested. If an environment is requested one time and never requested again it will remain in the cache indefinitely. This doesn’t lead to memory leaks in most puppet master configurations, however, because passenger, and other RACK servers, will terminate a worker process after a certain number of requests. If this wasn’t the case, then over time many puppet master setups would have an ever growing environment cache, since it is common to use short-lived, development or testing environments. These environments would never be freed from memory and the master process would continually grow in size (this is yet another reason why the WEBrick puppet master should not be used in a production scenario).

Right now the master only supports three eviction strategies:

Always (environment_timeout = 0) – any time the environment is requested the cached information is evicted and recreated.

Timeout (environment_timeout = <duration>) – if the cached information is older that the configured timeout it is evicted and recreated, otherwise the cached information is used as-is.

Never (environment_timeout = unlimited) – the cached information will never be evicted.

Each individual environment can have its own eviction strategy configured and so it can be tuned to how that particular environment will be used.

How did this used to work?

Pre-directory environments, puppet used a slightly different mechanism for refreshing the parsed manifests. Whenever it encountered a new environment it would create a new Puppet::Node::Environment instance. This instance would never be thrown away. Instead the Puppet::Node::Environment would track what manifest files it had parsed and keep track of the mtime of each file as well as the last time the mtime was checked. Whenever catalog was requested it checked if any mtimes were last checked more than filetimeout (a setting that can be specified in puppet.conf) seconds ago. For any of the recorded mtimes that had expired it would stat the corresponding file. If any of the new mtimes are different from the old mtimes, it would throw away all of the parsed manifests and begin parsing them all again.

That system had led to several problems. First, it is much harder to explain and understand what will happen when (not all of the files will timeout at the same time and so changing one file might cause an immediate reparse, while changing another won’t). Secondly, it relied heavily on stat calls, which we had found were often a bottleneck and something we want to avoid as much as possible in production systems. Third, the internal design relied heavily on tracking and changing a lot of state. It was simply harder for us to understand the code and ensure that it was correct.

The new environment caching system is a lot easier for us to work with. There have been some bugs and difficulties as we have tried to refactor the existing code to use the new system, but in the end it will be much more controlled and understandable to everyone.

In a follow up post I’m going to cover how to evaluate the puppet master’s workload for tuning the environment cache.

Every team that I’ve worked with has eventually come to ask the question: when is it done? And they don’t just mean “done”, they mean, when is it “done-done“? Every time, for whatever reason, this seemingly innocuous question becomes a sticking point of a lot of conversations. Sometimes everyone can agree on what it takes, but there ends up being concerns about the feasibility of the definition in practice. Other times there is some dissent about the points of done related to what is actually in the team’s control. All of these are perfectly valid concerns. Some of those concerns point to larger, organization problems that are being faced and some point to fear of tackling what at first looks like a daunting mountain of change to reach “done-done” of knowing when you are “done-done”.

I’m not going to try to solve all of those issues that come up during those conversations, but I’m going to do my part to try to coax them on a little bit and put out what I’ve started using as my baseline definition of “done-done”.

You are done when:

The code has been merged into all relevant branches.

All automated tests pass on all affected branches.

End user documentation about the change has been written.

The feature or bug fix has been demoed to another member of the team, and that person has had the chance to explore the bug fix/feature and find any potential problems.

Packages for all new releases of the product have been made.

It has been confirmed by the customer to be what they wanted.

In essence it comes down to: the change is made (points 1 and 2), it has been communicated to others (points 3 and 5), it has been shown to be working (point 4), and it is what was actually wanted (point 6).

If you want to take it to the next level you can add another point to being “done-done”:

Are you a rational, logical person? When you are given the data about a situation, do you change your mind? I would be willing to bet that you like to think about yourself in that light. I know that I do.

However, I know that I still hold beliefs in spite of some data. If you give me some data about a situation, I’m not very likely to believe it. I might accept it, I might even agree with it, but I won’t likely change my thought process that you, as the presenter of that data, think I should be changing. Why is that?

While sitting around, drinking some tea, and eating some lovely, lovely bread with Jeffrey Fredrick, we had covered topics ranging from ease at work, the results of the Strengths Finder test, beliefs that people hold, estimation and tracking in software development, and the problems people have understanding estimates, when Jeff blurted out (paraphrasing), “I’d love to put together something about naive pitfalls. For instance in selling, a naive way of going about it is believing that people want to hear about your product. They don’t, they want to hear how you’ll solve their problem.”

This brought out a whole list of naive fallacies that we’ve both found that people hold.

Fallacy 1: Scheduling

Belief: The way to get more done is to schedule more stuff.

Reality: The way to get more done is to work on a single thing until it is finished.

Fallacy 2: Persuasion

Belief: The way to persuade people is to give them data.

Reality: The way to persuade people is to agree that they have a certain understanding of something, and then provide a story that leads from their understanding to your understanding.

Fallacy 3: Feedback

Belief: Feedback is something that is given to someone.

Reality: Feedback is something that the person has to seek out.

Fallacy 4: Experience

Belief: The longer you do your job the better you get at it.

Reality: The more you deliberately practice your job, the better you get at your job.

The commonality that I see in all of these is the belief that the data alone will change what belief a person holds. In all cases the person who holds the belief needs to be the one taking themselves down a path that leads them from one belief about the world to another belief.

Take the fallacy of experience as an example. Purely gathering experience is not what produces change. Instead, the change occurs when the person experiencing the situation reflects on what has been occurring, inspects their own belief structures, and finds ways to change themselves to produce different outcomes. The data pointed to the existence something around them, but the person had to find a path to changing their surroundings.

It may be true that I can’t convince you that you are wrong (oh, the irony of linking to an article about confirmation bias here). I can, however, provide you with a story that will maybe lead you down the path of understanding why I think you are wrong.

CITCON is a conference that I’ve been attending since 2006 in London. I showed up because it was “near” where I was living in Germany and sounded like it might be interesting. Boy was it ever interesting! The only European one that I missed was because I had a final exam on the same day as the Brussels conference (damn you, formal methods!).

Well, now that I don’t live in Europe I probably won’t be making it to those CITCONs very much anymore, but there are still the North America ones. And what do you know? The next North America CITCON is right in my back yard here in Portland.

Having just moved to the city I didn’t know much about much of anything. Knowing that Portland has a lot of beer and breweries doesn’t do much when you are trying to find a space that fits a 100 people in an open-spaces style conference, but after some searching (and drinking a few McMenamins beers) we found a place. Next was to get the word out.

AgilePDX people there gave me a lot of advice. The RubyPDX group also get behind it. Various meetups that I crashed put up with my pitches to get people to show up, as well. I used my now tenuous connections to some groups in Seattle to get the word out up there a little.

In doing this I meet a lot of great people and groups that I’m planning on keeping in touch with. One is AgilePDX here in Portland who is going to be hosting another conference that meshes very well with CITCON: Agile Testing Open NW. If you can’t make CITCON, or if you find CITCON great, then I think you’ll probably want to go to ATONW, too.

Another group that I got in contact with, for those of you up in Seattle, is the The Center of the Agile Universe (aka The Fremont Agile Meetup). They are also putting on a very interesting sounding conference, that once again, if you like CITCON or miss it, I think you should attend MAQCon.

I’m really looking forward to this CITCON and hope to meet yet more people in the area.

Looking back on the last five years of my life spent living abroad it is easy for me to dwell on the things that didn’t work out well. That is just the way I seem to be. I’d like to try to spend a little time dwelling a bit on the things that really worked out well, though, because once I start listing them out it will become apparent to me just how many there are. And in the end the events that were good really outweigh the problems that came up.

I learned a lot. Really. All in all, this time has been the most jam-packed learning of my life. In no particular order, I learned:

to speak German. I could speak it ok before I moved there, but living in the country took it to a different level.

various forms of formal analysis. Like many things learned at university you won’t be applying it every day, but it completely changes the way you see the world.

how to give a five minute talk on just about any subject with just a few minutes of preparation.

how relational database systems work. I missed out on this in my undergraduate degree, and boy, was it an eye opener.

how to be (a little bit) comfortable not speaking the language. From being out of my depth with German to taking trips in countries where I don’t speak a lick of the language to not understanding someone’s accent, I’ve learned that it isn’t the end of the world and it’ll work out in the end.

to think about my communication style and how it impacts those around me.

that it is really hard to overdo it on refactoring and unit testing.

the fragility of a team.

that constant improvement is hard but pays off in spades.

the cooperative learning/teaching potential of a group of like minded people meeting every Tuesday for a beer and a chat.

that being a stranger in a strange land can feel like you have no connection to those around you.

that different cats, even sisters who have spent their entire lives together, will end up with completely different personalities.

that ruthlessly tracking down the real cause of an error can take you to very uncomfortable places that you probably wouldn’t expect to find yourself at the outset.

how many amazing people there are in London to learn from.

to understand people with different accents (this was a point of hilarity many times).

to concentrate on the journey and the destination will work itself out.

As well as learning things I also got to know a lot of people. Each one of them ended up being a great friend and helped my wife and I feel like we belonged where ever we were.

I got to do a bit of traveling. Ate some tasty pies in the Lake District. Found hops growing wild on the bank of the Rhein river. Crossed the channel on an overnight ferry (luxury!). Cycled along Loch Ness on a beautifully sunny day. Climbed a rock face in the French Alps (and then ran away when a storm came in and a rock fell a short distance away). Visited Neuschwanstein when it was shrouded in mist which revealed the castle now and then in its fairy tale splendor. Discovered that Marrakech is an amazing oasis of bustling activity on the edge of the desert.

When I left the United States I was a cocky, young programmer who felt that he had the solution to most, if not all, of the problems in software development. Now that I’ve returned to the United States I think that I’m a somewhat less cocky, not-quite-so-young programmer who feels that he has something of value to add, but has learned that he doesn’t know a lot about a lot of things. For helping me start that transformation I am eternally grateful and indebted to all of my Mitstudenten and Professoren from my degree program (every time I walk into an Apple store I think, “my god, I studied with the guy that wrote one of the programs running on all of these iPads!”). Then for giving me the feedback, space, and incentives to continue down this path there is everyone that I met at XTC and everyone I had the privilege and pleasure to work and learn with at TIM Group. If you ever find yourself in London try to make time for some beers with the XTC crowd and drop the guys at TIM Group a line and see if you can join them for a day, you won’t be disappointed.

The other day at work I was implementing a new feature and found myself in a nice little, self-contained bit of code. I was adding the ability to parse a pair of query parameters from an HTTP request into an optional Period (a class used to represent a period of time in the application). After TDDing my functionality into place I reached a fully functional version but cringed every time I looked at it. I normally try to refactor during the TDDing but I just couldn’t see what to do until I had a bit more meat to work with. Once it was all in place I had enough to start chopping away at it.

The code you see below is pretty much the code as it is in the application. I’ve changed a few of the types to remove references to internal structures, but I haven’t changed the logic. It makes use of a modified version of Nat Pryce’s Maybe class and uses Guava, which shows up in some of the interfaces that get implemented.

The first few lines seemed alright. They fairly clearly state that the two parameters needed to be considered optional dates and followed up with some lines declaring that if we have one but not the other then there is an error. The last statement was a horrible monstrosity, though. That said, my first victim was something that can’t even be easily seen in the above code: if neither date is there then the function should return nothing(). After that I took a swipe at the little bit of duplication that I felt was showing up in the two if statements about one or the other date being missing:

At this point I was pretty happy with the first few lines of the function. But that error checking was now an even worse eyesore! Working through the logic I realized that the way it was now structured I should never reach the final, eye-bleed inducing final return statement unless both dates are specified. That bit of information means that all of that obscure bind() and transform() logic is completely unnecessary. The second breakthrough was in noticing that if either of fromDate or toDate end up being empty after the check that neither is known then it is an error.

With those two new realizations I could get rid of bind() and transform() and replace it with just getting the values out of the Maybes or throwing the appropriate exception.

Much better. I find that the inlined fetching of the dates from the Maybes doesn’t lead to the clearest expression of what is going on. So let’s pull out the dates and inline the exception Suppliers. With that the dates can even be given names that give a better clue as to what is special about them at that point in the code.

Finally I decided to pull out the creation and checking of the Period itself. There were two reasons for this: I don’t like having a possibly “invalid” object floating around in a larger function, and there was another bit of code just above this function that was doing the same thing.

This is the point at which I stopped. Next time I go through this code I hope to find a few more things to clean up, though; this point is in no way the final destination. There are a few more possibilities that could be done here that I can see right now. The most obvious one to me was that the combination of parameter value and parameter name is just screaming out for a class. That would give a place to start moving a lot of this validation and transformation functionality that is right now just living in static methods (although the static methods are working pretty well right now).

I hope that this is a good illustration of refactoring in action on a real codebase. Something that I want everyone to get out of this is that not every step of the refactoring took me to a nice next step. This exercise was full of false starts and backslides in the readability of the code. I haven’t shown every dead end that I went down and had to undo out of, but rest assured it happened pretty often. What is important is that I kept trying to figure out how to better express what was happening. As I tried to make it clearer I ended up with a better understanding myself and could try to reflect that in the code.