Focus: Robustness Description: Set robustness goals, and consider what should happen, and what should not happen in the face of likely (and less likely) errors.

In the previous post we set up robustness goals, and implemented the first half of the order import functionality, performing downloads of orders via FTP. In this post we are going to complete the example by implementing the second half, which imports the orders into the warehouse system.

What we need to do

An outline of the functionality needed for the import process is as follows:

The download process leaves the order files it has downloaded in a folder where the import process takes over ownership of the files.

We are going to import the orders oldest-first.

If there’s an error processing an order, we’ll signal an error, move the file to the error folder and continue with the next order.

We are going to archive each order file in an archive folder after successful import.

Avoiding starvation and false prioritization

Why do we care about in which order we do the import, what has that to do with robustness, you might ask. Aren’t we simply going to import them all, so why care about ordering? The answer is that since we said import takes a good 30 seconds per order, we may create a so called “starvation” situation if newer order files gets processed before older order files are chosen, such that some orders might be delayed “indefinitely”. When the orders are few and processing time is abundant, any strategy will do. But when there’s enough orders so that it will take a significant amount of time to import them all, we will have problems without an ordering strategy.

For example, if we’re importing order files in “unspecified” order (which in case of .NET Directory.GetFiles on NTFS would be alphabetical order) and the naming scheme for the order files happens to be such that, say, the name starts with a two digit area code. We will in this case inadvertently cause a false prioritization for lower area codes over higher area codes.

To understand what ill effects this might have, let’s say there’s 120 new order files for lower area codes, and one older file for a higher area code – this will then then delay the order from the higher area code by around an hour, even though it was “first in line” before the other orders arrived! Since we’ve made promises to the customers regarding delivery times, the longer it takes for an order to be imported, the less time warehouse workers will have to pick the ordered goods, and we might not be able to fulfill the delivery in time.

Note that this also applies to the downloading process, as it wouldn’t do much good if we implemented “oldest-first-import” if downloading also falsely prioritize some orders. However, as we’re assuming that the downloading process is quite fast (magnitudes faster than importing) the order it uses is of little concern to us.

Processing errors

The kind of errors we can expect from order processing can roughly be placed in one of three categories:

Errors from the file system (reading, archiving)

Errors due to order contents (format issues, content validation issues)

Errors from the warehouse system API we’re using (validation issues)

Sample code

The presented code is kept as short as possible - possibly even too short - but I find too much code takes focus from what I actually want to discuss, which really isn’t the exact implementation, but the need to do robustness by design, not by accident.

Care has to be taken in the implementation of GetNextOrderFile() not to commit to a huge number of files, as this can cause starvation, since it also translates into a huge amount of processing time before the next decision on which files to import. It also should not re-read the directory to find the oldest one at each call, as we then (in case of a single very old unprocessable, unmovable file) would try to import the same order over and over again.

But why pause?

I include a pause here and there in the sample code. Why is that a good idea? Well, if you’ve seen eventlogs filled with tens of messages a second all stating the same unexpected error (“Access Denied” say) you know that if some error bubbles up to the top level, it can be a good thing to allow processing to take a short break, if nothing else to avoid overloading the logs, and possibly other systems.

If your code normally processes two orders a minute, an unforeseen error situation can force it to do one lap around the processing loop much much quicker than you anticipated, and the code may then start to hammer some other service with requests because you expected the normal processing to provide a natural pause between such calls.

Just a few days ago, I saw a service almost bring a server to it’s knees because of an “Access Denied” error, which made the service unable to remove a file, and it therefore tried processing the file over and over again. The processing needed for a single file took a significant amount of resources, and when it did this over and over – without the slightest pause – it really stressed out the server. Had there been a top level pause of even 10 seconds, the service would not have overloaded the server due to this error, and processing would be only very slightly delayed by this, as it really only pauses when it has nothing actual processing to do.

Improvements

An improvement to order-first would be to import in time-to-requested-delivery order, but this would be more complicated to get right, as orders then wouldn’t arrive in anything near the order we wanted to import them in, and we would also have to read the order files to get at this information. Oldest-first is good enough at this point, and certainly more robust than importing in “whatever” order. However, the implementation above does not care about the actual ordering method used, as that is abstracted away in the implementation of the GetNextOrderFile method.

There are likely lots of other improvements to the code, as it’s really just pseudo code.