Focus: Robustness Description: Robustness is not just about exception handling! System architecture and design plays an important role!

In the last post, we introduced an example of an application that clearly needed to be robust. Now we need to figure out how to make design decisions that help us achieve that goal.

A naive implementation of the order import application from the last post - fetching orders from a web shop via FTP, importing them into the local warehouse system - is something like the following pseudocode;

Some may think that making the processing robust simply amounts to adding a try-catch inside the inner loop in order to catch per-file errors, and another try-catch inside the outer loop. While I agree that it would indeed make this implementation considerably more robust, it’s a bit like thinking that adding sprinkles to your vanilla ice-cream makes it a five-star dessert! Its better, but there’s certainly room for improvement!

We need to consider robustness before rushing to do implementation. We need to consider the various error scenarios. For each scenario we need to consider our options to improve robustness in the face of errors, and find an appropriate strategy to handle the error, such as retrying, fallback to an alternate strategy, ignore the error and continue, or throwing an exception, aborting processing.

Let’s start with one of the most likely sources of errors, the network connection to the FTP Server.

Unnecessary dependencies

So, your application might not be able to reach the FTP Server at all times. No amount of try-catch will help you reach it when the network is down. But what would improve robustness, then? Well, if we did not need to reach it at all times!

One important thing to realize is that the naive implementation needs both the FTP Server and the Warehouse system up and running at the same time, in order to import anything. It may seem like they obviously must both be up, because we are in a sense moving orders from the FTP to the warehouse system. But this is not really true, and this misunderstanding has a negative effect on reliability.

Imagine what would happen if for some reason the network connection to the FTP server was unexpectedly lost for a couple of hours. Maybe the FTP Server is located within driving distance, and someone in desperation fills a USB flash drive with hundreds of orders from the web shop and drives to the warehouse, handing over the drive to you. Unfortunately, this would be of no immediate help, as the naive application would not be able to process these orders from the removable drive, as it only knows how to fetch orders from an FTP Server! Sure, if you’re lucky you might be able to do an rather stressful ad-hoc set up of a local FTP Server and re-configure your application to use that, but there are other problems with this design.

This version of the application is also always downloading and importing one order at a time. Let’s assume that downloading is rather quick, and takes no more than a tenth of a second per order, but importing into the warehouse system is rather slow, lets say around 30 seconds per order. If you were told well in advance that the warehouse external connection will be down between 10 to 12 for maintenance, your application would still not be able to do any processing between 10 and 12. Even if there were hundreds of orders waiting at the FTP Server by 9:50, your application would only manage to fetch and process about 20 orders until 10:00, and then processing would stop when the application was unable to fetch another order.

Independent processes

Clearly, we can improve both of these scenarios by re-designing the application into two independent processes, separating downloading from importing. If we do that, the download process would be able to easily download a couple of hundred orders in less than a minute, and you are likely to have an empty order queue at the FTP when the network goes down. Then the importing process could churn away on two orders per minute during the two hours of network maintenance.

Even if we were not that lucky, and hundreds of lunch orders was placed at the web shop at around 11 (temporarily out of reach from your application) the “manual move” USB flash drive trick would then be simple to perform if needed, just a matter of dumping a batch of order files into the application processing folder.

Either way, this design is way more robust than the naive implementation, regardless of how much error handling is put into the latter! Now, we are likely not to even notice shorter periods of downtime.

At an infrastructural level, you may also be able to improve robustness by installing network devices that can fallback to a secondary internet connection, if the primary connection fails. Note that this is no replacement for separating downloading from importing, however, as there are still scenarios where the separation is beneficial, even if the network would be 100% reliable.

Focus: Robustness Description: Make sure the system is able to handle and recover from failures

Robustness is a very important property of a system, especially for unattended and/or long-running operations. You should be able to estimate for each module, component, or function you are building, what the need for robustness is, and the things that are most likely to go wrong. You then have to make sure those things do not compromise your system’s overall functionality in a manner more severe than the error actually calls for.

There’s no reason your system shouldn’t be able to gracefully handle at least all of the common errors, most of the anticipated errors, and even a lot of the unanticipated errors that can, and will, occur at run-time. A small, commonly occurring error should not be able to cause a lot of trouble!

An Example

Let’s say you are employed by the FruutXpress company that is in the business selling fresh fruit with express delivery to customers in your city via their web shop. The customer orders needs to be shipped pronto, and there’s a warehouse full of fruit and 30 workers to pick the incoming customer orders into boxes and onto the delivery trucks.

Your task is implementing a program that imports the orders from new FruutXpress web shop, into the company warehouse system. We are told that the incoming orders are available as XML files, one order per file, on an FTP server for your import program to read. You will use the warehouse system’s rather simple API, that will let you query and import data.

On the surface a rather simple task, right? I mean, how hard can it be, just reading a couple of lines of some text file containing how many of which fruits have been ordered, a delivery address, and not much more, and then importing this into the warehouse system (which then can print out picking lists, shipping labels, and whatever else is needed for the warehouse personnel to do their work).

However, when thinking about it some more, you realize it’s a rather important function that is absolutely critical for the FruutXpress company. If your program would stop working, the customers would still place orders at the web shop, but the orders will never reach the warehouse system, and the delivery trucks and workers would sit idle, because they wouldn’t get any new orders to pick. Or rather, they would be running around screaming about probably having to work overtime in order to fulfill all pending deliveries, and you’d have warehouse management on the phone within minutes complaining!

Clearly you must design and build a really robust import program! It’s important that it does its job well, and only rather catastrophic errors, such as the network being down, or a server crash, should be able to force it to abort processing. It should be able to keep processing orders, even in the face of many different error conditions.

Focus: FunctionalityDescription: Make sure system meets even the unstated requirements!

If you are dealing with a customer with little experience of specifying software requirements, they are likely to only include functional requirements in the specification. They will likely leave out most or all of the non-functional requirements. I’m not talking about things that doesn’t work, I’m referring to a requirements that does not directly specify functionality the system should provide, but instead specify how the system should be, describing a quality of the system. These requirements, even though often unstated, are often crucial to the eventual success of the system.

Most of our other system properties are non-functional in nature, and thus I’m arguing that capturing these non-functional requirements are the key to actually providing the customer with a high-quality system, making it functional by design.

If we’re building a web-shop, there’s likely a requirement to send each order to the warehouse for picking. This is an entirely functional requirement. There could be multiple unstated non-functional requirements hidden in this requirement.

It does not say what delays would be acceptable (from the customer placing the order to it being available for picking in the warehouse) - a non-functional requirement related to performance.

It also might not say anything about the importance of not crashing or stalling upon trying to import an incomplete order (say one without an OK delivery address, or for obsolete items) - a non-functional requirement related to robustness.

This web-shop system can comply completely to the functional requirements, but still infuriate the customer because it doesn’t fulfill the (unstated) non-functional requirements. Consider if it hangs when receiving a single incomplete order, blocking further order processing, needing manual intervention to restart, or if orders from the web-shop are being imported to the warehouse system at a rate far slower than the customers are placing them at the web shop during peak hours, causing unwanted delays in picking.

Obviously, the best would be if all requirements were stated, instead of unstated, but in my experience customers seldom includes non-functional requirements. Of course, that doesn’t stop them from being unhappy when the system does not meet the unstated requirements (“but it’s obvious this shouldn’t take this long!”)

Don’t underestimate the power of the stuffing

Possibly the customer hasn’t even fully considered all these details when you are starting to build the system. Since not all details are known in advance, you (or your team) will have to fill the gaps with some “stuffing”, when you get down to that level of detail. Very often this stuffing corresponds a great deal with any unstated non-functional requirements.

The quality of the stuffing you and your team fill the system with, will be directly proportional to your understanding of the problem (and problem domain), and also to your development experience. Yes, you can ask questions, but if you ask too many, you’ll start getting contradictory answers or annoy your customer who had the hope you would be able to take sensible action on your own.

Domain knowledge and experience will tell you when the customer says one thing but really means another. It will help you decide to make configurable those things that are likely to change (even if the customer didn’t explicitly tell you so). It will help you make the right decisions, and guide your development towards a system that is not merely implementing each requirement, but is functional by design, and where our other system properties have been considered, and added in an amount appropriate for the system you are building.

The importance of experience and domain knowledge should not be underestimated! A team of developers, lacking either experience or domain knowledge are likely to get several things wrong at first. A team lacking both are more or less guaranteed to get most things wrong, and are even unlikely to ever get a moderately complex system to a point where it’s mostly working. This is because the inexperienced team hasn’t yet learned to consider the unstated non-functional requirements, but hopes to achieve full functionality by more or less independently working their way through implementing the requirements to the letter, one by one.

This reminds me of an assignment in programming class at university, where we were implementing a text editor, and had a requirement to warn the user of unsaved changes if he tried to exit. Some students then simply let the editor exit with a message like “Warning, you had unsaved changes” printed on the console.

Focus: Functionality Description: Make sure the system meets the stated requirements!

Surprisingly, when discussing design, it's easy to forget about perhaps the most important design criteria, namely that the system should be able to provide the required functionality! I actually didn’t include this in my first draft of important design criteria.

Perhaps it is thought that the necessary functionality is expressed solely in the requirements specification, and then realized by the implementation, and that the design is in some way orthogonal to this. However, a sloppy system design can ruin even the most well-specified requirements. If nothing else, it might set some unfortunate decisions in stone (well, code), and make other requirements hard or impossible to meet.

Examples

That quick and dirty hack you did a couple of months ago – you know, the one where you remove obviously invalid customer email addresses on the incoming orders from the new web shop (because those caused issues at your end, and the consultants had far more pressing issues than improving email validation rules).

You know the hack doesn’t catch all possible errors, but it clearly improved the situation, and has been running OK since day one. Even if that hack had no other of the desirable properties we are discussing here, it was functional, and provides value, or you would've removed it from production.

And that new web shop, if it isn’t functional – if it does not even provide the basic shopping functionality that your company wanted it to have – then it’s in effect worthless! This holds even if it would have a whole host of other properties (including some less desirable properties such as costing too much).

Obviously, a more complex system can provide value, even though it does not do everything exactly the way the customer wanted. It may have bugs, or a huge misunderstanding that makes one module clearly less useful than intended, but it can still be functional as a whole.

How to achieve functionality by design

The most important tools for achieving functionality are:

A good requirements specification. (Hah! as if that’ll ever happen!)

Good communication with the stakeholders (the customer, the users etc.)

A good understanding of the problem domain. (Domain knowledge.)

General development experience. (The more, the merrier!)

Several releases of the system. (Since you will never get it right the first time!)

It is important to note that a requirements specification, or even communication with the customer, will never in itself give the whole picture. At least, I’ve never been on such a project, and I’m pretty sure I never will. This is because the customer won’t be able to get every tiny detail down on paper, and won’t even be able to tell you everything during meetings.

The more complex the system, the less likely you are to get the whole picture up front before starting development. In all likelihood, no one (including the customer) will even understand the full system completely at that point, so understandably they cannot give you all the details.

It is also important to realize that it is very common that many quite important requirements are not actually stated explicitly in the requirements specification! There is a whole category of requirements that the customer often leave implicit and unstated, but that we should nevertheless understand if we want to build a great system. More on this in the next post!