Monday, April 30, 2007

The old EDA v SOA or SOA 2.0 view is still doing the rounds I note, which isn't surprising given the technology focus of lots of companies. But I thought I'd try very very briefly to explain what the difference is and why from a business perspective it isn't important what the implementation approach is.

Business Scenario

Okay so the scenario will be a standard supply chain where the invoice is sent out at the same time as dispatch from the warehouse. Using the SOA Methodology this gives us two "what's"A whoand a few "why"s one of which goes between the two services.So that is the business view. The next bit is to compare the Event based and request based implementation of that single why between the services. The direction, from a business perspective, is clearly from the warehousing to finance as that is what works in the business language for this scenario, this however doesn't matter to the technical implementation of the services. They key as with Christmas SOA is understanding how the business objectives can be met by technology, not trying to do a one-to-one match.

Request based

So what is the implementation in a request/reply world? Well clearly we have two services to build from above, but the question is what is the implementation approach for the request. The key with a request implementation is that the question is "I need to ask the other service to do something" hence its all about the destination service in the request. In plain old pseudo code this means

service Finance hasSendInvoice capability, which accepts a purchase order and customer details

This is the "traditional" Web Service way of thinking about things (although of course REST would be a post of an invoice as the code implementation). This is fine and dandy and clearly meetings the requirements of the overall business scenario.

Event based

In the event based world the question is about what has happened that will trigger a change. In this world its what the destination service is waiting to happen from the consumer. So in old pseudo code terms this means that we have two definitions.

service Warehouse hasDispatchOrder capability which sends an order dispatched event on completion

service Finance hasSendInvoice capability which listens for dispatched orders

This is the "traditional" Event way of thinking about things. Again this meets the business capability.

Pseudo Event - Polling for requests

There is a "third" (and in my opinion often technically poor) way of working which is to pretend you are event based but really its building an event model by using polling on requests. In this world the Finance service would continually put a request into Warehousing to ask "have any orders been dispatched" so back in pseudo code we have

service Warehouse has

DispatchedOrders capability which returns a list of the recently dispatched orders

In this case its sort of like Warehouse publishing an RSS feed of dispatched orders and Finance subscribing to that feed, or in old school speak its like Warehouse dumping a batch file at the end of the day to a shared disk and Finance uploading it. This again meets the business objective.

Summary

My point here is that whether doing Request/Reply, Eventing or Polling this is an implementation decision and is irrelevant from the business perspective. The key is to understand the best implementation method that meets the business objectives, this will be based on performance, timeliness and other "ilities" or just down to "how the current systems work". This separation of the business objectives from technology implementation is part of the key to making sure that IT moves in step with the business, you need to make IT deliver in business terms which means understanding the objectives rather than blindly implementing the requirements. Request/Reply, EDA, Polling, BDI and the like are all just technical implementation approaches to meet the business objectives. They are tools in the toolbox, and they are about design rather than architecture.

SOA, as in Business SOA, is about architecture, then choose the right tool for the implementation job.

The thing that unifies them all however is that the mentality of systems designers should be to think about how to properly fail the system. Planning for failure is about understanding what makes sense, what is really is critical and what you can cope with. As systems become more and more distributed and are co-ordinating more and more services it will be reckless to assume that everything will work just how you envisaged it.

Failure will happen. Don't cope with it... plan for it. Sometimes you might even force failures in order to keep the core of the system working. Failure shouldn't be a binary condition.

Friday, April 20, 2007

This one is about how important it is to have perfect accuracy. For some things its a really bad idea not to have perfect accuracy, banks get a bit upset for instance, but in other cases its much better to go with at least a degree of confidence if you can't get perfect information rather than waiting for the perfect information to arrive. Put it this way, if two ships are about to collide there is a reason expectation that both will veer to starboard, but there is a chance that one of them won't know they should do this and in fact going to starboard will result in a crash. The perfect information is only available after the crash, therefore its better to make a smart decision based on what you have available.

In some ways this is related to the time criticality of information, for instance "how accurate does the stock count have to be?". Again making a system more reliable means understand the tolerances of the information. So sure you can go and get a super accurate forecast from the great big service running on the massive servers, but if they are down it might be okay to just use a lightweight algorithm to do a 95% accurate calculation. It might be the best thing to go off and the timestamp from one of the Atomic Clocks, but if you can't get it then its probably okay to use your local machine clock.

This is similar to the timeliness of information but is different as it is not about how time impacts information but is purely about information accuracy. What this means is understanding what can be done when you don't have the time or opportunity to get the perfect information. So this is where you can't run a Monte Carlo Simulation because the connection is down. Its where its okay to say the population of France is "around 60 million" when looking at some high level marketing. It is also about understanding that it isn't okay to say that "around 2 quid" when you are buying one element, but when buying a million you need to know it exactly.

In one sense this is about knowing when its okay to reference Wikipedia for geographical information and when its better to talk to the Ordnance Survey. Its about understanding when you can replace a product and the customer won't care, and when doing so will result in a legal case.

Information doesn't always have to be 100% accurate, but coping with variable information accuracy is one of the most challenging problems of reliability. This isn't about a simple proxy or configuration changes, this is about different types of operation given the currently available information quality.

Lets put it this way... don't start with variable information accuracy to make your SOA environment more reliable, this is for those with budget and real determination.

Well when I wrote the post on the importance of understanding the minimum operating requirements I didn't think I'd have a great real-world example of what happens when people don't decouple their critical from their critical and enable the essential to continue when there are issues with the unimportant....

According to El Reg the folks at Blackberry say that the reason for their outage was due to a non-critical caching upgrade. This is exactly what I mean about enabling businesses to turn off the non-essential and really planning systems to degrade successfully rather than just switching off when anything goes wrong.

And these folks know networking... so I think we can all guess what the average application is like.

Thursday, April 19, 2007

Sure its great to have the latest prices, or the latest stock figure, but what do we mean by latest? What it was yesterday? An hour ago? A second ago? And sure its great when newer systems are able to give more accurate information than the old batch processing overnight information in the data warehouse solutions of a few years ago. But just because you now can get the information in "real"-time doesn't mean that if you can't get it right now that you have to keel over and die. Sure if you are doing a market trade on the stock-market its essential to have the price as accurately as possible, but what about if you can't get to the product catalogue for a car maker, is it okay to show the same price as yesterday?

Caching is your friend in this world, and this isn't just caching in the sense of "don't ask again for 10 minutes" ala HTTP, its much more about "this information is valid for 1 hour, but feel free to ask me if I'm available". This is another form of failure tree, in this model the request is made for the "real"-time information, if that succeeds it is cached, if it fails a check is made to the cache to see if the information is in there and if that information is still viable. Understanding the impact of this cached information, for instance a cached stock-level might mean that a customer won't get their product in 3 days time, is the important piece here.

The way I tend to think about these cases is in two basic scenarios. The first is easy, namely "its in real time" and that is what everyone is striving towards today, the second however is what people are moving away from which is "imagine if it was still nightly batch". You can include finer degrees of granularity if you need to but I find considering it in those two ways helps you understand what information is really time critical and what information is just time convenient.

The goal here is then to design the system so that either the time of information is irrelevant (unlikely) or the system has a decent set of tolerances around what different time-scales mean. I'd recommend in this state having a set of bands which indicate the boundaries and detailing the impacts at each of those levels. So for example

10-120 minute delayed - 75% probability of accuracy problems, inform the customer they will get a later email confirming status if it changes

120+ - Pot luck on whether it works, tell the customer to expect an email in the next 24 hours with their delivery date.

Not having real-time information shouldn't mean failure in lots of applications, but too often people take the easy "something failed so I should" route out. This again is different to the failure cases mentioned previously and the concept of minimal operation. Those deal with failures and have a mitigation, this is about coping with sub-optimal working.

Wednesday, April 18, 2007

I've been having some chats in the last few days about how SOA and SaaS work together and how SaaS will actual pan out (as opposed to the hype) and one thing that has become really obvious is that SaaS isn't really SaaS. What I mean is that no-one is talking about buying Software when they buy SaaS, they are looking to buy a Service whether that is a word processor, Salesforce automation, invoicing, accounts or whatever. People are looking to buy a business service which they access, they don't care how that is created, the don't care at all about the technology its just the delivery to them that matters.

SaaS really is about Service as a Service, that is providing a business service whole and in itself in the way that you want. Software as a Service is the IT view, but it isn't IT that is buying SaaS.

Okay so its "nice" to have the preferred form of address for a customer when you put up the order response page. But should you display an "order not processed" just because the service that tells you if its "Mr Jones" or "Steve Jones" isn't available or should you just say "Dear Customer" instead? Well clearly for something like that the answer is go with "Dear Customer".

This isn't limited however to just trivial data formatting elements it can apply to things that appear to be much more critical. The key here is to understand how to aggressively degrade the application. What this means isn't just planning for when a call fails, but actively making things fail and reducing the operation of the system to a minimal subset, and ensuring that the reliability of that subset is maximised.

With the failure lists above it was a case of something failing and then the calling service coping, with this approach its actually a case of deliberately not accessing services and operating as if they never existed. Why would you want to do this? To reduce the risk of knock on effects from failures in terms of system resources, live/dead locks and data errors. If you can shutdown into a "safe" mode of operation you can at least keep the lights on and keep the core business running.

As an example if you have a system that actually runs the core production line at a drug company and there are a number of systems that can change what is being made and which the production line reports. The critical factor is just keeping the line operating, as for every hour its not working means lost revenue for the company, then here you could look at the safe mode as designing the system so it can operate successfully without any links to external systems. This could involve log files being shifted after hours, or even tapes being couriered to another data centre. It means understanding how long this form of operation can continue and giving IT and the business time to put in place contingency plans. The point here is that the manufacturing line is the minimum operating condition, if there is any risk that external factors could force it to operate below peak efficiency then these should be ruthlessly shutdown and the core allowed to continue to operate.

This sort of approach is very important when dealing with 3rd party systems where they could be legal or trust issues that require you to shutdown access in a hurry either because you feel the remote system has been compromised or because they are not meeting their SLAs.

The difference between minimal operation and planning for failure is that the services might not actually have failed, but an operational decision is made to work without them. This is the critical difference. Planning for failure is about coping with failure that happens operationally an implementing a mitigation plan. Deliberately degrading the system is a business and technical decisions where it is decided that the risk of failure or information error outweighs the benefits.

Another example of this would be when a system has to handle dramatically increased loads due to an external event, maybe a surge in demand due to an overly successful marketing campaign or some external problem that has caused a surge in exception conditions. Rather than the system taking the classic "normal" two step IT solution to this problem:

Specify the hardware to a level so large it bankrupts the company

When the system falls over because the company refused to go into bankruptcy go "na, na, told you so"

There is another choice here which is to enable the business to have the "oh shit" moment and then start turning off pieces that they currently decide are non-critical and just operate the core that is required to handle this unexpected event. This might mean, for an issue, concentrating resources on the support functions and for a marketing campaign it might mean taking the business decision to take the orders and batch up the payments at the end of the day or preventing people from searching for products as 95% of people are looking for the specific element from the marketing campaign.

Design a system so element can be deliberately failed is a big change in the way applications are built today, but in a distributed SOA environment is going to become more and more important.

Old ways won't work. So what does it take to actually do this in a system? First off it means that you need to understand for each service what it calls and have a simple "Shall I call" check that can be toggled at runtime (not exactly difficult these days), if the answer is "don't" then you need to have a mitigation plan in place. Stage 1 is to put the check in place, and have no mitigation. This is the cheap first way of enabling your system to adapt in future as such challenges become your operational reality. The important bit is to really understand what the actual core operation is for your business so you can start planning for that and not creating an SOA environment that is rich and dynamic when everything is fine, and when there are issues its buggered.

Its not a tough question to ask the business... but its one I've rarely seen asked.

Tuesday, April 17, 2007

At some stage in a services lifecycle one of the interactions that it makes will fail, therefore start from the assumption that it will fail and start thinking of the impact and cost of that failure. There are four basic scenarios for failure of an invocation

Catastrophic - Failure of the invocation renders the consumer invalid, in other words future calls to the consumer (not the service) are invalid as a result of this failure. Its a really bad design position to get into when this happen but there are some cases where its possible (for example hardware failure), mostly its down to the stupidity of the people doing the design.

Temporal Failure - Failure of the invocation renders the current processing of the consumer invalid. This is a very normal scenario, for instance when you run out of database connections or when a network request fails. This is the standard for most systems, it just propagates the failure all over the network.

Degraded Service - Failure means the consumer can continue but is not operating at optimal capacity. This is a great place to be as it means the service is coping with failure and not propagating it around the network

DCWC - Don't care, Won't Care - This is where the invocation was an embellishment or optimisation that can be safely ignored. Example here would be the introduction of a global logging system where the backup is to local file. From a consumers perspective there is no change in QoS or operation, there are some operational management implications but these have well defined work arounds and are not related to the core business operation.

The goal of a high-availability SOA is to get everything into the Degraded or DCWC categories. This means planning the work around right from the start. Part of the question is whether the high-availability is really business justified, if it is then its time to start understanding how to fail.Primary, Secondary and BeyondThe easiest solution is to provide a fallback is to have redundant primary nodes. This is a very common solution for hardware failure (clusters) and can also be used to solve network connection issues. What might be done here is to have multiple primary services which are hosted in different environments, thus meaning if one environment fails that operation is seemless. The trouble is that redundant primary often is prohibitively expensive as it can require complex back-end synchronisation of data with all of the problems that this brings.

Next up is the idea of having secondary, or greater, services or routes. As a simple example lets take an order service that is trying to submit the order to the actual transactional system.

Submit request via WS over internet - Failure on exception

Submit request via WS over VPN - Failure on exception

Submit request via WS over JMS - Failure on exception to connect

Log to file - Failure on exception to connect

Here what we have is a variable QoS, the ISDN connection is slower, but uses the same technology and would be expected to contain the same responses. The JMS and log to file are both long term operational elements and so any expected responses cannot be assumed to happen within the required time period. So for order submission this might mean a lack of confirmation of the order and no ability to commit to a delivery time. That is looking at Primary/Secondary/etc from a network connectivity perspective but this isn't the only way to consider it. Other examples could be from a business perspective in terms of supplier preferences

Here we have a great financial deal with GeraldCash, and indeed we expect them to work all the time (hence Criticality is high) and them not being available will reduce the margin on a transaction and they will have to pay penalties. Next up is FredPay, its cheap enough but not the best from a reliability perspective, finally from a technical delivery perspective we have the Bank of Money, very reliable but the most expensive by a mile. If everything fails and we don't want to lose the transaction we could try and route it to the call centre for some offline processing.

So planning for failure means understanding what else could be done and what other calls could be made that would deliver the functionality required but within different bounds to the original request.

Planning for failure means its critical to protect your service from those that it is calling. Using a proxy pattern is an absolute minimum operational requirement, because its in that proxy (potentially via some infrastructure elements) that this failover can be done. You do not want the core business logic worrying about the failover tree, it just needs to be able to cope with the changing capabilities that are available to it. This means that you need to design the system to have a set of failure modes even if you can't see that being needed to day this doesn't mean you build the failure modes at this time but that you put the basics of the framework in place to enable it. And is a proxy really that much of an overhead? Nope I don't think so either.

Planning for failure should be a minimum for an SOA environment. This means planning for both functional and system failures and understanding the risks and mitigations that they present.