Pages

Saturday, August 13, 2011

Reining in Unruly Systems

At work we have a variety of critical systems that have grown pretty organically over the last 15 years. They have all sorts of deficiencies that directly impact maintenance and new releases. Thinking about how to improve the situation, I worked out some major objectives and the steps to get there.
While a lot's been written on the subject and undoubtedly everyone thinks about this stuff, I'm just winging it here; so cut me some slack if any of this seems obvious. If you have any specific suggestions for good reads on this topic, feel free to let me know!

We Reap What We Sow

Organic growth is sometimes the right thing, particularly if you are just getting started. However, once you get any semblance of size or maturity, your APIs, libraries and systems ought to be pretty solid. For the systems with which I interact at work we no longer have any excuses. Here are some characteristics of the situation:

* poor documentation,
* terrible API design,
* tight coupling between even disparate components in any given system,
* poor interoperability between systems,
* high dependence on individuals,
* a LOT of duplication of effort,
* focus on systems rather than services,
* weak processes surrounding the management of the systems (especially the release process).

A few years back we made a push to implement ITIL v3, but we didn't really have any top-level vision on how to accomplish this past our customer support. Poor tool implementation has hampered even that effort. That's a shame because I think ITIL's a great approach that would have put us on the path to recovery.

Ultimately we are sleeping in a bed that we made for ourselves. We have more problems than just the technical ones, particularly with regard to leadership, communication and accountability; real people problems. So any solution to our trouble will involve more than technology. I could go on and on, but for now I'll just focus on a mostly technical approach.

What to Do About It

To get out of our mess would would need to get a better understanding of our systems, both internal and external, and how they are used. With that in hand we would build a central system that exposes all the old functionality in a well designed API. At first the API would simply wrap calls to the old system. However, since the new API would be loosely coupled, with high cohesion, we could go component-by-component and slowly replace the old systems as needed. Apparently the timbot agrees with me:

The most successful projects I've seen and been on *did* rewrite all the code routinely, but one subsystem at a time. This happens when you're tempted to add a hack, realize it wouldn't be needed if an entire area were reworked, and mgmt is bright enough to realize that hacks compound in fatal ways over time. The "ain't broke, don't fix" philosophy is a good guide here, provided you've got a very low threshold for insisting "it's broke" <0.4 wink>.

So first off we would have to figure out what we already have. I'm not talking about a fine description of inputs and outputs on functions or a breakdown of service level agreements. Rather, we need to get a broad view of all our systems (internal and external): what they are supposed to do, who manages them and especially how they are used. What interfaces do they provide to people? What remote APIs do they expose to the network? How are both of those actually used? We need a high level view of everything.

How would we do this? With hard work! We'd have to talk to everybody to find out what they work on and what systems they know about. We would find out at a high level what they know about the different systems and what those systems do. Then we'd cross-reference everything to make sure the details lines up. The result would be a list of systems with all that detail.

In addition, the whole time we'd have been building a rudimentary tree of classifications for system functionality, and a graph of system interactions. The classifications would likely not correspond to any particular system, but would often overlap across many existing systems. The graph would be a one to many mapping of system to system for the classifications in our hierarchy. At this point we would finish up our initial effort, filling in as much info as we can from the interviews.

Next we would go to all the systems we found and look at their logs to see who/what's accessing them, how, and for what. If the logs don't tell enough we'll have to get creative. This information should be enough to fill in any gaps in our functionality classification hierarchy and our system interaction graph.

Finally, with the hierarchy and the graph, we would design an API that would provide all the same functionality. The classifications hierarchy would map directly to the API components. The guiding principle here would be high cohesion and low coupling. That means it probably won't look at lot like the old systems. And that would be just fine.

Here's a summary of the above process:

Perform interviews

Cross-reference and verify

Compose rough functionality classifications and graph of system interactions

Inspect logs

Finalize the graph of system interactions and validate the classifications

Design an API component/subcomponent per classification/subclassification

The New Hotness

So, the effort of getting a clear understanding of our systems is nothing to sneeze at. However, the critical pieces here are the new API and the system that will provide it. That's why I'll spend less words on it. <wink>

The system would have to be kept simple. It's only purpose is to expose the new API. Here's a stab at a feature list:

fast enough,

load balancing with failover,

one team to be exclusively in charge of the system,

a central access control system,

easy to manage security,

hard to give the wrong person the wrong access,

pluggable API component framework,

easy to build new API components for the system,

easy to install new API components into the system

Having completed our new API design and our new API system, we would now get to work implementing the API as pluggable components as wrappers around the old systems. Once those were done and released, we could start looking at rebuilding the functionality of the old systems, one API component at a time.

Conclusion

Naturally there is a lot more I could have said, but I'll stop here. However, here are some tidbits that I think would be feasible given the above API system:

With regards to the new API and the API framework, it would probably be best to get in everything that matters, make it work, and then boil it down to trim the fat. Only then release it. It's a lot easier to add on to an API then to take away!

Some systems fill a supporting role to other systems. This includes security, logging and notifications. These supporting systems could be provided special treatment in the API framework such that other API components could more easily plug into them.

Separate instance of the system for our internal systems.

This single system would also host a repo for client libraries to interface with the new API, written for a variety languages.

URLs to the API would look like: https://remote.<company>.com/component/...

A lot could be done to let ITIL drive the efforts, design and API system I have described here.

A separate effort would be to model internal processes in the company and build effective tools around those processes. These would be done as API components for the system. Ideally those processes would be as automated as possible.

When I get a chance, I'll be following up with a description of a model for modeling systems, services and processes (I like to call it SSP). Stay tuned.