Monday, April 30, 2007

I've been launching off of Todd Biske's blog roll into the world of SOA and EDA blogging. I'm actually kind of saddened that my voyage into the world of infrastructure automation has pulled me so far from a world in which I was an early practitioner. (My career at Forte Software introduced me to service-oriented architectures and event-based systems long before even Java took off.) I love what the blogging world is doing for software architecture (and has been doing for some time now), and I feel like a kid in a candy store with all the cool ideas I've been running across.

The article is actually very interesting to me from a Service Level Automation perspective. Jack captures his thoughts on the importance of building agile software architectures in the following paragraph:

Everything is moving toward on-demand business where service providers react to impulses - events - from the environment. To excel in a competitive market, a high level of autonomy is required, including the freedom to select the appropriate supporting systems. This magnified degree of separation creates a need for agility; a loose coupling between services so as to support continuous, unimpeded augmentation of business processes in response to the changing composition of the organizational structure.

(Emphasis mine.)

The only thing I would change about Jack's statement above is replacing the words "a loose coupling between services" to "a loose coupling between services and between services and infrastructure" and changing "composition of the organizational structure" to "composition of the organizational structure and infrastructure environment". (Some may have issues with the latter, but I don't mean that services should be written with specific technology in mind--just the opposite; they should be written with an eye towards technology independence.)

This is why I have been emphasizing lately the need to view the measure activity through the lens of both business and technical measures. Some of the business events thrown by an EDA may very well indicate the need to change the infrastructure configuration (e.g. if the stock market sees a 20% rise in volume in the matter of three minutes, someone may want to add capacity to those trading systems). However, the technical events from a software system (e.g. thread counts or I/O latency) may also indicate the need to change infrastructure configuration on the fly.

I wish I could spend more time collaborating with SOA architects and "tacticians". In fact, I have been speaking with Ken Oestreich about exactly this. If you are in the SOA space, and interested in talking about how SOA, EDA and SLA interconnect, let me know by commenting below. (Be sure to let me know how to contact you.) At the very least, think about how infrastructure will measure the performance of your software systems as you start your next development iteration.

Thursday, April 26, 2007

What is the biggest hurdle to adopting Service Level Automation--or even dynamic computing in general--in today's data centers?

Is it technology? Nope. Most servers, storage and even network equipment can be managed reasonably well today by several vendors, with varying degrees of dynamic, policy-based provisioning. Several critical monitoring interfaces are also now standard in everything from power controllers to OSes to applications.

Is it infrastructure architecture? Not really, with one caveat. As long as an architecture has been built from the ground up to be easily managed and changed, with real attention paid to dependency management and virtualization where appropriate, most data centers are excellent candidates for automation. Which is a small leap away from utility computing.

Is it software architecture? Nope. I talked about this before, but SLA systems are just your basic event processing architecture specialized to data center resource optimization. The really good ones (*ahem*) can do this without adding any proprietary agentry to the managed software payload. In other words, what ends up running in your data center is almost exactly what you would have run without automation. There is little evidence on the application host that it is being managed at all.

Then what is it? One word: culture. The overwhelming obstacle that I see in the data center market today is fear of rapid change.

It is true of the sys admins, though they get the value of automation right away. They just need to see everything work before they trust it.

Its true of the storage admins, though storage virtualization is gaining ground. Unfortunately, this doesn't yet translate to accepting constant and sometimes rapid, somewhat arbitrary change within their domain.

It is most true of the network guys. Networks are the last bastion of the relatively static "diagram", mapping each component of the network architecture exactly with an eye to controlling change. The idea of switching VLANs on the fly, reconfiguring firewalls on demand, or even not knowing which server is assigned which IP address without looking at a management UI is scary as hell for the average network administrator.

And who can blame any of them? The history of commodity computing in data centers is littered with bad results from untracked changes, or badly managed application rollouts. Add to that the subconscious (of even conscious) fear that they are being replaced by software, and you get staunch resistance to changing times.

What everyone is missing here, though, is the key differentiation between planning for change, and executing it. No one in the entire industry is arguing that data center administrators should stop understanding exactly how their data centers work, what can go wrong, and how to mitigate risk. Cassatt (and I'm sure its competitors) spends significant time with each customer, even in pilot projects, making sure the data center design, software images, and service level definitions result in well understood behavior in all use cases.

But once those parameters are defined, and the target architectures, resources and service levels are defined, its time to let a policy-based automation environment take over execution. A Service Level Automation environment is going to make optimal decisions about resource allocation, network and storage provisioning and event handling, and do it in a fraction of the time that it would take a single human (let alone a team of humans). And, as noted above, once provisioning takes place, the applications, networks and storage run just as if a human had done the same provisioning.

(By the way, none of this breaks with ITIL standards. It just moves execution of key elements from human hands to digital hands. It also requires real integration between the SLA environment and configuration management, asset management, etc.)

All of this reminds me of the paradigm shift(s) that the software development industry went through from the highly planned, statically defined waterfall development methods of the early years to the always moving, but always well defined world of agile development methodologies. Its been painful to change the software engineering culture, but hasn't it been worth it for those that have found success? And, isn't it absolutely necessary for the highly decoupled and interdependent world of SOA?

Data center operations is about to undergo the same pain and upheaval. Developers, be kind and help your brethren through the cultural shift they are experiencing. Perhaps some of you in the agile methods field can begin to work out variations of your methods for data center planning and execution? Perhaps we should integrate data center planning activities into our "product-based" approaches?

Are you ready for this shift? Is your organization? What can you do today (architecturally and culturally) to ready your team for the coming utility computing revolution?

Friday, April 20, 2007

This is the second post in my series providing a brief overview of the three critical assumptions of a Service Level Automation environment. Today I want to focus on the ways in which the metrics gathered from the "measure" capabilities of an SLA environment are evaluated to determine if and what action should be taken by the "response" capabilities.

Let me first acknowledge that my discussion of the measure capabilities included some analysis of simple metrics to create complex metrics. This is one piece of the analysis puzzle, and is a critical one to acknowledge. Ideally, all software and hardware systems would be designed to intelligently communicate the metrics that matter most to determine service levels. Where this consolidation occurs depends on the requirements of the environment:

Centralized approach: Gather fundamental data from target systems to central metrics processor and consolidate metrics there. The advantage here is having one place to maintain consolidation rules. The disadvantage is increased network traffic.

Decentralized approach: Gather fundamental data and do any analysis necessary to consolidate the fundamental data into a simplified composite metric there. Send the composite metric to the core rules engine (which I will discuss next).

Metrics consolidation is not really the core analytics function of a Service Level Automation architecture, however. The key functions are actually the following:

Are metrics being received as expected? (A negative response would likely indicate a failure in the target component or the communication chain with that component)

Are the metrics within the business and IT service level goals set for that metric

If metrics are outside of established service level goals, what response should be taken by the SLA environment

Given my recent reading into complex event processing (CEP), this seems like at least a specialized form of event processing to me. The analysis capabilities of an SLA environment must constantly monitor the incoming metrics data stream, look for patterns of interest (namely goal violations, but who knows...) and trigger a response when conditions dictate.

The great thing about this particular EP problem is that well designed solutions can be replicated to all data centers using similar metrics and response mechanisms (e.g. power controllers, OSes, switch interfaces, etc.). Since there are actually relatively few components in the data center stack to be managed (servers [physical, virtual, application, etc.], network and storage), the rule set required to provide basic SLA capabilities is replicable across a wide variety of customer environments.

(That's not to say the rule set is simple...its actually quite complex, and can be affected by new types of measurement and new technologies to be managed. Buy is definitely preferred over build in this space, but some customizability is always necessary.)

Finally, I'd also like to point out that there is a similar analysis function at the response end as at the measure end. Namely, it is often desirable for the response mechanism to take a composite action request and break it into discrete execution steps. The best example I can think of for this is a "power down" action sent from the SLA analysis environment to a server. Typical power controllers will take such a request, signal to the OS that a shutdown is imminent, whereupon the OS will execute any scripts and actions required before signalling that OS shutdown is complete. At that time, the power controller turns off power to the server.

As with measure, I will use the label "analyze" to reflect future posts expanding on the analysis concept. As always, I welcome your feedback and hope you will join the SLA conversation.

Monday, April 16, 2007

I recently set up Google Alerts to inform me about references to Service Level Automation on the web. There were many articles returned this week, (many of which involved Cassatt), but I found two additional articles of note. Each makes mention of Service Level Automation, and represents the growing interest in this approach.

The first is from the March 2006 issue of ACM Queue, entitled "Under New Management". The article was written by Duncan Johnston-Watt, the founder of Enigmatec. Johnston-Watt does an excellent job of outlining basic issues around one possible architecture for an autonomic data center. As expected for Enigmatec, its a policy automation focused approach, and is, in fact, one of the few articles from a policy engine vendor that I have see where the term Service Level Automation is used correctly.

Unfortunately, I don't necessarily agree that Johnston-Watt's architecture is optimal enterprise data centers. (It requires development of process automation flows to "optimize operational processes"--a significant amount of work that is prone to introducing new inefficiencies. It is also agent based, which alters the footprint of the software stacks being run in the data center, and can negatively affect the execution and architecture of the applications being managed.) All in all, though, there is some excellent information here for those thinking about Service Level Automation holistically, across the entire data center.

The other article, entitled "Virtually Speaking: Xen Achieves Higher Enterprise Consciousness" was published April 6, 2007 on ServerWatch. In the last few paragraphs of the article, uXcomm'saquisition of Virtugo is covered. In it, uXcomm claims the combination their Xen management tools and Virtugo'sVMWare tools "fills a gap not just in uXcomm's portfolio but in the virtual landscape as well. Until now [...] there was a gap between service-level automation offerings and performance management products."

Hmmm. Not sure how providing SLA for only virtual servers counts as filling gaps...but, even so, I hope uXcomm is aware that everyone in this space realizes the need for resource optimization includes VM performance management. I guess my question would be, what is uXcomm doing about marrying Service Level Automation to the rest of the data center?

As a side note, I know that I owe two more articles on my Service Level Automation Deconstructed series. I am working on the "Analyze" overview now, but have discovered some interesting technology to discuss here that I am reading up on now.

Wednesday, April 11, 2007

I just finished rereading a science book that has been tremendously influential on how I now think of software development, data center management and how people interact in general. Complexity: The Emerging Science at the Edge of Order and Chaos, by M. Mitchell Waldrop, was originally published in 1992, but remains today the quintessential popular tome on the science of complex systems. (Hint: READ THIS BOOK!)

John Holland (as told in Waldrop's history) defined complex systems as having the following traits:

Each complex system is a network of many "agents" acting in parallel

Each complex system has many levels of organization, with agents at any one level serving as the building blocks for agents at a higher level

Complex systems are constantly revising and rearranging their building blocks as they gain experience

All complex adaptive system anticipate the future (though this anticipation is usually mechanical and not conscious)

Complex adaptive systems have many niches, each of which can be exploited by an agent adapted to fill that niche

Now, I don't know about you, but this sounds like enterprise computing to me. It could be servers, network components, software service networks, supply chain systems, the entire data center, the entire IT operations and organization, etc. What we are all building here is self organizing...we may think we have control, but we are all acting as agents in response to the actions and conditions imposed by all those other agents out there.

A good point about viewing IT as a complex system can be found in Johna Till Johnson's Networld article, "Complexity, crisis and corporate nets". Johna's article articulates a basic concept that I am still struggling to verbalize regarding the current and future evolution of data centers. We are all working hard to adapt to our environments by building architectures, organizations and processes that are resistant to failure. Unfortunately, entire "ecosystem" is bound to fail from time to time. And there is no way to predict how or when. The best you can do is prepare for the worse.

One of the key reasons that I find Service Level Automation so interesting is that it provides a key "gene" to the increasingly complex IT landscape; the ability to "evolve" and "heal" the physical infrastructure level. Combine this with good, resilient software architectures (e.g. SOA and BPM) and solid feedback loops (e.g. BAM, SNMP, JMX, etc.) and your job as the human "DNA" gets easier. And, as the dynamic and automated nature of these systems gets more sophisticated, our IT environments get more and more self organizing, learning new ways to optimize themselves (often with human help) even as the environment they are adapting to constantly changes.

In the end, I like to think that no matter how many boneheaded decisions corporate IT makes, no matter how many lousy standards or products are introduced to the "ecosystem", the entire system will adjust and continually attempt to correct for our weaknesses. In the end, despite the rise and fall of individual agents (companies, technologies, people, etc.), the system will continually work to serve us better...at least until that unpredictable catastrophic failure tears it all down and we start fresh.

Monday, April 09, 2007

I was analyzing the Google Analytics for my blog recently, and noticed a small but measurable amount of traffic originating from "biske.com". Now, my Analytics are really never very exciting, so seeing an unknown domain explicitly called out was too much to ignore. I followed the domain, and arrived (after seeing a very nice family photo) at Todd Biske's Outside the Box blog.

But why was I getting references to my site from his? I kept looking, and to my delight rested my eyes on my name in his blogroll...WOO HOO!!! The first such link that I know of!By the way, follow some of those other blogroll links...there you will find a huge wealth of knowledge about SOA, BPM and the culture shock the distributed systems community is feeling as a result (with some calm, collected voices providing solid advice). Don't think SOA and SLA are related? You'll find lots of evidence to the contrary among the posts of Todd and his colleagues.

About Me

James Urquhart is a widely experienced enterprise software field technologist. James started his career programming a manufacturing job tracking system on the Macintosh (circa 1991), and slowly expanded his experience to include distributed systems architectures, online community and identity systems, and most recently utility computing and cloud computing architectures. He has held positions in pre and post sales services, software engineering, product marketing, and program management for the online developer communities of one of the largest developer sites in the world. His admittedly schizophrenic background is driven by a desire to work with technologies that are disruptive, but that simplify computing overall.

James is also an avid blogger. His primary blog, recently renamed "The Wisdom of Clouds" (http://blog.jamesurquhart.com), is focused on utility computing, cloud computing and their effect in enteprises and individuals.

In addition to his online work, James is the father of two children: a son, Owen; and a daughter, Emery; and the husband of the perfect friend and wife, Mia. James lives in Alameda, CA, plays rock and bluegrass guitar.