ITIL, or the Information Technology Infrastructure Library has fascinated me ever since I started in the industry. Early in my career I had heard a colleague say that “ITIL was the holy grail of structured IT management” and from that day forward my interest was piqued. At the time the term held a sort of mystical quality about it. It conjured images of an IT utopia where everything functioned like a well oiled machine, services ran themselves, and large service delivery teams were a thing of the past. Perceptions can be very misleading to the untrained mind.

Today more than 20 years after the birth of ITIL, most fortune 500 companies, industry leaders, and large IT shops have implemented some form if ITIL, some very successfully, some not. The most elusive part of ITIL for some organizations is that it’s designed as a framework to be implemented according to the relevant details of the business in question. Very rarely do you see corporation’s even competitors in the same industry implement it in the same way. In addition to this, ITIL is also frequently viewed as an end state rather than a consistent change in approach or a foundational change in the culture of service management. This tends to complicate matters and is exacerbated by a lack of understanding at various employee and management levels in organizations, including those responsible for setting the tone and driving policy.

The fact that ITIL is in its 3rd version and that so many organizations have adopted it has done little to demystify it for those organizations who have not adopted the framework into their businesses even though it has been around in some form since the 1980’s. ITIL is moving right along into its third decade.

ITIL has deep roots extending into the foundations of six sigma and TQM, and interestingly enough was influenced in many areas by Dr. W Edwards Deming (you should recognize the Deming cycle) who had profound contributions to not only TQM and Six Sigma, but to the methods and ultimate success of the post WW2 restructuring of Japan and it’s industrial complex. I’m sure Deming never considered in the 1950’s how far reaching his ideas on quality and continual improvement would reach into the information age and beyond.

There is an innumerable body of articles floating around which cite the reasons why organizations fail to implement ITIL. The number of reasons is as diverse as the businesses that try to implement the framework. This is not surprising to me considering the sheer number of practitioners out there that view ITIL as a religion, a stark contrast against the backdrop of execs and managers who have a vague view of ITIL and what it means to their company.

In this series we are going to cover ITIL in detail from several different perspectives. We are going to look at the failures, the successes and what makes ITIL the most profound and confusing framework for IT service management today.

Over the years I have encountered many services that relied on monitoring systems that were monolithic or relied too heavily on a single approach. Take for example synthetic transactions where a client somewhere submits a transaction to a generic API in a complex multi-tiered system. In one instance the team responsible for running the service with only server and synthetic transaction monitoring was convinced without a doubt (they had already drank from the 5 gallon bucket of coolaide ) that their high availability numbers meant the service was running great. The service in question was just converting to an ITIL/MOF process and they were perplexed by the high number of recurring incidents. They were further flummoxed by incidents which did not show up in monitoring and alarming, but rather came from end users, developers, and execs at the company.

From an outside perspective it’s relatively intuitive to see what was happening here. They had huge holes in the monitoring infrastructure that they relied on to get an internal picture of service health. In addition to this when you are running an incident management team without completely integrating a problem management flow into your process you are basically trying to bail out a boat with a one foot hole in its bottom using a thimble. Monitoring, incident management, and continual service improvement have to be performed together like a trio of finely tuned instruments to be successful. If you can’t coordinate these three instruments as a conductor, you are doomed to failure. Without getting too deep into this trio I wanted to touch on how important they are to each other. The rest of this article will focus on holistic monitoring and I will leave ITIL/MOF, Incident Management, and Problem Management for a future series.

Synthetic Transactions

First and foremost, you must take a broad approach to monitoring. Synthetic transactions are great to get a rough picture, but they are a sample in time, and do not represent the actual quality of service the user is experiencing or reflect the actual state of the service as a whole. Synthetic transactions by nature will not hit every server in every tier of the system due to the complexity of top tier load balancing, middle tier hashing and load balancing and back tier partitioning and replica state.

Organic transaction monitoring, true QoS

To broaden the monitoring effectiveness in multi-tiered system, organic transaction monitoring and an associated BI infrastructure should be used alongside synthetic transaction monitoring. Within the system the quality of every single transaction should be collected, aggregated, stuffed into a BI warehouse to report on in close to real time. This type of monitoring is light years beyond relying solely on synthetic data and will give you true insight into how individual customers, partners, and consumers of your service perceive the quality of delivery. Knowing at every level of the system how long the transaction took, and what its result was is a huge advantage. You can visually see patterns in the data that tell you more about your service in real time than weeks of testing would. You can SEE traffic patterns. You have now transformed yourself from the guy who watches red and green lights on a screen to the Systems Engineering equivalent of Cypher sitting at the console watching the matrix code and deciphering blonds, brunettes and redheads….

Server HW Monitoring

No approach to monitoring is complete without an effective toolset to monitor the health of the servers you run your service on. Preventing impact to the service is another important component of monitoring. A great example of this is seeing a SQL farm that represents the backend of a major global scale service with insufficient server / preventative monitoring. When you engineer redundancy into your services sometime you take this safety net for granted. Take the case of this SQL farm which has redundant partitioned data stores, with multiple RAID 10 drives on each server. Bad drive on one of the sides of the RAID 10 array? No problem you have redundancy, the server and service continues to hum along and your availability is wonderful. But if you are not alerting on the dirty bit / drive failure, and you are relying on datacenter staff to identify this via a walkthrough you are sitting on an availability time bomb. The MTBF rating on a drive is not a bullet proof vest for your service. I can’t tell you how many times I have seen a multi drive failure in a RAID 10 array (on both stripes) because there was no alarm and the DC folks didn’t see the red light on the drive and replace / rebuild. The larger the service the more prone you are to errors like this. With the addition of targeted and carefully thought out server monitoring and alarming you can avoid this pitfall. This is a single example but there are many others that you should use the server monitoring tool in your bag to solve. Make sure you have monitoring cases for all your hardware. Alerting on specific events is a great strategy to start with. Have an event catalog for the HW platform and OS you are running on and use it as a basis for this monitoring and alarming.

We are done! Our service now has synthetic transactional monitoring, organic QoS monitoring and transactional BI, server monitoring, and our availability looks fantastic.

You are not done; you are about half way there. WHAT?! You say “Justin what are you talking about! We are doing great! Availability is at an all-time high, incidents are down dramatically and the exec team is throwing a party for us!” You are definitely not done…and at the risk of depressing you, you will never be 100% done with monitoring if you are using a continual service improvement methodology. Let’s continue.

Capacity & Bottleneck monitoring (performance monitoring)

As the feature team innovates and releases code at break neck speeds to keep up with competition, there is little accurate consideration to how new features change the performance profile of the service. Just saying the words “performance profile of the service” is an immense simplification of the gravity of this topic. In actuality each tier of your service, each component of your service has to be considered as an individual and evaluated from an isolated standpoint. Capacity and performance bottlenecks will be vastly different for each component and hitting this bottleneck will have far reaching effects throughout your system. The bottleneck on your web tier could be CPU, your API or cache tier could be memory, your SQL/File tier could be controller throughput, disk IO, memory, CPU, network, or disk space. If fact you could have almost any bottleneck at any tier or component type. What makes this worse is that you need to understand what your primary and secondary bottlenecks are for each individual component to plan a monitoring scheme to alert you to approaching bottlenecks. A classic care here is that a production site on V1 of the service has plenty of headroom, in each tier and component type, from the standpoint of bottlenecks. V2 gets deployed and you notice that bug fixes to existing features, changes to existing features, and new feature adoption of the user base is now eating up 35% more CPU in the API tier, the cache hit ratio has dropped on the cache tier which in turn is driving IOPS and disk queue length up on the SQL/File tier getting you dangerously close to your alert threshold or worse an actual bottleneck. Surely you have a test team that does benchmarking, performance testing, breakpoint / bottleneck testing, failover and recovery, etc. Too often service operators rely on QA results and “performance certifications” and fail to consider that there are far too many variables to get accurate enough results in these areas within test environments to know with confidence what will happen when the bits are deployed. To be confident in QA results in these areas you need to have stellar parity in data distribution, server ratios, load profiles, configuration, isolation, and scenarios. Even if you have this the result set will almost never be close because the new feature sets the test team is simulating load for have not yet been adopted by a production user base. What this means is the load running in test for new features is basically a guess. As a production operator or engineering team responsible for the production site you need to rely on your own assessment and monitoring here.

Some rules to live by related to Capacity and Bottlenecks

Rule 1 – You must understand the primary and secondary bottlenecks for each service tier and component type

Rule 2 – Ensure you have sufficient monitoring in place to identify trends that will ultimately lead to a violation of the max

Rule 3 – Never deploy (anything – hotfixes, releases, patches, etc.) broadly without piloting and seeing the ACTUAL impact to the service as a whole (each component, each tier)

Rule 4 – Never let a bottleneck occur, your capacity planning and service expansion should be far enough in advance to avoid violations

Up to this point in the discussion we have talked about monitoring methods that are focused geographically near the system. When we ask the question what happens outside of our sphere of influence or the parts of the system under our control, things get a little more complicated. Everything we have considered thus far is judged from the point of being physically at or near the system. What happens when our system is located in several US datacenters and geo distributed across those datacenters and we start getting reports of system availability issues from users in Canada, China, India or anywhere else in the world? Our initial reaction is “surely you can’t expect me to monitor and control the internet globally?” While we certainly can’t control it, we do have some ways to monitor and influence it. If a large number of users in Canada are paying for your service and they cannot get to it reliably at times due to any number of network issues between you and them do you think they will be able to digest the technical significance of a route flapping or a routing loop in a poorly peered transit network? They will not be able to digest this and will blame your service for the hit in availability or reliability. With this in mind, it’s critical to understand where your client base is, where your partner base is and to monitor the actual experience these customers have from their region. If you find recurring problems or patterns in specific geographical areas it may make sense to work directly with transit networks, ISP’s and backbone providers to alleviate the issues where possible. Have remote monitoring in key area according to the distribution of your customers to be able to see what they see and solve the problems. If your application has a client that customers use consider instrumenting it with the ability to report back end to end QoS. If your application is web based synthetic transactions from an area monitoring node may be the best option.

Network Monitoring

Many teams have separate networking organizations that don’t tightly interact with the teams responsible for operating the service. In addition to this many global scale services have taken the time to build networks that have little to no SPOF’s. It’s hard to think of a global scale service that does not have redundant L2 and L3, redundant load balancing, HSRP implementation, and strong redundant peering at the network core layer. Even so it’s critical to expose network monitoring results and events to the team that is responsible for the service. Without this information the service delivery team is blind to network risk or events that could cause a hit to availability. If a top of rack switch is throwing errors and could take out an entire rack of API servers or worse SQL servers the service delivery team could mitigate this risk by shunting traffic away from the VIP’s servicing that rack or failover the SQL servers to other replicas residing in a different rack, colocation or even datacenter. Network monitoring design, implementation, and alarms should be exposed to the service operations team to mitigate risk.

In closing

Any monitoring strategy should take a multipronged approach to cover as much of the surface area of the service as possible and economical. Ultimately the right monitoring approach for your service will be based on what your service delivers, your availability goals, and your customers’ expectations. Your system may have many other considerations and areas not covered in this discussion where you require special case monitoring to ensure you have an accurate picture of your service as a whole. In fact we did not cover;

• Version monitoring

• Configuration monitoring

• Change monitoring

• Environmental monitoring

• Backup monitoring

• Security monitoring

As well as a slew of other monitoring topics. If there is sufficient interest it’s likely I will cover these topics in a follow-up. If you have questions on methods or implementing a specific type of monitoring feel free to ask. I’m happy to help where I can.

Like this:

I have spent the better part of the last two decades at Microsoft fulfilling a variety of technical and management roles in IT, R&D, and Systems Engineering. I have extensive experience over the last 10 years working on large distributed SQL and multi-tiered systems with a global scale user base in the hundreds of millions, both from Test/QA and System & Service Engineering perspectives. This experience is with both technical and project management lenses.

I want to share the experience I’ve had at Microsoft and provide insight into key tech topics that are critical components of our industry today. Data is exploding globally and the infromation revolution is experiencing exponetial growth. At such an exciting time of growth and change, there is a lot to talk about. From my vantage point I not only see exciting change, but monumental misteps and failure. Everything and everyone is fair game!

I look forward to writing over the coming months and hope you enjoy the content.