Category Archives: Uptime Institute

Post navigation

This week the Data Center Industry got the terrible news it knew might be coming for some time. That Ken Brill, founder of the Uptime Institute had passed away. Many of us knew that Ken had been ill for some time and although it may sound silly, were hoping he could somehow pull through it. Even as ill as he was, Ken was still sending and receiving emails and staying in touch with this industry that quite frankly he helped give birth to.

I was recently asked about Ken and his legacy for a Computerworld article and it really caused me to stop and re-think his overall legacy and gift to the rest of us in the industry. Ken Brill was a pioneering, courageous, tenacious, visionary who through his own force of will saw the inefficiencies in a nascent industry and helped craft it into what it is today.

Throughout his early career experience Ken was able to see the absolute silo’ing of information, best practices, and approaches that different enterprises were developing around managing their mission critical IT spaces. While certainly not alone in the effort, he became the strongest voice and champion to break down those walls, help others through the process and build a network of people who would share these ideas amongst each other. Before long an industry was born. Sewn together through his sometimes delicate, sometimes not so delicate cajoling and through it all his absolute passion for the Data Center industry at large.

In that effort he also created and permeated the language that the industry uses as commonplace. Seeing a huge gap in terms of how people communicated and compared mission critical capabilities he became the klaxon of the Tiering system which essentially normalized the those conversations across the Data Center Industry. While some (including myself) have come to think it’s a time to re-define how we classify our mission critical spaces, we all have to pay homage to the fact that Ken’s insistence and drive for the Tiering system created a place and a platform to even have such conversations.

One of Ken’s greatest strengths was his adaptability. For example, Ken and I did not always agree. I remember an Uptime Fellows meeting back in 2005 or 2006 or so in Arizona. In this meeting I started talking about the benefits of modularization and reduced infrastructure requirements augmented by better software. Ken was incredulous and we had significant conversations around the feasibility of such an approach. At another meeting we discussed the relative importance or non-importance of a new organization called ‘The Green Grid’ ()and if Uptime should closely align itself with those efforts. Through it all Ken was ultimately adaptable. Whether it was giving those ideas light for conversation amongst the rest of the Uptime community via audio blogs, or other means, Ken was there to have a conversation.

In an industry where complacency has become commonplace, where people rarely question established norms, it was always comforting to know that Ken was there acting the firebrand, causing the conversation to happen. This week we lost one of the ‘Great Ones’ and I for one will truly miss him. To his family my deepest sympathies, to our industry I ask, “Who will take his place?”

This week my teams have descended upon the Uptime Institute Symposium in Santa Clara. The moment is extremely bittersweet for me as this is the first Symposium in quite sometime I have been unable to attend. With my responsibilities expanding at AOL beginning this week there was simply too much going on for me to make the trip out. It’s a down right shame too. Why?

We (AOL) will be featured in two key parts at Symposiums this time around for some incredibly ground breaking work happening at the company. The first is a recap of the incredible work going on in the development of our own cloud platforms. Last year you may recall that we were asked to talk about some of the wins and achievements we were able to accomplish with the development of our cloud platform. The session was extremely well received. We were asked to come back, one year on, to discuss about how that work has progressed even more. Aaron Lake, the primary developer of our cloud platforms and my Infrastructure Development Team, will be talking on the continued success, features, and functionality, and the launch of our ATC Cloud Only Data Center. Its been an incredible break neck pace for Aaron and his team and they have delivered world-class capabilities for us internally.

Much of Aaron’s work has also enabled us to win the Uptime Institutes First Annual Server Round Up Award. I am especially proud of this particular honor as it is the result of an amazing amount of hard work within the organization on a problem faced by companies all over the planet. Essentially this is Operations Hygiene at a huge scale, getting rid of old servers, driving consolidation, moving platforms to our cloud environments and more. This talk will be lead by Julie Edwards, our Director of Business Operations and Christy Abramson, our Director of Service Management. Together these two teams led the effort to drive out “Operational Absurdities”and our “Power Hog” programs. We have sent along Lee Ann Macerelli and Rachel Paiva who were the primary project managers instrumental in making this initiative such a huge success. These “Cowgirls” drove an insane amount of work across the company resulting in over 5 million dollars of un-forecasted operational savings, proving that there is always room for good operational practices. They even starred in a funny internal video to celebrate their win which can be found here using the AOL Studio Now service.

If you happen to be attending Symposium this year feel free to stop by and say hello to these amazing individuals. I am incredibly proud of the work that they have driven within the company.

I have received a ton of emails after my post about our Uptime Institute Server Round Up Award asking me about our “Power Hog” Award. In case you missed it, part of our internal analysis was going through and identifying inefficient servers and systems and we motivated the owners of those systems to migrate their installations to the cloud infrastructure that we built out. You definitely knew you were in trouble when a Power Hog Award arrived on your desk. I guess we were not below shame as a tactic. So for those of you who were interested in seeing our illustrious(?) award I thought I would share a photo of one.

Today the Uptime Institute announced that AOL won the Server Roundup Award. The achievement has gotten some press already (At Computerworld, PCWorld, and related sites) and I cannot begin to tell you how proud I am of my teams. One of the more personal transitions and journeys I have made since my experience scaling the Microsoft environments from tens of thousands of servers to hundreds of thousands of servers has been truly understanding the complexity facing a problem most larger established IT departments have been dealing with for years. In some respects, scaling infrastructure, while incredibly challenging and hard, is in large part a uni-directional problem space. You are faced with growth and more growth followed by even more growth. All sorts of interesting things break when you get to big scale. Processes, methodologies, technologies, all quickly fall to the wayside as you climb ever up the ladder of scale.

At AOL I faced a multi-directional problem space in that, as a company and as a technology platform we were still growing. Added to that there was 27 years of what I call “Cruft”. I define “Cruft” as years of build-up of technology, processes, politics, fiscal orphaning and poor operational hygiene. This cruft can act as a huge boat anchor and barrier to an organization to drive agility in its online and IT operations. On top of this Cruft a layer of what can best be described as lethargy or perhaps apathy can sometimes develop and add even more difficulty to the problem space.

One of the first things I encountered at AOL was the cruft. In any organization, everyone always wants to work on the new, cool, interesting things. Mainly because they are new and interesting..out of the norm. Essentially the fun stuff! But the ability for the organization to really drive the adoption of new technologies and methods was always slowed, gated or in some cases altogether prevented by years interconnected systems, lost owners, servers of unknown purpose lost in the distant historical memory and the like. This I found in healthy populations at AOL.

We initially set about building a plan to attack this cruft. To earnestly remove as much of the cruft as possible and drive the organization towards agility. Initially we called this list of properties, servers, equipment and the like theOperations $/-\!+ list. As this name was not very user-friendly it migrated into a series of initiatives grouped the name of Ops-Surdities. These programs attacked different types of cruft and were at a high level grouped into three main categories:

The Absurdity List – A list of projects/properties/applications that had a questionable value, lack of owner, lack of direction, or the like but was still drawing load and resources from our data centers. The plan here was to develop action plans for each of the items that appeared on this list.

Power Hog – An effort to audit our data center facilities, equipment, and the like looking for inefficient servers, installations, and /or technology and migrating them to new more efficient platforms or our AOL Cloud infrastructure. You knew you were in trouble when you had a trophy of a bronze pig appear on your desk or office and that you were marked.

Ops Hygiene – The sometimes tedious task of tracking down older machines and systems that may have been decomissioned in the past, marked for removal, or were fully depreciated and were never truly removed. Pure Vampiric load. You may or may not be surprised how much of this exists in modern data centers. It’s a common issue I have had with most data center management professionals in the industry.

So here we are, in a timeline measured in under a year, and being told all along the way by“crufty old-timers” that we would never make any progress, my teams have de-comissioned almost 10,000 servers from our environments. (Actually this number is greater now, but the submission deadline for the award was earlier in the year). What an amazing accomplishment. What an amazing team!

So how did we do it?

As we will be presenting this in a lot more detail at the Uptime Symposium, I am not going to give away all of our secrets in a blog post and give you a good reason to head to the Uptime event and listen to and ask the primary leaders of this effort how they did it in person. It may be a good use of that Travel budget your company has been sitting on this year.

What I will share is some guidelines on approach and some things to be wary of if you are facing similar challenges in your organization.

FOCUS AND ATTENTION

I cannot tell you how many I have spoken with that have tried to go after ‘cruft’ like this time and time again and failed. One of the key drivers for success in my mind is ensuring that there is focus and attention on this kind of project at all levels, across all organizations, and most importantly from the TOP. To often executives give out blind directives with little to no follow through and assume this kind of thing gets done. They are generally unaware of the natural resistance to this kind of work there is in most IT organizations. Having a motivated, engaged, and focused leadership on these types of efforts goes and extraordinarily long way to making headway here.

BEWARE of ORGANIZATIONAL APATHY

The human factors that stack up against a project like this are impressive. While they may not be openly in revolt over such projects there is a natural resistance to getting things done. This work is not sexy. This work is hard. This work is tedious. This likely means going back and touching equipment and kit that has not been messed with for a long time. You may have competing organizational priorities which place this kind of work at the bottom of the workload priority list. In addition to having Executive buy in and focus, make sure you have some really driven people running these programs. You are looking for CAN DO people, not MAKE DO people.

TECHNOLOGY CAN HELP, BUT ITS NOT YOUR HEAVY LIFTER

Probably a bit strange for a technology blog to say, but its true. We have an incredible CMDB and Asset System at AOL. This was hugely helpful to the effort in really getting to the bottom of the list. However no amount of Technology in place will be able to perform the myriad of tasks required to actually make material movement on this kind of work. Some of it requires negotiation, some of it requires strength of will, some of it takes pure persistence in running these issues down…working with the people. Understanding what is still required, what can be moved. This requires people. We had great technologies in place from the perspective of knowing where are stuff was, what it did, and what it was connected to. We had great technologies like our Cloud to move some of these platforms to ultimately. However, you need to make sure you don’t go to far down the people trap. I have a saying in my organization – There is a perfect number of project managers and security people in any organization. Where the work output and value delivered is highest. What is that number? Depends – but you definitely know when you have one too many of each.

MAKE IT FUN IF YOU CAN

From the brass pigs, to minor celebrations each month as we worked through the process we ensured that the attention given the effort was not negative. Sure it can be tough work, but you are at the end of the day substantially investing in the overall agility of your organization. Its something to be celebrated. In fact at the completion of our aggressive goals the primary project leads involved did a great video (which you can see here) to highlight and celebrate the win. Everyone had a great laugh and a ton of fun doing what was ultimately a tough grind of work. If you are headed to Symposium I strongly encourage you to reach out to my incredible project leads. You will be able to recognize them from the video….without the mustaches of course!

Last week I once again had the pleasure of speaking at the Uptime Institute’s Symposium. As one of the premiere events in the Data Center industry it is definitely one of those conferences that is a must attend to get a view into what’s new, what’s changing, and where we are going as an industry. Having attended the event numerous times in the past, this year I set out on my adventure with a slightly different agenda.

Oh sure I would definitely attend the various sessions on technology, process, and approach. But this time I was also going with the intent to listen equally to the presenters as well as the scuttlebutt, side conversations, and hushed whispers of the attendees. Think of it as a cultural experiment in being a professional busy body. As I wove my way around from session to session I was growing increasingly anxious that while the topics were of great quality, and discussed much needed areas of improvement in our technology sector – most of them were issues we have covered, talked about and have been dealing with as an industry for many years. In fact I was hard pressed to find anything of real significance in the new category. These thoughts were mirrored in those side conversations and hushed whispers I heard around the various rooms as well.

One of the new features of Symposium is that the 451 Group has opted to expand the scope of the event to be more far reaching covering all aspects of the issues facing our industry. It has brought in speakers from Tier 1 Research and other groups that have added an incredible depth to the conference. With that depth came some really good data. In many respects the data reflected (in my interpretation) that while technology and processes are improving in small pockets, our industry ranges from stagnant to largely slow to act. Despite mountains of data showing energy efficiency benefits, resulting cost benefits, and the like we just are not moving the proverbial ball down the field.

In a purely unscientific poll I was astounded to find out that some of the most popular sessions were directly related to those folks who have actually done something. Those that took the new technologies (or old technologies) and put them into practice were roundly more interesting than more generic technology conversations. Giving very specific attention to detail on the how they accomplished the tasks at hand, what they learned, what they would do differently. Most of these “favorites” were not necessarily in those topics of “bleeding edge” thought leadership but specifically the implementation of technologies and approaches we have talked about the event for many years. If I am honest, one of those sessions that surprised me the most was our own. AOL had the honor of winning an IT Innovation Award from Uptime and as a result the teams responsible for driving our cloud and virtualization platforms were allowed to give a talk about what we did, what the impact was and how it all worked out. I was surprised because I was not sure how many people would come to this side session and listen to presentation or find the presentation relevant. Of course I thought it was relevant (We were after all going to get a nifty plaque for the achievement) but to my surprise the room was packed full, ran out of chairs, and had numerous people standing for the presentation. During the talk we had a good interaction of questions from the audience and after the talk we were inundated with people coming up to specifically dig into more details. We had many comments around the usefulness of the talk because we were giving real life experiences in making the kinds of changes that we as an industry have been talking about for years. Our talk and adaption of technology even got a little conversation in some of the Industry press such as Data Center Dynamics.

Another session that got incredible reviews was the presentation by Andrew Stokes of Deutsche Bank who guided the audience through their adoption of 100% free air cooled data center in the middle of New York City. Again, the technology here was not new (I had built large scale facilities using this in 2007) – but it was the fact that Andrew and the folks at Deutsche Bank actually went out and did something. Not someone from those building large-scale cloud facilities, not some new experimental type of server infrastructure. Someone who used this technology servicing IT equipment that everyone uses, in a fairly standard facility who actually went ahead and did something Innovative. They put into practice something that others have not. Backed back facts, and data, and real life experiences the presentation went off incredibly and was roundly applauded by those I spoke with as one of the most eye-opening presentations of the event.

By listening the audiences, the hallway conversations, and the multitude of networking opportunities throughout the event a pattern started to emerge, a pattern that reinforced the belief that I was already coming to in my mind. Despite a myriad of talk on very cool technology, application, and evolving thought leadership innovations – the most popular and most impactful sessions seemed to center on those folks who actually did something, not with the new bleeding edge technologies, but utilizing those recurring themes that have carried from Symposium to Symposium over the years. Air Side economization? Not new. Someone (outside Google, Microsoft, Yahoo, etc) doing it? Very New-Very Exciting. It was what I am calling the Innovation of ACTION. Actually doing those things we have talked about for so long.

While this Innovation of Action had really gotten many people buzzing at the conference there was still a healthy population of people who were downplaying those technologies. Downplaying their own ability to do those things. Re-stating the perennial dogmatic chant that these types of things (essentially any new ideas post 2001 in my mind) would never work for their companies.

This got me thinking (and a little upset) about our industry. If you listen to those general complaints, and combine it with the data that we have been mostly stagnant in adopting these new technologies – we really only have ourselves to blame. There is a pervasive defeatist attitude amongst a large population of our industry who view anything new with suspicion, or surround it with the fear that it will ultimately take their jobs away. Even when the technologies or “new things” aren’t even very new any more. This phenomenon is clearly visible in any conversation around ‘The Cloud’ and its impact on our industry. The data center professional should be front and center on any conversation on this topic but more often than not self-selects out of the conversation because they view it more as an application thing, or more IT than data center thing. Which is of course complete bunk. Listening to those in attendance complain that the ‘Cloud’ is going to take their jobs away, or that only big companies like Google , Amazon, Rackspace, or Microsoft would ever need them in the future were driving me mad. As my keynote at Uptime was to be centered around a Cloud survival guide – I had to change my presentation to account for what I was hearing at the conference.

In my talk I tried to focus on what I felt to be emerging camps at the conference. To the first, I placed a slide prominently featuring Eeyore (from Winnie the Pooh fame) and captured many of the quotes I had heard at the conference referring to how the Cloud, and new technologies were something to be mistrusted rather than an opportunity to help drive the conversation. I then stated that we as an industry were an industry of donkeys. That fact seems to be backed up by data. I have to admit, I was a bit nervous calling a room full of perhaps the most dedicated professionals in our industry a bunch of donkeys – but I always call it like I see it.

I contrasted this with those willing to evolve their thought forward, embrace that Innovation of Action by highlighting the Cloud example of Netflix. When Netflix moved heavily into the cloud they clearly wanted to evolve past the normal IT environment and build real resiliency into their product. They did so by creating a rogue process (on purpose) call the Chaos Monkey which randomly shut down processes and wreaked havoc in their environment. At first the Chaos Monkey was painful, but as they architected around those impacts their environments got stronger. This was no ordinary IT environment. This was something similar, but new. The Chaos Monkey creates Action, results in Action and on the whole moves the ball forward.

Interestingly after my talk I literally have dozens of people come up and admit they had been donkeys and offered to reconnect next year to demonstrate what they had done to evolve their operations.

My challenge to the audience at Uptime, and ultimately my challenge to you the industry is to stop being donkeys. Lets embrace the Innovation of Action and evolve into our own versions of Chaos Monkeys. Lets do more to put the technologies and approaches we have talked about for so long into action. Next Year at Uptime (and across a host of other conferences) lets highlight those things that we are doing. Lets put our Chaos Monkeys on display.

As you contemplate your own job – whether IT or Data Center professional….Are you a Donkey or Chaos Monkey?

This May, I once again have the distinct honor of presenting at the Uptime Institute’s Symposium. This year it will be held in Santa Clara, CA from May 9 through the 12th. This year my primary topic is entitled ‘Preparing for the Cloud: A Data Center Survival Guide.’ I am really looking forward to this presentation on two fronts.

First, it will allow me to share some of the challenges, observations, and opportunities I have seen over the last few years and package it up for Data Center Operators and IT professionals in a way that’s truly relevant to how to start preparing for the impact on their production environments. The whole ‘cloud’ industry is now rife with competing definitions, confusing marketing, and a broad spectrum of products and services meant to cure all ills. To your organization’s business leaders the cloud means lower costs, quicker time to market, and an opportunity to streamline IT Operations and reduce or eliminate the need for home-run data center environments. But what is the true impact on the operational environments? What plans do you need to have in place to ensure this kind of move can be successful? Is you organization even ready to make this kind of move? Is the nature of your applications and environments ‘Cloud-Ready? There are some very significant things to keep in mind when looking into this approach and many companies have not thought them all through. My hope is that this talk will help prepare the professional with the necessary background and questions to ensure they are armed with the correct information to be an asset to the conversation within their organizations.

The second front is to really dig into the types of services available in the market and how to build an internal scorecard to ensure that your organization is approaching the analysis in a true – apples to apples kind of comparison. So often I have heard horror stories of companies

caught up in the buzz of the Cloud and pursuing devastating cloud strategies that are either far more expensive than what they had to begin with. The cloud can be a powerful tool and approach to serve the business, but you definitely need to go in with both eyes wide open.

I will try to post some material in the weeks ahead of the event to set the stage for the talk. As always, If you are planning on attending Symposium this year feel free to reach out to me if you see me walking the halls.

Are you a Data Center professional who doubts that Carbon legislation is going to happen or that this initiative will never get off the ground? This afternoon President Obama plans to outline his intention to assess a cost for Carbon consumption at a conference highlighting his economic accomplishments to date. The backdrop of this of course is the massive oil rig disaster in the Gulf.

As my talk at the Uptime Institute Symposium highlighted this type of legislation will have a big impact on data center and mission critical professionals. Whether you know it or not, you will be front and center in assisting with the response, collection and reporting required to react to this kind of potential legislation. In my talk where I questioned the audience in attendance it was quite clear that most of those in the room were vastly ill-prepared and ill-equipped to this kind of effort.

If passed this type of legislation is going to cause a severe reaction inside organizations to ensure that they are in compliance and likely lead to a huge increase of spending in an effort to collect energy information along with reporting. For many organizations this will result in significant spending.

The US House of Representatives has already passed a version of this known as the Waxman Markey bill. You can bet that there will be a huge amount of pressure to get a Senate version passed and out the door in the coming weeks and months.

This should be a clarion call for data center managers to step up and raise awareness within their organizations about this pending legislation and take a proactive role in establishing a plan for a corporate response. Take an inventory of your infrastructure and assess what will you need to begin collecting this information? It might even be wise to get a few quotes to get an idea or ballpark cost of what it might take to bring your organization up to the task. Its probably better to start doing this now, than to be told by the business to get it done.