Recently in Technical Management Category

It's a great way to get the team--and not just ops--offsite to experience a healthier way of dealing and learning from failure.

If you are in the NYC-area, this is a great opportunity to learn how to make postmortems an integrated part of how to improve reliability and prevent future outages.

When we wrote our "how to do postmortems" section of the upcoming The Practice of Cloud System Administration, we asked Dave for advice because we respect his expertise. Now you can get a full day of training directly from Yulia and Dave!

A user recently asked for a lot of disk space. Not just a lot of disk space, but growing at an astounding rate per month. (Not big for someplaces, but bigger than my current employer was used to providing). It was an archive that would start large and grow in leaps and bounds. It had to be actual disk (not tape or other off-line technology) because the data would be accessed constantly.

He joked that what he really wanted was infinite disk space. I replied, "I can give you infinite storage." and I wasn't joking.

He told me to prove it so I explained:

Your data will start large and grow quickly. If I can install storage capacity faster than you need it, it is indistinguishable from if we actually had infinite storage. I can afford it because the equipment will come from your budget, so it isn't my problem (directly) and if this project is as valuable to the company as you say it is, your management will gladly fund it. (and if they don't fund it, then we know the value of the project is not what you think it is).

To stay ahead of your capacity demands I need to know your predicted growth rate. You'll have to provide estimates based on engineering estimates but I'll have monitoring that will give you "ground truth" as confirmation and will help you make better predictions.

The key is communication and cooperation. The biggest barriers to a project are often funding and capacity planning. If you can solve those two issues everything else falls into place and we can do anything.

That was true when sysadmins were providing megabytes of storage in the 90s and it is true when providing petabytes of storage today.

In an effort to help the less technical community understand what Wikimedia Foundation's systems administrators do, Sumana Harihareswara wrote some very interesting blog posts. They're interesting to technical people too.

It is particularly interesting how she expresses the value of what we do to the Wikimedia managers and donors. There's also some information in there about how Wikimedia Foundation Ops uses Puppet, Nagios, and Ganglia.

For the first time ever Google discusses our "DiRT" (Disaster Recovery Test) procedure. This is the week of hell where systems are taken down with little or no notice to verify that all the failure protection systems work.

Oh yeah... and the funny sidebar at the end was written by me :-)

Enjoy!

P.S. (I take credit for cajoling Kripa into writing the article. I think she did a bang-up job! Go Kripa!!)

I moderated a discussion with Jesse Robbins, Kripa Krishnan, John Allspaw about Learning to Embrace Failure. This is the first time you'll see Google reveal what they've been doing since 2006. Read the entire discussion in the new issue of ACM Queue magazine: Resilience Engineering: Learning to Embrace Failure

Participants include Jesse Robbins, the architect of GameDay at Amazon, where he was officially called the Master of Disaster. Robbins used his training as a firefighter in developing GameDay, following similar principles of incident response. He left Amazon in 2006 and founded the Velocity Web Performance and Operations Conference, the annual O'Reilly meeting for people building at Internet scale. In 2008, he founded Opscode, which makes Chef, a popular framework for infrastructure automation. Running GameDay operations on a slightly smaller scale is John Allspaw, senior vice president of technological operations at Etsy. Allspaw's experience includes stints at Salon.com and Friendster before joining Flickr as engineering manager in 2005. He moved to Etsy in 2010. He also recently took over as chair of the Velocity conference from Robbins.
Google's equivalent of GameDay is run by Kripa Krishnan, who has been with the program almost from the time it started six years ago. She also works on other infrastructure projects, most of which are focused on the protection of users and their data.

This is the 2nd of 3 articles on the subject that I'm involved with. Part 1 was published last week. Part 3 is kind of a surprise and will be out in less than a month. Watch my blog for the announcement.

Here's a good strategy to improve the reliability of your systems: Buy the most expensive computers, storage, and network equipment you can find. It is the really high-end stuff that has the best "uptime" and "MTBF".

Wait... why are you laughing? There are a lot of high-end, fault-tolerant, "never fails" systems out there. Those companies must be in business for a reason!

Ok.... if you don't believe that, let me try again.

Here's a good strategy to improve the reliability of your systems: Any time you have an outage, find who caused it and fire that person. Eventually you'll have a company that only employs perfect people.

Wait... you are laughing again! What am I missing here?

Ok, obviously those two strategies won't work. System administration is full of examples of both. At the start of "the web" we achieved high uptimes by buying Sun E10000 computers costing megabucks because "that's just how you do it" to get high performance and high uptimes. That strategy lasted until the mid-2000's. The "fire anyone that isn't perfect" strategy sounds like something out of an "old school" MBA textbooks. There are plenty of companies that seem to follow that rule.

We find those strategies laughable because the problem is not the hardware or the people. Hardware, no matter how much or how little you pay for it, will fail. People, no matter how smart or careful, will always make some mistakes. Not all mistakes can be foreseen. Not all edge cases are cost effective to prevent!

Good companies have outages and learn from them. They write down those "lessons learned" in a post-mortem document that is passed around so that everyone can learn. (I've written about how to do a decent postmortem before.)

If we are going to "learn something" from each outage and we want to learn a lot, we must have more outages.

However (and this is important) you want those outages to be under your control.

If you knew there was going to be an outage in the future, would you want it at 3am Sunday morning or 10am on a Tuesday?

You might say that 3am on Sunday is better because users won't see it. I disagree. I'd rather have it at 10am on Tuesday so I can be there to observe it, fix it, and learn from it.

In school we did this all the time. It is called a "fire drill". The first fire drill of the school year we usually did a pretty bad job. However, the second one was much better. The hope is that if there was a real fire it will be after we've gotten good at it.

Wouldn't you rather just never have fires? Sure, and when that is possible let me know. Until then, I like fire drills.

Wouldn't you rather have computer systems that never fail? Sure, and when that's possible let me know. Until then I like sysadmin fire drills.

Different companies call them different things. Jesse Robins at Twitter calls them GameDay" exercises. John Allspaw at Etsy calls refers to "resilience testing" in his new article on ACM Queue. Google calls them something else.

The longer you go without an outage, the more rusty you get. You actually improve your uptime by creating outages periodically so that you don't get rusty. It is better to have a controlled outage than waiting for the next outage to find you out of practice.

Fire drills don't have to be visible to the users. In fact, they shouldn't be. You should be able to fail over a database to the hot spare without user-visible affects.

Systems that are fault tollerant should be peridically tested. Just like you test your backups by doing an occasional full restore (don't you?) you should periodically fail over that datbase server, web server, RAID system, and so on. Do it in a controlled way: plan it, announce it, make contingency plans, and so on. Afterwords write up a timeline of what happened, what mistakes were made, and what can be done to improve this next time. For each improvement file a bug. Assign someone to hound people until the list of bugs are all closed. Or, if a bug is "too expensive to fix", have management sign off on that decision. I believe that being unwilling to pay to fix a problem ("allocate resources" in business terms) is equal to saying "I'm willing to take the risk that it won't happen." So make sure they understand what they are agreeing to.

Most importantly: have the right attitude. Nobody should be afraid to be mentioned in the "lesson's learned" document. Instead, people should be rewarded, publically, for finding problems and taking responsibility to fix them. Give a reward, even a small one, to the person that fixes the most bugs filed after a fire drill. Even if the award is a dorky certificiate to hang on their wall, a batch of cookies, or getting to pick which restaurant we go to for the next team dinner, it will mean a lot. Receiving this award should be something that can be listed on the person's next performance review.

The best kind of fire drill tests cross-team communication. If you can involved 2-3 teams in planning the drill you have potential to learn a lot more. Does everyone involved know how to contact each other? Is the conference bridge big enough for everyone? If the managers of all three teams have to pretend to be unavailable during the outage, are the three teams able to complete the drill?

My last bit of advice is that fire drills need management approval. The entire management chain needs to be aware of what is happening and understand the business purpose of doing all this.

John's article has a lot of create advice about explaining this to management, what push-back you might expect, and so on. His article, Fault Injection in Production is so well written even your boss will understand it. (ha ha, a little boss humor there)

In the last few weeks I've written about ways to get peers to adopt a technology you like, and how to get your managers to adopt it too. Today I'd like to point out some "non-traditional" strategies you might employ when those fail. This list was created when talking with a reader about how to get approval for installing a trouble-ticket system.

Often the non-technical push-back is against the entire concept of ticket systems and nothing will be "good enough". In that case, don't bring a knife to a gun fight. In fact, find a way to avoid the fight entirely.

The Art of War and other strategy books would suggest alternate strategies like these:

Privately confront the primary dissenter directly: talk privately with the person to find the reasons behind their actions and settle those issues. Enlist them as a supporter.

Go around the dissenter entirely: set up the ticket system of your own choosing for a project they are not involved in, when it is successful it will be politically difficult not to expand its use to all projects.

Go over the dissenter's head: get the dissenter's boss on board.

Leverage influential people: If there is someone that the dissenter feels walks on water and can do no wrong, get an endorsement from that person.

Act faster: install something and put it into action before they can push back.

Act slower: are there benefits to putting off the decision? For example, will the dissenter retire or change jobs soon? (You may not be allowed to know that they are on the way out. If your boss smiles knowingly when you ask, maybe they know something you don't know.)

Produce more data: Gather data and produce charts that show undeniably you are right (don't show a single charts that disagrees; if the dissenter doesn't have the raw data, they can't make those charts).

Produce less data: Work in secret to build the system.

The power of crowds: Can you get a lot of other people on board such that the dissenter is outvoted?

The Power of the Demo: Are they rejecting a system they haven't actually used? Install your preferred solution on a VM and give demos to likely supporters. (The secret to a successful demo is doing at least 5 dress rehearsals)

Divide an conquer: Find out where the opposition isn't in agreement with each other and play one side against the other.

Isolate dissent: Identify the dissenters and exclude them from the process (find a politically viable justification for this).

Overload the dissenter: Give them so much other work to do that they don't have time to dissent; or put so much of the research on their shoulders that they ask to be taken out of the decision process.

Reduce the choices: Don't show 15 different models and hope they pick the one you want. Only show options that you will accept.

Give too many choices: Show so many potential products that they are overwhelmed; declare your expertise and recommend the one you want.

Selective comparison: Show 1 really awful system followed by a perfect demo of your system. (In a related note: At a singles bar always stand next to an ugly person.)

Force a "win": Get agreement to default to your solution if a decision isn't made by a certain date ("because we can't delay ProjectX"). Make sure you've given them more work than can be accomplished by that date so-as to trigger the default.

Make the dissenter think they are making the decision: If you ask a child "what do you want for dinner?" they'll ask for ice cream. If you ask, "Should we have hamburgers or hotdogs?" they'll think they're making the decision even though you've already made it for them. (Worst of all: don't list choices one at a time, they'll keep saying "no" until you run out of choices: "Do you want hamburgers?" "no" "Do you want hotdogs?" no "Umm... well, we have ice cream" "yes!").

Take advantage of emergencies: In an emergency the normal decision process goes away. Can you create a situation (or wait for a situation) where you can get permission to install RT or ORTS "just for this one emergency" and then take advantage of the fact that "nothing is more permanent than a temporary solution"?)

Bullies only respect other bullies: Declare that your solution is the ONLY solution and brow-beat anyone that disagrees.

Discredit the enemy: If the dissenter is always going to find reasons to reject something, don't try to deal with the points they bring up; discredit the dissenter's opinions. ("He isn't a real stake-holder, why should we listen to him?" "He rejects anything new, remember the time....", "He won't even be using the system, why is he causing trouble for us?")

Running code beats vaporware: a running system beats the theory that it won't work.

Avoid the issue: Find another project to work on that will make you a success; leave this "can't win" situation to co-workers that are suckers.

If done right, these strategies could work or could get you fired. Proceed with caution. Work with your boss, or if you boss the problem, confer with peers.

Please post comments with your suggestions and experiences. (This website now supports OpenID and other systems.)

At Bell Labs in the 1990s I felt that the president spent most of his time alternating between two activities: Canceling projects because "if it was a good idea, why don't we hear that the competition is doing it?" and complaining that the competition had just released a product that we hadn't thought to create. The truth was that we had, but he canceled it for the former reason.

The article recommends a better way to encourage innovation: "we can turn to a third form of logic: abductive logic, the logic of what could be. To use abduction, we need to creatively assemble the disparate experiences and bits of data that seem relevant in order to make an inference--a logical leap--to the best possible conclusion."

As system administrators we often put down extremely new ideas. Centralized file servers were a bad idea, until everyone else was doing it. The web was "too much bandwidth and should be blocked." WiFi can't be made secure. Cloud computing is "untested."

Sometimes I am concerned that we get burnt out and forget that while it is our job to measure risk, we do this to find creative was to mitigate it; lest we find ourselves using it to justify stopping innovation.

Here's a good New Years Resolution: Make an effort to take the logical leap to see what could be.

My networking friends balked and said that this isn't a valid rule for networks. I think that violations of this rule are so rare they are hard to imagine. Or, better stated, networking people do this so naturally that it is hard to imagine violating this rule.

On a mailing list recently someone asked, "Does anyone have any recommendations for useful metrics to measure the performance of the systems team? (i.e. not the systems themselves)"

Fundamentally you want to define an SLA and then demonstrate that you are meeting it (or how close you are to meeting it, with improvement over time). The problem is how do you define an SLA? Here are some example metrics:

90% of all tickets will be closed in 3 days (measure the number of tickets that are older than 3 days)

VPN and remote access services up 99.99% of the time (measure uptime outside of scheduled maintenance windows)

New users have accounts/machines/etc. within n days of their start (preferably n=-1)

IMAP latency below n microseconds (measure how long it takes to do a simulated login, read of 100 messages, and log out)

I prefer measuring things that can be measured automatically. All of the above can be. Asking humans to take manual measurements is a burden and error prone.

I recently started a new assignment where I was supposed to write down the number of open tickets at the beginning and end of the day, and keep count of how many tickets I had completed. Oh brother. As you can imagine, I failed. There wasn't a single day that I remembered to collect all three data points. Eventually I found a script that extracts this data from our ticket system.

Some things that can't be automatically measured:

Customer happiness. Yes, you can send out surveys but I don't think that's accurate. People don't respond to surveys unless they are dissatisfied with you or compulsive survey-takers. It is better to give people a way to tell a manager that they were unhappy so that the team can be "educated". The problem becomes, how do I ask for that kind of feedback from our users? Sometimes it helps to disguise that in the form of a survey. A single-question survey ("On a rank of 1 to 5, how did we do?") followed by a big, big, optional comment box. The rank data you collect might be useful if your boss likes pretty graphs (especially if you graph over long periods of time). The real value will be in the comments you get. Listen to the comments you get and make sure the person that made the comment gets a personal phone call or visit not to defend or explain, but to ask for their suggestions on how we could do better. Angry customers want to be listened to more than anything else. In fact, they want to be listened to more so than they want the problem fixed. (Oh, you'll get compliments too. Print them out and put them on the wall for everyone to see!)

"Time to Return to Service" i.e. when there is an outage (dead disk, dead router, etc.) how long before you were able to return the service to an operational state. Don't measure this. Measuring that distracts engineers from building systems that prevent outages (RAID, redundant routers, and so on). If you instead measure uptime you are driving good behavior without micromanaging. If I was measured on my "return to service" times, I'd stop building systems with RAID or redundant routers so that I can have a lot of outages and tons of data to show how good I am at swapping in new hardware. That disk that you paid for shouldn't be sitting in a box next to the computer, it should be part of a RAID system that automatically recovers when there is an outage.

My last recommenation is controversial. You should penalize people that beat their SLA too well. If the SLA says there will be 99.9% uptime, and I provide 99.999% uptime then I am probably doing one of two bad things: Either I'm paying for redundancy that is wasteful or I'm avoiding important system upgrades and therefore impeding innovation. If I am hovering around 99.9% by +/- 0.1% then I've demonstrated that I can balance uptime with budget and innovation. If management complains about outages but I'm still at 99.9%, then they need to change the SLA and be willing to fund the resources to achieve it, or accept the intangible costs of a slower rate of upgrades. They may back down or they may choose one of the other options. That's fine. If you think about it the essential role of management is to set goals and provide resources to meet those goals. By working to hit (not exceed) your SLA you are creating an environment where they can perform their essential role whether they realize it or not. Similarly, if they want to save money you can respond with scenarios that include fewer upgrades (higher risk of security problems, less productivity due to the opportunity cost of lacking new features) or by accepting a lower SLA due to an increase in outages.

[Hal is Founder/CEO of Deer Run Associates. This article originally appeared on his blog Righteous IT.]

Reliving the last story from my days at the mid-90's Internet skunkworks, reminded me of another bit of tactical IT advice I learned on that job, and which has become a proven strategy that I've used on other engagements. I call it "Queue Inversion Week".

One aspect of our operations religion at the skunkworks was, "All work must be ticketed" (there's another blog post behind that mantra, which I'll get to at some point). We lived and died by our trouble-ticketing system, and ticket priority values generally drove the order of our work-flow in the group.

The problem that often occurs to organizations in this situation, however, is what I refer to as the "tyranny of the queue". Everybody on the team is legitimately working on the highest-priority items. However, due to limited resources in the Operations group, there are lower priority items that tend to collect at the bottom of the queue and never rise to the level of severity that would get them attention. The users who have submitted these low-priority tickets tend to be very understanding (at least they were at the skunkworks) and would wait for weeks or months for somebody in my group to get around to resolving their minor issues. I suspect that during those weeks/months the organization was actually losing a noticable amount of worker productivity due to these "minor" issues, but we never quantified how much.

What did finally penetrate was a growing rumble unhappiness from our internal customers. "We realize you guys are working on bigger issues," they'd tell me in staff meetings, "but after a few months even a minor issue becomes really irritating to the person affected." The logic was undeniable.

I took the feedback back to my team and we started kicking around ideas. One solution that had a lot of support was to simply include time as a factor in the priority of the item: after the ticket had sat in the queue for some period of time, the ticket would automatically be bumped up one priority level. The problem is that when we started modeling the idea, we realized it wouldn't work. All of the "noise" from the bottom of the queue would eventually get promoted to the point where it would be interfering with critical work.

Then my guy Josh Smift, who basically "owned" the trouble ticketing system as far as customization and updates was concerned, had the critical insight: let's just "invert" the queue for a week. In other words, the entire Operations crew would simply work items from the bottom of the queue for a week rather than the top. It was simple and it was brilliant.

So we looked at the project schedule and identified what looked like a "slack" week and declared it to be "Queue Inversion Week." We notified our user community and encouraged them to submit tickets for any minor annoyances that they'd been reluctant to bring up for whatever reason.

To say that "Queue Inversion Week" was a raging success was to put it mildly indeed. Frankly, all I wanted out of the week was to clear our ticket backlog and get our customers off our backs, but the whole experience was a revelation. First, the morale of my Operations team went through the roof. Analyzing the reasons why, I came to several conclusions:

It got my folks out among the user community and back in touch with the rest of the company, rather than being locked up in the data center all day long. The people who my folks helped were grateful and expressed that to my team, which makes a nice change from the usual, mostly negative feedback IT Ops people tend to get.

The tickets from the bottom of the queue generally required only the simplest tactical resolutions. Each member of my team could resolve dozens of these items during the week (in fact, a friendly competition arose to see who could close the most tickets), and feel terrific afterwards because there was so much concrete good stuff they could see that they'd done.

Regardless of what outsiders think, I believe most people in IT Operations really want to help the people who are their customers. It's depressing to know that there are items languishing at the bottom of the queue that will never get worked on. This week gave my team an excuse to work on these issues.

I think I can reasonably claim that Queue Inversion Week also had a noticable impact on the morale of the skunkworks as a whole. After all, many of the annoying problems that our users had been just doing work-arounds for were now removed as obstacles. Like a good spring cleaning, everybody could breathe a little easier and enjoy the extra sunshine that appeared through the newly cleaned windows.

We repeated Queue Inversion Week periodically during my tenure at the skunkworks, and every time it was a positive experience that everybody looked forward to and got much benefit from. You can't necessarily have it happen on a rigid schedule, because other operational priorities interfere, but any time it looks like you have a little "slack" in the project schedule coming up and the bottom of your queue is full of little annoying tasks, consider declaring your own "Queue Inversion Week" and see if it doesn't do you and your organization a world of good.

Tom will be teaching two half-day tutorials: "Time Management for System Administrators" and "Interviewing and Hiring System Administrators". This is a rare opportunity to see these talks presented in the Ohio area. Register soon!

With the economy in a down-turn, Time Management is key to being efficient at what you do. With people's hiring budgets being slashed, it is important that the people you do hire are top notch. Both of these tutorials are intended for both the new and experienced system administrator or IT manager.

The sixth annual Ohio LinuxFest will be held on October 10-11, 2008 at
the Greater Columbus Convention Center in downtown Columbus, Ohio.
Hosting authoritative speakers and a large expo, the Ohio LinuxFest
welcomes Free and Open Source Software professionals, enthusiasts, and
anyone who wants to take part in the event.
The Ohio LinuxFest is a free, grassroots conference for the Linux/Open
Source Software/Free Software community

I've registered, I've booked my hotel. Are you going to LISA 2008?
On Thursday I will be doing a 90-minute open Q&A session about Time Management. Feel free to stop by and ask me anything.
On Friday I will be presenting my newest talk titled, "System Administration and The Economics of Plenty". When we start to see how plentiful the world is, we think about our roles as system administrators differently. It affects everything from how we set policy to how we do our jobs.
Register online today!
I hope to see you there!

In system administration we have to keep many lists: lists of users, lists of machines, lists of IP addresses, and so on. The only way to keep information from growing stale is to make sure key processes are driven off of the live database.

Here are three different techniques I've seen used:

Level 1: Periodically gather the information. A spreadsheet is great for this and simple. Once a year you collect information and then you spend 354 days with out-of-date information. I've seen this in a number of places. At Lucent they hired a company to document "everything with a power plug" once a year. The information was put into a big read-only database that everyone ignored. I wonder how much they paid for this "service".

Level 2. Automatic collection. You know that, at least for the machines you know about, data is being collected and it is, hopefully staying up to date. If the process is automated, you can run the process weekly or daily. Machines can stay hidden if the "discovery" software isn't very good, or if someone wants it to stay hidden.

Level 3. Actively-used data. Rather than storing data, if you actively use it then you know it is up to date because people are dedicated to keeping it up to date. They receive a benefit, not just you. If the inventory is used to drive software upgrades, then people will complain they are "left behind" and you'll know to add them to the inventory. If patches only go to machines in the inventory, then sysadmins are compelled to keep the list accurate so they aren't dealing with security flaws.

Level 3 is a self-correcting system, which saves times and assures far greater accuracy than other solutions.

A company with limited IP address space found itself constantly emailing its engineers begging them to return unused IP addresses that had been allocated to them. Nobody listened. When they changed the IP allocation process to be a "rental agreement" that required semiyearly confirmation that they IP address was still in active use (they received email with a link to confirm the ownership), suddenly the list became much more accurate.

The pressure for a person to keep the data should be self-serving to the person, not you. Employees are quick to demand corrections to any inaccuracies related to payroll, right? The payroll department has an incentive to pay everyone their accurate salary. The employee has an incentive to make sure they are paid the correct amount, and make sure their home address and such is accurate. I once saw a company try to send holiday cards to each employee. A secretary was about to blast email to everyone asking for their home address. Since the email wasn't going to say why she needed their address (the card was a surprise), I was sure it was going to cause nothing but a big flap about privacy. Instead I encouraged her to simply get permission to use people's home address as listed in the payroll system. While monthly paychecks were direct-deposit, bonuses and tax-info was sent by paper-mail. Everyone kept that database extremely accurate.

What do you use to keep inventories and other lists of information up to date?

Tom and Strata be teaching and speaking at LISA 2006 in Washington D.C., Dec 3-9, 2006. This is one of our favorite conferences of the year because it is so dam useful. Get your boss to send ya. This year it is in Washington D.C., which makes it easy to get to for all the east-coasters that usually don't get around.

In addition, we will be hanging out in what is known as "the hallway track". In fact, if you haven't attended LISA before, you should know that a lot of the educational value is the people you meet. Tom says, "Early in my career a lot of what I learned was from the conversations in the hallway."