The Limoncelli Test: 32 Questions for Your Sysadmin Team

12 Comments

Awesome list; I especially like your piece on testing restores, which is someplace where a lot of IT shops fall short. I'd like to add one point, which I always bring up to people claiming that tape is dead and that disk-to-disk is the wave of the future.

One scenario for backups that people often forget is "disgruntled admin has trashed all data." Malice on the part of an IT administrator has gained a lot more attention as something that actually happens, largely because of Terry Childs locking out the city of San Francisco from managing their network back in 2008. It's generally the hardest to deal with, as in most organizations, the backup/storage administrators hold all the keys to the kingdom.

It's different from the case where the building burned down, because WAN replication to a remote site will usually save you in these cases. However, a rogue storage admin can usually trash the backup copy just as easily as they can trash the master, unless the software has some WORM capabilities that cannot be disabled.

Intruders are just as dangerous. WebHostingTalk experienced a situation in 2009 where their backups were allegedly trashed by an intruder before the intruder started deleting posts out of their live database:

For certain kinds of data, especially data under regulatory compliance requirements, it's extremely important to keep the data offline where nobody, not even the person primarily responsible, can get to it without setting off red flags.

This is a wonderful test/check list for all IT shops.
Although, I prefer a more holistic (top-down) approach like ITSM (IT Service Management). It is a huge methodology, but can be fitted gradually, filling the gaps with a list like this and adapt it to the organization size and its idiosyncrasies.

I would like to have more of advice on managing Virtualization/Cloud installations best practices, there isnt't many literature out there on Security, Backup, Configuration Management and Monitoring on this environment. Do you think the cloud is a generalization of a local rack of servers or different ball game ?

I could certainly recommend ITSM and ITIL, but then the person reading my blog would be lost for 6 months or more, possibly never to return. I wanted a checklist people can read in one sitting.

Best practices of Virtualization/Cloud is a hot topic. The cloud practices overlap with a rack of local servers, but there is a lot more to it. I'm not an expert outside of my little virtualization echo chamber (http://code.google.com/p/ganeti). I'm sure people with big VMware clusters have different experience than Ganeti users.

The one hour limit on root passwords seems a bit arbitrary and in some cases violates the canary principle. When there are tens of thousands of highly disparate systems, it rarely is a good idea to change anything all at once (in one hour). That may not be enough time to realize that the seemingly innocent root password change is actually triggering a previously unknown bug that is causing outages on significant numbers of systems. It is important to manage authentication, (including the password of last resort) and be able to change that in a reliable manner. But if one says "Change a hundred passwords per minute until done", we wouldn't finish in an hour, or even a few hours. Too much focus on "magic numbers" like an hour can actually hurt as it prevents fairly sane approaches from being considered because it's not quite fast enough.

When it comes to virtualization, we pointed out that virtualization doesn't change the rules, because it isn't new. It's been around for decades, even in Unix. We just called it something else and didn't brand all the virtualization techniques under the newfangled "cloud" name. (Try telling a mainframer that virtualization is brand new technology and see what you get back.)

Mark : My take on the password change questions, is that they are "can" not "do". ie would the answer to "We need to change every root password right now", be "oh man this is going to take days" or "we can change them in an hour, does the risk we are mitigating outweigh the risk of making the change quickly ?"

Tom : I think something you have missed, although you have hit a number of aspects of it, is "does the department have a service and system go live checklist ?" ie a set of criteria that a service or system is tested against before it is considered supportable and able to be considered live ?

This is a great list. It's hard to add to what is already great, but more could be said about empowering.

ex. Do staff have the apporpriate resources and authority to perform thier jobs with excellence and efficiently including all necessary and reasonable: HW, SW, training, talent, objectives, decisions, support, creativity, etc?

It's not as objective as your questions but I'm sure you can communicate the topic better than I. It seems like a point so important should be worth additional consideration.