10/20/2010

85 Operations Rules to Live By

This was originally posted in June of 2007 and I have not updated it since, however this list is still pretty applicable. I'll update it soon...

(85) Operations Rules to Live By

1) Capacity First - Optimizations Later - this rule broken will guarantee downtime. Do not optimize under the stress of downtime - focus on capacity first.

2) Make sure you have every net available to catch you – Postgres examples - WAL Files, Slony Replication, Snapshot technologies, Disk based DB versioning (spinoff of Snapshots)

3) Do not ‘Optimize’ problems into your architecture. Often things created to solve problems turn into operational albatrosses to maintain later. Make sure that tools developed in engineering for operations are handed off completely. It usually doesn't work to go back to eng for further development. More importantly, the change requests can be disruptive to well scheduled engineering plans.

4) Keep it simple. Keep it simple, because you are smart. Do not make it overly complex because you can.

5) Caching should be used very sparingly and really in order to protect resources that are hard to horizontally scale. If however, you can scale something horizontally, rarely is it wise or prudent to add a caching layer. If used, it should be to gain performance for the end user not to gain capacity for a web site; otherwise, you have just created another bottleneck that has very unclear limits of capacity. They can have potentially negative impacts to the system as a whole. Cascading failures often come at the result of a failed caching layer.

6) Don’t code everything in house; do not buy everything from vendors - use the right tool for the business at the right time

7) Negotiate - the only way to negotiate with real strength is to have done the homework of creating options that are viable so that you would be willing to walk from your preferred vendor if necessary. Do not bluff.

8) Always N+1 and if N=1, do not under any circumstances use +1 for anything but waiting for N to fail. When using the redundant server to 'share' the load, the management of 49% capacity or rarely managed. There are many times when opportunities to have N+2 arise – take advantage of them.

9) Data loss is never a risk the company can take - this has been universally true. The cost of lost data far exceeds the cost to ensure data cannot be lost

10) Parallelize whenever and wherever you can - this is an important way of thinking when it comes to multiple locations. For example, If MogileFS were setup to be location aware and needed to replicate its data in real time, it should work so that each MogileFS server can replicate its data to the entire load balanced farm of MogileFS servers at the other end. Implement many to many wherever possible.

11) RTFM - To this day, I will read the entire manual of a pair of RAID cards to see what the subtle differences are. The devil is in the details. Do your homework

12) Know your bottlenecks and know how to spot them - every layer - know if you are blocking on disk, RAM, or CPU. It is usually that simple.

13) Have a regular capacity management process - Do not react. Be proactive. Knowing where your soft spots are is critical to staying ahead of the capacity curve.

14) Don’t promote failures and do not fear change

15) Do not breathe your own exhaust. Do not think that the output of your work should be the motivator for how you do things in the future.

16) Ops people that code should write ops tools, not application software

17) The value of a project manager, tech writer, and financial analyst in the ops organization should not be underestimated. They will more than pay for themselves.

18) Monitor EVERYTHING - alert on actionable only, record other for trend information

19) Have a regular process to look at trend data everywhere

20) Do not make the monitoring so noisy that it becomes useless

21) Ensure your monitoring system is so simple EVERYONE in the company can easily use it. It is surprising how often ops metrics turn into business metrics, marketing metrics, sales metrics, etc.

22) Do postmortems only if the people that can make changes are there. Otherwise, it is a waste of time.

23) Publish your postmortems. Attach the event data to the postmortem so people can easily look at a Keynote for example and connect the incident to the data.

24) Assign people to be point people for every bit of technology.

25) Assign backup people to those primary people.

26) Hire constantly - even when you do not have headcount - always be hiring.

27) Be your own harshest critic. You can always improve no matter how smart you are or think you are.

28) Compare yourself to as many companies as possible. Look outside of the company.

29) Pick one Tradeshow/Conference, only one, per year and go. If the one meets more than once a year fine, but pick one and only one.

30) Buy what you need, not what you want. Never ever, take off your corporate hat and leave the “what’s easiest and safest for me” hat on.

31) Do what is best for the business, always, even if that means you should not be there.

33) You should not get more than one or two times to fail. There is goodness in fear. There is goodness is some fear, but manage the difference between chronic mistakes and honest mistakes.

34) Be ruthless - your competitors are.

35) Think of the work you do as something you want to sign your name with when done. That also means finish the job.

36) Be available for others.

37) Partner with startups - give them your expertise and scale and you will be rewarded with free product, sometimes for life.

38) Capacity is a business/product issue. This means the net cost per page/post/login, etc. must be visible to make the right business/product decisions.

39) Always beat the budget. The Operations group is usually the largest spender of discretionary money. Revenue targets are often missed, but the Operations group has many ways usually to defer spending, etc.

40) Because something sucked in the past, does not mean it will suck today or in the future - try things and have the tools to test

41) Documentation - document everything - well. Make it so new people should not have to talk to anyone to learn everything

43) Create gigantic plotter size drawings of the logical flows of each of your products

44) Faq-O-Matic, Wiki, something where people can easily post “this is how you fix this” and make it very findable. This is where a tech writer comes in handy, but the idea is to make incremental, even if informal, documentation easier along the way.

45) Make sure everyone, truly everyone is totally replaceable

46) Most people get more done at home, than at the office, some people do not.

47) Bundle your orders - you get to ask for the most discounts, contract terms, etc when you have batched your hardware orders. Ask for everything - price locks, spares kits, lease terms, everything before they get the PO.

48) Develop long term relationships with your vendors - make sure you can call them at your next job

49) Give everyone in ops everything they could ever use to make them useful remotely - Treo’s, EVDO card, dual 24” LCD Panels, ANYTHING and EVERYTHING. With talented people, the costs will be more than paid back by effective remote employees. Remember Ops (and Eng) are power users, understand what a pixel is and will make the most of screen real estate.

50) Be a complete stick in the mud with IT standards. Until the Mac runs office 2k7 and outlook, you have to run some form of windows. Period. Unless it is *all* Macs - it destroys office productivity for meeting/calendaring, contact management, mail lists, etc. If an employee is willing to stake their job on running an instance of XP on parallels, fine. This is very rare This rule is now dated/moot and stick in the mud isn't necessarily the best approach. This list is very '2007'

51) Have a streamlined purchasing process - know your budget, make sure you get to manage to it. Get the 'actuals' report from Finance. There is a gap between technical driven budgets/reports and financially driven budgets/reports. A good Ops manager can create models that bridge those different views into COGS. A CFO who understands is a supportive CFO and helps drive business decisions.

52) The weekly meetings must have continuity. Items from previous meetings committed to then must have accountability.

53) Create a separate escalation system to involve engineering into the problems with code that effects operations negatively. The idea is ops issues that escalate into engineering, will either get lost in any ops ticketing system or any engineering ticketing system. Create a separate ticket queue and manage to a cleaner and simpler list.

54) Incorporate Operations staff in every stage of a feature or product development from the design stage so that scalability, monitoring, and reliability are baked in to production. This also makes sure operations is responsible for hardware procurement, monitoring systems in place, run books written, etc and that it launches on time and according to operational standards.

55) Practice for being a real company - Sarbanes, WebTrust, SAS 70, Visa, Banking, etc. Do not forget that if you are successful you are going to have to deal with this stuff. It is easier earlier in the process, even if it is nothing more than awareness and knowledge vs. any real change early on. Deploy a ticketing/task tracking tool. USE IT. Put change control/change management into the same system. USE IT. Keep putting info in there. It helps to figure what “what changed in the last week” as an example.

56) Do not make it hard to have redundancy or multiple locations. Things are hard in the beginning, but do not slow your success down with bad architecture that does not allow true scale and reliability.

57) Oracle Standard Edition (MS SQL Server Std.) is affordable. If you can constrain your use of Oracle to Standard Edition, there is no reason that a viable business even a small startup cannot use it.

58) Postgres and MySQL are free for a reason. If you do not really care about transactional integrity, MySQL is fine. Until the mandatory chaining of the words ‘Vacuum’ and ‘Postgres’ is broken, Postgres represents an unpredictable, usually negatively surprising database.

59) Capacity should be measured at the peak second of the day +20-30%. Unless you are a vmotion guru that is the way it goes.

60) Read as many trade magazines as possible. They are usually free; the only ‘price’ to be paid is to answer their somewhat exhaustive surveys. The value of what is new is huge. Have them delivered to your home. The chances of them being read at work are next to nil.

61) Care about security. Engineers should not be in production. Engineers should do code review. There should be separation of duties in ops. There should be people in ops that can easily revoke access to other ops staff. Create an employee handbook that states there will be serious consequences for breaches of security. Build everything from the ground up with the security and privacy of your customer’s data in mind (physical, logical and functional). If you had your customer in a courtroom and you were on the stand, would you feel good about what you had done to make sure the company’s and customers’ data was protected with vigor and diligence?

62) Control your access points. First, people must be able to do their jobs; secondly, make sure you know exactly where they are coming in. Implement two factor authentication methods.

63) Keystroke logging is critical for bastion/gateway hosts that people must go through to get to production hosts. This is harder with Windows, however TS gateways helps and hopefully there are auto screen recording options soon.

64) Make sure you have many redundant ways to get to production if things go badly. Do not assume the corporate VPN solution is going to get you to production when the corporate network breaks. Put VPN devices directly in production.

65) Use LDAP for auth. Even if you have just 10 or more machines and you are copying passwd and shadow files around, you need LDAP (AD).

66) Do not discount how helpful a Windows Server 2003(2008) R2 machine can be in a UNIX environment. Just because you do not know it, does not make it bad. Learn it.

67) Do not waste people’s time with ineffective wireless solutions. People are mobile. They want to work on couches, in meetings, un-docked, what have you. Make sure wireless is solid.

68) There will always be people that put in extra effort and time - make sure you tear up their vacation request forms. There will always be people that do not put in the extra time, be diligent about getting their forms. Operations is asked to make a huge sacrifice of personal time on a regular basis. Manage ‘comp’ time by gut and reward the people that are woken up at 3am, are online quickly and happy to fix something.

69) Implement an RDBMS for tracking all your production assets. Have a centralized, then replicated source of truth for everything. Assets, Staff, Networks, Contracts, everything. Replicate this database offsite. This means a working copy somewhere, not daily backups to tape.

70) Automate as many processes as safe to do so (e.g. OS or product releasing, pushing of files, grep of logs, etc.)

71) Automation should always be using the Operations’ RDBMS (Source of Truth).

72) There are usually three states of a machine - Out of Service, In Service, and In Production. In Service is the state where it is being brought up to speed with cfengine, rsync, what have you. In Production is when it is taking traffic. There needs to be a state where a machine can have information collected from or tests run against without taking production load.

73) Respect your log data. Have methods to harvest logs off machines you need to take out of service/rebuild, etc.

74) If you are growing fast and don’t have a lot of time to ‘optimize’, lock stuff down as you go - when you have a process or procedure that works, don’t change or modify it until absolutely necessary at a later stage. Essentially - default to locked down, grow, and then revisit only when necessary.

75) You will never avoid an Ops engineer blowing away a critical part of your infrastructure by mistake - it happens - take some of the most critical parts of your infrastructure and run a “what if Joe Random does a ‘rm -rf /’ on our XYZ box?”

76) Try to keep stuff fun and interesting for your team - if they are not enjoying what they are doing, they will go find that buzz elsewhere. Let the team know they are also responsible for this and this is not just a management task. Boss’s need the buzz too.

77) The true value in having the ability to deliver on 99.999% uptime is being able to stay nimble. It means you have the ability to utilize redundancy whenever you need to. This allows physical changes, location changes of equipment, code changes, rollback changes, etc. The value of this is huge to a company, more than the value to the customers.

78) If you can deliver 99.999% - Give the customers a 100% SLA

79) Never drown the ability to release software in process. Drown yourself in the ability to rollback or failover to an older version of code. You will never ‘process out’ failures. It is futile. You can however do tremendous things to cover your butt when things go bad. CYA = Staying Nimble = Successful Company.

80) Keep in mind the why’s and what-for’s of what you’re building for the customer every step of the way - whatever you set out to deploy for the end users, keep that at the forefront throughout and design everything (infrastructure, processes and staffing) for the purpose of delivering the best of breed service or product.

81) Do it right the first time. Rarely do you get the chance to go back and redo things. If you do, it comes at a very big cost to the company. Take the hit on work, the first go round.

82) Reach out to partners, allies, and similar companies in the industry and see how they run Operations. Chances are they will have the same challenges as your organization and maybe have better solutions too. Do not be afraid to share your experiences or processes, as they will probably reciprocate.

83) Hire people that are so good that they make you nervous about holding onto your current job. This goes beyond hiring ‘A’ players. Hire people that you admire and want to learn from. Hire people that you would be willing to work for someday.

84) IT and Operations are two completely different beasts. A solid Operations manager/exec can manage corporate IT, but rarely does a traditional IT person have the abilities to manage Internet Operations functions.

85) When starting a new job, or each year at your current job, fight for budget. That doesn’t mean utilize the squeaky wheel approach, it means have a good documented case based on historical information. If you are evaluating a new job, make sure you know *exactly* what your budget is and what it is based on. Also, ensure that you have the ability to change the budget should the landscape change.