6.
Good Engineers
• Detail Oriented
• Aspire to be operational engineers
• Stubborn
• Can steer their inner ADD
– Interrupt driven
• Not the same as good developers

7.
Danger signs
• Thinks operation is a path to
development engineering
– Fire them
• Want people dedicated to the task
• A good operations engineer should
spend some time in development
• A good development engineer MUST
spend some time in operations

30.
Something is wrong
• Don’t worry, data warehouse
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.

31.
tcpdump / waveshark
• If you suspect the network
• Don’t just suspect
• LOOK AT IT
• Tcpdump / waveshark will tell you
– If your packets are lost, delayed or
corrupted
– Your windowing is wrong

32.
Rule 4: Divde and Conquer
• Look at the problems in turn
• Split between people
• Go in the order you suspect is the most
likely

33.
Rule 5:
Change one thing at a time
• I cannot stress this enough
• IF YOU DO NOT THEN YOU HAVE
FAILED TO IDENTIFY THE PROBLEM

34.
Rule 6:
Keep an audit trail
• You might be making things worse
• Good for the root cause analysis
• Have your shell log all commands
– Good practice anyway
• Version control

35.
Rule 9:
If you didn’t fix it, it ain’t fixed
• You must do something to fix a problem
• Or it will bite you again
• And again
• And again
• They don’t just appear and disappear
• Except BGP route convergence :)

38.
Complexity kills
• Design against it
• Reuse components
• Define standards
• Have a few images that all machines
look like - reimage machines every now
and then for the heck of it.
– EC2 forces you to do this

40.
MTTR
Medium Time To Recovery
• Important
• Noone cares if you fail once a minute
– If you recover in 50 ms
• If you are down 1 minute a week, you
are still going to hit 4 nines (99.99%)
• Failures happen, plan how to deal with
them

53.
Your datacenter
• Keep it tidy
– Label things, keep cables as short as possible
– Have a switch in each rack
• If you are small without dedicated DC staff
you need
– Remote control power switches
– Remote console!

54.
Virtualization
• Please use it
• Managing becomes much easier
• Power consumption
• Need a new test box
– The requestor can have it in minutes

55.
Power consumption
• Maybe not as important in Europe
• 8 core machines are more efficient than
1 core
• But memcache uses 1 core and all RAM
• Get more RAM and virtualise

56.
Our network admin boxes
• 1 Xen CPU for Vyatta
• 1 Xen CPU for LVS
• 1 Xen CPU for Squid - Carp
• 1 Xen CPU for Squid
• 1 Xen CPU for Monitoring
• 1 Xen CPU for network tasks
• We can have more of these and a loss of one
affects us less

57.
Vyatta
• Opensource router
– Really like it
– No need to use Cisco

60.
Squid
• As a reverse web accelerator
• 90 % of our hits served from RAM in less than
1 ms
• Same as wikipedia
• We only use RAM cache ( unlike wikipedia)
• Cached per user
• If not cacheable - cache for a second to
redue backend effect

67.
EC2 and S3
• We save all our binlogs to S3
• We save database dumps to S3
• We have monitors running from EC2
• We plan to build a datawarehouse
cluster on EC2

68.
EC2 Requires Automation
• Machine is blank when you bring it up
• Download database dump from S3 and
replicate up - automatically
• Use puppet
• Amazon saves you hardware
headaches
– But complexity is still a problem