SREcon: Performance Checklists for SREs 2016

04 May 2016

When Netflix is down, minutes matter, and there's little time for traditional performance engineering. At SREcon16 Santa Clara I gave the closing address on performance checklists for SREs. Checklists are vital for this kind of work, and are often implemented at Netflix as custom dashboards of selected metrics.

This was my first talk about my SRE work at Netflix, where I've joined the on-call rotation for the Core incident response team. I began by summarizing the difference between performance engineering (where I spend most of my time, and is the team I'm on), and SRE incident response for performance issues.

I summarized a dozen checklists in talk, as well as methodologies to derive them. They are roughly sorted in intended order of use: starting with cloud-wide dashboards and ending with Linux specific checklists.

The first two checklists are our Performance and Reliability Engineering (PRE) Triage Checklist, a shared document, and then predash, a custom dashboard. These are Netflix specific, and show how we begin this type of analysis. I thought for a moment that they were too specific to Netflix, but wanted to include them anyway for completeness.

I've reproduced the Linux checklists below, which should be implemented as GUI dashboards. Check the presentation for eight other checklists.

Another short checklist. Won't solve everything. ext4slower/dist, bioslower/latency, are from bcc/BPF tools.

8. Linux Network Checklist

sar -n DEV,EDEV 1 &xrarr; at interface limits? or use nicstat

sar -n TCP,ETCP 1 &xrarr; active/passive load, retransmit rate

cat /etc/resolv.conf &xrarr; it's always DNS

mpstat -P ALL 1 &xrarr; high kernel time? single hot CPU?

tcpretrans &xrarr; what are the retransmits? state?

tcpconnect &xrarr; connecting to anything unexpected?

tcpaccept &xrarr; unexpected workload?

netstat -rnv &xrarr; any inefficient routes?

check firewall config &xrarr; anything blocking/throttling?

netstat -s &xrarr; play 252 metric pickup

tcp*, are from bcc/BPF tools.

9. Linux CPU Checklist

uptime &xrarr; load averages

vmstat 1 &xrarr; system-wide utilization, run q length

mpstat -P ALL 1 &xrarr; CPU balance

pidstat 1 &xrarr; per-process CPU

CPU flame graph &xrarr; CPU profiling

CPU subsecond offset heat map &xrarr; look for gaps

perf stat -a -- sleep 10 &xrarr; IPC, LLC hit ratio

htop can do 1-4. I'm tempted to add execsnoop for short-lived processes (it's in perf-tools or bcc/BPF tools).

For more about SRE at Netflix, see my colleague Jonah Horowitz's talk Netflix: 190 Countries and 5 CORE SREs. We're also hiring SREs (keep an eye on Netflix jobs). For other talks about SRE (Site Reliability Engineering), see the SREcon16 program.

This was my first SREcon and I found it very useful and informative, particularly to see what SRE really means to different companies. Thanks to USENIX and the organizers for a great conference!