I attended DevOpsDaysTO back in September. The 2-day event was held in a cool venue, downtown Toronto. Roughly 300 attendees (hard to believe from looking at this picture…I got there early). Half of the attendees work for small companies (under 100 total employees).

Here were the key messages that resonated with me during the presentations/discussions that I was part of during the event. I have attempted to be terse and accurate with each section; my apologies to the presenters if I have misquoted them, or if their message is misrepresented/out-of-context.

Paul Osman (500px):

Paul discussed different initiatives and practices that he recommends for improving service robustness and controlling the environment. The concept of “micro-services” was introduced as a means to improve system availability. Single points of failure can be eliminated by refactoring complex services into a collection of smaller services (micro-services). Paul stressed that Measuring / Monitoring your environment is essential. Taking care to choose the right things to monitor will go a long way to keep you well-informed of system status. Nothing is worse than being surprised by your customers when they report a system outage: ideally, your monitoring system should always alert you first. That way, you have a chance to deal with the outage before customers do! Building a "circuit breaker" to reliably take your environment offline/online quickly and reliably is important to have. It is the sort of feature that you won’t realize how important it is until you need it! Paul also advocates running “game day” exercises. Stress out the production system effectively in order to attempt to make things fail *while you're watching*.

Doug Barth (PagerDuty) - Injecting failures

Doug experienced failures during service rollout. His team found that lots of bugs can be found in the “exception” code paths. These paths are difficult to test in Dev/QA environments, and are often left for customers to stumble onto. PagerDuty’s solution was to test failure modes in production. “Failure Fridays” is a weekly (1 hour) meeting held at PagerDuty where dev and ops attend and prepare to attack the Production system in a carefully-crafted manner. List of attacks would be discussed in advance in order to identify the victim(s) for the current weeks attacks. Before beginning the attacks, they would open their relevant monitoring dashboards (as failures should be accurately portrayed here - If not, then this problem (in-and-of-itself) is a valuable discovery to make!). Attacks would start small (single host), then grow to include more systems. 5 mins max is given for the attack (stopping as soon as an aspect of the service breaks). Detailed documentation is key (recording times for log correlation, tracking discoveries made during the meeting and publishing TODOs discussed during the meeting).

IPTABLES was used for simulating packet drops (please double check command usage and parameters before trying anything like this, these are dangerous commands):

IPTABLES –I INPUT 1 –P TCP –DPORT 9160 –J DROP

IPTABLES –I INPUT 1 –P TCP –DPORT 7000 –J DROP

IPTABLES –I OUTPUT 1 –P TCP –SPORT 9160 –J DROP

IPTABLES –I OUTPUT 1 –P TCP –SPORT 7000 –J DROP

You can slow the system down with the TC command (same disclaimer as above. man pages are your friend!):

TC QDISC ADD DEV ETH0 ROOT NETEM DELAY 500MS 100MS LOSS 5%

Results? PagerDuty found issues associated with areas such: Large files on ext3 volumes, failing to restart due to bad /etc/fstab file, high latency from network isolated cache, low service capacity due to a lost data center, missing alerts/metrics on their monitoring system. The cultural impact was high: knowledge sharing, highlights untestable systems, keeps failure handling on everyone’s mind.

Lisa van Gelder (Cyrus Innovation):

Speed of delivery was a key topic in this presentation. Continuous delivery can be achieved by removing the bottlenecks that slow you down. While a seemingly obvious statement to make, a detailed analysis of the delivery process can help you more easily see where the biggest delays are occurring (and hence giving you an opportunity to speed up/eliminate these bottlenecks). Doubling the frequency of releases can reduce some risks, as this will reduce the code churn for each release. With less code being introduced per release, the rollback process is more straightforward, and individual bugs will be easier to track down.

How do you know if your production service is working properly? “Canary testing” (reference to miners’ early warning test of gas leaks underground). So, if your “canary” is an automated user of your service, you can monitor your canary to ensure it’s always functioning as expected. If the canary fails for some reason, that might be an indication of a problem with your service.

More ways to do performance testing: 1) “Soak test” : is a load test, but run for a longer period of time (e.g. 2-3 hours of constant load). 2) “Dark launch”: early release of a new version (will provide early quality feedback on the release – before customers are expecting to see it)

Always make it easy to rollback a release. If you can't rollback, the risk of making a release is that much higher (since there’s no going back if a bug is discovered). DB changes can often cause rollback problems. A good way to deal with this is by separating the release of DB changes from the code changes. E.g. release 1 introduces new schema to the DB, release 2 introduces code that uses the new schema. This practice will cause the developers to build in extra resiliency; and will improve your ability to perform reliable rollbacks.