There are a lot of great things about the cloud, but the “destroy and rebuild” philosophy which is really good for building a continuous delivery pipeline, really sucks when applied to troubleshooting production problems. When your application goes haywire, the most valuable engineering skill is not the the ability to bring up a copy of your system or even the knowledge of your technology stack (although it doesn’t hurt). It is the skill of understanding and solving problems.

Finding the root cause of the issue and mitigating it with minimal disruption in production is a must-have skill for engineers responsible for managing and maintaining production systems, which nowadays includes ops, dbas and devs alike. In this talk I will discuss the skills required to troubleshoot complex systems, traits that prevent engineers from being successful at troubleshooting and discuss some techniques and tips and trick for troubleshooting complex systems in production.

There are plenty of materials on getting development and operations to work together. More conversations are happening around inclusion of other technology groups, such as DBAs and QA testers, into DevOps processes. That said, DevOps conversations has been largely devoid of talk about BizOps place at the table. The goal for any tech-centric group is not to build and/or architect the best technology, but rather to effectively support business. Yet, many of those groups are either not privy to or don’t bother understanding the business goals and overarching effects of the technical decisions made. In this talk I’ll discuss key areas and feedback points in every DevOps process fit for inclusion of business units in order to align technology and business goals and make your life easier.

With emergence of DevOps approach to application development, deployment and management developers get more and more involved in day-to-day system operations. Lately, there has been a popular point of view that developers should be included in oncall rotation on equal grounds with sys admins. While I don’t fully subscribe to that mentality, there are certain processes that must be implemented by every organization to get developers involved in production operation of the software they built. In this talk I’ll walk through different aspects of operational oncall responsibilities and discuss ways in which developers should (and should not) be involved in operation of production systems.

I gave a talk at DevOpsDays Denver, talking about collaboration of testing and monitoring and production troubleshooting.

Identifying and fixing issues in new code before deploying it to production is important for every software development cycle. However, relying on traditional testing methods in the age of Internet-scale data driven problems may prove to be incomplete. Identifying and fixing the issues in production quickly is crucial, but it requires insight into usage patterns and trends across the whole architecture and application logic. In this talk I touch on inefficiencies of some of the most common testing methods, provide real world examples of discovering odd edge cases with monitoring and offer recommendations on top-down metric instrumentation to help DevOps organizations with identifying and acting on business-effecting problems.

If you want to catch more fish bugs, use more hooks.
— George Allen, Sr. + me

The concept of automation is a big part of DevOps approach, to the point where some people (incorrectly) define DevOps exclusively as automation. But while there are a lot of tools and talks around automating deployment pipeline process for build->test->deploy, a few talk about utility automation for intermediate steps of that process.Continue reading Hookin’ it up

I’ve talked at length about the importance of business process monitoring alongside of system monitoring, but in discussions I found that sometimes an overview and simple examples are not enough to convince people about the benefits of this approach. Business owners think they don’t need to know anything about the operational performance of their systems as long as they have their numbers, and engineers often don’t feel they need invest time into understanding the business they are supporting in detail, finding examples shown too “common sense.”Continue reading In search of unknown

“OMG, Facebook is DOWN!!!” was the cry of the millions when Facebook was unavailable for about 3 hours because of the network issues. Given the nature of Facebook service, the downtime did not have any long lasting effects on it’s user base. In fact, some say that the productivity significantly increased during the 3 hour window without access to Facebook. Bottom line is – the unavailability of the social networking service doesn’t negatively impact the users (ego and reputation of the service aside). Question is: does it also hold true for the companies leveraging Facebook, or other social networks like Twitter, Flickr, FourSquare, etc., in their daily operations?

Mistakes happen. People who make a claim that they can produce bug free product are lying either to you or to themselves. And it is debatable what’s worse. Anyone who worked in the tech industry for a few years has a couple of horror stories up their sleeves about THE mistake. Some of those stories are amusing, in retrospect of course, some are pretty disturbing, but all of them clearly demonstrate one point – there is no perfection. As the systems today become more and more complex it is virtually impossible to avoid all the mistakes and implement a bug-free solution. So once you accept it as an axiom, the accent shifts from the question “How to avoid all mistakes?” to “How to minimize the impact of a mistake?”