Apps Behaving Badly

What to do when something goes horribly wrong in production? Well of course we hope that it never happens, but there are occasions when mistakes occur or soething unexpected comes up and your servers start chewing memory and not completing connection, everything is going to hell.
At the guardian our CMS has a number of architecture decisions made that allow us to recover from almost all forms of failure, and we’ll detail how some of these work, and why we made them work they way we chose to.
Once you’ve managed to patch the system into such a state that it can recover, the next vital task is to reason out why it happened and how we can fix it. There is a method that we use when addressing serious site failures, and a number of tools and approaches that you can use after the fact to try to reinterpret what happened and trace back in time.

Michael Brunton-Spall

Guardian News and Media

Michael Brunton-Spall is the Developer Advocate for the Guardian. He has worked at the Guardian for three years now, helping to build and scale the website. He has spent a lot of time helping to setup and run the platform team that manages internal, behind the scenes, performance and scalability issues.
As a Developer Advocate, Michael speaks at conferences, organises conferences, supports users of the API’s and does training.

Lisa van Gelder

Guardian News and Media

Lisa van Gelder is one of the Guardian’s senior web developers. Lisa has been developing software for 12 years and has been involved in building and scaling the Guardian’s main website as well as the comments system. Lisa has worked closely with Operations to diagnose and debug apps in production and is experienced in supporting the cleanup and diagnosis of major performance issues.