About Me

Followers

My Visual Studio Achievements

Twitter

Tuesday, May 5, 2015

Chaos Monkey

Chaos Monkey is a great concept introduced by Netflix to create random issues in cloud server environments so that those issues can be addressed early and test against unexpected failures. That make sure any the system can recover from any kind of failure which can happen in production.

The good news is that, they didn't kept it secret. They made that tool online including source in Github. Its written in Java so it was little difficult for the .Net community to extend. There are many .Net ports of same tool or tried to implement the same.

The problem

In our company also things are not different from any other production environment. There are issues which are first reported / only happens in production. We have an online audit system which contains many backend processing components. The backend queueing system is developed in such a way that any application track developer can add a new queue type and write it's associated handler. Though there are guidelines for the backend developers to roll back properly, there are times when developers are not caring it to meet the deadlines and it goes to production. In production we end up with wrong data states. Some of the reasons are unexpected IIS / AppPool recycling. Database timeout / outage etc...Since the application is not expecting those states, it will be clueless how to recover from there. Finally support has to manually run SQL statements in production servers to correct the state.

Ideally QA is supposed to test scenarios such as IIS reset and all when the backend services are running. But they have limitations. Its very difficult to make sure that they cover all the scenarios. Ensuring the randomness and tracing the abnormal event to reproduce in dev will be difficult. There are also difficulties in manually creating abnormal scenarios when the tests are running overnight.

Solution

All these things leads to a testing strategy where issues / abnormal scenarios which are expected in production needs to created in dev/QA environments so that we can identify how the system behaves to those abnormal events and sometimes change the application flow to recover those states.

This is the point we started looking towards already existing ChaosMonkey solution. It seems suitable for us. But we are not able to use it straight away because ChaosMonkey is targeted towards cloud but our production deployment are in house.

Why extensible

So we tried to extend the tool. But since its in Java and most of our developers are familiar with .Net ecosystem, we started looking for .Net port of Chaos Monkey. Unfortunately not able to find much. Though there are some they are also targeted towards cloud. To restart local IIS server, had to write lot of code.

So finally decided to take one as base and make it extensible to meet the scenario. Forked ChaosMonkey implementation from Githb by Simonmnro and added my own changes to make it plugable.