Internet Performance Delivered right to your inbox

3 Powerful Platforms, 1 Automated Disaster Avoidance Response

DevOps Gameday at Lisa ’10 demonstrated how a tough problem can be solved using the strengths of three powerful software platforms.

Our scenario was as follows: on an otherwise healthy network, an EC2 cloud server goes down, making the overall network performance less than optimal. Using network monitoring from Zenoss and a simple health check ping, it will detect the failure and automatically invoke a shell script that will make a call into Opscode’s Chef system. Chef will then automatically take down the failed server and build a new one as a replacement.

In order for this new server to be of value, the world must know it is there. That is where DynECT Managed DNS comes in and can route people to the new server automatically using our REST API, replacing the failed IPv4 A record.

This self-healing system worked beautifully, showcasing how a system administrator could save hours and eliminate long lasting poor network performance by setting up a few scripts that leverage these powerful tools. With the proof of concept now a reality, the door has been opened to expand this scenario.

Large Scale Networks

Instead of a simple two server and master database system, these same concepts can be expanded to monitor a large scale network. This would be done using better monitoring metrics than a simple ping test (such as processor load, speed thresholds or any other such metric that is most important to the implementer) to determine server viability. Follow that up with an action that may require a system administrator to reply to a page to start the replacement of a single failed server but will initiate an automatic response if a large scale outage is detected, such as a 10% loss of servers network-wide.

This would save users from ever seeing a degradation in services and allow for 100% uptime no matter what unexpected catastrophe may hit. The great thing about this is that with use of the automation and open source tools, the solution can be built to match each and every companies individual needs.

The Lisa conference spurred a great deal of discussion surrounding these concepts and quite a bit of buzz among those attending. The hope is that a community of users will take this idea and run with it, creating their own uses that they can share with others. The base code that was run in the demo is available on github, allowing for an easy community code share.

Getting Started In Seven Easy Steps

Have an EC2 account with the correct private keys to access it via ssh.

Be registered at Zenoss and at Opscode so that you can download their respective tools.

Have an Ubuntu server to use as the host/monitor. Once you have this, download and install Zenoss monitoring and Opsocde’s Chef. Opscode in particular has a great ‘Getting Started’ wiki to help you set that up. For the demo, all this was done on an Ubuntu server.

Add the cookbooks available at this github repo to your Chef install (using Chef’s knife with the cookbook upload command) and the roles (remembering to change the XXXXX spots in the base role to the correct credentials for your Dyn and WordPress setup).

Once this is done, you should be able to bring up a new server from the knife command line, filling in all the correct parameters to the knife EC2 create call. Once this has been verified, add the script and template from the “scripts” directory to the bin directory of the Zenoss monitoring install.

In Zenoss, set up your alert and alert command to call the script with the ip address and EC2 instance ID, which is the hardware ID tag of the Zenoss device.

Resources

Kevin Gray is a employee at Oracle Dyn Global Business Unit, a pioneer in managed DNS and a leader in cloud-based infrastructure that connects users with digital content and experiences across a global internet.