Atlanta Data Center Power Outage

Jul 10, 2013

Share With

The recent power failure event has reinforced the fact that an amazing customer base has gathered around this company and a terrific and dedicated team is in place here. Some of you already know that JaguarPC recently experienced an extreme power loss event which resulted in downtime for our clients’ servers.WHAT HAPPENED? The event began on June 29th when an initial power failure was quickly resolved. A subsequent power failure occurred on June 30th. The second power failure event was mitigated when the standby generator was activated and once again servers were coming back online. The standby generator system then experienced a failure which resulted in a no power situation at the Data Center. A roll in generator was then activated and power was restored.Typically, the way the system is supposed to work in our setups is for failover to go to battery backup for the few seconds to minutes while the generators are started. Battery backups then hand off to the generators until the grid power is back up and then it fails back to the normal power grid. Obviously, this is not what transpired over the course of this recent event.
WHAT WAS DONE TO RECTIFY THE EVENT?Once all systems were restarted it soon became evident that the Cloud had been damaged. Our Chief Technical officer immediately deployed our disaster recovery plan and started a series of procedures. This plan was in place as to how an event of this type would be dealt with should it ever occur. Full backups were on hand, an emergency team was sent to the data center immediately in order replace the disks on our SANS and our backup and restoration team was called in to get servers and data back online as soon as possible.The cloud went down because the multiple outages caused power to be cycled many times over from the A & B feeds going on and off which caused a multiple disk failure. Our disaster recovery (DR) plan that was in place for this very scenario worked smoothly and data restoration efforts began in earnest.Our emergency team was able to restore the disks, rebuild the cloud and began its restoration in a matter of a few hours. Our staff and our Disaster Recovery plan for this type of event and its supporting systems performed very well and covered all of the elements which this event produced.Most of the servers had been restored by early Monday morning except a couple of hypervisors on the cloud which the backup team approached with resourceful and diligent tactics to retrieve all the data.
WHAT IS BEING DONE TO ENSURE THIS TYPE OF EVENT WON’T OCCUR AGAIN? We are currently working on a plan to make the setup more fault tolerant. The network design issue that took down other Data Centers when a core network cabinet failed at another location is being actively addressed. We are also increasing capacity, improving network response times and providing greater redundancy.Layers of UPS (Uninterrupted Power Supply) and more fault tolerant arrays to our SANS for the cloud are also being added. These changes will allow us to absorb more failures at once. We can also cleanly power down the system if ever needed again as opposed to being exposed to a major power failure. Essentially, in the future we will make sure that more systems are in place so this type of power failure can never happen again.
100% UPTIME NETWORK COMPENSATION IS BEING OFFERED An event of this magnitude certainly doesn’t sit well with our customers but many of you have been beyond patient and understanding and for that we express extreme thanks. In light of the recent event affected clients are being offered an opportunity to receive a SLA credit for the 100% Uptime Guarantee. Please, submit a ticket to our Customer Service Department and we will calculate the outage and credit your account. We ask for your patience as we process the requests. We can assure you that compensation will occur but it may take some time to work through all of the requests.We do hope this summary helps our clients understand the sequence and nature of the events which transpired. It should also give you a peace of mind that it is unlikely this can ever happen again because of the measures we are taking and the technology we are continually adding to our systems. It is our goal to continue to create the best hosting environment possible.