How our new backup system saved 24+ hours of downtime

Remember when we announced our new infrastructure in October last year? Part of the innovation, which we were particularly proud of, was our in-house created backup/restore system. A few days ago this system was put to its first critical real-life test and the results were impressive. We were able to restore 3 times more data, 7 times faster, compared to the previous such event when we were still using the old backup solution. Here is how we did it.

How often do we need massive backup restores?

The short answer is: very rarely. Having a highly redundant infrastructure with multiple SSDs in RAID almost eliminates the need of such restores. Normally, when an SSD fails, it is seamlessly replaced with a new piece of hardware without any noteworthy downtime or data loss. And disk failures are very common: for a provider of our size, it is normal to see such events on an almost daily basis. However, every now and then, a misfortunate coincidence of several hardware and software failures at once can make the standard hardware replacement impossible. And these are the times, when we need to restore all the accounts that were on the damaged instance from our backup copies.

Previously, before our new backup system.

The previous time we needed to make full backup restore of a whole shared hosting server was more than an year ago. Back then we were using R1Soft backup, which is among the most popular in our industry. Hosting providers like us use this software for two main reasons. First, it is quite reliable. We've almost never had any serious issues with missing and corrupt backups. And second, it is very lightweight and does not create significant load on the production servers while creating the backups (a resource-intensive process that takes place every day). With these two features R1Soft works perfectly in 99% of the time -- when it creates the backups and when individual backup copies are needed.

However, in the rare occasions when a full restore of multiple accounts is necessary, R1Soft has one serious drawback -- the recovery process is painfully slow and the affected sites can experience prolonged outage. In the event in question, all our affected accounts were down for 28 hours. It took this long for two reasons. First, R1Soft does not allow simultaneous restores from and to multiple locations. All the data needs to be recovered through one single network interface and this is slow. Another issue with R1Soft is that the recovery cannot be incremental and the server instance is down during the whole restore process. All affected sites can only come back online at the same time, after the whole information is transferred from the backup server to the production machine. Therefore, even the smallest website could not be brought back up until we have restored the full server.

Most shared hosting providers will hardly consider this story a serious problem that requires further actions once the restore is over. After all, only a single machine was affected and all customers got their websites back without data loss. The downtime of the sites was also almost negligible on an yearly basis: 28 hours are just 0,3% of the year. However, at SiteGround, we were quite unhappy with the duration of the issue and were determined to prevent this from repeating in the future.

And now, after our new backup system.

That’s how we set our minds on creating our own backup system to guarantee a faster restore process and our talented DevOps department started working on it. We launched the new solution in October 2015 but it wasn’t until just a few days ago that we had to use it in an event similar to the one described above. Compared to our then-used solution R1Soft, our own system makes distributed backups and allows simultaneous restores from multiple backup instances to multiple production servers. Thus, we now were able to recover 4TB of data (which was nearly three times more than the previous time), in just 4 hours, compared to the 28 hours from the story above. Moreover, our system allows incremental recovery and the first accounts were up just a few minutes after the issue was identified, with the longest downtime (about 4 hours) affecting only few individual sites. This brought down the average downtime for all affected accounts to less than 2 hours, compared to 28 hours from before. Quite an impressive improvement, isn’t it? But...

Can it get even faster?

Yes, it can! In our latest massive restore case, we actually were not able to use the Infiniband network connectivity between our backup servers and the production ones as planned in such cases. Thus the data was transferred through the standard network of 1 GBit/s, instead over the 10 Gbit/s Infiniband connection. This, we found, was due to a dormant hardware issue that we were able to discover only during an actual restore. However, we have already made sure that next time this will not be an issue, and thus will make the restore even faster.

Another thing is that with the new system we can theoretically restore on unlimited number of production instances simultaneously, but in practice we are limited, not by the backup system itself, but by the way our DNS system works at the moment. We had three instances affected by the issue and each of them had individual DNS. Thus we needed to restore to only three new instances using the old IPs, so that the domain names, which are not registered with us can continue to work as before and do not experience additional downtime, due to DNS propagation time. To avoid such limitation in the future we plan to work on a brand new central DNS and/or proxy system.

Our backup system story is just another example of how we approach problems. We are never satisfied to just fix the immediate issue and forget about it until the next time. We take each problem as a challenge that needs a unique solution. And if such a solution does not exist at that time, we never shy away from inventing it ourselves.

I have been with SiteGround since it was born and it has always amazed me to watch this company grow and develop its unique personality. My rewarding and challenging job is to help SiteGround communicate its strengths in the best way possible, learn from its mistakes and become a better person, oops, I meant a better brand!

I was quite grateful for the daily backups a couple of weeks ago where a configuration problem with my Drupal installation meant that upgrading a few modules completely clobbered 4 websites. I decided that my knowledge is not up to diagnosing exactly what happened so selected to restore from the previous backup. The man-machine interface is intuitve and the process completed very quickly.

I still have to work out how to overcome the config issue but at least the sites work again.

This is great communication! Many companies would shy away from admitting that real life applies to them too. But telling us the real story only builds confidence with us as your customers. Much appreciated and thank you for sharing!

I could not be happier with the service and the confidence in your ability to recover me if I ever need it. I am with you because my former company left me really hanging when THEIR server crashed. When they finally did get me back up, the site had no resemblance to what they had built less than a year before, and no one seemed to have any idea what my site had been 10 months previous! I really should have sued them, but was, and still am, too busy trying to restore my business! I still look for an expert in VirtueMart who can help me!

We have been with you for several years, resell your services, and routinely recommend you to our customers. Your customer service and tech support has always been exceptional. It is great to know that you are also developing new systems and solutions to help improve that service even more! Thank you for making our work day less stressful, and a little easier!!!

You are the best web host I've ever used and my previous one was very good so that's saying a lot. You aren't the best because you don't make mistakes — you're the best because of how you handle your mistakes and customer problems. You could have done what every other web host does and said, "That's how it works. There's nothing we can do about it. Your site will be up in 28 hours. Take a chill pill." Instead, you felt your customers' pain and said, "How can we do it better?" And you did it! Awesome.

I love the approach. I love even more the transparency that goes with sharing your process with the world. Most companies pretend problems never happen so as not to undermine customer confidence. Any sensible person knows this for the fallacy that it is.

I love companies that show the world the challenges that they face and how hard they work resolve them and avoid them in the future.

I will echo the sentiment from above... I am a proud and happy client.

Very nicely done! I am in the middle of completing phase 1 of a municipal fiber optic network.
I have been preaching the "3 R's" = Redundancy, Restoration, and Resiliency through our first phase. It is greatly to hear about SiteGround's commitment to keepings thing up and running with the ability to bounce back when the proverbial stuff hits the fan.

THIS is just one reason I love SiteGround! The hosting company I previously used went down for long periods of time (several days in some instances) and all they could say was they had "every man on deck" working on the problem. PLEASE, SiteGround, NEVER SELL OUT TO ENDURANCE INTERNATIONAL!!!

Stories like this are why I'm such an enthusiastic customer and promote you service when ever we bid on a project. You are a big selling point for us when we write our proposals. I sleep better at night knowing my two VPS's with you will be working when I wake up. 🙂

This really shows how dedicated you are in providing the best service for your customers.
I've tried a couple of hosts, but none comes even near to the amazing service that SiteGround offers. You are the only one I can recommend wholeheartedly! I'm grateful that I've found you!

Since we moved our customers sites with you, finally we found a competent and reliable hosting provider, either for running time, speed, and competency of your tech support. Go on like this. Congratulations.
Andrea Gallucci, MD DIGITHAI Software Group.

The people already described how happy and secured they fill with your solution. What else I can say!
Im really proud to be one of your new customer... keep it up. Keep your eyes in the future and with a visionary solutions that can help us all and protect against downtime and losses.

I have to say, you really can't appreciate a good web host until you've had a bad one. My previous web host provided little support and my Magento installation ran painfully slow. I was apprehensive about switching to Siteground initially because of my bad experience. Since I made the switch to Siteground support has been excellent, Magento runs at least 5 - 10 times faster than before, and I have experienced virtually no down time. My previous host was down several times (sometimes for days) during the year and a half or so that I had them.

I've been recommending SG since I moved all my sites to them and my recommendations are not because I am a customer, but also because I believe every business owner deserves a good home for their business website with a hosting company that not only provides the best hosting plans, hardware, and support, but also that thinks ahead by looking at various ways to improve their service, and making our life easier. Did I mention that I've been with previous hosting companies and SG is the only hosting company that goes above and beyond the call of duty to make sure everything is running smoothly, and provide assistance even when the issue is not their doing?

Because of our review, comments and recommendation of SG, all over the web, we are constantly contacted by people asking us why we strongly recommend SG, and whether we are getting some huge kickbacks from SG, and we tell them that our reviews are honest, unbiased, and they can try them for themselves and find out how good SG (the whole team) is.

Your story does concern me. 4TB!!! As word gets out you will have to start thinking about 1PB or even 1EB. God help us if it even gets to 1ZB or even 1YB. Keep up the great work and you will get so much bigger. Well done everyone.

Hi, Dan and thanks for the great question 🙂 On our shared hosting servers our clients may use the backup restore tool in cPanel to manage their backups and restore data. For more details check this tutorial:

On our cloud servers, however, only our support team can restore a backup for you. We are in the process of implementing the same backup/restore tool for our cloud customers and once it is ready it will be installed on all cloud servers.

I was one of the customers who had a website on the affected equipment, and I think must have been near the end of the back-up process, but well done. I'm glad I wasn't at the end of a 28 hour process! I'm always impressed by SG.

Glad to know the progress SG is doing and I am quite happy that I moved to SiteGround in January, 2016 to host my Blog Ease Bedding.

Now I want my blog to be more secure with daily backups, which I don't have in StartUp Plan. The second problem I am facing is- CPU increase. Please let me know how I can solve this problem, where I can get daily backups and no worries about CPU increase.

Hello Jaswinder, thank you for the good words! While on the StartUp plan you could take advantage of daily backups by signing up for our Backup subscription service: https://ua.siteground.com/daily_backup.htm
Alternatively you could upgrade to a GrowBig or GoGeek plan, which include by default Backup subscriptions and allow higher CPU executions, as well as give you much more additional features.

By submitting your comment, you expressly authorize SiteGround to collect and process your personal data for the purposes of managing SiteGround’s blog, its content and the avatars related to your comments in accordance with our Privacy Policy. *