Head in the cloud – feet on the ground

In the last couple of years we have been witnessing a tendency of clients moving their in-house IT systems to the cloud. We argue that the ability to internally restore data by employees of the organizations via their IT department, and further by the IT departments at data-recovery labs is becoming a non-trivial task. Moreover, individuals relying on SaaS providers (e.g. GMail, Facebook, Twitter, Salesforce) are unaware of the risks of losing their cloud-data and find themselves contacting data-recovery labs for assistance, despite the latter’s inability to assist in such matters. We discuss several key factors that should be matched by the customer of such cloud services and elaborate on some real life examples where restoring, and recovering the lost data is challenging.

Introduction

In the last couple of years we have been witnessing a trend of moving internal IT systems to the "cloud". The delivery of computing as a service rather than a product, whereby shared resources, software, information and systems are provided as a utility over the internet.

The main motivations for this process are to increase the efficiency of the IT department with cost savings and improved management. The typical and reasonable assumption is that the availability of the data in the cloud will be as good as the availability of the systems of the company before the move to the cloud. Indeed, cloud service providers are carefully defining the SLA for the availability of the cloud based service in their offering, but what about the availability of the data and the its backup (and restore) policy?

Theory

The typical CIO and IT manager are well aware of the complexities relating to backups of complex IT systems. This is mounted to the difficulties with restoring onto a live system, performing the restore in minimum time, and periodically testing the validity of the backups. Actually, this complex nature of the backup and restore processes are one of the motivations to move to the “cloud” in the first place. The complexity is driven from various elements:

The size of the data to backup and restore is growing very fast.

The frequency of the backup required by the users is growing – as the data is changing and accumulating more quickly.

The duration of the backup is growing, and the backup process is becoming more complex.

The backup process of a live system ("Hot backup") is adding further complexity to the underlying system and the backup process.

The growing costs of backup software licenses and equipment.

The act of migrating the data from the organization to the cloud service provider (together with the responsibility to backup the data and restore it when necessary) might give a false sense of safety. It is somehow assumed that all the difficulties related to backing up and restoring the data is suddenly gone. And that it’s safe to assume that the service provider will overcome these issues perfectly.

Unfortunately, this is not the case.

We had encountered numerous cases whereby cloud service providers suffer from data loss, without the ability to properly recover. Either by taking very long hours to bring the system and its data back online, or fail to restore some of the user data altogether. Multiple such cases had been recently on the news (see below).

In some occasions, the restoration procedure fails and data-recovery companies are asked to assist in recovering the client’s data. However, in such occasions even further complexity is exhibited. Since shared resources are used to service multiple-clients, utilizing the services of data-recovery labs might affect other users’ resources (e.g. when some storage components needs to be taken apart and examined) and as such might be more harmful than useful.

The inherent difficulty in designing, implementing and testing the backup and restore of a typical SaaS (cloud) provider is increased further by the variance of the requirements by their different customers. For example – in the case of a data-loss incident (like human error, virus, physical malfunction, sabotage etc.) the cloud service provider is required to recover the latest working backup (assuming such a backup really exists) – for example, a backup that was made twelve hours prior to the incident. But is a twelve-hour old backup good enough for the client? The answer highly depends on the set of expectations of the customer. While a small domestic company might be okay with losing twelve-hours worth of emails (or not even notice the missing emails), a large multinational company will surely notice the missing emails – some of which might be critical to its workings.

Is the backup and restore policy of the cloud service you are using adequate to your company’s needs? Here are some categories you might want to look into:

- Data Retention Period – what is the total duration of the time the backed-up data is kept by the service provider? A Day, a week, a month, a year? The answer is highly dependent, of course, on the nature of the data, the rate of changes in the data, the importance of the data, regulation and many other factors. The parameter is highly dependent on the exact requirements of your organization. But what is the cloud provider actually doing?

- Backup Frequency - how frequent is the backup taking place? Is it an ongoing backup? Every hour? Every 12 hours? Once a week? The growing rate and complexity of the data combined with the desire to save all the information – together with the meta-data – makes a high frequency backup rather challenging. Advanced data storage devices contain sophisticated mechanisms to ease this tasks – for example snapshots and storage virtualization – but they are not complete solution, complex to manage by themselves, costly and might not scale well.

- Backup policy and disaster recovery plan (DRP) - cloud service providers that hold valuable or sensitive data and are aware of the risk of losing information prepare for possible data loss by integrating a set of backup procedures and disaster recovery plan from common set of disaster scenarios. These plans are prepared by a domain expert and should be in line with the customer requirements, especially regulation requirements (if exists), to be able to allow the customer to recover all their data and even more importantly to minimize the restore time. Typically, in a case of a disaster the panic and confusion are great, and the duration between the disaster and the successful restore is critical. Since these restore processes are complex, the cloud service provider should practice the recovery process to find and mitigate possible errors in it.

Other factors should also be taken into consideration – is there a backup of the data off-site? (How far is this place?), which backup and storage technology is the cloud service provider using, and how reliable is it? These factors can help evaluate the maturity of the cloud service provider, and it is highly recommended that these factors will be verified with the service provider to see that the company demands for data availability are matched with the service provider capabilities.

Practice

In this section, we will try to analyze common data-loss cases and compare the possible solutions when the service is given as a cloud based service with the “traditional” approach of company based IT systems and services. All of the cases shown are real life cases of real people from the last year (2011-2012). We are witnessing more and more such cases, as the shift to cloud based services is on the rise.

“… I’m using cloud based service for my company’s web site (they have templates for flash and nice content). I accidentally overwrote parts of my site content with some old dummy content. I didn’t save the site content on my computer. How do I restore the site content back? “

Fig 1: Web site collision

In the traditional IT world, the answer for such case is relatively straight forward: contact your content/web admin, and ask her to restore the disk’s content/site content to the latest backup. The backup software usually provides a simple interface which allows for quick detection of the modified files and allows restoring them to their original location.

In the cloud based service world however, things might be trickier. Not all web services save backups as snapshots “per user”, not all of them provide the user with the functionality of selecting which files to restore (selective restore), and finally not all of them give the ability to restore pages derived from templates owned by the site. In this case the restore operation has to be executed manually, on a per-page basis, during a long downtime of the site.

“ … Someone took over my email account and deleted everything !”

Fig 2: Account hijack

In the traditional IT world, the email content is backed up and can usually be restored to the last backup relatively quickly. The password of the account will be reset and the user can quickly get back to normal operation (several hours of email might be lost, but that’s usually acceptable). If several hours’ loss is unacceptable, or the backup is not working altogether, a data-recovery company might be contacted and asked to assist in recovering the last backup, or recovering from the media that contained the email data before it was deleted.

In the cloud this might be much more complicated. Most email providers by default will not allow you to restore deleted emails (that is if they were “permanently” deleted by deleting them from the trash). Organization and individuals can purchase archiving services (for example from vendor likes http://www.google.com/postini/ for Gmail or from other third party vendors) to overcome this issue, but in many cases do not as they are unaware of the risks. If the organization is subject to regulations (e.g. SoX, PCI etc) or in the midst of a legally bounding process (E-Discovery process during a trial), having no email archive will result in no access to deleted emails.

Finally, a hijacked email account, if not part of a domain (or an admin account in a domain) might be difficult if not impossible to re-gain. Some methods exist to recover a hijacked account, but if the proper measures had not been taken beforehand, chances are your email account – with its data - is gone forever as there might not be a way to distinguish the real owner from the “new” one.

“ … If you've deleted a message permanently, by clicking Delete Foreverin your Spam or Trash, you won’t be able to recover the message using the Gmail interface. In the past, users have reported that they are missing all of their messages as a result of unauthorized access. If your account was compromised and you would like us to investigate whether recovery is possible, please first complete this process to secure your accountand then file a report. “ (taken from formal Google web site). “

“… I lost my password to my cloud based ERP account, cannot recover it, I need to approve some invoices fast. What can I do?”

Fig 3: Lost password

Traditional IT world: use the restore password procedure. A quick call to the help desk, reset your password (Authenticate via phone if nothing else works). If in urgent need of support – escalate via phone.

Cloud based world: Indeed, no data is lost at the cloud based service. The only thing that needs recovering is the password for the user. However, instead of talking to internal IT, now the user needs to deal with service provider in an out-of-band procedure (most of which only have email support). This is not simple if there is difference in service hours, serious language barrier, or strict procedures of the provider that are not part of the corporate culture of the customer. In the event when time is scarce, like the one mentioned above, the restore process can be quite painful.

“ … I’m using a cloud based CRM. We have data with cross relationship. I accidently added some garbage data, and noticed only after few weeks. How can I clean it? …”

Fig 4: CRM case

First, let’s explain this scenario. The main issue here is data corruption by the user – in most cases, accidentally. From the service provider’s stand point, the system was working just as it should have. However, from the user’s standpoint a restore is required. The situation is trickier still since cross relationship is involved. We will use some naïve example to illustrate this. Think of a “customer” table, where each customer has a car with some model, and the model is taken from the “car models” table. This is the simplest form of “primary key” <-> “foreign key” relationship. Now assume that the user added some corrupted (garbage) car models, and then added some customers with cars of these models.

After some time (usually days to weeks) the corrupted data is revealed. Now the user is faced with a problem – how and what data to roll back? They cannot simply restore the car models table, since some records are linking to it. A lot of data has already been changed in the system (added/removed/changed), so a full roll back is not an option either. The only approach left is a manual, tedious analysis and repair of the relevant data using current and older snapshots of the entire data.

This is an over simplified example of course. Think of a typical CRM system, where the tables lying “underneath” the system are complex, rich with fields, and contains a mesh of cross relations. The restore task in a real life case of data corruption is non-trivial at best.

In both the “traditional” IT and the cloud based world this is not an easy task. But whereas in the “traditional IT” world, the IT department will provide you with complete backups from different time periods (depending on the retention policy defined by the organization to be the right one for it), the cloud provider will not grant you this data. It might show you some history of changes for certain fields or tables, but these would be limited and difficult to work with and will not allow you to repair a large set of errors.

Similar recovery scenarios will also be required when intentional data deletion (be it due to a malicious user or a cleanup process gone awry) or data corruption occur in your CRM data (e.g. an integrated system with a bug causing some unexpected data to be introduced or otherwise corrupted).

For complete snapshots of your CRM data you would have to use third party tools such as OwnBackup - http://www.ownbackup.com that provides nightly snapshots of i.e. Salesforce CRM data elements.

Google's Gmail had a glitch introduced that caused 30,000 users or so to lose email, chat and contacts from their Gmail accounts. The cause appears to be a bug in a software update.

Fig 5: Gmail is down

Due to various reasons – some of which mentioned above - cloud based services are subject to malfunction and downtime and in some cases data-loss. Examples of cloud-services malfunction resulting in data-loss are not as rare as people might think. Here are some recent examples:

Summary

The growing dependency of modern company on digital information, combined with the trend of moving IT systems to the cloud requires some deep inspection of the backup and restore policy of the cloud based service and vendor. It is highly recommended a customer of such cloud based service will verify with the vendor that their data availability requirements are matched with the abilities of the provider. Another alternative is to use third-party backup solutions that match the needs of the customer and ensure a backup of their own is available if need comes.