(Note: This summary was performed prior to Keith Townsend joining VMware as a full-time employee. VMware’s SRM solution was not included as part of the research. This work is a draft of the research intended for publishing by a major research firm. While incomplete, The CTO Advisor believes the content adds value to the overall discussion of modern data protection.)

Summary

Data has become the most important asset within the enterprise. As an asset class, enterprise IT receives the burden of protecting the asset ensuring it remains available and accessible. Also, as an asset the ability to allocate and move data around becomes critical. Data has gravity which means it’s difficult to move the asset where needed as quickly as monetary assets.

Data protection products have evolved from the backup category to a category that enables the protection and mobility of data. This report will examine the essential functions of modern-day data protection products and provide context to the relationship between data protection and data management.

Report Key Findings

- Backup or data protection is a crucial enabler for advanced data management functions such as copy data management and regulatory compliance.

- Data deduplication (dedupe) has emerged as one of the most critical junctions between data protection and data management. Dedupe powers data mobility and enables cloud-based disaster recovery

- Data protection expands the use case of backup sets to practical use for disaster recovery with RPO times of 15-minutes or less and RTO approaching 15-minutes when leveraging automation.

Relationship to Data Management

Data protection is a critical application within the broader data management category. Due to the increasing realization that data fuels business decisions and capability, data management has become a vital part of an enterprise IT strategy. Data management enables a number of business capabilities within the enterprise. Modern data protection products perform critical functions within the data management strategy. Data management is inclusive of the following capabilities:

Copy data management

Data mobility

Access control

Backup and Recovery

Disaster Recovery

Data lifecycle management

While this report centers on data protection, many of the capabilities highlighted are a direct result of the innovations in data protection products.

These data management focused advancements in data protection enables storage, business continuity teams, and cloud architects to leverage backups in a manner not previously available. For example, cloud architects are leveraging the tiering and data mobility features of data protection to backup data to public cloud storage. The data mobility enables operations and developers to take advantage of cloud-based resources to perform on-demand compute including machine learning and disaster recovery.

There’s a wide range of capability across the data protection products covered in this report. Some of the newer products aim to cover the majority of data management capabilities. Some of the traditional players in this space look toward a best of breed approach to integrated data protection and data management. This report will review the techniques and resulting data management capabilities of modern data protection solutions.

Defining Data Protection

Data protection is a broad term. Software-defined storage companies, security firms, and software companies all lay claim to the phrase. For this report, data protection is the process and capability that enables the backup and recovery of data storage on primary storage systems. The related topic of protecting data-in-transit over IP data network is outside the scope of this report. Products covered in this report fall into three categories. Some products may offer versions available in across categories.

Software only – Software only solutions include traditional applications such as CommVault, Veritas, and Avamar. Software only solutions may also install in Infrastructure as a Service solution.

Software as a Service (SaaS) – Cloud-based backup includes platforms such as Druva Inc., Rubrik Datos, and Carbonite. SaaS solutions differ in that the SaaS provider maintains the infrastructure.

Appliances – All in-one appliances that include both storage and software required for data protection. Products include Datrium DVX, Cohesity and Rubrik Inc.

With so many products on the market, it’s essential to define a baseline for modern day data protection suite beyond the apparent backing up and restoring of data.

Metadata

Metadata is the foundation of modern data protection platforms. Metadata is an essential enabler to leveraging the explosion in the amount and importance of data. By mining metadata of backup sets, enterprises gain the capability of near real-time identification of the location of data and understanding who has accessed the data. In the most advanced systems, metadata allows for indexing the content of data distributed across the globe.

The appliance-based data protection products researched to create this report leverage secondary storage as the target for storing backup data. That is except for Datrium DVX which backs up to a primary data platform. By leveraging secondary storage, data protection solutions can create and store advanced metadata attributes of the majority of enterprise data.

Appliance-based solutions store the metadata on a FLASH layer within the secondary storage platform. Storing the data on all FLASH enables real-time queries of metadata enabling data management capabilities not generally associated with data protection. Software-only solutions may also be capable of allowing similar performance capability by separating the location of metadata from backup data. SaaS solutions hide the complexity of the infrastructure and thus the separation of metadata and backup data.

The EU General Data Protection Regulation (GDPR) provides a potential use case for metadata from data protection platforms. By collecting and centralizing access of backup metadata, internal regulators use custom applications to identify personally identifiable data (PII) or other in-scope data stored on remote file shares. Administrators can create manual or automated processes to remove data as requested by EU citizens. Advances in backup recovery may also give the ability to mask data in existing backups to ensure in-scope data is never recovered and thus violating GDPR.

Data Deduplication

Data deduplication (Dedupe) is one of the most critical features of any modern data protection platform. Dedupe can power disaster recovery replication and enables leveraging public cloud storage for data tiering and archive. Also, dedupe provides the foundation for data mobility for advanced data management capabilities such as data mobility.

Dedupe is a feature expected in enterprise-class storage systems. Backup software has had the concept of synthetic-fuel for over a decade. A synthetic-full backup uses backup metadata to create a full backup catalog from incremental backups after an initial full backup. Primary data continues to grow at an unprecedented rate. A need to tier or move enterprise data to the cloud appears. Modern data protection products improve by implementing object and block-level dedupe.

Not all dedupe performs equally or take the same approach to data reduction. Some vendors have integrated metadata capability with the dedupe engine to create 'global dedupe.' Datrium DVX is an example. Datrium uses a technology that uses a crypto hash to identify data chunks. The capability enables efficient data transfer between sites to allow the use of the backup for disaster recovery (DR) replication.

Other vendors such as Veritas use an agent/media server approach to deduplication. Products in this category leverage forever-full backups to limit that amount of data transferred to the media server. Once the data transfer to the media server, a dedupe algorithm is applied at the media server. The approach enables reduced storage usage on the media server but is less efficient for purposes of DR or moving data to the public cloud.

SaaS providers approach varies with some providers leveraging agent side processing to implement real-time dedupe and other solutions that dedupe once the data transfers to the cloud service.

Disaster Recovery

With the advancements in dedupe and the economics of public cloud, leveraging backup products to power replication for disaster recovery rises. Products such as Veeam and Zerto have long championed the use of secondary data as the engine for disaster recovery. In traditional disaster recovery designs, engineers replicate data from a tier 1 storage array located in the production data center to a tier 1 storage array located in a DR facility.

The tier 1 to tier 1 replication provides a high service level. However, the model is inefficient from a cost perspective. Organizations must maintain two tier 1 storage arrays and the remaining infrastructure to recover from a primary site failure. Co-location services help to pare down the cost of the design. Regardless, replication-based DR remains out of reach for many organizations.

Cloud-based DR appeals to cost-conscious operations, even for legacy applications not designed for efficient use of public cloud. Public cloud best suits for dynamic workloads. DR is such an application of the public cloud. Logically, it makes sense to pay for DR only in the case of a DR event or testing. Modern day data protection solutions enable the pay as you go use of public cloud for DR.

With only a few exceptions, each product researched enabled the ability to replicate data to the public cloud. As described in the section on deduplication, each product varies in its efficiency. Also, each product differs in the infrastructure required to leverage the public cloud as a backup target. Some solutions require you to deploy a media server or a virtualized appliance in the public cloud. SaaS solutions hide the complexity of using the public cloud as a backup target as these solutions use the public cloud as the primary target for backups.

The type of cloud storage available for replication varies between cloud provider as well the format in which the data resides. To keep recurring cost low, some providers enable object storage as a medium. Other’s require the use of block storage attached to a media server or virtual appliance. While block storage carries a higher per GB cost than object storage, leveraging block storage reduces the recovery time objective (RTO) when hydrating the data.

Regardless of the backend storage each provider storages backup data in a proprietary format. On the low end of the options, providers require administrators to restore data to a block target. More advanced solutions expose IP based storage protocols to backups. For example, an administrator makes a recovery point available via NFS for an application running on an Amazon EC2 instance to access.

Beyond merely replicating backup sets to the cloud, some solutions integrate directly with public cloud providers enabling necessary orchestration of virtual machine recovery. These advanced capabilities allow automated recovery of a supported virtual machine image. An example of Rubrik CloudOn which gives the option to manually restore a virtual machine image as an EC2 instance or automate the recovery of a critical workload at set intervals. Even with this level of automation, leveraging the public cloud as a DR site requires a great deal of planning and testing.

While Cloud-based DR provides a favorite checklist, many companies still desire the capability of leveraging idle resources for DR. Most products support site-to-site replication. Recovery Point Objectives (RPO) remains a weakness of this space. Except for products that leverage tier 1 replication to tier 1 replication such as Datrium DVX, an RPO of 15 minutes is the most aggressive RPO provided by these vendors.