3 reasons to embrace horizontal scaling for secondary storage

While hyperconvergence and scale-out architectures have offered horizontal scaling for primary storage, it is now time to look at the benefits of horizontal scaling for secondary storage applications such as backup and archive

Nutanix and other technology vendors popularized the concept of hyperconvergence—combining compute and storage in the same box—in an attempt to do away with silos in virtualization environments.

For virtualization workloads, hyperconvergence is also extending into backup. By 2020, 30 percent of organizations will have replaced traditional backup applications with storage- or hyperconverged integrated systems for the majority of backup workloads, according to a 2017 Gartner study.

The technology of hyperconvergence isn’t as important as its value—horizontal scaling even as data continues to grow in the enterprise.

Much in the same way hyperconvergence has transformed storage for virtual machines, scale-out architectures introduce horizontal scaling to unstructured file data. Common in cloud solutions, true scale-out architecture adds nodes that work together to provide higher performance than could be achieved with just one large monolithic system.

While many technology companies acknowledge the scalability and silo-busting perks of hyperconvergence and scale-out architecture, much of the horizontal scaling conversation has focused on the primary tier.

As a technology executive for a secondary storage vendor, I’ve seen how scale-out architecture helps bust silos in backup and archive, enabling enterprises to scale horizontally as their data grows.

Here are three reasons to consider horizontal scaling when seeking ways to backup and archive file data.

Reason #1: Billions of files

Traditional backup software was designed in an era when data volumes on the order of millions of files were considered huge. With unstructured data sets now often spanning billions and tens of billions files, traditional backup software has hit some very practical limits. Enterprises frequently segment backups into silos to more easily manage backup catalogs or to meet backup windows.

If your organization doesn’t produce or collect a lot of data, disk-to-disk replication probably works just fine. But for the growing number of enterprises with hundreds or sometimes thousands of file systems, each primary storage silo results in another secondary storage silo to manage.

Reason #2: File data is much bigger

File data remains the missing piece of the horizontal scaling puzzle. Today, most of the attention on horizontal scaling is on virtualization workloads.

In ”hyperconverged vendors focus on data protection,” Taneja Group Founder Arun Taneja contrasts Nutanix’s focus with companies like Cohesity and Rubrik that are applying the principles of hyperconvergence to secondary storage to virtualized workloads.

“Data protection as a discipline has been sleeping for decades,” writes Arun, “but data management convergence may be the wakeup call it needs.”

Trends including Internet of Things (IoT), artificial intelligence, and real-time data are even further transforming data management requirements. By 2025, annual data creation will surge to 163 zettabytes according to an April 2017 Seagate and IDC report on the evolution of data.

The volume of unstructured data in the form of images, videos, and other data that can’t be easily be managed with existing block storage workflows has grown rapidly. While most industry estimates for unstructured data hover around 80 percent of overall data, we have had conversations with many enterprises who estimate that 95 percent of their enterprise data volume may be unstructured. Because of this, effective secondary storage for file data is critical.

Reason #3: File data doesn’t dedupe well

In a circular quandary, much of the economic benefit from hyperconverged storage systems relies on deduplication but a lot of file data doesn’t dedupe well. With dedupe built into your storage system, such as when a virtual machine backup dedupes well from another VM, you gain significant storage benefits.

Secondary storage for virtualization or for database workflows is often priced assuming you’ll use deduplication, and these economics generally don’t apply to unstructured file data.

Some executives respond by saying that unlike oil and gas, biotech or high-performance computing, their industries don’t generate very much unstructured data. My response? Just wait. Machine learning and artificial intelligence will provide data in a broad range of industries that may currently not struggle with data storage.

Hyperconvergence was intended to be a silo-buster and has successfully eliminated silos in the primary tier. Unstructured file data — which is about 80 percent of overall data volume — is largely not a primary use case for hyperconverged systems in use today. But if your enterprise is committed to more efficiently managing your data, it definitely should be.

This article is published as part of the IDG Contributor Network. Want to Join?