Big Unstructured Data and the Case for Optimized Object Storage

Paul Speciale

The world of data storage is marked by ongoing innovation, but major paradigm shifts in technology are much less frequent.

A major transition in storage paradigm hasn’t occurred since the introduction of network storage including Storage Area Network (SAN) and Network Attached Storage (NAS) and technology, to augment Direct Attached Storage (DAS) technology in the 1990’s. ;

Today the industry is at an inflection point in the growth of big unstructured data that is motivating the need for storage systems that are optimized for extremely large-scale data – large in:

capacity (Petabytes and beyond),

number of objects (millions to billions of files),

long-term preservation of this type of data, with very high data durability and very low cost of ownership.

Traditional storage paradigms like SAN and NAS certainly have their place in the big data ecosystem. ; Object storage has emerged as an exceptional fit for big unstructured data applications, specifically those dealing with rich media. ; Because object storage systems eliminate one of the fundamental components of storage systems to date – the file system – they represent a genuine paradigm shift that can eliminate previous limitations to scalability and efficiency, while still retaining the data semantics of file systems. They open the door to new levels of management simplicity with big unstructured data.

Storage Area Networks and Transactional Applications

Existing storage technology sets each achieved a paradigm shift to meet evolving business demands at the time. The introduction of SANs continued the presentation of storage as opaque disk blocks (as in DAS), but shifted the topology and management of storage to dedicated storage networks.

This allowed storage servers and arrays to be managed centrally, and connected to multiple servers, thereby reducing the cost of managing multiple “islands of data”. The applications that SAN served well at the time, and continues to serve well today, tend to be transactional enterprise applications such as ERP, CRM and other structured database applications.

Given the structured (tabular) views of the data in these applications, performance of SANs is therefore optimized to perform fast random lookups, reads and writes on small data records. More recently, the growth of virtualization has seen SAN technology used very effectively for the data stores created and managed by hypervisors. Once again, the access patterns to this data tend to be randomized, and depend on fast lookups of smaller amounts of data.

The Introduction of Network Attached Storage for Unstructured Data

A major shift in storage paradigm was sparked in the early 90’s, and was driven by the growth of unstructured (file-based) data in corporate applications such as office automation, design automation and collaborative engineering. This file-based data tends to be created, reused, modified and shared, often by teams of collaborative workers. Files types ranged from spreadsheets, word processing documents, and presentation graphics to CAD and other design files some of which were megabytes in size.

Vendors of file sharing technologies such as Novell, and then dedicated NAS vendors made the argument that managing this file based data in block based storage systems was needlessly complex, and could be dramatically simplified with file based management and access protocols such as NFS and CIFS.

NAS preserved the centralized management paradigm of network storage, but further enhanced sharing of the data by multiple users. ; NAS systems have become optimized for fast file serving, and can typically perform effectively where files are small (Kilobytes) to moderate (gigabytes) in size.

Access patterns tend to be a random mix of file operations such as lookups, reads, writes as well as metadata operations. Over time, NAS has broadened its application support to include databases (typically smaller than those in SAN), as well as support for hypervisor data stores.

Corporations have previously viewed big unstructured data as a burden and therefore a cost, they have turned to the lowest-cost media available for storage of this data: tape. Now that there is a realization that there is tremendous business value in unstructured data, they understand that keeping it dormant and hard to access on tape is indeed a highly inefficient choice.

The need to actively access and mine these large-scale archives makes tape unwieldy at best. In order to unlock the value and “reactivate” this data, what is really needed is a scalable, online disk-based “active archive” that makes it possible to access large amounts of unstructured data with very low latency, and potentially very high-levels of throughput. Tape will certainly remain a component of the overall data storage puzzle, but for the reasons cited above alone, it would be hard to argue it becomes the active archive repository of choice.

Big Analytics Data

Today’s Big Data trend seems to represent a similar shift in the underlying data, although it is one of not only data volume, but also variety (data types), and velocity (the three “V’s” of Big Data, as per the classic definition). The growth of Big Data for Analytics has been occurring for well over a decade, with applications in a wide variety of fields such as energy research, agriculture, biotech, finance and others having collected analytics data in the form of billions of small log files, and storing them in structured data repositories such as relational databases.

A new growth driver, in the form of online, (cloud-based) applications and social media, have driven the transactional rates of this type of data to a point where traditional databases no longer fit the bill. Scalability beyond single database servers and the ability to process billions of records in parallel are now requirements. As a result, a new breed of specialized “semi-structured” databases (e.g., Hadoop) and technologies (MapReduce) has come into use to provide much more scalable and optimized solutions for this Big Analytics Data.

The Emergence of Big Unstructured Data

Big Unstructured Data represents the other part of the Big Data wave: mountains of rich media data, primarily images and video files, but also large documents and archives are driving unstructured data growth. Most estimates point to unstructured data as the major component of data growth in the next decade, representing up to 80% of the overall data volume.

An often cited study by IDC argues that while data volumes will grow by a factor of 35 times, the number of qualified people to manage these systems will remain nearly flat (1.4 times growth). In other words, everyone will need to learn how to manage 20 times more data, or data storage systems need to become much more automated and manageable.

The media and entertainment industry is leading the way in unstructured data growth, and the proliferation of its digital media is a key market driver behind new storage paradigms. With new 4K high-definition video formats representing data rates of multiple terabytes per hour, and double that for 3D formats – this industry is clearly driving all three aspects of the big data definition described above. In media applications, Petabytes have become common, with individual movie projects now requiring tens of Petabytes, and television productions having also generated Petabytes of video in just the last few years.

Unstructured data growth is also occurring in enterprises, with the need to archive years of digitized corporate documents, backup payloads, image and design data files. In specific vertical industries such as healthcare, we have seen medical imaging drive tremendous data volumes in the form of files.

Industries such as biopharmaceuticals can now sequence genomics data in days, where that process used to require months. It is very clear that mainstream corporations are already dealing with Petabyte scale data, with multiples of growth expected.

The explosion of personal digital photography is another driver of both volume and velocity and illustrates one of the key tenets of object storage: ; it obviates the need for the traditional file system folder hierarchy. Managing file system hierarchies to store pictures by Year, by Month, by Event, is becoming impossible with the explosion in digital photos. ; ; ;

As a result new smart applications such as Google’s Picassa have emerged that manage the image data in large-scale object repositories, instead of file system “trees”. This shifts the data management intelligence into the application to sort and categorize the data by default, and eliminates the need for users to actively manage file systems.

Avoiding File System Limitations

File systems themselves, as the classical repository for unstructured data, can become a limitation at this scale. File system limits such as the number of files per directory, the number of directories, and the depth of the file system tree – must all be planned very carefully. Moreover, the need to scale a file system beyond the limits of a single host has lead to the development of both distributed file systems, as well as scale-out NAS systems. Limitations still exist in these technologies, but more importantly the fundamental argument that data must be actively organized and managed to fit into the file system structure, still remains.

Object Storage systems combine the properties that appear to fit the requirements painted above: scalability to today’s tens of Petabytes, and tomorrow’s Exabytes of unstructured data, all presented as a single uniform pool of data. ;They offer direct, online access to data through object-based protocols for the emerging new breed of intelligent applications, with direct, low-latency and high-throughput. While some Object Storage system utilize file mirroring (replication) based data protection, Optimized Object Storage systems offer advanced data protection schemes (based on Erasure Coding) to provide the highest levels of data durability for massive amounts of unstructured data with greatly reduced storage overhead. Erasure Coding approach this problem in a fundamentally different manner. With Erasure Coding, data can be archived for extremely long durations, with the problem of managing data integrity and susceptibility to data loss being statistically nearly eliminated.

For many industries, there has been talk of 7 year, 10 year and even 100-year archives. Preserving data durability for these types of durations introduces new complexities that must be solved. Will traditional RAID technologies work on data at this level of scale, or will we incur the wrath of nearly continuous RAID rebuilds? What about other issues that arise with Big Data size volumes, such as preserving the integrity of data, especially on high-density (but very low cost) disk drives, where the probability of data corruption due to disk-related bit errors (e.g., bit rot) becomes more frequent?

Finally, Optimized Object Storage systems are available that are independent of the underlying hardware platform. This is a key enabler to migration of data across newer generations of processors, disk drives and network fabrics, a key requirement for Archives that are intended to live for the duration of human lifetimes. This hardware independence also makes it possible to host Object Storage on the most efficient hardware platform available at any given time. Commercial Object Storage systems are shipping today that leverage not only high-density, low-cost disk drives, but also low power components including processors similar to those used in notebook computers or tablets. This can drive down the cost of power to the lowest possible levels, up to 70% lower than current generations of storage systems.

In the last few years the use of public cloud-based storage such as Amazon’s S3 have grown to managing 100’s of millions of files for millions of users, and are being used in a wide variety of file-sharing, video-sharing, imaging, and backup applications. Systems such as S3 are indeed Object Storage systems deployed in public clouds. They offer SLA’s, security and performance that are acceptable for many applications.

Commercial Optimized Object Storage systems are now available for corporations that wish to leverage object storage internally for their own dedicated applications. These systems offer enterprise-class SLA’s for data durability and availability, plus much higher levels of performance than available over public clouds. They are now being deployed in a wide variety of Petabyte scale (and beyond) unstructured data applications in media & entertainment, science & higher education, government, and enterprise environments.

Addressing Petabyte to Exabyte Unstructured Data

A storage technology shift isn’t likely to happen just because the technology is interesting. A shift to new style storage systems has been seen to occur when it is coincident with a significant shift in application requirements. In other words, it’s all about the data, preservation of the data and access to that data. For the new world of Petabyte to Exabyte scale unstructured data, Optimized Object Storage promises to provide a dramatically simplified approach to data management, at an order of magnitude higher scale, while making it possible to manage these systems with the same numbers of qualified storage administrators we have available today.

About the Author

Paul Speciale, vice president of products for Amplidata, has over 20 years of technology industry experience, in domains including cloud computing, enterprise data storage and database management.

He has held senior product marketing roles at Savvis, Q-layer, agami, and Zambeel,s well as senior positions in systems engineering and product management at Object Design Inc, IBM and Oracle. Paul holds Masters and Bachelors degrees in Applied Mathematics with specialization in numerical computing from UCLA.