The Right Storage Option Is Important for Big Data Success

Greg Schulz, author of several books on storage, is founder of the StorageIO Group, an IT industry analyst consultancy whose blog can be found at storageioblog.com and at twitter.com/storageio.

Big Data is becoming an important tool for agencies, and as the data itself grows dramatically, the storage and data management solutions that agencies employ become more critical. As agencies deal with challenges such as implementing analytics and getting a handle on massive data files, they also must find the best fit among various storage options.

Use Metadata and Policy Management

Some agencies with Big Data storage needs will focus on obtaining a large amount of capacity at a relatively low cost. For some applications, an important attribute of storage solutions and services are their metadata capabilities. This includes the ability to support flexible and user-defined metadata.

Another enabling capability is policy management, which can use metadata for implementing or driving functions such as how long to retain data, when and how to securely dispose of it, and where to keep it (along with application-related information). This adds some flexible structure to unstructured data without the limits or constraints associated with structured data management.

Find the Right Medium

Finding the right storage medium can help an agency meet its needs. Hard disk drives (HDDs) have been a popular approach to providing a balance of performance, capacity, density and cost-effectiveness for many applications. This trend should continue as agencies retain more data for longer periods.

Big Data can also benefit from today’s solid-state drive solutions that use dynamic random-access memory or NAND flash memory — or a combination of both — to support bandwidth needs. SSDs can also be used to store metadata and other frequently accessed items.

Tape continues to play a number of roles in Big Data. These include transporting large amounts of data in a timely manner and providing an archive or master gold backup of data kept on a disk.

Reduce Big Data Footprint

Deduplication is not always an effective technique for maximizing Big Data storage capacity. Agencies should consider different tools, technologies and techniques to lessen the impact of storing and protecting their ever-growing data sets.

For example, a Big Data project could use archiving or automated tiering to migrate some data to a slower or lower-cost tier of storage, such as tape, that resides online, near-line or offline.

Another option to reduce the data footprint is to rethink how, when, where and why data is protected. Another technique for reducing the data footprint is compression (real-time or time-deferred) that can leverage different algorithms to reduce storage demands.

Protecting, Preserving and Serving Big Data

40,026

number of exabytes (a billion gigabytes) of data expected to be produced worldwide in 2020, 14 times the amount created in 2012

SOURCE: IDC Digital Universe 2012

Protecting Big Data requires basic reliability, availability and serviceability — capabilities such as redundant power, cooling, controllers, nodes and interfaces. Agencies also should ensure data integrity and durability by conducting background data scrubs to detect parity or protection errors and bit-rot, among other inconsistencies. These background checks should be transparent to normal running operations and should correct inconsistencies before they expand into problems.

Agencies also should revisit RAID levels to optimize their Big Data storage solution. Factors to consider include how many drives are in a RAID pool or group, and chunk or I/O size, as well as the sizes and types of devices being used, which may be optimized for smaller amounts of data.

Consider Storage System Options

Some Big Data solutions used for analytics employ clusters or grids of industry-standard x86 or ia64 servers with internal or dedicated storage, along with application software.

Big Data applications can also leverage existing storage systems that are optimized for different uses. Some storage systems intended for traditional high-performance computing can be a good fit for bandwidth-intensive concurrent or parallel access applications using block or file access methods.

Storage solutions with object access (including HTTP, XML and cloud data management interface) are also an option for Big Data storage needs such as video, audio, image, surveillance, seismic or geographic data, among other applications with large files or items to store. Object storage systems support variable sizes and different types of data, ranging from kilobytes to gigabytes.

General Big Data storage tips:

Use intelligent power management solutions that do not compromise performance.

Leverage tools and techniques to reduce the data footprint.

Keep an eye on the amount of total raw versus usable storage with different solutions.

Review storage setups (including RAID or protection) for areas that might be optimized.

If concerned with long HDD rebuild times, revisit and address why drives are failing.

Use a mix of SSD, HDD and tape storage, where applicable, to stretch the budget.

The many different facets of Big Data applications have various storage requirements. Knowing an agency’s needs and options can support data growth while minimizing budget growth.

Small HDDs with Big Improvements

Manufacturers are making significant advances with hard disk drives, including 3.5-inch form factor drives with 4-terabyte capacity (which should have even greater capacity in the future), along with increased capacity and faster 2.5-inch HDDs.

Some newer, 10,000 rpm 2.5-inch HDDs have similar (or better) performance than older, 15,000 rpm 3.5-inch devices. Other HDD improvements include shingled magnetic recording and heat-assisted magnetic recording, which increase the aerial density of a device (the number of bits stored in a given physical space on a disk platter). HDDs continue to grow and add features, making them relevant for Big Data environments.

Put It Here

In general, storage options for Big Data include:

Storage dedicated to servers using internal or external devices

Storage shared among servers via sharing software

Storage using block, file and object, or using an application programming interface (API) that is accessed online, near-line or offline