Blogs

Tony Pearson is a Master Inventor and Senior IT Architect for the IBM Storage product line at the
IBM Systems Client Experience Center in Tucson Arizona, and featured contributor
to IBM's developerWorks. In 2016, Tony celebrates his 30th year anniversary with IBM Storage. He is
author of the Inside System Storage series of books. This blog is for the open exchange of ideas relating to storage and storage networking hardware, software and services.
(Short URL for this blog: ibm.co/Pearson )

My books are available on Lulu.com! Order your copies today!
Featured Redbooks and Redpapers:

Safe Harbor Statement: The information on IBM products is intended to outline IBM's general product direction and it should not be relied on in making a purchasing decision. The information on the new products is for informational purposes only and may not be incorporated into any contract. The information on IBM products is not a commitment, promise, or legal obligation to deliver any material, code, or functionality. The development, release, and timing of any features or functionality described for IBM products remains at IBM's sole discretion.

Tony Pearson is a an active participant in local, regional, and industry-specific interests, and does not receive any special payments to mention them on this blog.

Tony Pearson receives part of the revenue proceeds from sales of books he has authored listed in the side panel.

Tony Pearson is not a medical doctor, and this blog does not reference any IBM product or service that is intended for use in the diagnosis, treatment, cure, prevention or monitoring of a disease or medical condition, unless otherwise specified on individual posts.

Wrapping up my week's theme of storage optimization, I thought I would help clarify the confusion between data reduction and storage efficiency. I have seen many articles and blog posts that either use these two terms interchangeably, as if they were synonyms for each other, or as if one is merely a subset of the other.

Data Reduction is LOSSY

By "Lossy", I mean that reducing data is an irreversible process. Details are lost, but insight is gained. In his paper, [Data Reduction Techniques", Rajana Agarwal defines this simply:

"Data reduction techniques are applied where the goal is to aggregate or amalgamate the information contained in large data sets into manageable (smaller) information nuggets."

Data reduction has been around since the 18th century.

Take for example this histogram from [SearchSoftwareQuality.com]. We have reduced ninety individual student scores, and reduced them down to just five numbers, the counts in each range. This can provide for easier comprehension and comparison with other distributions.

The process is lossy. I cannot determine or re-create an individual student's score from these five histogram values.

This next example, complements of [Michael Hardy], represents another form of data reduction known as ["linear regression analysis"]. The idea is to take a large set of data points between two variables, the x axis along the horizontal and the y axis along the vertical, and find the best line that fits. Thus the data is reduced from many points to just two, slope(a) and intercept(b), resulting in an equation of y=ax+b.

The process is lossy. I cannot determine or re-create any original data point from this slope and intercept equation.

In this last example, from [Yahoo Finance], reduces millions of stock trades to a single point per day, typically closing price, to show the overall growth trend over the course of the past year.

The process is lossy. Even if I knew the low, high and closing price of a particular stock on a particular day, I would not be able to determine or re-create the actual price paid for individual trades that occurred.

Storage Efficiency is LOSSLESS

By contrast, there are many IT methods that can be used to store data in ways that are more efficient, without losing any of the fine detail. Here are some examples:

Thin Provisioning: Instead of storing 30GB of data on 100GB of disk capacity, you store it on 30GB of capacity. All of the data is still there, just none of the wasteful empty space.

Space-efficient Copy: Instead of copying every block of data from source to destination, you copy over only those blocks that have changed since the copy began. The blocks not copied are still available on the source volume, so there is no need to duplicate this data.

Archiving and Space Management: Data can be moved out of production databases and stored elsewhere on disk or tape. Enough XML metadata is carried along so that there is no loss in the fine detail of what each row and column represent.

Data Deduplication: The idea is simple. Find large chunks of data that contain the same exact information as an existing chunk already stored, and merely set a pointer to avoid storing the duplicate copy. This can be done in-line as data is written, or as a post-process task when things are otherwise slow and idle.

When data deduplication first came out, some lawyers were concerned that this was a "lossy" approach, that somehow documents were coming back without some of their original contents. How else can you explain storing 25PB of data on only 1PB of disk?

(In some countries, companies must retain data in their original file formats, as there is concern that converting business documents to PDF or HTML would lose some critical "metadata" information such as modificatoin dates, authorship information, underlying formulae, and so on.)

Well, the concern applies only to those data deduplication methods that calculate a hash code or fingerprint, such as EMC Centera or EMC Data Domain. If the hash code of new incoming data matches the hash code of existing data, then the new data is discarded and assumed to be identical. This is rare, and I have only read of a few occurrences of unique data being discarded in the past five years. To ensure full integrity, IBM ProtecTIER data deduplication solution and IBM N series disk systems chose instead to do full byte-for-byte comparisons.

Compression: There are both lossy and lossless compression techniques. The lossless Lempel-Ziv algorithm is the basis for LTO-DC algorithm used in IBM's Linear Tape Open [LTO] tape drives, the Streaming Lossless Data Compression (SLDC) algorithm used in IBM's [Enterprise-class TS1130] tape drives, and the Adaptive Lossless Data Compression (ALDC) used by the IBM Information Archive for its disk pool collections.

Last month, IBM announced that it was [acquiring Storwize. It's Random Access Compression Engine (RACE) is also a lossless compression algorithm based on Lempel-Ziv. As servers write files, Storwize compresses those files and passes them on to the destination NAS device. When files are read back, Storwize retrieves and decompresses the data back to its original form.

As with tape, the savings from compression can vary, typically from 20 to 80 percent. In other words, 10TB of primary data could take up from 2TB to 8TB of physical space. To estimate what savings you might achieve for your mix of data types, try out the free [Storwize Predictive Modeling Tool].

So why am I making a distinction on terminology here?

Data reduction is already a well-known concept among specific industries, like High-Performance Computing (HPC) and Business Analytics. IBM has the largest marketshare in supercomputers that do data reduction for all kinds of use cases, for scientific research, weather prediction, financial projections, and decision support systems. IBM has also recently acquired a lot of companies related to Business Analytics, such as Cognos, SPSS, CoreMetrics and Unica Corp. These use data reduction on large amounts of business and marketing data to help drive new sources of revenues, provide insight for new products and services, create more focused advertising campaigns, and help understand the marketplace better.

There are certainly enough methods of reducing the quantity of storage capacity consumed, like thin provisioning, data deduplication and compression, to warrant an "umbrella term" that refers to all of them generically. I would prefer we do not "overload" the existing phrase "data reduction" but rather come up with a new phrase, such as "storage efficiency" or "capacity optimization" to refer to this category of features.

IBM is certainly quite involved in both data reduction as well as storage efficiency. If any of my readers can suggest a better phrase, please comment below.

To avoid overwhelming people with too many features and functions, IBM decided to keep things simple for the first release. Let's take a look:

Frames

The base frame (2231-IA3) supports a single collection, from as small as 3.6 TB to as large as 72 TB of usable capacity. You can attach one expansion frame (2231-IS3) that holds two additional collections, 63 TB usable capacity for each collection. Disk capacity is increased in eight-drive (half-drawer) increments of 3.6 TB usable capacity each. A full configured IA system (304 drives, 1 TB raw capacity per drive) provides 198 TB usable capacity.

Of course, that is just the disk side of the solution. Like its predecessor, the IBM System Storage DR550, the IA v1.1 can also attach to external tape storage to store and protect petabytes (PB) of archive data. Hundreds of different IBM and non-IBM tape drives and libraries are supported, so that this can be easily incorporated into existing tape environments.

Protection Levels

Each collection can be configured to one of three protection levels: basic, intermediate, and maximum.

Basic protection provides RAID protection of data using standard NFS group/user controls for access to read and write data. This can be useful for databases that need full read/write access. Users can assign expiration dates, but in Basic mode they can delete the data before the expiration date is reached.

Intermediate adds Non-Erasable Non-Rewriteable (NENR) protection against user actions to delete or modify protected data. However, similar to IBM N series "Enterprise SnapLock", intermediate mode allows authorized storage admins to clean up the mess, increase or reduce retention periods, and delete data if it is inadvertently protected. I often refer to this as "training wheels" for those who are trying to work out their workflow procedures before moving on to Maximum mode.

Maximum provides the strictest NENR protection for business, legal, government and industry requirements, comparable to IBM N series "Compliance SnapLock" mode, for data that traditionally were written to WORM optical media. Data cannot be deleted until the retention period ends. Retention periods of individual files and objects can be increased, but not decreased. Retention Hold (often referred to as Litigation Hold) can be used to keep a set of related data even longer in specific circumstances.

You can decide to upgrade your protection after data is written to a collection. Basic mode can be upgraded to Intermediate mode, for example, or Intermediate mode upgraded to Maximum.