Monthly Archives: January 2014

(Excerpt from original post on the Taneja Group News Blog)

Storage experts know that there are two ways to handle crushing data growth – the kind of growth that exceeds our traditional scale-up storage array capabilities (in one way or another). The bad way is to keep plopping down more copies of those arrays which tends to spiral OPEX out of control – there isn’t as much OPEX efficiency at scale as we might naively think.

(Excerpt from original post on the Taneja Group News Blog)

I just had the pleasure of sitting with Rob Whitely from Riverbed and Michelle Tidwell from IBM to discuss their jointly validated solution that combines IBM enterprise class storage in the data center with Riverbed Granite in branch locations. Together, the result is as if you had enterprise class storage in each branch location, but really all data is managed and protected in the data center. And each branch gains resiliency to network failures and local storage performance.

An IT industry analyst article published by SearchStorage.

Raw capacity numbers are becoming less useful as deduplication, compression and application-aware storage provide more value than sheer capacity.

Whether clay pots, wooden barrels or storage arrays, vendors have always touted how much their wares can reliably store. And invariably, the bigger the vessel, the more impressive and costly it is, both to acquire and manage. The preoccupation with size as a measure of success implies that we should judge and compare offerings on sheer volume. But today, the relationship between physical storage media capacity and the effective value of the data “services” it delivers has become much more virtual and cloudy. No longer does a megabyte of effective storage mean a megabyte of real storage.

Most array vendors now incorporate capacity-optimizing features such as thin provisioning, compression and data deduplication. But now it looks like those vendors might just be selling you megabytes of data that aren’t really there. I agree that it’s the effective storage and resulting cost efficiency that counts, not what goes on under the hood or whether the actual on-media bits are virtual, compacted or shared. The type of engine and the gallons in the tank are interesting, but it’s the speed and distance you can go that matter.

Corporate data that includes such varied things as customer behavior logs, virtual machine images and corporate email that’s been globally deduped and compressed might deflate to a twentieth or less of its former glory. So when a newfangled flash array only has 10 TB of actual solid-state drives, but based on an expected minimum dedupe ratio is sold as a much larger effective 100+ TB, are we still impressed with the bigger number? We know our raw data is inherently “inflated” with too many copies and too little sharing. It should have always been stored “more” optimally.

But can we believe that bigger number? What’s hard to know, although perhaps it’s what we should be focusing on, is the reduction ratio we’ll get with our particular data set, as deflation depends highly on both the dedupe algorithm and the content…

An IT industry analyst article published by SearchStorage.

Using Hadoop to drive big data analytics doesn’t necessarily mean building clusters of distributed storage; a good old array might be a better choice.

Using Hadoop to drive big data analytics doesn’t necessarily mean building clusters of distributed storage — good old external storage might be a better choice.

The original architectural design for Hadoop made use of relatively cheap commodity servers and their local storage in a scale-out fashion. Hadoop’s original goal was to enable cost-effective exploitation of data that was previously not viable. We’ve all heard about big data volume, variety, velocity and a dozen other “v” words used to describe these previously hard-to-handle data sets. Given such a broad target by definition, most businesses can point to some kind of big data they’d like to exploit.

Big data is growing bigger every day and storage vendors with their relatively expensive SAN and network-attached storage (NAS) systems are starting to work themselves into the big data party. They can’t simply leave all that data to server vendors filling boxes with commodity disk drives. Even if Hadoop adoption is just in its early stages, the competition and confusing marketing noise is ratcheting up.

In a Hadoop scale-out design, each physical node in the cluster hosts both local compute and a share of data; it’s intended to support applications, such as search, that often need to crawl through massively large data sets. Much of Hadoop’s value lies in how it effectively executes parallel algorithms over distributed data chunks across a scale-out cluster.

Hadoop is made up of a compute engine based on MapReduce and a data service called the Hadoop Distributed File System (HDFS). Hadoop takes advantage of high data “locality” by spreading big data sets over many nodes using HDFS, farming out parallelized compute tasks to each data node (the “map” part of MapReduce), followed by various shuffling and sorting consolidation steps to produce a result (the “reduce” part).

Commonly, each HDFS data node will be assigned DAS disks to work with. HDFS will then replicate data across all the data nodes, usually making two or three copies on different data nodes. Replicas are placed on different server nodes, with the second replica placed on a different “rack” of nodes to help avoid rack-level loss. Obviously, replication takes up more raw capacity than RAID, but it also has some advantages like avoiding rebuild windows.

So if HDFS readily handles the biggest of data sets in a way native to the MapReduce style of processing, uses relatively cheap local disks and provides built-in “architecture-aware” replication, why consider enterprise-class storage? …

RT @TruthinIT: There's no cost of goods like a traditional NAS device where I've got disks I've got to pay for. And if I'm not using the data on those disks, I still got to pay for those disks. bit.ly/2BBX073@Nasuni@smworldbigdata

In 30 min I'm interviewing @Cohesity (and customer) on @TruthinIT about Mass Data Fragmentation. It's about having too many copies in about four or five different "dimensions", including cloud! Join us webcast (12.11.18) @ 1pmET (and there will be prizes) bit.ly/2PdqrQn