Posts from 2017-09-14

Trends in data growth are downright scary. Per Barry M. Ferrite, as well as leading IT analysts, data is on pace to grow from current levels to more than 60 zettabytes (ZBs) by 2020 and to more than 163 ZBs by 2025. Driving this data growth are three trends: the digitization of information formerly stored in analog formats (the so-called digital democracy), the mobile commerce phenomenon, and the Internet of Things.

This tsunami of new data will challenge large organizations and those that are data intensive. One prominent cloud architect has noted that the current manufacturing output of the disk industry in terms of capacity is around 780 exabytes per year. The flash industry produces approximately 500 exabytes of capacity annually. Even with forecasted capacity improvements, the world will still confront a severe capacity shortage by 2020.

Tape could help fill the void, with demonstrations of over 300 TB per cartridge coming from IBM and tape manufacturers. How soon tape drives and cartridges supporting these capacities will come to market remains to be seen.

DMI has begun using an avatar, Barry M. Ferrite, "your trusted storage AI", to provide entertaining and informative public service announcements about storage technology and data management. This follows a series of "edutainment" videos we made in 2012-2013 to talk about the state of storage industry infighting at that time.

Each episode of Storage Wars was a mash-up of Star Wars and Annoying Orange. For their "historical value," here was our version of Storage Wars -- Episodes IV, V and VI (labeled Storage Wars, Storage Wars 2 and Storage Wars 3 for YouTube storage.}

Hope you enjoy the trip down memory lane. DMI will be creating more edutainment videos in the future to teach storage fundamentals.

Barry M. Ferrite responded last May to inquiries from many DMI members regarding how to bend the cost curve of storage, which currently accounts for between 30 to 70 cents of every dollar spent annually for IT hardware. He talked about the secondary market, a place where you could buy used hardware at a fraction of the price of new gear, and build out your capacity without breaking the bank.

Barry introduced us to ASCDI, an organization for secondary market equipment sales that imposes a code of ethics on members to ensure that consumers get the products they were promised and in good working order. Have a listen.

DMI thanks Joe Marion of ASCDI for offering his perspective for this video.

Scary as it seems, this is actually not an uncommon question from novice IT personnel, especially those who have been taught their trade at schools offered by hypervisor computing vendors or flash technology companies. Yet, tape storage is coming back into vogue in industrial clouds, large data centers and certain vertical industry markets.

Barry M. Ferrite, our trusted storage AI, offered this public service announcement on LTO-7 tape about a year ago to help acquaint newbies with the merits of tape technology. He will likely revisit this subject shortly with the release of LTO-8.

A little over a year ago, DataCore Software's late Chief Scientist, Ziya Aral, released a groundbreaking piece of technology he called adaptive parallel I/O that showed the way to alleviate RAW I/O congestion causing applications, especially virtual machines running in hypervisor environments, to run slowly.

Demonstrations of the effectiveness of adaptive parallel I/O in reducing latency and boosting performance of VMs demonstrated the silliness of arguments by leading hypervisor vendors that slow storage was to blame for poor VM perfomance. Storage was not the problem; the decreasing rate at which I/Os could be placed onto the I/O bus (RAW I/O speed) was the problem.

The problem was that hypervisor vendors really don't seem to want to place blame where it belongs -- with hypervisors and how they use logical cores in multi-core processors. In better times, the error of such an assertion (that storage was responsible for application performance) could be shown just by looking at queue depths on the hosting server. If the queue depth was deep, then slow storage I/O was to blame. Conversely, if queue depths were shallow, as they typically are in hypervisor computing settings we've seen, then the problem lies elsewhere.

Aral and DataCore showed that RAW I/O speeds were to blame and they provided a software shim that converts unused logical CPU cores into a parallel I/O processing engine to resolve the problem. Here is our avatar, Barry M. Ferrite, reviewing the technology in its early days -- at about the same time as Star Wars Episode VII was about to be released.

Since the initial release of Adaptive Parallel I/O technology, DataCore has steadily improved its results as measured by the Storage Performance Council, reaching millions of IOs per second in SPC benchmarks...on commodity servers from Lenovo and other manufacturers.

In a couple of public service announcements made last year, Barry M. Ferrite, DMI's "trusted storage AI," warned of a coming Z-Pocalypse (zettabyte apocalypse). Archiving is the only solution for dealing with the data deluge.

These PSAs provided some "edutainment" to help folks get started with their archive planning. We hope it helps...

Continuing on this message, Barry returned in the next PSA with this additional information...

Amusing but serious, we hope to add more guidance from Barry in the future on the topics of archive and data management.

The data management market today comprises many products and technologies, but comparatively few that include all of the components and constructs enumerated above. To demonstrate the differences, we surveyed the offerings of vendors that frequently appear in trade press accounts, industry analyst reports and web searches. Our list originally included the following companies:

Avere Systems*

Axaem

CTERA*

Clarity NOW Data Frameworks

Cloudian HyperStore

Cohesity*

Egnyte

ElastiFile*

Gray Meta Platform

IBM*

Komprise

Nasuni*

Panzura*

Primary Data*

QStar Technologies*

Qubix

Seven10

SGI DMF

SGL

ShinyDocs

StarFish Global

StorageDNA*

STRONGBOX Data Solutions*

SwiftStack Object Storage*

Talon*

Tarmin*

Varonis

Versity Storage Manager

Only a subset of these firms responded to our requests for interview (denoted with asterisks) which we submitted by email either to the point of contact identified on their websites or in press releases. After scheduling interviews, we invited respondents to provide us with their “standard analyst or customer product pitch” – usually delivered as a presentation across a web-based platform – and we followed up with questions to enable comparisons of the products with each other. We wrote up our notes from each interview and submitted them to the vendor to ensure that we had not misconstrued or misunderstood their products.
These interviews were updated to ensure their accuracy when comments were received back from the respondents. Following are those discussions.

Ideally, a data management solution will provide a means to monitor data itself – the status of data as reflected in its metadata – since this is how data is instrumented for management in the first place. Metadata can provide insights into data ownership at the application, user, server, and business process level. It also provides information about data access and update frequency and physical location.

A real data management solution will offer a robust mechanism for consolidating and indexing this file metadata into a unified or global namespace construct. This provides uniform access to file listings to all authorized users (machine and human) and a location where policies for managing data over time can be readily applied.

That suggests a second function of a comprehensive or real data management solution. It must provide a mechanism for creating management policies and for assigning those policies to specific data to manage it through its useful life.

A data management policy may offer simplistic directions. For example, it may specify that when accesses to the data fall to zero for thirty days, the data should be migrated off of expensive high performance storage to a less expensive lower performance storage target. However, data management policies can also define more complex interrelationships between data, or they may define specific and granular service changes to data that are to be applied at different times in the data lifecycle. Initially, for example, data may require continuous data protection in the form of a snapshot every few seconds or minutes in order to capture rapidly accruing changes to the data. Over time, however, as update frequency slows, the protective services assigned to the data may also need change – from continuous data protection snapshots to nightly backups, for example. Such granular service changes may also be defined in a policy.

The policy management framework provides a means to define and use the information from a global namespace to meet the changing storage resource requirements and storage service requirements (protection, preservation and privacy are defined as discrete services) of the data itself. The work of provisioning storage resources and services to data, however, anticipates two additional components of a data management solution.

In addition to a policy management framework and global namespace, a true data management solution requires a storage resource management component and a storage services component. The storage resource management component inventories and tracks the status of the storage that may be used to provide hosting for data. This component monitors the responsiveness of the storage resource to access requests as well as its current capacity usage. It also tracks the performance of various paths to the storage component via networks, interconnects, or fabrics.

The storage services management component performs roughly the same work as the storage resource manager, but with respect to storage services for protection, preservation and privacy. This management engine identifies all service providers, whether they are software providers operated on dedicated storage controllers, or as part of a software-defined storage stack operated on a server, or as stand-alone third party software products. The service manager identifies the load on each provider to ensure that no one provider is overloaded with too many service requests.

Together with the policy management framework and global namespace, storage resource and storage service managers provide all of the information required by decision-makers to select the appropriate resources and services to provision to the appropriate data at the appropriate time in fulfillment of policy requirements. That is an intelligent data management service – with a human decision-maker providing the “intelligence” to apply the policy and provision resources and services to data.

However, given the amount of data in even a small-to-medium-sized business computing environment, human decision-makers may be overwhelmed by the sheer volume of data management work that is required. For this reason, cognitive computing has found its way into the ideal data management solution.

A cognitive computing engine – whether in the form of an algorithm, a Boolean logic tree, or an artificial intelligence construct – supplements manual methods of data management and makes possible the efficient handling of extremely large and diverse data management workloads. This cognitive engine is the centerpiece of “cognitive data management” and is rapidly becoming the sine qua non of contemporary data management technology and a key differentiator between data management solutions in the market.

Data management means different things to different people. To most, it is a term used to describe the deliberate movement of data between different data storage components during the useful life of the data itself. The rationale for such movement is often helpful in differentiating data management products from one another.

For example, data may be moved to decrease storage costs. Different storage devices may be grouped together by performance and cost characteristics to define “tiers” of storage infrastructure.

Data that is accessed and updated frequently may be best hosted in the highest performance (most costly) tiers, while data that is older and less frequently accessed or updated may be more economically hosted on less performant and less expensive tiers. A product that tracks data access and modification frequency and that moves less active data to slower tiers automatically may be termed a data management solution, though such products are more appropriately termed hierarchical storage management or HSM products.

Similarly, data may be migrated between storage devices to level or optimize the load placed on specific devices or interconnecting links. This may be done to improve overall access performance by introducing target parallelism or simply to scale overall capacity more efficiently. It may also provide a means to enable the decommissioning of certain storage products when they have reached end of life by providing a way to migrate or copy their data to alternative or newer storage with minimal operator intervention. Again, this may be termed data management, but it is actually infrastructure management or scale out architecture.

Moving or copying data between storage platforms may also be performed in order to preserve certain data assets that, while they are rarely accessed or updated, must be retained for legal, regulatory or business reasons. The target “archival” storage may comprise very low cost, very high capacity media such as tape. Technically speaking, this is an archive rather than a data management product.

Archive may be part of data management, but it is not necessarily a data management solution unto itself.

Data management is a policy-driven exercise. True data management involves the active placement of data across infrastructure based on business policy coupled to the business context and value of the data itself. While the expense of the storage target and the frequency of access and modification of the data can – and should – also serve as variables in the determination of policies for how the data should be hosted, what services the data is provided to ensure its protection, preservation and privacy, and when it should be moved or discarded, real data management considers data value, not just storage capacity and cost. Its goal is more than improving capacity allocation efficiency; data management strives to improve capacity utilization efficiency. Data management is business centric, not storage centric.

Many of the products obtained in web searches as “data management solutions” do not deliver business value centric management at all. Some are HSM, migration, or archival products simply. Others have an underlying agenda, to move data out of one architectural or topological model into another. For example, several firms are terming as data management solutions products that are intended to bridge on-premises hosted data into a cloud service model. Others are, under the covers, seeking to move file system data into object storage system models, or hard disk hosted data into solid state storage products leveraging non-volatile memory chips.

While potentially useful migratory tools, these are not necessarily what a consumer may be seeking who is trying to place data under greater business control so it can be shared more efficiently, used in analytical research more readily, or governed in accordance with the latest legal or regulatory mandate.

Following on this thread, we will look at the information gleaned from the web about vendors whose products result from a web search engine query on the term "data management." If you are a vendor or user of any of these products, please expand our research with additional information.