Why Storage Needs More Structure

The storage industry has been pumping up the growth of unstructured data for years. But unstructured data without a structured means of finding, accessing, and retrieving files and objects is basically useless. Many storage administrators are now struggling, since organizing unstructured data after the fact can be complex and difficult to scale. But that is changing with a few relatively new open source offerings.

Typically, “unstructured” refers to a mix of file types that get amalgamated into one giant data store. For example, productivity suite files (documents, spreadsheets, presentations), music files, videos and photos fall into this unstructured category. With the availability of lower-cost storage devices like SATA drives, it became economically feasible for companies to keep more and more files online. And with the surge in freely-hosted, consumer-generated content, a market was born for file systems that could scale to nearly infinite capacity. Global namespaces and clustered file systems addressed the desire to have a single data store, but often at other, hidden costs.

Think about it this way: With a single global data store comes a single global index. Combine more files with more simultaneous requests and you have a bottleneck. The system cannot find the files quickly enough, because the index gets overloaded. To address this issue, many system architects deploy a database like MySQL to serve as a separate reference for the file system. By assigning unique identification numbers to each file, and mapping that to a specific location on the file system, application requests using unique IDs can go directly to the file, bypassing directory crawls, and boosting performance. But even that solution can only go so far.

Scaling MySQL is one of the greatest skills for large web applications, and it is by no means a simple feat. Combine that with the frequent need for caching, and you now have a three-part solution — file system, separate database referencing, and caching — to solve the simple objective of quick and efficient file retrieval for large amounts of data. Why not start fresh with an approach that combines the need for structured access with a large file store? A few enterprising folks have done just that with an open source project called Cassandra. Billed as “distributed storage system for managing structured data while providing reliability at a massive scale,” Cassandra is basically fusing the database referencing model with the near infinite capacity of the global and clustered file systems.

Cassandra was initially developed at Facebook, where it was intended to solve the problem of Inbox search. They wanted something that was fast, reliable, and could handle significant throughput of both write and read requests. And since the messaging never stops at Facebook, they needed a system that could both store all of the data, and provide rapid, predictable results for search queries. Cassandra implements a database-style approach to managing structured data with a flexible data model than can be dynamically controlled. It bases this on a table with rows, each identified by a key, and a corresponding series of column families. A more in depth explanation can be found here.

Other potential applications beyond Inbox search include recommendation engines, targeted advertising, and content search, particularly when you combine many concurrent inputs and output requests to the same data set. Earlier this year, the project was accepted into the Apache incubator program. A project page also exists at Google Code.

Three other open-source projects are also bringing more structure back to the overall system: HBase, built on the Hadoop core; Hypertable, modeled after Google’s BigTable; and OpenNeptune. These solutions deliver rapid access to individual records within extremely large data sets — all aim (or claim) to reach petabyte scale and up to billions of rows and millions of columns. Combined, these projects represent a turn in the tides for companies relying on big data — success will be based not on how much data they can retain, but how quickly and efficiently they can use that data.