Accumulo: Why The World Needs Another NoSQL Database

If you’ve been unable to keep up with all the competing NoSQL databases that have hit the market over the last several years, you’re not alone. To name just a few, there’s HBase, Cassandra, MongoDB, Riak, CouchDB, Redis, and Neo4J.

To that list you can add Accumulo, an open source database originally developed at the National Security Agency. You may be wondering why the world needs yet another database to handle large volumes of multi-structured data. The answer is, of course, that no one of these NoSQL databases has yet checked all the feature/functionality boxes that most enterprises require before deploying a new technology.

In the Big Data world, that means the ability to handle the three V’s (volume, variety and velocity) of data, the ability to process multiple types of workloads (analytical vs. transactional), and the ability to maintain ACID (atomicity, consistency, isolation and durability) compliance at scale. With each new NoSQL entrant, hope springs eternal that this one will prove the NoSQL messiah.

So what makes Accumulo different than all the rest? According to proponents, Accumulo is capable of maintaining consistency even as it scales to thousands of nodes and petabytes of data; it can both read and write data in near real-time; and, most importantly, it was built from the ground up with cell-level security functionality.

It’s the third feature – cell-level security – that has the Big Data community most excited. Accumulo is being positioned as an all-purpose Hadoop database and a competitor to HBase. While HBase, like Accumulo, is able to scale to thousands of machines while maintaining a relatively high level of consistency, it was not designed with any security, let alone cell-level security, in mind.

As a result, for much of its existence administrators were forced to deploy HBase in isolation, either granting users carte-blanche access to the database or no access at all. There was no middle ground. Out of necessity, HBase user Trend Micro developed table-level security for HBase, which the company unveiled to the open source community at HBaseCon in May. Still, it was a time-consuming, complex project and table-level security does not allow the granularity of control that enterprises in highly regulated industries – namely finance, government and healthcare — seek and/or require.

Why is Cell-Level Security Important?

Cell-level security – that is, the ability to assign user access permissions down to the cell-level — is significant to Big Data scenarios because it allows administrators to extend the access and functionality of a given database to the maximum number of users while still remaining in compliance with applicable privacy and security regulations.

Consider the following (admittedly simplified) scenario at a healthcare-related enterprise. The enterprise collects and stores large volumes of patient data, both clinical and financial, from a variety of internal and external sources in a NoSQL database. HIPAA and other regulations stipulate that non-approved personnel may not access certain data that makes up this large data store. Now assume the enterprise would like to let its team of Data Scientists loose on the database to find patterns and develop predictions that will help patient outcomes.

Without cell-level security capabilities, each Data Scientist would have to meet the threshold of an “approved” user under HIPPA even if he or she only intended to analyze a subset of non-sensitive data stored in the database. Even with table-level security, certain users could be prohibited from accessing important but non-regulated data if it resides in the same table as just one cell containing regulated data. Alternately, the sensitive data might remain sequestered in an isolated database, where it is unavailable for analysis or use by applications.

With cell-level security, users can be allowed access to the database and all tables within it but with specified, regulated cells encrypted. Each member of the Data Scientist team in our healthcare scenario, for example, would be allowed to mine the data stored in the NoSQL database with the exception of data covered by HIPAA or other regulations, which would essentially be blacked out from his or her view. Thus, cell-level security would allow more users access to the database, theoretically resulting in faster and/or more enlightening analysis.

Introducing Sqrrl

The combination of consistency at scale, fast read/writes and cell-level security makes Accumulo an intriguing entrant to the NoSQL community. As mentioned earlier, the database got its start at the NSA, where the agency’s technologists began efforts to reverse engineer Google BigTable in 2008, as told in this Wired magazine account.

The agency needed a scalable, secure database with which to store and analyze the never-ending stream of data it collects as part of its mandate “to understand the secret communications of our foreign adversaries.1” BigTable was the best database it could find to meet its requirements, but even it lacked cell-level security. As such, engineers at the NSA developed and natively incorporated cell-level security capabilities into its reverse-engineered version of BigTable, and Accumulo was born.

Three years later, in 2011, the NSA open sourced Accumulo, submitting the database to the Apache Foundation for review. And this week, a start-up called Sqrrl emerged from stealth-mode to commercialize Accumulo for use in the enterprise. The company is aiming Accumulo at enterprises in healthcare and financial services. Both are highly regulated industries that require rigorous security controls associated with data access and security.

When he left Vertica earlier this year, Lynch vowed to help fund no fewer than 20 Big Data start-ups in the Boston area, and he is well on his way. With Accumulo now in the fold, the Atlas Big Data portfolio includes at least six Boston-based start-ups, including Hadapt, Hopper and DataXu. Wikibon CEO and Chief Analyst David Vellante and I visited the Atlas office in May, where we interviewed Lynch as well as the founders of a number of these Atlas-funded Big Data start-ups inside theCUBE. You can watch all the videos, including Wikibon’s analysis of the Big Data market, here.

Wikibon caught up with Lynch earlier today, who said Atlas is inundated with pitches for funding from Big Data start-ups, but “Accumulo is the only commercially viable version of BigTable on the market.”

Returning to the theme struck earlier, there is no shortage of NoSQL databases from which to choose to support Big Data projects. While this can make navigating the NoSQL landscape an exercise in confusion for CIOs and IT managers tasked with implementing Big Data Analytics inside the enterprise, in the long-term I believe a large and active NoSQL community will help not hurt the adoption of Big Data. While many NoSQL databases will flame out, the best of the best will emerge even stronger thanks to the competition.

As for Accumulo, its ability to support cell-level security, maintain consistency at scale and perform fast read/writes puts the database and Sqrrl in an advantageous position when pitching enterprise customers, particularly those in highly regulated industries. As with any start-up in this space, Sqrrl’s challenge is to continue enhancing Accumulo and to develop and execute a business model that allows the company to successfully monetize Accumulo while remaining true to the database’s open source roots.

Sqrrl is currently in the midst of relocating from Washington to Cambridge’s Kendall Sq., just a stone’s throw from MIT to the north and Boston’s financial district to the south. This should make its job attracting and retaining talented engineering, sales, marketing and business development staff that much easier. A brilliant team of founders and $2 million in seed funding doesn’t hurt either.