Cloudera sends in the auditors – for Hadoop

Techies need tools to manage cranky Hadoop clusters, and business managers need to manage and report on access data stored in Hadoop to appease cranky auditors. And so, as part of an update to its CHD4 stack on Tuesday at the Strata conference in San Francisco, Cloudera is previewing a new data visualization and auditing tool that adds this much-needed feature to its big data muncher. The update also includes better data archiving and tweaked Hadoop cluster management tools.

The new data visualization and auditing tool is called Cloudera Navigator 1.0, and it will control freak and document access to data stored in the Hadoop Distributed File System, in the HBase key-value store that rides on top of HDFS, and the Hive data warehousing that overlays HDFS for ad-hoc querying.

Cloudera Navigator is a data-discovery tool, helping analysts figure out what data is being stored in a Hadoop cluster, what formats the data is stored in and where, and how the data got into the system in the first place.

First and foremost, however, it has auditing capabilities that keep track of who did what inside the system, much as other enterprise applications have been doing for many years now. Cloudera Navigator, explains Charles Zedlewski, vice president of products at the company, will verify which users and groups have access to what files and directories in a Hadoop cluster and allows for audit tracking to be turned on for each kind of Hadoop service individually.

Cloudera Navigator also has a dashboard that auditors can query to see who has access to what data, and there is an export feature that can take all of the audit information and port it out so it can be sucked into Security Information and Event Management (SIEM) tools.

The lack of such tools keeps auditors up at night, and we won't think too hard about how excited they get when they see Hadoop being brought under their watchful eyes.

But the fact remains that various regulations – Sarbanes-Oxley, HIPAA, PCI, Basel II, and so forth – have very strict rules about demonstrating that data is only available to those who are entitled to it. And that is why, says Zedlewski, that healthcare, financial, and retail companies have been lining up to beta test Cloudera Navigator.

Cloud Navigator doesn't actually mine the actual data, but rather the metadata that is created as information is poured into the Hadoop system. So you cannot do data discovery or auditing on information that is already in a Hadoop cluster, but you can do it for any new information you suck into it or spit out after munching it.

The data discovery side of the tool is important for ease of use as Hadoop clusters scale, too. "The very act of making Hadoop more of a self-service kind of program is more of a challenge on a petabyte-class system than on a terabyte system," says Zedlewski. You could do data discovery on the raw data in a small cluster, perhaps, but on a petabyte-scale Hadoop cluster with thousands of nodes, you might have 10,000 tables but the metadata only weighs in at a few gigabytes of capacity.

Cloudera has also cooked up a new feature called Enterprise BDR, which is short for backup and disaster recovery, that takes the replication features inherent in HDFS as well as HBase and the Hive metastore and coordinates and orchestrates them so you can do backup and recovery on a remote Hadoop cluster. Right now, says Zedlewski, companies have to do a lot of scripting themselves to take the asynchronous replication features in HDFS and Hive and the synchronous replication used in HBase and keep all the data and metadata in synch on a backup cluster.

Beep beep beep. . . .

For those people with Oracle relational databases, Cloudera Enterprise BDR is analogous to Oracle's Data Guard, which is used to keep backup copies of production databases. Zedlewski says that failover between Hadoop clusters has not been automated, and the recovery time objective for failover, given this and the complexity of replication at these different layers in a Hadoop setup, is 30 minutes to an hour, not minutes or seconds. Cloudera is currently working on snapshotting for HDFS and HBase, and that could close the recovery window.

Cloudera is not providing pricing for either the Navigator or Enterprise BDR features, but Zedlewski says it is a "small incremental charge" that will adds tens of per cents to a Cloudera support license charge, not double or triple it.

And finally, with the update to the CDH4 stack, Cloudera Manager 4.5 is being kicked out, and you can do rolling updates of the nodes in the cluster rather than having to take the cluster down for four to eight hours to upgrade the nodes in a typical 100-node setup. You can update the Hadoop software more frequently and apply security and other patches as needed without taking the cluster down.

Now all Cloudera needs to do is coordinate the rolling updates of Hadoop with rolling updates of Linux, Java, and other elements underneath Hadoop that also need to be patched in a rolling fashion as well.

Cloudera may not provide official pricing, but it says that depending on features it costs anywhere from $2,000 to $4,000 per node for a support contract for the CDH4 stack and the Cloudera Manager.

Biz is booming, and Project Impala is impending

Cloudera is privately held and has raised $141m in five rounds of venture funding from a slew of investors, and must be itching to go public or be acquired for some outrageous multiple.

Zedlewski is not about to comment on any of that, but he did say that Cloudera was "steadily moving out of the startup phase" and now has 320 employees and has more than doubled its bookings and revenues in the past year. The company currently has more than 150 paying customers.

EMC's Pivotal Initiative made a made a big splash ahead of the Strata conference, launching its Hawq SQL database overlay for HDFS, which is a direct competitor to the Project Impala real-time, parallel query extensions that Cloudera cooked up to speed up Hive.

"Impala is going great guns, and we think we will be able to get it to general availability in a month or two," says Zedlewski. ®