Cloudera released Version 2 of Cloudera Director, which is a companion product to Cloudera Manager focused specifically on the cloud. This led to a discussion about — you guessed it! — Cloudera and the cloud.

Points of integration, e.g. taking information about security-oriented roles from the platform and feeding then to the role-based security that is specific to Cloudera Enterprise.

Features new in this week’s release of Cloudera Director include:

An API for job submission.

Support for spot and preemptable instances.

High availability.

Kerberos.

Some cluster repair.

Some cluster cloning.

I.e., we’re talking about some pretty basic/checklist kinds of things. Cloudera Director is evidently working for Amazon AWS and Google GCP, and planned for Windows Azure, VMware and OpenStack.

As for porting, let me start by noting:

Shared-nothing analytic systems, RDBMS and Hadoop alike, run much better in the cloud than they used to.

Even so, it seems that the future of Hadoop in the cloud is to rely on object storage, such as Amazon S3.

That makes sense in part because:

The applications where shared nothing most drastically outshines object storage are probably the ones in which data can just be filtered from disk — spinning-rust or solid-state as the case may be — and processed in place.

By way of contrast, if data is being redistributed a lot then the shared nothing benefit applies to a much smaller fraction of the overall workload.

The latter group of apps are probably the harder ones to optimize for.

But while it makes sense, much of what’s hardest about the ports involves the move to object storage. The status of that is roughly:

Cloudera already has a lot of its software running on Amazon S3, with Impala/Parquet in beta.

Object storage integration for Windows Azure is “in progress”.

Object storage integration for Google GCP it is “to be determined”.

Security for object storage — e.g. encryption — is a work in progress.

When I asked about particularly hard parts of porting to object storage, I got three specifics. Two of them sounded like challenges around having less detailed control, specifically in the area of consistency model and capacity planning. The third I frankly didn’t understand,* which was the semantics of move operations, relating to the fact that they were constant time in HDFS, but linear in size on object stores.

*It’s rarely obvious to me why something is o(1) until it is explained to me.

Naturally, we talked about competition, differentiation, adoption and all that stuff. Highlights included:

In general, Cloudera’s three big marketing messages these days can be summarized as “Fast”, “Easy”, and “Secure”.

Notwithstanding the differences as to which parts of the Cloudera stack run on premises, on Amazon AWS, on Microsoft Azure or on Google GCP, Cloudera thinks it’s important that its offering is the “same” on all platforms, which allows “hybrid” deployment.

In general, Cloudera still sees Hortonworks as a much bigger competitor than MapR or IBM.

Cloudera fondly believes that Cloudera Manager is a significant competitive advantage vs. Ambari. (This would presumably be part of the “Easy” claim.)

In particular, Cloudera asserts it has better troubleshooting/monitoring than the cloud alternatives do, because of superior drilldown into details.

Cloudera’s big competitor on the Amazon platform is Elastic MapReduce (EMR). Cloudera points out that EMR lacks various capabilities that are in the Cloudera stack. Of course, versions of these capabilities are sometimes found in other Amazon offerings, such as Redshift.

Cloudera’s big competitor on Azure is HDInsight. Cloudera sells against that via:

Cloudera tries to deposition competitors as being good mainly at these kinds of jobs.

This can be reasonably said to be the original sweet spot of Hadoop and MapReduce — which fits with Cloudera’s attempt to portray competitors as technical laggards.

Cloudera observes that these workloads tend to call for “transient” jobs. Lazier marketers might trot out the word “elasticity”.

BI (Business Intelligence) and “analytics”, by which Cloudera seems to mainly mean Impala and Spark.

“Application delivery”, by which Cloudera means operational stuff that can’t be allowed to go down. Presumably, this is a rough match to what I — and by now a lot of other folks as well — call short-request processing.

While I don’t agree with terminology that says modeling is not analytics, the basic distinction being drawn here make considerable sense.

Comments

I can explain about ” semantics of move operations, relating to the fact that they were constant time in HDFS, but linear in size on object stores.”
Move file in the regular file system is just change in the metadata. It not only happens in constant time, it is instant.
As a result software designers are using it widely. For example – we write files in some directory and when all of them are ready – we just move them to the result place, thus avoiding having partial results in case of failure. Among others, Hadoop MapReduce is taking this approach…
In the object store all file pat : s3://bucket_name/dir1/dir2/file_name.txt is actually one key which determine in which server the file have to be stored. So changing anything in the name will almost always require physical move of the file to new server…

Regarding rename, yes, object stores often map keys to physical storage locations. Imagine a hashed scheme, where the key is hashed to determine what server has the value.

David also identified the other big issue with object stores and rename: directory rename. In HDFS, directory rename is an atomic O(1) metadata-only operation. S3 doesn’t have directories, though users often still name their keys “/foo/bar/baz” for familiarity (and since S3 supports prefix listings). This means though that a “directory” rename in S3 involves renaming each key individually, which is O(n) metadata, and O(n) data since rename is really a copy. Apps can run into issues here since the rename is not atomic; blobs will start showing up in the destination part way through a directory rename.

Netflix wrote a small open-source metadata layer on top of S3 called s3mper to fix the partial visibility issue for their MapReduce workflows.