1. There are multiple ways in which analytics is inherently modular. For example:

Business intelligence tools can reasonably be viewed as application development tools. But the “applications” may be developed one report at a time.

The point of a predictive modeling exercise may be to develop a single scoring function that is then integrated into a pre-existing operational application.

Conversely, a recommendation-driven website may be developed a few pages — and hence also a few recommendations — at a time.

Also, analytics is inherently iterative.

Everything I just called “modular” can reasonably be called “iterative” as well.

So can any work process of the nature “OK, we got an insight. Let’s pursue it and get more accuracy.”

If I’m right that analytics is or at least should be modular and iterative, it’s easy to see why people hate multi-year data warehouse creation projects. Perhaps it’s also easy to see why I like the idea of schema-on-need.

Like many other analytic offerings, Datameer is meant to be “self-service”, for line-of-business business analysts, and includes some “data preparation”. Datameer also has had some data profiling since Datameer 4.0.

The main way of interacting with Datameer seems to be visual analytic programming. However, I Datameer has evolved somewhat away from its original spreadsheet metaphor.

Datameer’s primitives resemble those you’d find in SQL (e.g. JOINs, GROUPBYs). More precisely, that would be SQL with a sessionization extension; e.g., there’s a function called GROUPBYGAP.

Many of the companies I talk with boast of freeing business analysts from reliance on IT. This, to put it mildly, is not a unique value proposition. As I wrote in 2012, when I went on a history of analytics posting kick,

Most interesting analytic software has been adopted first and foremost at the departmental level.

People seem to be forgetting that fact.

In particular, I would argue that the following analytic technologies started and prospered largely through departmental adoption:

Fourth-generation languages (the analytically-focused ones, which in fact started out being consumed on a remote/time-sharing basis)

Electronic spreadsheets

1990s-era business intelligence

Dashboards

Fancy-visualization business intelligence

Planning/budgeting

Predictive analytics

Text analytics

Rules engines

What brings me back to the topic is conversations I had this week with Paxata and Metanautix. The Paxata story starts:

… that are meant to be used by business analysts rather than ETL (Extract/Transform/Load) specialists or other IT professionals …

… where what Paxata means by “data preparation” is not specifically what a statistician would mean by the term, but rather generally refers to getting data ready for business intelligence or other analytics.

Metanautix seems to aspire to a more complete full-analytic-stack-without-IT kind of story, but clearly sees the data preparation part as a big part of its value.

If there’s anything new about such stories, it has to be on the transformation side; BI tools have been helping with data extraction since — well, since the dawn of BI. Read more

I chatted yesterday with the Hortonworks gang. The main subject was Hortonworks’ approach to SQL-on-Hadoop — commonly called Stinger — but at my request we cycled through a bunch of other topics as well. Company-specific notes include:

Hortonworks founder J. Eric “Eric14″ Baldeschwieler is no longer at Hortonworks, although I imagine he stays closely in touch. What he’s doing next is unspecified, except by the general phrase “his own thing”. (Derrick Harris has more on Eric’s departure.)

John Kreisa still is at Hortonworks, just not as marketing VP. Think instead of partnerships and projects.

~250 employees.

~70-75 subscription customers.

Our deployment and use case discussions were a little confused, because a key part of Hortonworks’ strategy is to support and encourage the idea of combining use cases and workloads on a single cluster. But I did hear:

Hortonworks evidently sees a lot of MicroStrategy and Tableau, and some Platfora and Datameer, the latter two at around the same level of interest.

Accumulo is a big deal in the Federal government, and has gotten a few health care wins as well. Its success is all about security. (Note: That’s all consistent with what I hear elsewhere.)

I also asked specifically about OpenStack. Hortonworks is a member of the OpenStack project, contributes nontrivially to Swift and other subprojects, and sees Rackspace as an important partner. But despite all that, I think strong Hadoop/OpenStack integration is something for the indefinite future.

Hortonworks’ views about Hadoop 2.0 start from the premise that its goal is to support running a multitude of workloads on a single cluster. (See, for example, what I previously posted about Tez and YARN.) Timing notes for Hadoop 2.0 include:

It’s been in preview/release candidate/commercial beta mode for weeks.

Q3 is the goal; H2 is the emphatic goal.

Yahoo’s been in production with YARN >8 months, and has no MapReduce 1 clusters left. (Yahoo has >35,000 Hadoop nodes.)

The last months of delays have been mainly about sprucing up various APIs and protocols, which may need to serve for a similar multi-year period as Hadoop 1’s have. But there also was some YARN stabilization into May.

Frankly, I think Cloudera’s earlier and necessarily incremental Hadoop 2 rollout was a better choice than Hortonworks’ later big bang, even though the core-mission aspect of Hadoop 2.0 is what was least ready. HDFS (Hadoop Distributed File System) performance, NameNode failover and so on were well worth having, and it’s more than a year between Cloudera starting supporting them and when Hortonworks is offering Hadoop 2.0.

Hortonworks’ approach to doing SQL-on-Hadoop can be summarized simply as “Make Hive into as good an analytic RDBMS as possible, all in open source”. Key elements include: Read more

Cloudera changed CEOs last week. Tom Reilly, late of ArcSight, is the new guy (I don’t know him), while Mike Olson’s titles become Chairman and Chief Strategy Officer. Mike told me Friday that Reilly had secretly been working with him for months.

Mike shared good-sounding numbers with me. But little is for public disclosure except the stat >400 employees.

There are always rumors of infighting at Cloudera, perhaps because from earliest days Cloudera was a place where tempers are worn on sleeves. That said, Mike denied stories of problems between him and COO Kirk Dunn, and greatly praised Kirk’s successes at large-account sales.

Cloudera now self-identifies pretty clearly as an analytic data management company. The vision is multiple execution engines – MapReduce, Impala, something more memory-centric, etc. – talking to any of a variety of HDFS file formats. While some formats may be optimized for specific engines – e.g. Parquet for Impala – anything can work with more or less anything.*

Mike told me that Cloudera didn’t have any YARN users in production, but thought there would be some by year-end. Even so, he thinks it’s fair to say that Cloudera users have substantial portions of Hadoop 2 in production, for example NameNode failover and HDFS (Hadoop Distributed File System) performance enhancements. Ditto HCatalog.

*Of course, there will always be exceptions. E.g., some formats can be updated on a short-request basis, while others can only be written to via batch conversions.

Everybody else

There’s a widespread belief that Hortonworks is being shopped. Numerous folks – including me — believe the rumor of an Intel offer for $700 million. Higher figures and alternate buyers aren’t as widely believed.

Views of MapR market traction, never high, are again on the downswing.

IBM Big Insights seems to have some traction.

In case there was any remaining doubt — DBMS vendors are pretty unanimous in agreeing that it makes sense to have Hadoop too. To my knowledge SAP hasn’t been as clear about showing a markitecture incorporating Hadoop as most of the others have … but then, SAP’s markitecture is generally less clear than other vendors’.

Folks I talk with are generally wondering where and why Datameer lost its way. That still leaves Datameer ahead of other first-generation Hadoop add-on vendors (Karmasphere, Zettaset, et al.), in that I rarely hear them mentioned at all.

I visited with my client Platfora. Things seem to be going very well.

My former client Revelytix seems to have racked up some nice partnerships. (I had something to do with that. :))

Datameer is designed to let you do simple stuff on large amounts of data, where “large amounts of data” typically means data in Hadoop, and “simple stuff” includes basic versions of a spreadsheet, of BI, and of EtL (Extract/Transform/Load, without much in the way of T).

That’s all still mainly true, although with the recent Datameer 2.0:

You can run Datameer and the underlying Hadoop on a desktop or workgroup group.

There are some infographics pretty-picture-drawing capabilities, which will surely delight those who like vector-based HTML 5 pictures of coffee cups, saucers and macaroons.

No doubt Datameer has been generally enhanced on multiple fronts.

In essence, Datameer has two positionings.

One is “OK, you’ve got Hadoop — now wouldn’t you like to do something useful with it?” That can include both business intelligence and ETL.

Beyond that, Datameer founder/CEO Stefan Groschupf’s core argument is that schema-on-read is really, really useful, even at the cost of absorbing a potentially large performance hit. In other words, he’s making a case for a form of non-relational BI.

I’ve chatted with Datameer a couple of times recently, mainly with CEO Stefan Groschupf, most recently after XLDB last Tuesday. Nothing I learned greatly contradicts what I wrote about Datameer 1 1/2 years ago. In a nutshell, Datameer is designed to let you do simple stuff on large amounts of data, where “large amounts of data” typically means data in Hadoop, and “simple stuff” includes basic versions of a spreadsheet, of BI, and of EtL (Extract/Transform/Load, without much in the way of T).

Stefan reports that these capabilities are appealing to a significant fraction of enterprise or other commercial Hadoop users, especially the EtL and the BI. I don’t doubt him.

Elder care issues have flared up with a vengeance, so I’m not going to be blogging much for a while, and surely not at any length. That said, my first post about Datameer was never going to be very long, so lets get right to it:

Datameer offers a business intelligence and analytics stack that runs on any distribution of Hadoop.

Datameer is still building a lot of features that it talks about, for target release in (I think) the fall.

Datameer’s pride and joy is its user interface. Very laudably for a software start-up, Datameer claims to have spent considerable time with professional user interface designers.

Datameer includes 124 functions one can use in these formulae, ranging from math stuff to text tokenization.

Datameer does some straight BI, with 4 kinds of “visualization” headed for 20 kinds later. But if you want to do hard-core BI, use Datameer to dump data into an RDBMS and then use the BI tool of your choice. (Datameer’s messaging does tend to obscure or even contradict that point.)

Rather, Datameer seems to be designed for the classic MapReduce use cases of ETL and heavy data crunching.