When vendors talk about the integration of advanced analytics into database technology, confusion tends to ensue. For example:

Aster Data is generally an exception to this rule, as it should be, since that integration is at the core of its positioning. Even so, in the last paragraph of that link, I called Aster out for what at that time was some product description nonsense, which was specifically in an area that many vendors are confusing about explaining, namely …

… the distinction between three kinds of parallelization.

If you do something entirely in SQL on an MPP system that parallelizes SQL — then it’s parallel!

If you have a parallelization framework such as SQL or MapReduce that can invoke the same function on every node — well, then that’s parallel!

Many algorithms — including almost every important statistical one — have to be explicitly coded to be parallel if they’re actually going to run in in parallel. The seminal paper on parallel data mining shows that such parallelization is, in many important cases, straightforward — but somebody still has to take the trouble to actually do it.

Netezza TwinFin i-Class was renamed/repackaged/repriced before it ever shipped. Even so, when Tim Young or Phil Francisco tries to recall exactly the “i” stands for, comedy ensues. And the post I promised to write about Netezza TwinFin i-Class in June (as per the last sentence of this post) hasn’t happened yet, for reasons other than lack of interest on my part.

SAS/DBMS integration tends to be a multi-year process, with in-database scoring coming long before in-database modeling. The drip-drip-drip of big-company PR over that time period can be quite bewildering …

… especially since SAS partners in some cases are shipping home-grown in-database modeling long before SAS gives it to them.

Comments

7 Responses to “It can be hard to analyze analytics”

Sam Madden on
October 11th, 2010 10:14 pm

I’m not sure how you decided that the above referenced paper is the “seminal paper on parallel data mining” but it looks much more like a very high level survey of how to implement some machine learning methods in Map Reduce.

Specifically, I believe there are two with this statement:

1) I would generally say data mining = unsupervised machine learning, and the methods described in the above paper are not entirely unsupervised.

2) There is a very large community of machine learning researchers working on parallelization, and they by and large are not focused on doing this in MapReduce. This research was going on long before 2006.

So it’s probably neither accurate to characterize this paper as “seminal” or “data mining”.

I also think you are wrong that all of these “advanced analytics” algorithms have to be “explicitly coded to be parallel”. Cohen et al (including a number of Greenplum folks and Joe Hellerstein from UC Berkeley) did a really nice job of showing how a bunch of these algorithms can be implemented in parallel in a SQL engine in their 2009 VLDB paper:

– I really enjoyed working with Greenplum and FOX/MySpace to develop and write up their experiences with high end statistical methods implemented in parallel using old-fashioned SQL. This was so healthy: one of those times where I saw surprising new things in the field that I could map back to the research world and increase the positive feedback loops between the two.

– We published this in the industrial track of the VLDB conference in 2009. (Which was I believe the one time that you and I met?)

– You happily state that you “always resisted anything from Joe or Greenplum with the ‘MAD’ label”. I.e. you didn’t read the technical material I wrote up for a major conference, nor any of Greenplum’s marketing materials about analytics.

– You assert that Greenplum “went to the other extreme and didn’t talk about its advanced analytics capabilities at all”.

Huh?

I’m especially puzzled because I had this same odd conversation with folks at Aster who I know are smart and conscientious. Putting my machine learning hat on, I would have to posit correlation through a hidden variable.

Anyhow, let’s chalk it up to an oversight. I encourage you to read the paper, you might find it useful. Parts of it are a bit technical, but if you made it through the Chu paper on machine learning with MapReduce you’ll certainly be fine. And as a followup, let me recommend the work of Daisy Wang on doing declarative Bayesian inference for information extraction, e.g. her papers in the last VLDB or ICDE that show how the Viterbi algorithm can be boiled down into a couple dozen lines of recursive SQL.

I predict we will be seeing sophisticated analytics written in a whole bunch of ways in the coming years, including but not limited to SQL, MapReduce and parallel extensions to current stat and scientific computing packages. SQL will be important for a large number of establishment customers, so it’s important to keep up with the ways that people are bending it to their will to do analytics. It’s also nice to have MapReduce out there as an alternative syntax, and it’s great to see the open source community rallying around Hadoop. I’m like seeing lots of tools in the belt, and lots of communication between the people working in the space. That’s how we all make progress.

I think the problem — at least for me — was seeing an over-the-top marketing slogan (the “MAD” stuff) attached to academic work — or was it commercial work? — the separation, if any, wasn’t clear.

The BS siren was so buzzing so — well, so MADdeningly that it was hard to concentrate on the substance.

I didn’t get past the apparent claims that rapid application development as applied to analytics was a unique discovery of yours — or was it of Greenplum’s — or was it of your client’s?

Easiest to just put it aside and go on to other things.

If “MAD” was about advanced analytics rather than general agility, that was very unclear in Greenplum’s marketing, or even in internal discussions to the extent I was privy to same.

It all sounded like a Greenplum marketing program that they’d let slide, much like their brief emphasis on MapReduce.

————————————————–

Even on reviewing the paper, I don’t get it. There’s a lot of preachiness that boils down to “Don’t believe what you hear at TDWI” — not that that’s bad advice, but it’s hardly novel — plus a few quick paragraphs in Section 5 saying “Doing statistics is a good idea.”

Probably I should (and should have) paid more attention to Section 5, and ignored the stuff around it.

Bottom line: It’s not that I didn’t read the paper, it’s just that I have difficulty identifying anything novel or instructive in it.

In fairness, I should note that that’s my reaction to a lot of papers. Obviously, I’m not the target audience.

>Bottom line: It’s not that I didn’t read the paper, it’s just that I
>have difficulty identifying anything novel or instructive in it.

Well I know I can’t please everyone, which is fine.

But seriously, you knew all that stuff already? Data-parallel implementations of the conjugate gradiant method? 10-line implementations of the bootstrap in SQL? I sure wasn’t — Brian Dolan showed me those.

I thought it was pretty cool that a statistician in the field was (a) pulling this off at scale in standard SQL, and (b) willing to share it with the community. I figured many folks would learn from it like I did, and get some use out of it. The lessons aren’t system-specific either — you can use this stuff on any good shared-nothing engine. Still seems instructive to me (and to lots of folks I talk to.)

On the warehouse architecture side, hats off to you for being immune to the conventional wisdom of the data warehousing crowd. On this one I’m sure you’re on solid ground, and we could sit down over beers some time and joke about that stuff, or maybe cry about how it pigeonholed the SQL vendors and IT shops during a critical decade. (“Schema Good! Real Data Bad!”) But it’s still not widely-held conventional wisdom that dirty data is good data, or that you should strive to support unstructured data, its extracted features, and the computational methods all in the same environment. I think it’s important for the industry that folks counter the old DW message with some memorable messages, especially if they’re presented side-by-side with the brass tacks of algorithmics and experience in the field.

It’s a shame that the acronym and UrbanDictionary quote distracted you. Or maybe the fact that the marketing folks at GP thought it was cute and ran with it after the paper appeared. Still, you should be used to marketing folks and be able to cut to the technical stuff right?