Until I did all this recent research on data warehousing, I didn’t realize just how big a role data mining plays in driving the whole thing. Basically, there are three things you can do with a data warehouse – classical BI, “operational” BI, and data mining. If we’re talking about long-running queries, that’s not operational BI, and it’s not all of classical BI either. The rest is data mining. Indeed, if you think back to what you know of the customer bases at data warehouse appliance vendors Netezza and DATallegro, there are a lot of credit-reporting-data types of users – i.e., data miners. And it’s hard to talk about uses for those appliances very long without SAS extracts and the like coming up.

That was just the analysis. There’s also data mining scoring. In data mining scoring you substitute numbers for values in a table, and then do a row-by-row weighted sum of what results. Or else you do this real-time, for single rows, if that’s your preferred way of deploying things. Just about everybody agrees this is better done “in the DBMS” than in an extract file. Indeed, since the batch version of this is table-scan-to-the-max, scoring turns out to be ideally suited for data warehouse appliances and other MPP/shared-nothing products. (That doesn’t – and shouldn’t – stop Oracle from making scoring integration part of its data mining value-added pitch.)

I further think scoring could be particularly well suited for FPGA-based pattern matching. But I’m not aware of Netezza doing anything in this direction. On the other hand, they may think that the huge projections inherent in data mining scoring – i.e., as a byproduct of variable reduction – mean that the FPGA is anyhow already more than pulling its weight.

The wild card here is the attempt by companies like KXEN and Verix to change the rules of data mining. (Verix currently runs on Oracle, by the way.) KXEN, in particular, would like data mining to be done in a lot more, but probably a lot smaller, processing runs than it is today.

Comments

[…] It sounds as if the product is optimized for data mining and generic OLAP alike. Indeed, SAS Intelligence Storage is used to power both SAS’s data mining and other advanced analytics, and also its more conventional BI suite. • • • […]

[…] Deep analysis and decision support. Routine, scheduling reporting was covered in my first two categories. But this third one is where the bulk of ad hoc query and data mining fall. Generally, it’s where lots of specialized and/or calculation-intensive analytic technology comes into play. It’s also where the drilldown aspect of standard reporting shows up. Also, this is the area that is driving much of the recent transformation and disruption in the data warehouse market, because different kinds of BI need different kinds of data warehousing technology. […]

“…to change the rules of data mining. (Verix currently runs on Oracle, by the way.) KXEN, in particular, would like data mining to be done in a lot more, but probably a lot smaller, processing runs than it is today.”

You have been reading too much literature from bloated companies selling over-priced tools. Data mining has been done on the desktop for years. I work as a data miner for an international bank and spend much more time on my (admittedly, beefy) PC than I do on the UNIX machine.

The issue isn’t so much whether the traditional tools and data sets of full-time data miners happen to fit on desktop machines in a certain enterprise. (Although I’m curious — what tools do you use, on what kinds of data sets, and what kind of warehouse/mart did they emerge from?)

Rather, KXEN is trying to do a few things:

A. Divide problems into more, smaller, simpler models.
B. Make data mining accessible to people who aren’t data mining experts.
C. Take out a lot of steps from the data mining process, such as variable reduction.

And Verix is trying to outsource the whole data mining task altogether.

From my conversations with the folks at KXEN, I’d say your point ‘B’ is their real goal, although it is questionable to what extent this is possible. I can tell you that KXEN representatives who pushed point ‘B’ got a skeptical reception at KDD2006 (including from myself).

I work at an international bank where I construct statistical models of customer behavior. My last project involved 200,000 records with several hundred candidate predictors. The raw data was housed in a relational database and a local data warehouse, which was accessed via SQL and related qureying tools. I completed this project using my tool of choice, MATLAB, running on a Windows PC sporting an AMD Athlon64 FX-53 and 2GB RAM (I have recently upgraded to a Windows PC with an Intel Core 2 Extreme X6800 and 4GB RAM).

As to the cost issue, which was my original point: Neither PC I mentioned cost over US$3000 at the time of purchase (and that includes the monitor). I use MATLAB which is a little less than $2,000 new (less than $3,000 with typical analytical options), and much less than that (a few hundred bucks) for the annual subscription. The database would have been there anyway: incremental cost: $0. The data extraction software I wouldn’t count since any business analyst would have it anyway, but I admit that could be debated. The most expensive part of this process? Me. Hiring a qualified nerd to do this work is not cheap.

Yes, you’re right about KXEN. They’re trying to be classic “disruptors.” And KDD2006 — well, that was a conference for the putative disruptees. I don’t recall talking with or about KXEN there, but I have zero difficulty believing everything you’re suggesting about their reception there.

As for your core point — I see what you mean. But let me throw another set of questions back at you: Where did those several hundred candidate predictors come from? Are they ALL from transactional data that HAD to be recorded in the ordinary cost of business anyway? Or is there a cost to accumulating the info in the first place?

Thanks for the good discussion,

CAM

Greg on
December 21st, 2006 12:28 pm

Late contribution, but I have to ask: is this from the technical journal of “Duh!” or “Dee dee dee!”

Why would anyone store such voluminous amounts of data, in such ridiculous formats, if they did not intend to mine the data store?

The data I deal with comes from several sources. Much of it would have been recorded anyway: administrative items and transactions. Some of it comes from other sources, and is purchased largely for analytical purposes: credit bureau data, etc.

I’m not sure where you’re going with this, though, since the answer would be the same whether we used our current process, KXEN or any other analytical tool.