I have a small blacklist of companies I won’t talk with because of their particularly unethical past behavior. Actian is one such; they evidently made stuff up about me that Josh Berkus gullibly posted for them, and I don’t want to have conversations that could be dishonestly used against me.

That said, Peter Boncz isn’t exactly an Actian employee. Rather, he’s the professor who supervised Marcin Zukowski’s PhD thesis that became Vectorwise, and I chatted with Peter by Skype while he was at home in Amsterdam. I believe his assurances that no Actian personnel sat in on the call.

In other news, Peter is currently working on and optimistic about HyPer. But we literally spent less than a minute talking about that

Before I get to the substance, there’s been a lot of renaming at Actian. To quote Andrew Brust,

… the ParAccel, Pervasive and Vectorwise technologies are being unified under the Actian Analytics Platform brand. Specifically, the ParAccel technology … is being re-branded Actian Matrix; Pervasive’s technologies are rechristened Actian DataFlow and Actian DataConnect; and Vectorwise becomes Actian Vector.

and

Actian … is now “one company, with one voice and one platform” according to its John Santaferraro

The bolded part of the latter quote is untrue — at least in the ordinary sense of the word “one” — but the rest can presumably be taken as company gospel.

All this is by way of preamble to saying that Peter reached out to me about Actian’s new Vector Hadoop Edition when he blogged about it last June, and we finally talked this week. Highlights include: Read more

There have been many recent announcements about how data integration/ETL (Extract/Transform/Load) vendors are going to work with MapReduce. Most of what they say boils down to one or more of a few things:

Hadoop generally stores data in HDFS (Hadoop Distributed File System). ETL vendors want to be able to extract data from or load it into HDFS.

Syncsort thinks different sort algorithms should be usable with Hadoop. Consequently, it plans to contribute technology to the community to make sort pluggable into Hadoop. (However, Syncsort is keeping its own sort technology proprietary.)

Syncsort is considering replicating some Hive functionality, starting with joins, hopefully running much faster. (However, Syncsort’s basic Hadoop support is a quarter or three away, so any more advanced functionality would probably come out in 2012 or beyond.)

SnapLogic fondly thinks that its generation of MapReduce jobs is particularly intelligent.

In my first post-fire briefing, I had a long-scheduled dinner with the Pervasive DataRush folks. Much of DataRush’s positioning, feature evolution, and so on remain To Be Determined. Most existing customers and applications remain To Be Disclosed. What’s more, DataRush is a technology to accelerate applications that

Need to be parallelized

Should run on SMP rather than shared-nothing hardware

and Pervasive hasn’t done a great job of explaining where #2 applies.

That said, there’s at least one use case for which DataRush should clearly be considered today. Suppose you have a messy ETL/data transformation task that requires custom code. Then I see three main choices:

Write the code within the confines of an off-the-shelf ETL tool.

Write the code to run on an analytic DBMS platform, ideally an MPP/shared-nothing one.

Use something like DataRush (and I’m not familiar with any good alternatives to DataRush).

I’ve made a few references to Pervasive DataRush in the past — like this one — but I’ve never gotten around to seriously writing it up. I’ll now try to make partial amends. The key points about Pervasive Datarush are:

DataRush grew out of Pervasive Software’s ETL business, as the underpinnings for a new data transformation tool they were building.

DataRush is a Java framework for doing parallel programming automagically.

Unlike most modern parallelization technologies, DataRush is focused on single SMP (Symmetric MultiProcessing) boxes rather than loosely-coupled grids.

Both Pervasive Software and Cast Iron Systems told me recently of fairly pure cloud offerings. In this, they’re joining Informatica, which started offering Salesforce.com integration-as-a-service back in 2006. So far as I can tell, the three vendors are doing somewhat different things. Read more

Many MPP data warehousing vendors have told me their products are used for ELT (Extract/Load/Transform) instead of ETL (Extract/Transform/Load). I.e., needed data transformations are done on the MPP system, rather than on the — probably SMP — system the data comes from.* If the data transformation is being applied on a record-by-record basis, then it’s automatically fully parallelized. Even if the transforms are more complex, considerable parallel processing may still be going on.

*Or it’s some of each, at which point it’s called ETLT — I bet you can work out what that stands for.

Call me slow on the uptake if you like, but it’s finally dawned on me that outsourced data marts are a nontrivial segment of the analytics business. For example:

I was just briefed by Vertica, and got the impression that data mart outsourcers may be Vertica’s #3 vertical market, after financial services and telecom. Certainly it seems like they are Vertica’s #3 market if you bundle together data mart outsourcers and more conventional OEMs.

When Netezza started out, a bunch of its early customers were credit data-based analytics outsourcers like Acxiom.

After nagging DATAllegro for a production reference, I finally got a good one — TEOCO. TEOCO specializes in figuring out whether inter-carrier telcom bills are correct. While there’s certainly a transactional invoice-processing aspect to this, the business seems to hinge mainly around doing calculations to figure out correct charges.

I was talking with Pervasive about Pervasive Datarush, a beta product that lets you do super-fast analytics on data even if you never load it into a DBMS in the first place. I challenged them for use cases. One user turns out to be an insurance claims rule-checking outsourcer.

One of Infobright’s references is a French CRM analytics outsourcer, 1024 Degres.

1010data has built up a client base of 50-60, including a number of financial and retail blue-chippers, with a soup-to-nuts BI/analysis/columnar database stack.

I haven’t heard much about Verix in a while, but their niche was combining internal sales figures with external point-of-sale/prescription data to assess retail (especially pharma) microtrends.

Via Data Integrator, Pervasive is a leader in the low-cost integration market, with revenue split about 50/25/25 between direct sales, ISVs, and SaaS. Pervasive fondly believes that its products cost half as much as Cast Iron’s, and wind up taking no more installation effort when you factor in Pervasive’s broader capabilities in areas such as workflow. However, there’s some doubt as to whether this is apples-to-apples. Cast Iron does include hardware, after all, and as Pervasive itself points out, Cast Iron will bundle some professional services into a sale if you ask nicely.

For very high-end applications, the list of viable database management systems is short. Scalability can be a problem. (The rankings of most scalable alternatives differ in the OLTP and data warehouse realms.) Extreme levels of security can be had from only a few DBMS. (Oracle would have you believe there’s only one choice.) And if you truly need 99.99% uptime, there only are a few DBMS you even should consider.

But for most applications at any enterprise – and for all applications at most enterprises – super high-end DBMS aren’t required. There are relatively few applications that wouldn’t run perfectly well on PostgreSQL or EnterpriseDB today. Ingres and Progress OpenEdge aren’t far behind (they’re a little lacking in datatype support). Ditto Intersystems Cache’, although the nonrelational architecture will be off-putting to many. And to varying degrees, you can also do fine with MySQL, Pervasive PSQL, MaxDB, or a variety of other products – or for that matter with the cheap or free crippled versions of Oracle, SQL Server, DB2, and Informix.

What’s more, these mid-range database management systems can have significant advantages over their high-end brethren. Read more

Pervasive Software has a long history – 25 years, in fact, as they’re emphasizing in some current marketing. Ownership and company name have changed a few times, as the company went from being an independent startup to being owned by Novell to being independent again. The original product, and still the cash cow, was a linked-list DBMS called Btrieve, eventually renamed Pervasive PSQL as it gained more and more relational functionality.

Pervasive Summit PSQL v10 has just been rolled out, and I wrote a nice little white paper to commemorate the event, describing some of the main advances over v9, primarily for the benefit of current Pervasive PSQL developers. In one major advance, Pervasive made the SQL functionality much stronger. In particular, you now can have a regular SQL data dictionary, so that the database can be used for other purposes – BI, additional apps, whatever. Apparently, that wasn’t possible before, although it had been possible in yet earlier releases. Pervasive also added view-based security permissions, which is obviously a Very Good Thing.