I talked with Exasol today – at 5:00 am! — and of course want to blog about it. For clarity, I’d like to start by comparing/contrasting the fundamental data structures at Vertica, ParAccel, and Exasol. And it feels like that should be a separate post. So here goes.

Exasol, Vertica, and ParAccel all store data in columnar formats.

Exasol, Vertica, and ParAccel all compress data heavily.

Exasol and Vertica operate on in-memory data in compressed formats. ParAccel decompresses the data when it gets to RAM. Exasol, Vertica, and ParAccel all — perhaps to varying extents — operate on in-memory data in compressed formats.

ParAccel and Exasol write data to what amounts to the in-memory part of their basic data structures; the data then gets persisted to disk. Vertica, however, has a separate in-memory data structure to accept data and write it to disk.

Vertica is a disk-centric system that doesn’t rely on there being a lot of RAM.

ParAccel can be described that way too; however, in some cases (including on the TPC-H benchmarks), ParAccel recommends loading all your data into RAM for maximum performance.

Exasol is totally optimized for the assumption that queries will be run against data that had already been previously loaded into RAM.

Interesting, since Vertica in comparison is much more open in “revealing” what goes behind the scenes as compared to the other two where what goes on is very very very tough to get from their website. I knew a lot more from your site as well as from another blog site called fulltablescan. That is one reason why i have a soft corner for Vertica. BTW .. what do you think about Pentaho ? i did not see any comments from you anywhere in your site !

Perhaps I should have said that most conclusions drawn from TPCs are jokes. I wouldn’t say that TPCs provide no evidence for any claim at any time.

But if you think about it, in the post I mainly was illustrating what TPCs did NOT show — namely, great disk-centric performance for ParAccel. They may have it, but the TPCs don’t show that, because the TPCs weren’t done on a disk-centric configuration.

I haven’t talked w/ Pentaho. Both Lance Walter and I have been guilty at various times over the past year of being slow getting back to each other. I’m the guiltier of the two.

CAM

Dominika on
August 14th, 2008 5:52 am

WRT [possible] great disk-centric performance for ParAccel:

I’m quite certain they don’t have it, at least not yet. This is why they only used memory based solutions at the very bottom of the scale factors (100GB, 300GB and 1000GB).

It would seem that the memory based solutions (ParAccel, Exasol) are only effective if all of the required data is in memory. For example take this customer benchmark from Exasol. The first run of the queries took 20 minutes compared to 24 minutes on the customer’s existing system. The explanation from Exasol is that during the first run EXASolution “completely reorganized the internal data and performed internal optimizations, e.g. it generated an index”. Of course, the subsequent runs took significantly less time, but let’s be realistic, that is not a new trick. Both DB2 and Oracle have features (Query Patroller and Results Cache) that can just return the result if a given query is run more than once. IMO so called customer benchmarks where the same queries are executed more than once are quite unimpressive.

Now if my data warehouse or data mart follows most, nightly bulk load, then automated KPI reports (or similar) this is there any advantage with a product like this (ParAccel, Exasol)? I don’t know the answer but I am interested in knowing if you know.

WRT to the validity of TPC-H or conclusions drawn from them:

Whether or not you think TPC-H is valid or not, there are audited and validated metrics that are in the full disclosure reports that would probably allow you cross-check some of the metrics that you report on. For instance, in your post on TEOCO you wrote:

“Oracle couldn’t get the load time for 100 million call detail records (CDRs) below 24 hours“

This full disclosure report shows that an Oracle database was able to load the data for entire 30TB scale factor (almost 260 billion rows) in just over 16 hours. Loading data is not rocket science, but it appears that with TEOCO there would appear that there was a bit of EBKAC going on. This would seem to be also confirmed from Paul’s comment. Would you agree Curt?

But yeah, I’d say there was something quite confusing about how the statement was framed. With the numbers that far out of whack, the task has to have been something very different from what we commonly think of as “load”.

I’m pretty sure that, say, Exasol, ParAccel, QlikView, and SAP BI Accelerator all do a lot better jobs than row-based DBMS’ caches do. Compression lets you put more in RAM. Convincing the cache to preload exactly what you want isn’t always as straightforward as running the right query at the right time. Etc.

Pentaho sells open source and OS-based business intelligence tools. They use the Mondrian ROLAP server, which relies on a back-end DBMS, but Pentaho does not itself provide database technology.

Therefore Pentaho literally not comparable to Exasol, ParAccel, Vertica. Pentaho is a different genre of product.

Balaji on
August 15th, 2008 6:12 pm

Seth. Thanks very much for having a great website. I got to know that Clareos Crosscut is in fact ParAccel Analytic database and as usual googled around and got a technical architecture doc for Crosscut. Amazing.But correct me if i am wrong. Now since i know a lot more about vertica or C-Store, i can truly compare vertica~C-Ctore with Paraccel~Clareos Crosscut. Regarding your comment on pentaho, that was a very generic question i asked to Curt and i know very well that Pentaho does not belong to this Genre of columnar DB. Again a very generic question. I am fascinated by Pentaho because of all the material that their website provided as well as its partnership with both Vertica and Paraccel.
My 2cents — Paraccel and Vertica are on a collision course because both are great products with superb engineering brains behind them. Now who will blink first?

[…] Last spring, DATAllegro user John Devolites of TEOCO told me of troubles his firm had had loading CDRs (Call Detail Records) into Oracle, and how those had been instrumental in his eventual adoption of DATAllegro. That claim was contemptously challenged in a couple of comment threads. […]

Interested to know more about your views on these columnar DB’s and SybaseIQ. Even though everyone bags Sybase, it seems some are holding tightly to SybaseIQ as they have had it for aeon’s in comparison to these new comers.