I visited Greenplum in early April, and talked with them again last night. As I noted in a separate post, there are a couple of subjects I won’t write about today. But that still leaves me free to cover a number of other points about Greenplum, including:

After much prodding, Greenplum finally gave me clear list pricing. Greenplum perpetual licenses list at $16K/core or $70K/terabyte. Annual maintenance is 22% of purchase price. Alternatively, one can buy an annual subscription on either basis, at 50% of the perpetual license purchase price. Of course, that’s just list. Quantity discounts are de rigeur.

Greenplum had about 65 paying customers at the end of Q1. I’ve forgotten how that jibes with a figure of 50 customers last August.

Greenplum claims rich functionality in standard SQL. In particular, Greenplum says “lots” of customers are using SQL 2003 OLAP. Greenplum further says it has “comprehensive” SQL-92 and –99 support.

Greenplum Release 3.3 has “more flexible” compression, which Greenplum bravely asserts is now fairly close to columnar compression in effectiveness. (Aster Data and other row-based vendors make similar claims.)

Greenplum Release 3.3 contains few performance enhancements for analytics, fixing an OLAP edge case that wasn’t previously parallelized — relevant buzzwords include grouping, aggregates, and DISTINCT, apparently in combination with each other — and speeding up sorts.

Greenplum’s data loading story goes something like this:

Greenplum has an external tables facility that, in principle, could be used to index and query on tables outside Greenplum. It’s almost never actually used for that. However, external tables is the main way to load data into Greenplum from another relational DBMS.

A huge benefit of loading Greenplum via external tables is that you can load in parallel without passing the data through the master node.

Another benefit is that you can do ETL by building a view on the foreign database, then loading that view verbatim into Greenplum. (I guess this is an exception to Greenplum’s ELT orientation.)

In addition, Greenplum has something called Scatter/Gather which puts daemons on the hosts for flat files, allow the files to be loaded into Greenplum in parallel.

Like many data warehouse DBMS vendors, Greenplum tells you that if update volumes are high enough, you should bang them into something else and then feed the data warehouse in microbatches. Greenplum’s recommendations for the “something else” are PostgreSQL or file systems. Apparently, this is happening at some Greenplum telco customers. In one case, latency is only 15 seconds.

In general, Greenplum asserts that very little work is done at the Greenplum master node, and the Greenplum master node isn’t a bottleneck.

Greenplum proudly promises that its customers will never have to do dump/restore for any release, even the big more-than-point ones that only come around every few years.

Greenplum added some Greenplum-awareness features to the pgAdmin III Postgres administration tool, which seems to be the most widely used tool with Greenplum today.

Greenplum says it has 10-gigabit switches running in the lab, but doesn’t need them. For now it’s sticking with its “handful of commodity 1-gigabit switches” strategy.

Greenplum MapReduce news and commentary include:

I’ve only ever gotten a single clear example of Greenplum MapReduce production use. But multiple Greenplum users are actively developing in MapReduce, judging by their dialog with the company.

Greenplum 3.3 has some MapReduce ease-of-use/programming upgrades, in low-glitz areas such as error-handling.

Greenplum’s current MapReduce language support is: Perl, Python, R, and C. Java didn’t make it into Release 3.3

Greenplum agrees with the MapReduce skeptics’ claim that you can in principle do anything in UDFs (User-Defined Functions) you can in MapReduce, but believes that sometimes doing it in MapReduce turns out to be easier.

One point of comparison…A couple months ago, a now-Vertica customer benchmarked Vertica and one of the aforementioned DBs, and a deciding factor was the relative amount of storage hardware required. 1TB of web app event data compressed to 200GB in Vertica (80% reduction). Same data “compressed” to greater than 1TB in the other. I think in the end, the competitor’s DB was 8x larger than the Vertica physical DB size. 8x less storage = faster performance (less IO) and, more obviously, lots less hardware when you’re managing dozens of TBs (uncompressed) of data.

In our lab testing we’ve seen fast block-compression schemes achieve up to approx 2/3rds of the theoretical maximum compression rate for typical datasets. (i.e. if the theoretical max is 6x compression, the best fast compression schemes will achieve approx 4x compression). We see roughly the same compression (give or take 10-20%) if the data is laid out in rows vs an idealized columnar representation.

In other words, the storage layout of the data makes far less difference than people appreciate, and columnar storage doesn’t provide any magic loophole to defeat entropy.

When you use “entropy” in the context of “compression”, do you basically mean “Kolmogorov complexity”? Anyhow, how is it calculated in PRACTICE? I.e., how do you know what the theoretical maximum is for a given dataset?

A few points: we’d like to see the master node concept go away all together. The problem is that with a bigger system, it becomes the number of threads/connections which can reasonably be maintained on the head node (inbound and internally), and the cost of the failover which results. Most customers with this problem will be using a PG session pooler, but this comes with it’s own problems. This problem is not unique to GP, and it is a very tough architectural and implementation problem, I think of the majors only Teradata has a solve.

On the compression front, two factors ultimately influence how well the compression works. The more structured the data is, the more effective an auto-codification scheme like Vertica. The more random and unknown the data, the more likely the standard block/dictionary schemes will work.

Ben Werther’s comment shows a widespread misunderstanding about data and entropy. Bodies of data do not have entropy. Only models of data have entropy. Bear with me. Suppose you compress English text by Huffman-coding individual characters. You would get, say, 4 bits per character. Then compress the same data using Huffman-coded digrams; you’ll get something lower, like 3 bits. So which is the entropy of the data? Neither! You have two figures but the data didn’t change. What changed was the model of the data. That is a very important and not-always-recognized distinction because information theory does not deal with modeling — only with encoding modeled data.

Now Kolmogorov complexity. That concept has little practical value in database compression because you can’t use it to quantify anything. It’s really philosophical more than anything else.

[…] be cheaper than Greenplum DB (it’s basically Greenplum DB + Greenplum HD in the same box). Last price tag for Greenplum is old but let’s say we can expect at least 10-20k$/TB of user data. It’s […]