Daily Archives

Finding table skew in Greenplum is very important. If you end up with a bad distribution of records across segments you end up with one node doing much more work than the others. Unfortunately Greenplum firmly supports No Child Left Behind, it will only be as fast as it’s slowest member. Thus it is extremely important to have an even distribution across all the segments. A good way to check this is to use the hidden column gp_segment_id. A simple count query with this parameter will let you know how well your data is spread across nodes.

So I’m pulling a sample of 500 tuples from an example advertising impression data set.

First I think I’ll distribute it on the business unit. The results are:

A little bit better, but that doesn’t work so well either. The data is spread across the segments but segment 2 is holding much more data than everybody else. This will make for some hot spotting when I query the data. Next up maybe I can try by ip.

Ah, that looks much better. Of course that this could be skewed in this dataset if we were heavily from a certain country or data came much more often from specific network segment. So I would need to continue to watch it and see if skew develops over time. It looks like this will work for now.

ERROR: could not find segment file to use for inserting into relation table (64749). (appendonlywriter.c:569) SQL state: XX000

Which essentially means game over. Dump your table and recreate it, because you won’t be able to put any more data into it. Luckily you can still pull it out. This seems to only happen to append only compressed tables in both 3.x and 4.x. It’s supposed to be fixed in an upcoming patch release. It’s still enough to make George Bush sad.