Stability and finding insight

Since the roll out of Greenplum 4.2 in combination with Greenplum on-site for a little training our clusters stability over the last year has improved. The main issues we run into at this point in time is lack of space as people want to throw more and more on the clusters and the occasional bad query that gets into the system and causes issues. To help with some of the space issues some simple jobs have been moved to a Hadoop cluster, this allows us to use Greenplum for the more complex data analytics functions. By removing some of the rollup functionality we have cut a decent number of table scans out of the system which resulted in a definite responsiveness increase.

The other problem we run into is queries getting into the system that cause resource contention. We have a liberal access policy and as it’s a startup environment, change is constant. Tools thus far to track down a rogue query haven’t been exactly outstanding. With the acquisition of MoreVRP I’m hoping a blessed and bottled better solution for this will come out. Currently I gather process stats and drop them to a local filesystem and drag that into a local share to do pivot tables on it and look for issues. There are some in database Greenplum queries that do this, of course when the DB is having issues using it to troubleshoot why it is having issues doesn’t work very well. The past couple of weeks we’ve been doing some work with OpenTSDB and it looks very promising. The tcollector system is easy to work with and the amount of data we have been able to throw at a single TSDB instance is impressive. The display ability of OpenTSDB is it’s weak point, we haven’t found a good display layer. Using the current tools though I am able to product the following graph, which shows memory utilization by query on each segment. Stats are collected every minute so a lot of queries don’t show, but our sub-minute queries aren’t really our problem children. As I perfect the filters I think this will be extremely helpful in tracking down skewing queries and the memory hogs the get dropped on the system in realtime.