A few months ago, CEO Mayank Bawa of Aster Data commented to me on his surprise at how “profound” the relationship was between design choices in one aspect of a data warehouse DBMS and choices in other parts. The word choice in that was all Mayank, but the underlying thought is one I’ve long shared, and that I’m certain architects of many analytic DBMS share as well.

For that matter, the observation is no doubt true in many other product categories as well. But in the analytic database management arena, where there are literally 10-20+ competitors with different, non-stupid approaches, it seems most particularly valid. Here are some examples of what I mean.

Hash partitioning distribution. In shared-nothing or shared-not-very-much database architectures, multiple processors pull data off disk in parallel. Ideally, it will be the case that for each long-running query, the amount of data retrieved at each node is almost identical. That way, each node is done at the same time, with no wasteful waiting.

Consequently, data should be distributed more or less randomly across the nodes. That can be done through “round-robin” allocation — each node takes a turn in strict order receiving new records or blocks. Or it can be done by hashing on a particular key — in essence, by assigning data to different disks depending on the value in some particular field or combination of fields.

Hash partitioning distribution is a wonderful optimization. For most large tables, there’s a obvious join key that will be relevant to a significant fraction of all long-running queries. Pre-hashing on that key saves a huge step in the execution of hash joins involving that key, and hence can provide a significant reduction in the total query processing workload. Nor is this benefit confined to single-fact-table or single-primary-key schemas. When different kinds of data are stored in the same warehouse, each large fact table can be hash partitioned distributed on its own key.

For almost all databases on almost all shared-nothing vendors’ systems, hash partitioning distribution is the way to go. Even so, a couple of products don’t even bother supporting it. Oracle Exadata isn’t going to perform joins of that kind anyway until data is moved from the storage to the database tier, so hash partitioning distribution has no benefit in Exadata’s multi-tier architecture. Kognitio, while not having such a clean proof of why hash partitioning distribution is utterly beside the point, thinks the costs of violating strict randomness outweigh the costs in its silicon-centric approach.

Indexing alternatives. More generally, analytic DBMS generally differ from OLTP DBMS in that they’re optimized to run more table scans and fewer updates and pinpoint queries. I’ve written about that many times, even coining the phrase index-light to encapsulate the story. The general idea is that if you’re retrieving a lot of rows per query, it becomes inefficient to keep spinning the disk to ensure you get only the rows you want. You get a lot more bytes/second doing sequential than random reads, so if a sufficiently large fraction of the rows are ones you actually want, it’s better to just scan them all.

If you’re going to follow an extreme form of that approach (e.g. Netezza, DATAllegro), you might as well have huge block sizes for your data (1megabyte+). If you think indexes of various kinds will actually be useful a reasonable fraction of the time, you might go with smaller sizes, such as 128K, which is what Teradata and HP (Neoview) favor.

Meanwhile, columnar vendor Vertica recreates some of the benefits of indexes by storing the same column in multiple sort orders. And that leads me to the next point.

High availability/failover alternatives. Most analytic DBMS mirror the data on-the-fly. But strategies differ. Some just rely on a storage vendor’s technology; others build in their own forms redundancy.

Particularly interesting is Vertica’s approach. Not only does Vertica allow multiple copies of the data to each be used for querying; it encourages the storage of the same columns in different sort orders, with the optimizer obviously choosing to query the copy that’s sorted in the way most useful for a specific query’s execution plan.

Redundancy and failover strategies are tightly tied to other administration issues too. For example, Aster Data and other vendors brag, with varying degrees of emphasis, that a new node can be added to a system, and the whole thing reconfigures itself automagically with zero down time. Similarly, different systems respond differently to node failure, in terms of metrics such as time to reestablish normal operation, performance hit (if any) after normal operation resumes, performance hit before normal operation resumes, and time window (if any) that redundancy is lost — so that a second failure would crash the whole system.

Bottom line: There never will be an analytic DBMS that simultaneously possesses all highly desirable architectural attributes for the product category.

Comments

Oracle has long had hash partitioning as an option – it is used in many DWHs that I am aware of.
The choice of partitioning scheme is (or should be) dependent on what the design is trying to achieve – is partitioning to facilitate database maintenance (perhaps data lifecycle management) or is there to enhance query and batch performance, or for both performance and management. Hash can very good if the prime query dimension can use the same hash – not so good if queries are run at higher levels of aggregation and all of the partitions get hit (ignoring any parallel access benefits) and a problem if you need to “age out” old data

I don’t understand why it is worthy of comment that “hash partitioning has no benefit [on Oracle Exadata]” as a follow on from “For almost all databases on almost all shared-nothing vendors’ systems, hash partitioning is the way to go”. Oracle Exadata is shared storage, so a method that is a virtual no-brainer for shared-nothing is not so much of a no-brainer for Exadata.

Besides that issue, hash partitioning is typically used on Oracle data warehousing even without the Exadata engancement. It allows more efficient distribution of fact table rows to the parallel query slaves that perform a hash join on the partitioning key, because each PQ slave that reads fact table rows from disk only has to transmit those rows to a single consuming PQ slave, thus greatly reducing intra-slave messaging (and CPU load). Range-hash composite partitioning is the usual technique, with the range partitioning providing partition pruning (typically on a date column, which can also benefit ILM considerations) and the hash subpartitioning providing more efficient intraslave messaging.

You state that hash partitioning in an Exadata environment is not useful. You appear to know something we don’t. So, please enlighten us on what you mean. If hash partitioning is not useful in an Exadata environment I presume you have evidence that one or more of the following are true:

1) Query performance in an Exadata environment is not improved with the use of hash partitioning

2) Query performance in an Exadata environment is degraded with the use of hash partitioning

3) Queries malfunction in an Exadata environment when using hash partitioning

I know the correct answer this quiz, but I’m eager to learn your deep, dark secret on the matter.

Disclaimer: The views expressed in this comment are my own and do not necessarily reflect the views of Oracle. The views and opinions expressed by others on this comment thread are theirs, not mine.

Brian on
February 2nd, 2009 12:37 pm

@Curt

Can you elaborate on why you perceive hash partitioning to be useful for Oracle w/o Exadata storage, but not useful with Exadata storage? Isn’t the database grid doing exactly the same work in each case?

I am quite certain your #2 statement is incorrect, but I think it is worth while to understand how you arrived there.

1. Talking about exadata in isolation from the Oracle DBMS layer does not seem to me to be very enlightening. As far as the complete “Oracle Database Machine” system is concerned it is shared storage, with each Oracle database instance having access to every disk.
Decomposing the system to a small enough unit that you can demonstrate shared-nothingness does not seem to be a useful classification.

2. I just don’t follow you there. In both cases the benefit comes from reduced intra-slave messaging, independently of the storage solution. It makes just as much sense on Exadata as it does on a SAN, Netapp, or dedicated storage surely.

On reviewing my notes, it seems I got it a bit wrong. I’m sorry. They indeed distinctly say there are hash partitions in Exadata, but they are striped via ASM.

OK — please enlighten me: How DO hash partitions give significantly better query performance than round-robin partitioning in a system where localization of query processing at particular nodes is not one of the benefits?

The point of the Exadata architecture is to have parallel access to disk, thus removing a classical I/O bottleneck in scan-heavy queries, without simply moving that bottleneck upstream. So I think the distinctions I’m drawing and terminology I’m using are useful for a significant fraction of the people who consider that architecture. If you’re not in that fraction, so be it.

One way for you to avoid the sorts of foibles in the future is to bear in mind the fact that every feature Oracle supports on non-Exadata storage (e.g., SAN, NAS, DAS, etc) is also supported with Exadata. To the contrary, conventional storage does not support any of the Exadata-specific features.

Hash partitioning is a good way to evenly place data that would otherwise suffer skew. Hash partitioning in Oracle (with or without Exadata) offers the same benefit enjoyed by shared-nothing architectures. Unlike shared-nothing architecture, however, with Oracle all CPUs in the database grid have 100% parallel access to all the data regardless of which partitioning strategy is being used. The only way to get equal access to all data from all CPUs in a shared-nothing architecture is to replicate all data on all nodes. That remindes me, your readers are growing tired of you stating matter-of-factly that Oracle doesn’t offer parallel access from CPU to disk. In fact, it is the opposite. Shared disk means shared disk. Shared nothing means shared nothing. How do you suppose a shared nothing database is going to offer parallel access for all CPUs to all disk? Sure, shared-nothing offers parallel access between any CPU and a *portion* of the data, but not all data.

Disclaimer: The views expressed in this comment are my own and do not necessarily reflect the views of Oracle. The views and opinions expressed by others on this comment thread are theirs, not mine.

My error, apparently, was in stating that in Exadata there’s no benefit to hash partitioning that wouldn’t also be provided by round-robin. I didn’t see anything in your explanation of the benefits of hash partitioning that wouldn’t also be conferred by round-robin. In fact, you referred to reducing skew, and hash partitioning — as you surely knew even before I alluded to it in the post above — is INFERIOR to round-robin at eliminating skew, although there’s debate as to how significant this difference is.

Would you perhaps be so kind as to take another try at explaining the difference?

As for your snide and incorrect comment about the source of my error — well, I won’t lecture you about how to avoid such mistakes in the future, because doing so might be obnoxious and condescending.

Of course round-robin is superior to hash partitioning at reducing skew. That’s probably because round-robin *eliminates* skew. I’ll tell you how significant the difference is. One is absolute and the other isn’t…at least not without a perfect hash which would be impossible without foreknowledge of all values being inserted. So it is fair to say that round-robin is somewhere between 0% and 100% more effective at reducing skew than hash partitioning.

The major difference is that a round-robin approach cannot support any degree of partition pruning whereas hash partitioning can. There are other benefits that I won’t list because Oracle documentation is available to anyone with web access.

Disclaimer: The views expressed in this comment are my own and do not necessarily reflect the views of Oracle. The views and opinions expressed by others on this comment thread are theirs, not mine.

As for the other — yeah, I was using the term “hash partitioning” too narrowly. Once again, I’m sorry for the excitement.

Daniel Abadi on
February 2nd, 2009 8:02 pm

One thing to be aware of is that “partition pruning” sometimes means “no parallelism”.

SELECT sum(sales)
FROM table
WHERE store_id = 5

If you hash on store_id, only one partition is involved in answering the query, which means the query runs at the speed of the one disk which contains that partition. If you use round robin partitioning and have an index on store_id on each partition, the query runs at the speed of all the disks reading (in parallel) the (much smaller) number of ‘store-id = 5′ tuples from the index.

Of course, if you don’t have an index on store-id on each node, you’d probably prefer to just get one disk involved in the query, even if all disks can be scanned in parallel.

Oracle said that even hash partitions are striped across disks, courtesy of ASM. That was in my notes right next to the observation that hash partitions don’t have their usual somewhat-pseudo-random distribution benefit in the case of Oracle, because the data’s already pseudo-randomly distributed via other mechanisms. (Those are, of course, the notes I should have checked before incorrectly saying Oracle doesn’t do hash partitioning in Exadata at all.)

Yes, I think that understanding the role of ASM is critical in some ways to understanding the reason why we use hash partitioning in Oracle, or rather it helps to explain what we do not use it for. Because, as you say, ASM spreads the data over all the available devices, hash partitioning is not used to associate data with particular storage devices in the way that it is for other platforms.

Rather it is a logical method of subdividing the data to allow more efficient processing. As I mentioned above it is used to reduce intra-slave messaging on hash joins, but Daniel’s comment on indexing reminds me that it also enables parallel index range scans. To take a simple example, one might partition sales transactions according to the day of the sale, and then hash each day of data into 64 subpartitions. A local index on “transaction dollar amount” could then be scanned in parallel to isolate all sales with a transaction dollar amount in a particular range of values, for example “more than $1,000″, _if_ the optimizer estimated that to be more efficient than scanning table partition of the entire day of sales.

Just an observation from the sideline…I think the confusion stems from your use of the phrase hash partitioning. I think you should have used the phrase hash distribution as you are discussing the physical locality of data vs. the logical grouping of data.

For example, in Netezza you distribute data to the SPUs by using the DDL phrase distribute on random or you can use a phrase of distribute on [hash] (column(s)). Likewise DB2 has DDL clauses to do both data distribution (DISTRIBUTE BY) and logical grouping (PARTITION BY & ORGANIZE BY).

ghassan salem on
February 3rd, 2009 4:29 pm

Curt,
Oracle has something called partition-wise join, that works when you join 2 equi-hash-partitioned tables (i.e. same key partitioning, as well as same number of partitions) on the partitioning key. And in a parallel query, this is done in parallel. So, you can get the benefits of hash-join as you might get in a shared-nothing system.
Also, bear in mind that in Oracle, you can range partition a table on some column(s), and hash-partition it on anotherr column. So you get the benefits of range partitioning (e.g. ILM, fast purge of old data, partition pruning, …) as well as partition-wise joins.

Have a look at the doc, it will make your posts when writing about Oracle less error-prone.

Actually, what would help me avert errors would be to post in less haste. As to WHY I rushed this post and several others up between the time I submitted the first draft of the article and the time it was finally posted … well, in the interest of peace and comity, I won’t spell that out.

So now that we’ve finished trudging through the terminology, it should make a lot more sense now why I said that the effectiveness of Oracle hash partitioning to handle data skew is between 0% and 100%.

The moral of the story is hash partitioning is only effective at handling skew based on the data being loaded. For example, it isn’t that good at evenly loading, say, 42 partitions if the partition key is something like a gender column. I know you are perfectly aware of how a hash function works, but I wanted to put that example out for the casual reader. I hope you’ll indulge me on that…

P.S., On a tangent regarding terminology… I recall in Informix DSA version 6 table partitions where called fragments. The default placement was round-robin. I found it strange then as I do now that it was considered a good thing to have a database residing in “round-robin fragmented storage” considering the generally negative connotation of the word fragment in database-land.