As discussed in my previous post, it’s a silly myth to suggest a bitmap index should only be used for so-called “low cardinality” columns else the resultant index would be “huge”. I thought it might be worth a quick refresh to see a simple little demonstration why such claims are a nonsense. There is in fact no limit to the number of distinct values by which a column could be considered a candidate for a bitmap index.

I’ve already shown how a bitmap index on a column with 10,000 distinct values could potentially be smaller than an index with just 4 distinct values, even with the same data type and size. The number of distinct values in a column is but one small consideration, the number of rows in the table and the average ratio of repeated values are just as important. The other important consideration that can significant impact the size of a bitmap index is the distribution and clustering of the data within the table as we shall see.

Using the same demo as the previous post, a reminder of the size of our bitmap index on a column with 10,000 distinct values in a 1 million row table.

Absolutely it does. As there are on average just 100 rows per distinct value, that’s a small enough selectivity for the CBO to use the bitmap index on its own. Note the query has used just 6 consistent gets to return the 100 rows of data.

Let’s now drop the bitmap index and see how a regular B-Tree index would compare and perform:

The first thing we notice is that the B-Tree index is significantly larger than the equivalent bitmap index. 1090 leafs blocks whereas the bitmap index was only 56 leaf blocks. So it’s not the bitmap index that’s so-called “huge” here on a column with 10,000 distinct values but the corresponding B-Tree index. Notice also that the Clustering Factor of the index is quite good at 10,895 in a 1 million row table.

The query also uses the index but as the B-Tree index is somewhat larger, the overall number of consistent reads has increased from 6 up to 8. So not only is the bitmap index substantially smaller despite having 10,000 distinct values, but it’s also more efficient to use as a result.A key reason why the bitmap index is so small and compact is because the effective Clustering Factor of the indexed column is excellent. The data was inserted into the table in CODE column order and so all the values with the same CODE value are grouped, ordered and clustered together within the table. Within the bitmap index, this means there are large and few grouping of zeros (0) that can be compressed extremely efficiently.

For example, for the first CODE value of 1, the bitmap sequence would look something like:

111111 …. 1110000000000000000000000000000000……..000000

with the 100 values of 1 (true) grouped together followed only by a series of effectively 999,900 zeros (o representing false). The 999,900 zeros can be compressed back almost nothing at all. Note there could actually be somewhat more false (0) values than actual rows in the table but that’s a discussion for another day.

The next value of 2 would have a bitmap sequence something like:

00000000….0001111111…11111100000000000000000000…0000

with 100 zeros followed by 100 1s followed by effectively 999,800 zeros, with again the two grouping of zeros compressed down to almost nothing at all.

And so on …
Let’s now create a different table that contains the identical data but this time with the CODE values effectively randomized throughout the table. The Clustering Factor of the CODE column in this case will be terrible:

The first thing we notice is that the bitmap index is now substantially larger than it was previously. It’s gone from just 52 leaf blocks all the up to 520 blocks, a whole 10 times larger than before. The reason is all to do with the clustering of the data as now values for CODE are littered all over the table and are no longer grouped together.

The bitmap sequence for the first CODE value of 1 would now look something like:

with the 1s now littered all over the place. This means it’s far less efficient to compress the groups of zeros as there are now substantially more such groupings than before. The index is now 10 times larger as a result.

Although the CBO is still using the bitmap index on its own, the performance of the query has deteriorated substantially with the number of consistent gets increasing from 6 all the way up to 103.

So the bitmap index is now nowhere near as efficient. A bitmap index isn’t that great with large numbers of distinct values if the associated clustering is poor and so the B-Tree index is the way to go after all, right ?

Well just wait a minute. If the clustering is poor for a bitmap index, the clustering will likewise be poor for a corresponding B-Tree index as well. Most of these consistent reads are due to reading the data out of the table, not from reading the index directly. So the performance of using an equivalent B-Tree index is also likely to be impacted as well.

The first thing to note here is that the B-Tree index is still 2090 leaf blocks in size. Even compared with the now far less efficient Bitmap index, at 520 leaf blocks it’s still approximately just 25% the size of the B-Tree index. So the 10,000 distinct value bitmap index, even when it’s as inefficient as possible, still can’t be described as “huge” here as it’s still only a 1/4 the size of the corresponding B-Tree index. With a Clustering Factor of 999,992 on a 1 million rows table, it doesn’t really get much worse than that and yet the Bitmap index on a 10,000 distinct column is still way ahead of the B-Tree index.

As we can see, the performance of the B-Tree index has likewise deteriorated with such a bad Clustering Factor. At 105 consistent gets, it’s still worse than the corresponding Bitmap index which only needed 103 consistent gets.

With a Bitmap index that is as inefficient as it can be, on a column that has 10,000 distinct values in a table of only 1 million rows, the Bitmap index still outperforms the corresponding B-Tree index.

It’s a complete myth and utter nonsense that a bitmap index is only suitable for low cardinality columns and would become “HUGE” and be 100s of times slower if created on so-called high cardinality column

Let me repeat: There is no limit to the number of distinct values in a column for it to be considered for a Bitmap index.

thank you for this Blog and all the information on the behaviour of indexes you provide. I have one question converning the last sentence: “There is no limit to the number of distinct values in a column for it to be considered for a Bitmap index.” So what about the most extreme case: I guess a bitmap index on a unique column would be bigger than a corresponding b*tree index (and contain a lot of 0-values and a single 1-value for every bitmap, so the order of the table would have no impact on the size of this index); indeed I did not only guess but also build a test case to verify the assumption. So my question is: could you imagine a reason for defining a bitmap index on a unique column?

To start, Oracle doesn’t support an actual Unique Bitmap index. That’s because a Unique index must be more efficient as a B-Tree than a bitmap because an B-Tree index (on one column), only has to store:

The indexed value + 1 byte
The rowid
2 Bytes of overhead

However, a corresponding bitmap needs to store:

The index value + 1 byte
2xRowids
The bitmaps for the index entry

So a bitmap index has to be larger and less efficient than the B-Tree by definition.

You need the column to have on average a number of repeated values for it to be considered as a bitmap index, if the column is unique or very close to being unique then you’ve gone too far the other way.

That’s why a column that has say 10 million distinct values needs to be in a table with (say) 100 million rows for it to be considered. If the table only has 10.0001 million rows, then it would be too unique to be considered as a bitmap.

Allow me a caution message: the reluctance to use bitmap indexes for high-cardinality columns is not so much to do with the size of the index.

It’s mostly to do with what happens to the index when it is subjected to changes – update/insert/delete.
Fact is: data changes. Few of us have the luxury of static data to deal with, even in DW environments.

That’s where the problems usually show up with bitmap. I agree entirely that the size argument for high-cardinality is not as relevant *if* care is taken with data distribution of said column.
But on the field very few tables lend themselves to that sort of arrangement.

What I often observe is what I call pseudo-organized distribution: someone loads a table with a given initial distribution and then things go pear-shaped as updates to the table change that distribution.
The result is a horrible salad where part of the table’s data matches the physical arrangement that favours one type of index whereas the rest is a worst-case example.
Then it all becomes a maintenance nightmare with frequent “re-orgs” thrown in.

I think there might be a place here to investigate the effect of methods of storage that enforce a given physical organization: IOTs and clustering. Then the dba may be guaranteed that future data volatility will not stuff up a carefully arranged distribution to favour one or other type of indexing.

Thanks for this great series on bitmap indexing, very thought provoking.

It depends somewhat on the distribution of values as I’ve shown, but it also depends very much on the size of the table. In my example above, it doesn’t really matter what the distribution might be, even in the worst case scenario, the bitmap index is considerably better than a corresponding B-Tree.

If a table is large (say 100 Million rows) then likely any column with 1 million distinct values would outperform a B-tree, regardless of the distribution.

If a table is subjected to too much DML and the DML is concurrent, then the locking implications alone rule out the bitmap index, no matter how tiny it might be in comparison.

Certainly if the index is borderline and dependant on data clustering to be effective, then the bitmap index is likely more trouble than it’s worth if you need to reorg all the time, except of course for DW tables which are purged and repopulated, then it might not be such an issue ensuring the data is populated in a specific order that benefits a particularly important index.

Noons
Structuring the data order is not the complete answer – in a DW with typically many bitmaps on (say) the dimensional keys of the fact table it is unlikely that the optimal ordering for one column is the same ordering for another column. True, correlation between columns is often a help – customers shop at nearby stores and certain products are sold in certain countries for examples but in the main we are going to get the odd big bitmap along with the small.

There are things in the DW mange process we can do to help the problem of updates overtime – especially if you have paid for the luxury of partitioning where we can devise low user impact ways of keeping the bitmaps as optimal as possuble

Thanks, Pete. I’m finding partitioning more and more useful in our DW, since we finally managed to grab the licence from management!

With partitions and sub-partitions, we now got a few of our mid-size tables to not need a lot of the previous indexing at all!
Turns out with a little bit of query predicate juggling, we can make 90% of the monthly processing hit a single or couple of sub-partition(s), which we are happy to full scan in most cases as it is sufficiently small.

Before, we were indexing all over the place to get to the same working set size!
The improvement in disk space, I/O overhead and maintenance is absolutely overwhelming. And the management is over the moon with the execution time reductions!
We still need indexing, of course. But it is much reduced and a lot more efficient, now. I’m now investigating the use of bitmaps for further improvements, hence the interest Richard’s series has created!

Because the number of records to be retrieved is very low, just 100 out of 1 million. So the selectivity is just a small 0.01% and it’s far cheaper to use the index to retrieve 100 rows than perform a FTS.

The other advantage of using a bitmap index on high cardinality columns is that you increase the chance of the CBO actually using the index on its own.

The optimizer has an indication of the data scatter for a b-tree index in the clustering_factor; but the clustering_factor is meaningless in a bitmap index – it’s just a copy of the num_rows figure. So for data scatter when doing the bitmap arithmetic the optimizer assumes (approximately) that 20% of the data required will be well clustered and 80% will be widely scattered. (That’s mentioned somewhere in the Oracle Wait Interface book – and I think I demonstrated it in CBO-F).

So if you’ve got data which is very well clustered (better than 20/80) the optimizer may use a b-tree index when it ignores a bitmap index, and if you’ve got data that is badly clustered (worse than 80/20) then the optimizer may use a bitmap when it ignores a b-tree index.

It seems that most people have data with very poor clustering when comparing b-tree indexes to bitmap indexes.

Just to clarify for those peeking in, the num_rows referred by Jonathan is the index num_rows, not the table num_rows. So indeed, the actual CF of a bitmap index is somewhat meaningless and that’s why I only listed it in the above demo when referencing the b-tree index. Therefore Oracle has to make assumptions regarding the actual CF of a bitmap index that might be entirely wrong.

However, getting back to the question raised by Henish, if the selectivity of an index is really low, the CF has reduced impact anyways. For 100 rows, whether the CF is actually high or low, it doesn’t really matter either way, as it’s likely cheaper than performing a FTS on a say 10,000 block table.

BTW, I’m currently in Qantas lounge in Sydney, about to embark on a 24 hour total journey to the US and so on my third glass of champange, as I need a little artifical courage due to my dislike of flying. If nothing of what I’ve said makes any sense, I hope this goes close to explaining why 😉

I should have made it clear that my comments were addressed at the comment about turning b-tree indexes into bitmap indexes.

It’s a comment I’ve seen often enough as the wrong answer to a problem that I’d like to put it into the “index myth” list – except that it often works.

The point, which I appreciate is not one that you need to be told, is that:
“because it’s not a bitmap” is NEVER the correct answer to the question “why isn’t Oracle using my index”, but some people have been persuaded that it is.

[…] far away — with my own slides. Though, I did manage to focus on bitmap indexes part and the myth of bitmap indexes not working well for columns with high cardinality. Very interesting conclusions. I’m still […]

“Confusing Dutch for German has got to be one of the funniest things I done in a long time :)”

You must be really boring! 🙂

Google translation of the Dutch shows the final paragraph as:

For dessert have a tip for the hard-core Oracle performance tuners, a tip that many bakers actually query should know. It is a myth that bitmap indexes are only suitable for columns with a limited set of values. Richard Foote in this blog shows that 10,000 different values in a table with rows 1,000,000,000 efficiently to consult with a bitmap index with a ‘normal’ b-tree index. Even if the conditions for a bitmap index rather unfavorable. Tie it in the ears!

That’s a whole new area of discussion (and one I probably should go into one day) !!

Basically, the manner in which the use of bitmap indexes are costed are somewhat different from b-tree indexes, primarily because the clustering factor is meaningless in the manner it’s captured for bitmap indexes. As such, costings of bitmap index accesses are not listed within the execution plan.

A topic worthy of a blog entry on it’s own (see Jonathan Lewis’s comments for a summary of the issue).

[…] hamper the effective compression capabilities of the index (I’ve discuss the impact of the Clustering Factor on the effectiveness of Bitmap Index compression previously). Therefore, it might well be beneficial to more accurately determine the number of […]