Bitmap Indexes With Many Distinct Column Values (Wots…uh the deal) February 1, 2008

In the seemingly never ending list of 8 things one may not have known about indexes, Number 3 stated:

“Bitmap Indexes can be extremely useful and beneficial even if the column contains thousands of distinct values”.

On the surface, this may seem like a somewhat strange thing to suggest to many folk. It seems to be stated in numerous places that Bitmap indexes are only really beneficial with columns that have low numbers of distinct values. For example, a column called GENDER (I was going to use another word but I have to deal with enough spam messages as it is :) ) has only a few distinct values, so it would be perfect for a Bitmap Index.

Columns that have say 5 or 10 or maybe 20 distinct values should all be OK. But what if a column has 100 distinct values, that might just be pushing it. A column with 1000 distinct values would obviously be totally inappropriate. I would have to be some kind of deranged fool for even contemplating and suggesting a column with 10,000 distinct values, right !!

A Bitmap Index is actually a B-Tree index in it’s basic structure and shape. It has index branch blocks that point to index leaf blocks that have all the necessary index information stored in an ordered and sorted manner. However, in the leaf blocks, a conventional B-Tree index basically stores the indexed column values followed by its corresponding rowid. A bitmap index differs considerably and basically stores in the leaf blocks for each distinct column, the column value followed by a starting and ending rowid that specifies a range of possible rowids within the table followed by a series of bits that denotes for each possible rowid within the range whether the row contains the column value (1) or not (0). If the index entry is larger than roughly half the index block size, another bitmap “piece” is created for the index entry, specifying a different range of rowids with corresponding bitmaps.

The “biggest” component of the index entry is thus this series of bits. But most of the values will be zeros (as a specific row can only have at most the one value of the column) and all these zeros can be compressed quite efficiently by Oracle within the index entry.

So having lots of distinct column values means having lots of index entries with lots of rowid ranges with lots of bitmaps. So a column with anything approaching 10,000 values would be totally inappropriate, right ?

However, the equivalent Bitmap Index required just 56 leaf blocks and an equality query returning the same 100 rows does so with just 11 consistent reads.

Ummm, perhaps bitmaps indexes don’t require such low numbers of distinct column values to be useful after all …

A few points to ponder on from this specific example.

The B-Tree index had to store 1,000,000 index values, once for each and every not null row in the parent table. The Bitmap Index only had to store the index values 10,000 times, once for each unique occurrence of the column value (although there may be times when this ratio may be higher)

The B-Tree index had to stored 1,000,000 rowids, once for each and every index row entry. The Bitmap Index only had to store a pair of rowid values for each unique occurrence of the column value (although there may be times when Bitmap Indexes need to store more than one pair of rowids per index value).

If the rows in the table are clustered together based on the index column value, it means the zeros in the bitmap index can be nice and continuous within the bitmap string and so will compress nicely. Therefore, the manner in which the rows are ordered in the table will have a direct impact in how efficient the Bitmap Index can be compressed.

There’s lots and lots of interesting things to discuss about Bitmap Indexes. For now, just note the next time you read Bitmap Indexes should only be used for columns with few distinct values, you may want to get some clarification on what is meant exactly by “few” …

Share this:

Like this:

Related

What about concurrency in your example?. I don’t see a big deal using a BMI with a medium-high cardinality, if what you`re looking for is improving logical queries ( and OR maybe ). You`re making a sort of matrix. But what a about concurrency?, it`s a big point.

How well does this concept of density scale? Say I had a data warehouse with a table that had 50 billion rows and a 5 million distinct values in a column for a density of 1/10,000. Should I just expect it to fall down dead, or could it still work?

That would be 50 billion rowids that have to be stored in a B-tree index. With partitioned tables, these rowids could suddenly become more expensive too btw if the index were to be global as we now have to store 10 bytes for the rowid. That would be 50 billion index values that would need to stored as well (assuming no nulls although nulls are stored in the bitmap index).

With the Bitmap index, you may only have a (relatively few) 5 million rowid pairs although this would depend on how the data is ordered and the efficiency of the compression. With a table that large, you would have multiple bitmap pieces per index entry but still there’s a 1/10000 ratio to catch up.

In a Warehouse, you also have the advantages of the potentially efficient execution plans open to you with multiple Bitmap indexes and Star Transformations without expensive “on-the-fly” bitmap transformations.

I’m planning on discussing the importance of column ordering and the usefulness of MINIMIZE RECORDS_PER_BLOCK to the efficiency of bitmap indexes “sometime”.

BTW, what you’ve done is a perfect example of why I like to show these little demos and illustrations. I show why and how I might suggest something might work. You can then run the same thing in your own environments and then experiment and play around and see for yourself whether something is applicable or not :)

moving slightly away from the point…
In isolation an index on GENDER may not be a good bitmap index key choice – probably (?) just two key values and depending on data source potentially similar row counts for each key value (OK, men in convents will have a low man count) So the index could be potentially selecting say a third to two thirds of the table (sweeping generalisation alert) and the CBO might decided that visiting the table without using the index to be more efficient.
But in association with other bitmap indexes on the same table it could be useful – we can AND or OR bitmaps together; so a GENDER bitmap could be useful to find the woman that lives in Sydney who shops at a certain supermarket and uses a certain cellphone provider and is aged between 40 and 49, which is how I was alleged to have met my wife :-)

I start the Bitmap Index section in my seminar introducing two related myths. One, as discussed, that Bitmap Indexes can only be used for columns with very small numbers of distinct values. And two, that Bitmap Indexes can generally be used in isolation with such columns.

Just one little comment, beware the possible ramifications of OR conditions. In your example, if you had said someone who “lives in Sydney who shops at a certain supermarket and uses a certain cellphone provider and is aged between 40 and 49″ OR is a woman, then “OK Computer” it ain’t :)

OR tends to be best used with multiple values from the same index and not between indexes – or indeed I would have a larger choice of wife than I did.

I think OR works well with “bought product A OR product B” queries where the product is stored in a single bitmap indexed column. I certainly would not look at “is RED or FEMALE” unless I had a very strong business case… but stranger business cases do exist…

We have an implementation of bitmap index that work very well with columns that have millions of distinct values and the software is being distributed under LGPL. If you can use a C++ program for your application, you can try it at .

The key to keep bitmap indexes in ORACLE is compression — ORACLE owns a large number of compression pattens if one evidence of this. With some compression methods, one can actually prove with theoretical analysis about the sizes of compressed bitmap indexes, e.g., http://doi.acm.org/10.1145/1132863.1132864 . Some Sybase and IBM products have take a different way of dealing with this size issue caused by high number of distinct values, http://doi.acm.org/10.1145/253260.253268 . The Wikipedia article on bitmap index, http://en.wikipedia.org/wiki/Bitmap_index, captures some of the key points. If you have to stay with ORACLE, my observation is that its compression method is actually pretty effective. Of course, personally, I would like to see them adopt a better one.

Provide some details as to why it requires tuning, such as the execution plan
Provide details on the size of the table
Provide details of the cardinalities of the columns
Provide details of available indexes

But most important of all, provide a SQL statement that actually makes sense and will compile. The code you supplied makes references to a table alias sl which is not in the FROM list.

Recently, I was looking into (but not working on, officially) rationale behind using B*Tree indexes (many of them on different combinations of columns) on one of the tables in our application. Being a follower of your blog, I knew you would have something helpful and I found this. Great explanation. Now, coming back to ours table, although this is an OLTP-type of system, this particular table gets populated/updated at regular intervals using a batch job (which translates and derives data before loading/merging it into this table). Once populated, it is queried extensively based on various (but fixed set of) combinations of columns. Recently, the application started facing huge performance issues, either during data upload or data querying. Basically, many (composite) indexes (normal and function-based) was identified as one of the biggest reasons for upload process being slow (i.e. missing SLAs). Reading your article (again) made me wonder if Bitmap indexes on individual columns instead of composite B*Tree indexes, might help. As you have proved, the size of the Bitmap indexes is much less than corresponding B*Tree indexes. I am though not sure if upload process would benifit or have detrimental effect due to Bitmap indexes. I am told that the upload process has been designed/tuned for efficiency. Would you have any pointers?

Indexes slow table changes as the indexes need to be updated as well. Depending on your database version, bitmap indexes are likely to be even more problematic as they can cause contention issues and mucho redo/undo and can bloat out in size considerably as well (especially for pre 10g bitmap indexes, less so since 10g).

Depending on usage and requirements, it might be worth investigating making non-critical indexes unsable and rebuild them again after the load. You might end up with a faster overall completion times and more efficient index structures as well.

If your application is OLPT and if you have any additional concurrent DML after the data loads, then I would consider bitmap indexes. The locking issues especially can likely cripple your application.

Thanks for your reply and apologies for my late comment.
The database version is 10.2.0.4 so I believe a bitmap index exploding in size due to many DML operations (and corresponding redo/undo generation) is not much to worry about (although we will definitely have to test and prove it).
Funny thing is, one of the proposed (and implemented) solutions to address severe performance issues is making many existing B*Tree indexes unusable before data load and rebuilding them later. The main driving factors behind suggesting bitmap indexes were
a) As said earlier, this table gets populated/updated using batch job that run at regular intervals. So there is no concurrent DML activity to worry about. In the sense of how data is maintained in this table, it is more similar to the way a DataWarehouse table is maintained than a Transactional table is maintained in OLTP system.
b) Once loaded, this table is extensively accessed using predicates on combination of some (or many) columns in this table. In the current setup, the designer seems to have proposed (and implemented) multiple composite B*Tree indexes that match with WHERE clauses of the queries against this table. This has ensured that all queries use indexes and get executed in sub-second. But it has also resulted in one column being included in multiple indexes and that too at different locations. This effectively means when such a column value is updated, 3-4 indexes will need to be updated.
I was hoping that single column bitmap indexes (on selected columns) instead of multiple composite B*Tree indexes will have following benifits:
a) By definition and due to being on one column, these indexes will be much smaller in size and hence more efficient to rebuild, if required.
b) Since most of the queries against the table have multiple columns in WHERE clause, bitmap indexes will be more effective for AND/OR operations
c) During data load, updating a column (that has a bitmap index) will require updating only a single index.
Hope my thinking is not completely off-the-track here.

Apologies for another message. But I was also wondering why does Oracle decide to use B*Tree index instead of Bitmap index, even when cost of B*Tree access is higher than the Bitmap access? I observed this on the table in question.

Again apologies for late reply (blame it on my vacation…)
I will see if I can produce a working example. The main issue in producing such example is generating the test data. At present, I am experiencing this behavior on a CHARACTER data that is meaningful and hence the indexes on the same have a pattern (used space, clustering factor etc.). I am not sure how I can produce large volume of meaningful CHARACTER test data so that corresponding indexes follow the same pattern.

No, I think you’re probably going down the right track, IF IF IF, you indeed don’t have concurrent DML on this table.

With all these things, the best thing to do is to test it all out and see if indeed you do get the expected benefits. Certainly having fewer indexes will speed up data loads, having bitmap indexes will potentially increase flexibilty.

Your own tests and timings will confirm whether you’re going down the right path.

With your other question, a bit more detail would certainly be helpful.

Thanks for your inputs. I am bit confused as what more details are needed and for which other question. Am I correct in assuming that your above (latest) reply is intended to be for my explanation of reasons for possibly using bitmap indexes above?

Some obstacles I’ve left behind, with other instead, I’m still fighting.
Studying the bitmap index and trying to understand how to calculate the cost, I found some threads where was explained, that a good approximation it’s given whereas the 80% of clustered data and 20% of scattered data, like so: