Introduction

After searching the internet for a .NET implementation for WAH compressed BitArray and not finding one, I decided to write and publish an article here so the .NET community are not left out as all the available implementations are in the Java language. This ties into a pretty advanced topic of bitmap indexing techniques for databases and I needed this for my RaptorDB Document data-store database engine.

What Is It?

A BitArray is a data structure in the .NET library for storing true/false or bits of data in a compact form within an array of Int32 objects. A WAH or word-aligned hybrid BitArray is a special run length compressed version of a BitArray which saves a lot of space and memory. All the implementations that exist in Java essentially duplicate the functionality of a BitSet, that is the AND, OR, NOT, and XOR operations with the compressed internal storage format.

In my implementation, I defer the functionality of the BitArray to itself and just add compression and decompression routines. This is much faster than the Java way at the expense of memory usage. To overcome this, I have also added a FreeMemory method to release the BitArray contents and keep the compressed contents. Arguably, if you are using 100 million bits, then a full implementation is more performant than my implementation, but for most of our Use Cases, we are at most in the 10 millions of bits range.

This original method was invented at the Berkeley Labs of US Department of Energy; it is a project named FastBit and is used for high energy physics department experiments; you can see it here: FastBit site.

Why Should I Care?

So what?, you ask. Well, as mentioned before, BitArrays are used in an indexing technique called bitmap indexes (Wiki) and column based databases which store data in columns instead of rows. An example which you might know is Microsoft's PowerPivot for Excel which can process millions of rows in seconds. Interestingly, Microsoft has only recently announced the implementation of bitmap indexes in the upcoming SQL Server platform, post 2008 R2. It has long been in use by other RDBM vendors like Oracle.

From the above, in the worst case, you will get N/31 more bits encoded or about 3% increase in size to the original.

What You Get

WAHBitArray is essentially the same as the standard BitArray in the .NET Framework, with the following additions:

FreeMemory(): This will first compress the internal BitArray then free the memory it uses.

GetCompressed(): This will compress the current BitArray then return it as an array of uint.

CountOnes(), CountZeros(): will count the respective bits in the array.

GetBitIndexes(bool): will return an enumeration using yield of the respective bit position; for example, if the bit array contains 10001... this will return integers 0,4,... if the bool parameter was true, and 1,2,3,... if bool was false.

Get(), Set(): Methods implemented with auto resizing and no exceptions.

Points of Interest

BitArray class is sealed by Microsoft so inheriting from it was not possible.

BitArray throws an exception if the length of two BitArrays are not equal on bit operations, WAHBitArray makes them the same as the largest before operations.

BitArray must be resized in 32 increments, otherwise it mangles the compression bits.

Version 2.0

For extra speed in compressing and uncompressing the bits, and the fact that the .NET Framework implementation does not give access to the internal data structures in the BitArray, I had to rewrite all the BitArray functionality in WAHBitArray.

Using Reflector to see the internal implementation of the BCL BitArray one can see the following snippets:

Now with access to the internaluint[]
bits, the compression method gets 31 bit blocks of data instead of one by one. This is done in the Take31Bits() method, which finds the two adjacent uint values in the _uncompressed list and does some bit manipulations as follows:

public WAHBitArray And(WAHBitArray op)
{
this.CheckBitArray(); // check the bit array is uncompressed
uint[] ints = op.GetUncompressed(); // get the values
FixSizes(ints, _uncompressed); // make the sizes the same
for (int i = 0; i < ints.Length; i++)
ints[i] &= _uncompressed[i]; // do the AND operation
returnnew WAHBitArray(false, ints); // return a new object
}

The compression and uncompression routines were rewritten to operate in the uint[] arrays as follows:

Version 2.5

This version is a post back from the work done in RaptorDB, which is an overhaul of all the routines focusing on multi thread access and concurrency issues. A very important lesson learned was locking at the source of the resource being used as opposed to a higher level, which takes care of concurrency issues at the lowest levels, and saves a lot of headaches and debugging.

Much of the code has been reformatted and optimized. I am very confident in the correctness of this version as it is being tested to it's maximum in RaptorDB.

History

Initial release v1.0: 22 June 2011

Update v1.1: 24 June 2011

Bit operations now return a WAHBitArray instead of BitArray

A bit operation will take either a WAHBitArray or a BitArray as the input

Share

About the Author

Mehdi first started programming when he was 8 on BBC+128k machine in 6512 processor language, after various hardware and software changes he eventually came across .net and c# which he has been using since v1.0.He is formally educated as a system analyst Industrial engineer, but his programming passion continues.

* Mehdi is the 5th person to get 6 out of 7 Platinum's on Code-Project (13th Jan'12)* Mehdi is the 3rd person to get 7 out of 7 Platinum's on Code-Project (26th Aug'16)

Comments and Discussions

Thanks for sharing this. Also the And, Or and Xor methods can be factorized like below.

The doWork function takes two uint sets having the same size and an operation (&, |, ^ ...). The first set of uints is updated with the value of the operation done on its elements and the elements of the uncompressed set.

The code is more readable and maintanble this way. If the calculation has to be updated, modifying the doWork function suffices rather than modifying it in three different places .

Since you mention that Microsoft has only just announced bitmap indexing in SQL Server, I thought I'd point out they are certainly no stranger to the technique. Bitmap indexing has been a part of Microsoft Access since the very first version (over 20 years!), and MSSQL has always used it internally on the fly (JOIN processing, parts of index scans, etc.)

Good Job! I already implemented a bitarray as a bitset class, based on a old work done by someone here or there (don't remember) This bitset has been used extensively by me, so I am used to it, and I've implemented several performance improvements, let me tell you some of them:

You should implement GetHashCode(), Equals() to allow this class to be used as index in Dictionary or HashtableAlso ToString() should be overriden (useful for debugging)

Here is a not-so-lightweight comparison, with a little more work you may compare int-by int if both arrays are compressed or uncompressed, but for a hurry this should work as fine, (and actually does)!

Its many times needed to see if the bitarray is empty (ie. after a And() operation), this yields a property needed, called 'Zero' which is good to implement as an extremely fast routine, like this one:

Even as I did not sniff inside the compression schema, the same may not need to decompress the array and work on the compressed one as well, even faster!

Another useful is the "cardinality" or OnesCount()

You can setup a simple array of 256 bytes, putting inside the number of ones of this byte (counted)then simply add up all bytes (making at least 4 integer additions for each integer in the _uncompressed array).The actual method makes Bit-Length additions (loops trough)

Sorry I'm not very clear about this at the moment. If you are trying to fine the intersection of several large sets over a huge base set you might need to jump ahead a large distance in one set or the other. That is you might have a large gap in one set but lots of members in the other. I think you would need some sort of hierarchy on top of the bit arrays to make this fast. Or I might be thinking about this in completely the wrong way

Good stuff. You posted this a couple of days before I finally made the decision to open source my implementation of Word Aligned Hybrid Bit Vectors and a search framework I built on top of them. As an FYI: a major restriction of my implementation of WAH Vectors is that writing to compressed vectors can only occur on the least significant 31 bits. Check it out: http://softwarebotanysun.codeplex.com. I still need to get more code coverage with the unit tests, and documentation and examples are non-existent; however, I will be working on them over the next couple of weeks as well as a series of blog posts.

Regarding the patent, I just took a look. The patent filing makes it sound like the compression scheme and logical operations are all covered. However, FastBit is licensed under LGPL (and so is Software Botany Sunlight) which states in the preamble:

"Finally, software patents pose a constant threat to the existence of any free program. We wish to make sure that a company cannot effectively restrict the users of a free program by obtaining a restrictive license from a patent holder. Therefore, we insist that any patent license obtained for a version of the library must be consistent with the full freedom of use specified in this license."

... and later in section 11:

"For example, if a patent license would not permit royalty-free redistribution of the Library by all those who receive copies directly or indirectly through you, then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Library."

So yeah, I'm think it's safe to make derivative works as long as they are licensed under the LGPL or the GPL. Regardless, I haven't used Sunlight for anything commercial yet. It is more or less a pet project for enjoyment and opportunity to use C# features like unsafe and dynamic. I should definitely put a warning up on codeplex though for users who may want to use it commercially. I'll also see if I can get a hold of anyone who knows how to read legalese and whatnot.

As an FYI: a major restriction of my implementation of WAH Vectors is that writing to compressed vectors can only occur on the least significant 31 bits.

Taking the original data in multiples of 31 bits seems a bit icky. It would seem a relatively easy way to improve efficiency would be to process output data in multiples of 32 words words (128 bytes) formatted as 31x33 bits. The first 31 bits of each group would hold the lower 32 ("data") bits of each logical output word, and the last word would hold the "mode" bit for the preceding 31 words. Using that approach would allow input words that weren't 0x00000000 or 0xFFFFFFFF to be copied directly from the source to the destination without need for bit-shifting or masking.

If word[5] of the array is supposed to represent 32 bits of literal data, then bit 5 of word[31] would be zero. If word[5] of the array is supposed to represent something else, then bit 5 of word[31] would be set, and the bits of word[5] would indicate what exactly it's supposed to be. A really simple approach would be to use bit 31 to indicate whether the word represents a run one of 0x00000000 or 0xFFFFFFFF, and the bottom 31 bits to indicate how many such words it represents. Since most runs aren't going to be anywhere near 2 billion words long, it may be helpful to allow some bits in the word to say something about data that follows the run. For example, one could use bits 31-30 to select one of four types of run:

A run of 0 to a billion words of zero

A run of 0 to a billion words of ones (0xFFFFFFFF)

A run of 0-255 BITS of zeroes, 0-127 BITS of ones, 0-255 BITS of zeroes, 0-127 BITS of ones, and enough zeroes to pad out any partial word.

A run of 0-255 BITS of ones, 0-127 BITS of zeroes, 0-255 BITS of ones, 0-127 BITS of zeroes, and enough ones to pad out any partial word.

Note that's just a simple example of how things could be done. Note that any four runs comprising 130 bits or less (and many combinations of four runs totaling 256 bits or more) could be stored in a single word.

Thanks for the most interesting article. Did you by any chance do some performance testing on it. In particular it would be nice to know the speed difference between the WAHBitArray and the BitArray for AND, OR, NOT and XOR. I have been using BitArrays for representations of ranges of integers (Line segments) and your code might be very useful for compressing them. All in all this has some definite potential.Thanks again

Wow again, and I thought this was an off beat topic but it seems people are actually using bitarrays. I never would have guessed for line segments!?

The performance is the same I'm afraid as all the computation is done by the standard bitarray, if you follow the link to the fastbit site, there you will find a research article which did some performance test. The long and short of it is that for low densities of values their way is faster but for high densities it's slower.

Cheers

Its the man, not the machine - Chuck YeagerIf at first you don't succeed... get a better publicist

Hi, I put a description of some of the software in a previous post on this thread.

I am rewriting it all at the moment - I currently have a programming language writen in C++ which knows how to talk to standard SQL databases and create the bitmapped data from them and a GUI and bitmap engine written in C# and winforms which allows the user to query the bitmapped database.

I am replacing the programming language with a XML based declarative language (although it also supports embedded C# routines) for creating the bitmapped database - the GUI I am rewriting in WPF which is a bit of a pain but I am mostly there.

I have had similar systems for over 12 years - this is the 5th iteration.

Somebody explained bitmapped indexes to me at Tech-Ed ages ago over a few beers - I worked it all out myself how to implement it and went from there. It is only in recent years I have come across WAH and BBC etc so it's nice that the standard ways aren't fundamentally different from my own hand rolled stuff. I can do counts on a 100 million row database in 1/10th of a second so I must be doing something right.

Great job, bitmap indexes are new to me too eversince project Gemini which became powerpivot from Microsoft which intrigued me immensely and thus began the journey to find out. It's going into the document store version of RaptorDB.

Its the man, not the machine - Chuck YeagerIf at first you don't succeed... get a better publicist

They are interesting to work with and very satisfying because they can be so fast - I like micro-optimising - in most programming tasks micro-optimising isn't worth it but in this sphere it can have some huge benefits. Profilers are your friend.

I have been keeping an eye on your RaptorDB work, it looks very promising indeed.

Are there any patents on WAH? I seem to recall the BBC algorithm being patented.