Introduction

After searching the internet for a .NET implementation for WAH compressed BitArray and not finding one, I decided to write and publish an article here so the .NET community are not left out as all the available implementations are in the Java language. This ties into a pretty advanced topic of bitmap indexing techniques for databases and I needed this for my RaptorDB Document data-store database engine.

What Is It?

A BitArray is a data structure in the .NET library for storing true/false or bits of data in a compact form within an array of Int32 objects. A WAH or word-aligned hybrid BitArray is a special run length compressed version of a BitArray which saves a lot of space and memory. All the implementations that exist in Java essentially duplicate the functionality of a BitSet, that is the AND, OR, NOT, and XOR operations with the compressed internal storage format.

In my implementation, I defer the functionality of the BitArray to itself and just add compression and decompression routines. This is much faster than the Java way at the expense of memory usage. To overcome this, I have also added a FreeMemory method to release the BitArray contents and keep the compressed contents. Arguably, if you are using 100 million bits, then a full implementation is more performant than my implementation, but for most of our Use Cases, we are at most in the 10 millions of bits range.

This original method was invented at the Berkeley Labs of US Department of Energy; it is a project named FastBit and is used for high energy physics department experiments; you can see it here: FastBit site.

Why Should I Care?

So what?, you ask. Well, as mentioned before, BitArrays are used in an indexing technique called bitmap indexes (Wiki) and column based databases which store data in columns instead of rows. An example which you might know is Microsoft's PowerPivot for Excel which can process millions of rows in seconds. Interestingly, Microsoft has only recently announced the implementation of bitmap indexes in the upcoming SQL Server platform, post 2008 R2. It has long been in use by other RDBM vendors like Oracle.

From the above, in the worst case, you will get N/31 more bits encoded or about 3% increase in size to the original.

What You Get

WAHBitArray is essentially the same as the standard BitArray in the .NET Framework, with the following additions:

FreeMemory(): This will first compress the internal BitArray then free the memory it uses.

GetCompressed(): This will compress the current BitArray then return it as an array of uint.

CountOnes(), CountZeros(): will count the respective bits in the array.

GetBitIndexes(bool): will return an enumeration using yield of the respective bit position; for example, if the bit array contains 10001... this will return integers 0,4,... if the bool parameter was true, and 1,2,3,... if bool was false.

Get(), Set(): Methods implemented with auto resizing and no exceptions.

Points of Interest

BitArray class is sealed by Microsoft so inheriting from it was not possible.

BitArray throws an exception if the length of two BitArrays are not equal on bit operations, WAHBitArray makes them the same as the largest before operations.

BitArray must be resized in 32 increments, otherwise it mangles the compression bits.

Version 2.0

For extra speed in compressing and uncompressing the bits, and the fact that the .NET Framework implementation does not give access to the internal data structures in the BitArray, I had to rewrite all the BitArray functionality in WAHBitArray.

Using Reflector to see the internal implementation of the BCL BitArray one can see the following snippets:

Now with access to the internal uint[] bits, the compression method gets 31 bit blocks of data instead of one by one. This is done in the Take31Bits() method, which finds the two adjacent uint values in the _uncompressed list and does some bit manipulations as follows:

public WAHBitArray And(WAHBitArray op)
{
this.CheckBitArray(); // check the bit array is uncompressed
uint[] ints = op.GetUncompressed(); // get the values
FixSizes(ints, _uncompressed); // make the sizes the same
for (int i = 0; i < ints.Length; i++)
ints[i] &= _uncompressed[i]; // do the AND operation
returnnew WAHBitArray(false, ints); // return a new object
}

The compression and uncompression routines were rewritten to operate in the uint[] arrays as follows:

Version 2.5

This version is a post back from the work done in RaptorDB, which is an overhaul of all the routines focusing on multi thread access and concurrency issues. A very important lesson learned was locking at the source of the resource being used as opposed to a higher level, which takes care of concurrency issues at the lowest levels, and saves a lot of headaches and debugging.

Much of the code has been reformatted and optimized. I am very confident in the correctness of this version as it is being tested to it's maximum in RaptorDB.

History

Initial release v1.0: 22 June 2011

Update v1.1: 24 June 2011

Bit operations now return a WAHBitArray instead of BitArray

A bit operation will take either a WAHBitArray or a BitArray as the input

Share

About the Author

Mehdi first started programming when he was 8 on BBC+128k machine in 6512 processor language, after various hardware and software changes he eventually came across .net and c# which he has been using since v1.0.
He is formally educated as a system analyst Industrial engineer, but his programming passion continues.

* Mehdi is the 5th person to get 6 out of 7 Platinums on CodeProject (13th Jan'12)