Introduction

After much searching the internet for a .net implementation for WAH compressed BitArray and not finding one, I decided to write and publish an article here so we the .net community are not left out as all the implementations are in the java language. This ties into a pretty advanced topic of bitmap indexing techniques for databases and I needed this for my upcoming RaptorDB Document data-store database engine.

ٌWhat is it?

A BitArray is a
data structure in .net library for storing true/false or bits of
data in a compact form within an array of Int32 objects. A WAH or
word-aligned hybrid BitArray is a special run length compressed
version of a BitArray which saves a lot of space and memory. All
the implementations that exist in java essentially duplicate the
functionality of a BitSet that is the AND, OR, NOT and XOR operations with the compressed internal
storage format.

In my implementation I defer the functionality of the BitArray to
itself and just add compression and decompression routines. This is much
faster than the java way at the expense of memory usage, to
overcome this I have also added a FreeMemory method to release the
BitArray contents and keep the compressed contents. Arguably if you are using 100s million bits
then a full implementation is more performant than my
implementation but for most of our usecases we are at most in the
10s millions of bits range.

This original method was invented at the Berkeley Labs of US Department of
Energy, it is a project named FastBit and is used for high energy
physics department experiments, you can see it here :
FastBit site

Why should I care?

So what! you ask?, well as mentioned before BitArrays are used in an indexing technique called bitmap indexes (wiki) and Column based databases which store data in columns instead of rows, a example which you might know is Microsoft's PowerPivot for Excel which can process millions of rows in seconds. Interestingly Microsoft has only recently announced the implementation of bitmap indexes in the upcoming SQL Server platform post 2008 R2. It has long been in use by other RDBM vendors like Oracle.

From the above in the worst case you will get N/31 more bits
encoded or about 3% increase in size to the original.

What you get

WAHBitArray is essentially the same as the standard BitArray in the .net framework with the following additions:

FreeMemory() : this will first compress the internal BitArray then free the the memory it used.

GetCompressed() : this will compress the current BitArray then return them as an array of uint.

CountOnes() , CountZeros() : will count the respective bits in the array.

GetBitIndexes(bool) : will return an enumeration using yield of the respective bit position for example if the bit array contains 10001... this will return integers 0,4,... if the bool parameter was true<code> and 1,2,3,... if bool was false.

Share

About the Author

Mehdi first started programming when he was 8 on BBC+128k machine in 6512 processor language, after various hardware and software changes he eventually came across .net and c# which he has been using since v1.0.
He is formally educated as a system analyst Industrial engineer, but his programming passion continues.

* Mehdi is the 5th person to get 6 out of 7 Platinum's on Code-Project (13th Jan'12)
* Mehdi is the 3rd person to get 7 out of 7 Platinum's on Code-Project (26th Aug'16)

Thanks for sharing this. Also the And, Or and Xor methods can be factorized like below.

The doWork function takes two uint sets having the same size and an operation (&, |, ^ ...). The first set of uints is updated with the value of the operation done on its elements and the elements of the uncompressed set.

The code is more readable and maintanble this way. If the calculation has to be updated, modifying the doWork function suffices rather than modifying it in three different places .