Introduction

This article is the version 2 of my previous article found here (http://www.codeproject.com/Articles/190504/RaptorDB),
I had to write a new article because in this version I completely
redesigned and re-architected the original and so it would not go
with the previous article. In this version I have done away with
the b+tree and hash index in favor of my own MGIndex structure
which for all intents and purposes is superior and the performance
numbers speak for themselves.

What is
RaptorDB?

Here is a brief overview of all the
terms used to describe RaptorDB:

Embedded: You can use RaptorDB inside
your application as you would any other DLL, and you don't
need to install services or run external programs.

NoSQL: A grass roots movement to replace
relational databases with more relevant and specialized
storage systems to the application in question. These systems
are usually designed for performance.

Persisted: Any changes made are stored on
hard disk, so you never lose data on power outages or crashes.

Dictionary: A key/value storage system much
like the implementation in .NET.

Features

Very fast performance (typically 2x the insert and 4x the
read performance of RaptorDB v1)

Extremely small foot print at ~50kb.

No dependencies.

Multi-Threaded support for read and writes.

Data pages are separate from the main tree structure, so can
be freed from memory if needed, and loaded on demand.

Automatic index file recovery on non-clean shutdowns.

String Keys are UTF8 encoded and limited to 60 bytes if not
specified otherwise (maximum is 255 chars).

Support for long string Keys with the RaptorDBString class.

Duplicate keys are stored as a WAH Bitmap Index for optimal
storage and speed in access.

Two mode of operation Flush immediate and Deferred ( the
latter being faster at the expense of the risk of non-clean
shutdown data loss).

Enumerate the index is supported.

Enumerate the Storage file is supported.

Remove Key is supported.

Why another data
structure?

There is always room for improvement, and the ever need for faster
systems compels us to create new methods of doing things. MGindex
is no exception to this rule. Currently MGindex outperforms b+tree
by a factor of 15x on writes and 21x on reads, while keeping the
main feature of disk friendliness of a b+tree structure.

The problem with
a b+tree

Theoretically a b+tree is O(N log k N) or log
base k of N, now for the typical values of k which are above 200
for example the b+tree should outperform any binary tree because
it will use less operations. However I have found the following
problems which hinder performance :

Pages in a b+tree are usually implemented as a list or array
of child pointers and so while finding and inserting a value
is a O(log k) operation the process actually has to move
children around in the array or list, and so is time
consuming.

Splitting a page in b+tree has to fix parent nodes and
children so effectively will lock the tree for the duration,
so parallel updates are very very difficult and have spawned a
lot of research articles.

Requirements of
a good index structure

So what makes a good index structure, here are what I consider
essential features of one:

Page-able data structure:

Easy loading and saving to disk.

Free memory on memory constraints.

On-demand loading for optimal memory usage.

Very fast insert and retrieve.

Multi-thread-able and parallel-able usage.

Pages should be linked together so you can do range queries
by going to the next page easily.

The MGIndex

MGIndex takes the best features of a b+tree and improves upon
on them at the same time removing the impediments. MGIndex is
also extremely simple in design as the following diagram shows:

As you can see the page list is a sorted dictionary of first
keys from each page along with associated page number and page items
count. A page is a dictionary of key and record number pairs.
This format ensures a semi sorted key list, in that within a page the
data is not sorted but pages are in sort order relative to each other.
So a look-up for a key just compares the first keys in the page list to
find the page required and gets the key from the page's dictionary.

MGIndex is O(log M)+O(1), M being N / PageItemCount
[PageItemCount = 10000 in the Globals class]. This means that
you do a binary search in the page list in log M time and get the value
in O(1) time within a page.

RaptorDB starts off by loading the page list and it is good to go from there and pages are loaded on demand, based on usage.

Page Splits

In the event of page getting full and reaching the PageItemCount,
MGIndex will sort the keys in the page's dictionary and split the data
in two pages ( similar to a b+tree split) and update the page list by
adding the new page and changing the first keys needed. This will ensure
the sorted page progression.

Interestingly the processor architecture plays an important role here
as you can see in the performance tests as it is directly related to
the sorting key time, the Core iX processors seem to be very good in
this regard.

Interesting
side effects of MGIndex

Here are some interesting side effects of MGIndex

Because the data pages are separate from the Page List
structure, implementing locking is easy and isolated within a
page and not the whole index, not so for normal trees.

Splitting a page when full is simple and does not require a
tree traversal for node overflow checking as in a b+tree.

Main page list updates are infrequent and hence the locking
of the main page list structure does not impact performance.

The above make the MGIndex a really good candidate for
parallel updates.

The road not
taken / the road taken and doubled back!

Originally I used a AATree found here (http://demakov.com/snippets/aatree.html) for the page
structures, for being extremely good and simple structure to
understand. After testing and comparing to the internal .net
SortedDictionary (which is a Red-Black tree structure) it was
slower and so scrapped (see the performance comparisons).

I decided against using SortedDictionary for the pages as it
was slower than a normal Dictionary and for the purpose of a key
value store the sorted-ness was not need and could be handled in
other ways. You can switch to the SortedDictionary in the code
at any time if you wish and it makes no difference to the
overall code other than you can remove the sorting in the page
splits.

I also tried an assorted number of sorting routines like double pivot
quick sort, timsort, insertion sort and found that they all were slower
than the internal .net quicksort routine in my tests.

Performance Tests

In this version I have compiled a list of computers which I
have tested on and below is the results.

As you can see you get a very noticeable performance boost with the new Intel Core iX processors.

Comparing B+tree and MGIndex

For a measure of relative performance of a b+tree, Red/Black tree and MGIndex I have compiled the following results.

Times are in seconds.

B+Tree : is the index code from RaptorDB v1
SortedDictionary : is the internal .net implementation which is said to be a Red/Black tree.

Really big data sets!

To really put the engine under pressure I did the following tests on huge data sets (times are in seconds, memory is in Gb) :

These tests were done on a HP ML120G6 system with 12Gb Ram, 10k raid
disk drives running Windows 2008 Server R2 64 bit. For a measure of
relative performance to RaptorDb v1 I have included a 20 million test
with that engine also.

I deferred from testing the get test over 100 million record as it
would require a huge array in memory to store the Guid keys for finding
later, that is why there is a NT (not tested) in the table.

Interestingly the read performance is relatively linear.

Index parameter tuning

To get the most out of RaptorDB you can tune some parameters specific to your hardware.

PageItemCount : controls the size of each page.

Here are some of my results:

I have chosen the 10000 number as a good case in both read and writes,
you are welcome to tinker with this on your own systems and see what
works better for you.

Performance Tests v2.3

In v2.3 a single simple change of converting internal classes to structs rendered huge performance improvements of 2x+ and at least 30% lower memory usage. You are pretty much guaranteed to get 100k+ insert performance on any system.

Some of the test above were run 3 times because the computers were being used at the time (not cold booted for the tests) so the initial results were off. The HP G4 laptop is just astonishing.

I also re-ran the 100 million test on the last server in the above list and here is the results:

As you can see in the above test, the insert time is 4x faster (although the computer specs to not match the HP system tested earlier) and incredibly the memory usage is half than the previous test.

RaptorDBGuid is a special engine which will MurMur2 hash the
input Guid for lower memory usage (4 bytes opposed to 16 bytes),
this is useful if you have a huge number of items which you need
to store. You can use it in the following way :

RaptorDB
interface

Get the Key and put it in
the string output parameter, returns true if key was
found

Get(T, out byte[])

Get the Key and put it in
the byte array output parameter, returns true if key was
found

RemoveKey(T)

This will remove the key from the index

EnumerateStorageFile()

returns all the contents of
the main storage file as an IEnumerable<
KeyValuePair<T, byte[]> >

Enumerate(fromkey)

Enumerate the Index from
the key given.

GetDuplicates(T)

returns a list of main
storage file record numbers as an IEnumerable<int>
of the duplicate key specified

FetchRecord(int)

returns the Value from the
main storage file as byte[], used with GetDuplicates
and Enumerate

Count(includeDuplicates)

returns the number of items
in the database index , counting the duplicates also if
specified

SaveIndex()

Allows the immediate save to
disk of the index (the engine will automatically save in
the background on a timer)

Shutdown()

This will close all files
and stop the engine.

Non-clean
shutdowns

In the event of a non clean shutdown RaptorDB will
automatically rebuild the index from the last indexed item to
the last inserted item in the storage file. This feature also
enables you to delete the mgidx file and have RaptorDB rebuild
the index from scratch.

Removing Keys

In v2 of RaptorDB removing keys has been added with the following caveats :

Data is not deleted from the storage file.

A special delete record is added to the storage file for tracking deletes and which also help with index rebuilding when needed.

Data is removed from the index.

Unit Tests

The following unit tests are included in the source code (the output folder for all the tests is C:\RaptorDbTest ):

Duplicates_Set_and_Get : This test will generate 100 duplicates of 1000 Guids and fetch each one (This tests the WAH bitmap subsystem).

Enumerate : This test will generate 100,001 Guids and enumerate the index from a predetermined Guid and show the result count (the count will differ between runs).

Multithread_test : This test will create 2 threads inserting 1,000,000 items and a third thread reading 2,000,000 items with a delay of 5 seconds from the start of insert.

One_Million_Set_Shutdown_Get : This test will do the above but shutdown and restart before reading.

RaptorDBString_test : This test will create 100,000 1kb string keys and read them from the index.

Ten_Million_Optimized_GUID : This test will use the RaptorDBGuid class which will MurMur hash 10,000,000 Guids writting and reading them.

Ten_Million_Set_Get : The same as 1 million test but with 10 million items.

Twenty_Million_Optimized_GUID : The same as 10 million test but with 20 million items.

Twenty_Million_Set_Get : The same as 1 million test but with 20 million items.

StringKeyTest : A test for normal string keys of max 255 length.

RemoveKeyTest : A test for removing keys works properly between shutdowns.

File Formats

File
Format : *.mgdat

Values are stored in the following
structure on disk:

File
Format : *.mgbmp

Bitmap indexes are stored in the following format on disk :
The bitmap row is variable in length and will be reused if
the new data fits in the record size on disk, if not another
record will be created. For this reason a periodic index
compaction might be needed to remove unused records left
from previous updates.

File
Format : *.mgidx

The MGIndex index is saved in the following format as shown below:

File
Format : *.mgbmr , *.mgrec

Rec file is a series of long values written
to disk with no special formatting.
These values map the record number to an offset in the
BITMAP index file and DOCS
storage file.

Share

About the Author

Mehdi first started programming when he was 8 on BBC+128k machine in 6512 processor language, after various hardware and software changes he eventually came across .net and c# which he has been using since v1.0.
He is formally educated as a system analyst Industrial engineer, but his programming passion continues.

* Mehdi is the 5th person to get 6 out of 7 Platinums on CodeProject (13th Jan'12)

We have been using RaptorDB KV v2 for string lookups with Guids as keys (so as a Dictionary).
To give you a bit of context; the strings are actually complex objects that we create/calculate during the night and then serialize to a string.
Basically a typical scenario is that a user does a search query which yields a set of guids from a SQL query (potentially > 100 000). With these guids we do the lookups in raptordb kv to get the appropriate objects, still serialized as strings.
I noticed that the first time we do a lookup for a set of guids (say, 20 000 different guids) it can take 2 minutes sometimes to get all of the strings (and I notice there is a lot of I/O going on during this entire period). If I rerun the same query for those 20 000 guids after it, I get them near instantly (0,5 sec for all 20 000).

Does this sound familiar to you at all and do you perhaps have any tips to avoid the "cold query" performance hits? I haven't digged into the code just yet but I figured it wouldn't hurt to ask first

Also just out of curiosity; do you plan to release any updates to this project in the future?

Firstly RaptorDB has to load index pages from disk into memory, for Guid's these are random in nature so you end up loading a lot of pages in memory (keys are sorted but the random nature of Guid's loads different pages). For 20,000 keys you will at most have 3 pages (each page has a capacity of 10,000 but it will be on average filled to 65-80%) the loading of the pages should be very quick, but getting the value will involve a disk seek for the data.

2 minutes seems a bit much, since 10 million key seeks takes around the same time on the test systems I've worked with (although the values were simple 16 byte).

There are a few enhancements that I have planed to post back to the KV version from the doc version, hopefully time allowing

Hmm. If the actual size of the values potentially have impact on the disk i/o, that could be the cause.
A bit of metrics about the store:
425 632 key/value pairs.
All the values are base64 strings with minimum 1500 chars, though most of them weigh in at around 2000 to 2600 characters.

If you are interested I'd be happy to try and make you a self contained sample based on my store.

Hi
I isolated a reproducible scenario in my application where retrieving ~16000 guids took more than 100 seconds the first time, but after that the same retrieval would only take a few 100 milliseconds.
This is running inside IIS, when I recycled my application pool that first call would most of the time (but not always!) take the mentioned 100 seconds the first time the call is done, and just be VERY quick afterwards again.
I ran a profile for one of the times where it took so long and this is how it looks like; https://dl.dropboxusercontent.com/u/17011386/raptordbprofile.png[^]
--> you can see it spends most of its time in actual I/O

Anyway, I used the same store in a unit test where I retrieved the same set of 16000 guids but for whatever reason I cannot reproduce the "cold hit" performance hit I suffer from in the actual application... The longest it will take in the unittest is around 2sec the first time, but the subsequent times I get the same figures as the normal times in the actual application (a few 100ms for the 16000 guids).
Maybe it has to do with restrictions that are put on the IIS application pool the application runs in? I'll need to investigate further on my end but it seems to be a non-raptordb issue after all...

PS: these are the test results: (1 unit test retrieves the 16000 kv pairs three times)
First time: 00:00:01.2366249
Second time: 00:00:00.3089830
Third time: 00:00:00.2779498

--> I notice that after running this test for the first time, a new file is added next to my store, called: "store.0001.2014-10-01"

After that when I run the unittest again, the difference between first and second time is actually smaller
First time: 00:00:00.3292366
Second time: 00:00:00.2767409
Third time: 00:00:00.2796290

I'm not web guy, and I've had problems getting to grips with IIS and it's pools and application life cycle or lack thereof since IIS decides when to kill your app when ever it likes ( bad for a embedded database engine which has stuff in memory and I at least didn't get a heads up to flush to disk!).

I have tried methods like RemoveKey and Delete, but it is not working. --- EnumerateStorageFile() always return all version of key and value.

The I assume that:
1. RaptorDB gain very very high speed in indexing & query. But one coin always has two faces. RaptorDB is not suitable for highly dynamic data environment ---- changing values frequently.
2. But in the application with relatively static data environment ---- small amount of data are changing, we can rebuild the whole database periodically ----- to reduce the size.

RaptorDB's storage file is append only so it will always grow, even for same key items.

EnumerateStorageFile() will give you what ever has happened like a log [of changes].

1. Yes, currently RaptorDB does not have an in-place [storage wise] change support (you will loose history if this is the case).
2. "Compactification" of the storage file is a bit tricky and requires "application" knowledge [what to keep and what to loose].

Currently there is no other way to cleanup, and it will be a "stop the world" event (or partially at least to switch data files).

Side note:

Strings and objects are tricky beasts since the do not have a fixed length and updates will create a lot of unused space which you have to manage, it may be possible to have in-place update (storage wise) for fixed length data types like int, long etc. (and loose history also) but that is a whole other can of worms.