Abstract

There is no shortage of collection classes in the .NET Framework. I would suggest that the reason for this is because no one type of collection is good at everything. Each specializes in a certain task and addresses a certain requirement (or limitation) of computing. In the following article I'll present an implementation of a list style collection called a "gap buffer" that has excellent performance when inserting or removing items at or near the same relative index.

One of the rules of programming that I've tried to live by over the years is that whenever possible—use the simplest tool for the job. Start with what's available and has been proven to work before writing your own. With the wealth of classes and objects supplied in the .NET Framework, I'm hard pressed to ever encounter a need that hasn't already been addressed. However, occasionally that happens and it happened to me recently when I was using a List<T> to manage a large collection of objects.

The List<T> is a fine piece of engineering and in nine out of ten cases it's exactly what I need, but on a recent project I hit a wall. Have you ever tried to perform numerous insert or remove operations on a large List<T>? Depending on the number of elements being managed, the performance can range from "acceptable" to "my grandmother is more likely to finish a game of Risk™ first". This led me to investigate different types of collections and memory management techniques that would yield better performance when exchanging elements in the middle of the collection.

The "gap buffer", as I'll explain shortly, was developed to address this exact need. I didn't invent it and I'm not sure who did. I do know that it has been around for a long time as most well engineered things are. It became clear that this was the collection type that I was looking for, but look as I may, I couldn't find a .NET implementation of one. From that investigation was born the GapBuffer<T> class; a .NET collection that implements the principles of a gap buffer and is a drop in replacement for a List<T> when repetitive insert and remove performance is critical.

To understand how the generic GapBuffer<T> works, it's helpful to first have a clear understanding (or review) of its nearest cousin: the aforementioned List<T>.

The List<T> collection is what you might call a "dynamic array". A student of computer science will note that the words "dynamic" and "array" don't usually go together. When an array is created, the size of the array must be specified up front. Once the appropriate amount of memory has been allocated, the size of the array cannot change. The reason for this is because as an operating system dishes out a memory request, it will very likely have memory allocated to a different purpose just before and just after the memory block given for the array. Block after block, like a row of books on a bookcase, the objects in memory are given their appropriate space. When that array reaches its limits and needs to expand, the chances of its being empty space on either side are practically zero and there's simply no adjacent space available. It's like if you were to remove the paperback version of 'The Legend of Sleepy Hollow' from a full bookshelf and try to replace it with the hardback version of 'War and Peace', it simply wouldn't fit. The only solution is to put 'War and Peace' somewhere else and that's exactly how the List<T> works. Ignoring for a moment some of the more specific mechanics of a managed collection we can draw a simple analogy. Consider the following sentence fragment:

In this example we'll pretend that the amount of total space is akin to the size of an array and the words "In a hole in the ground" occupy some of that space. We'll also pretend that, like an array, we decided how much space we needed up front and, as you can see, we have a little extra left over. Seeing that we have a little extra room we'll add a few more words:

Letter-by-letter the available space begins to fill until eventually the limit is reached. Like an array that's been filled to capacity there's no more room. In that case we need to create a new space. Once the operating system allocates us some more memory and the new space has been created, the existing contents can be copied over and the phrase completed:

It may look like we simply added more space but as I stated before this is not possible. We had to create an entirely new space and copy what we had in step two into step three. The process of filling the available space and then making new space is repeated as more and more words are added. Hence, our example is a bit like a "dynamic array" or List<T>. If you were to run Reflector and peer inside of the List<T> class, you would see that it's really only a fancy wrapper around a run-of-the-mill array. What makes it so useful is that all the work of jockeying around the memory required to maintain the contents as it grows (or shrinks) is automatically handled for you. Below is a basic diagram of the essential components of the List<T> class:

As you can see, the Count property of the List<T> is not the actual size of the internal array but rather how much of the internal array is being used. What the consumer of the collection sees is just a subset of the internal array. As more items are added to the List<T>, the Count property is updated to reflect the actual amount of the internal array being used. When the internal array is filled to capacity a new, larger array must be made; the contents of the current internal array are then copied into the new array; the new array then becomes the internal array for the List<T> and the old one is discarded. By all accounts this technique is quite effective. While there is some overhead incurred when the internal array reaches capacity and a new one must be created, there is little to no work necessary to add new items into the empty portion of the internal array. This tradeoff is acceptable because it allows us not having to state in advance how much space we're going to need. It will grow as necessary.

Like Kryptonite to Superman, every superhero or collection has its weakness and the List<T> is no different. While appending items to the end of the list requires very little work, inserting items somewhere in the middle of the list requires a great deal of work. The reason is because adding an item to the end of the list means simply placing it in the next unused index and increasing the Count. However, there is no unused space in the middle of a List<T>. Back-to-back, shoulder-to-shoulder the items in a List<T> are lined up. So how are we going to wedge another item in? The only way is to copy the entire contents of the internal array after the point in question over one space. There's no way around it. Every time we want to place a new item somewhere in the middle of the array, everyone after that point has to shift over to make room for the new guy. This effect gets worse as the number of the items in the List<T> increases and the insertion point gets closer to the beginning of the collection. It's a little like cutting in a line of New Yorkers waiting to see a Yankees game. You cut in front of one or two guys at the back of the line and you might not get hurt, but cutting in front of a couple hundred (or thousand) and the game isn't all that you'll be missing.

A gap buffer attempts to overcome the overhead of inserting items into the middle of the collection by placing the empty portion of the array in the middle instead of at the end; hence the name "gap buffer" referring to the "gap" in the middle of the "buffer". To be more specific, the gap itself is not fixed to any one position at all. At any time it could be in the middle of the buffer or just as easily at the beginning, the end, or anywhere between the two. Here's what the GapBuffer<T> looks like:

As you can see, a gap buffer uses the same pattern of wrapping an internal array, but in this case the unused portion of the array is not fixed at one end. The job of the collection is to combine the used portions of the internal array into one continuous view of the contents. In purpose and in practice it works very much like a List<T>. New items are inserted into the empty portion of the internal array and when the array is filled to capacity, a new one is made. However, as previously noted, when an item is inserted into the beginning of a List<T>, all the items after that point need to be moved down each time this operation is performed because a List<T> keeps all the empty portion of the internal array at the end. In a GapBuffer<T> this shifting of contents to make room still needs to happen—but only once. After the insertion point is established and the gap is moved, the empty portion of the internal array will now start at that point. All subsequent insertions to the same position require no extra work at all because there is now ample space at that position to insert the new items. Using the same illustration we used with the List<T>, the GapBuffer<T> works a little like this:

The gap of unoccupied space is located in the middle (or in this case, closer to the beginning) of the block allocated for the phrase. While a sentence like this would obviously look a little strange with so much space in the middle, a gap buffer works by combining the contents so that it appears as though it's one continuous block of text and would read "In the ground lived a hobbit." Now it's a snap if we wanted to insert some text after the word "in". Instead of having to relocate a huge portion of the contents, the gap is already at the insertion point and more text can be filled in:

So long as the unused portion of space is where we want life is great. No extra work was necessary to make room for the newly added text. To then insert some text elsewhere in the phrase, the gap needs to be moved. Isn't this the same problem the List<T> had? Well yes, but to a much lesser extent. If we wanted to place some text right after the word "ground" the gap only needs to move a few spaces from where it's currently located. If we were to take our simple example and expand it to say we have the entire text of 'The Hobbit' book following this first sentence, the methodology of a List<T> would buckle trying to move over everything to place some new text here. But with a gap buffer, only a few items are moved:

Once the gap has been placed, letter by letter, the new text can be inserted without ever having to incur any more overhead:

As with a List<T>, the process of filling the empty space continues until at which point, the buffer becomes full and the contents must be copied into a new, larger buffer. The work of the collection is to make this transparent to the consumer. While there may be a big gaping hole in the middle of the array when accessing the collection it appears as though it's one continuous phrase or block.

Note: The simple examples above are not intended to advocate that either a List<T> or a GapBuffer<T> should be the structure of choice when storing and manipulating text and are given only by way of analogy. Traditionally a gap buffer is well suited to this purpose. However, a discussion about string handling in .NET and the use of a gap buffer for storing text is outside the scope of this article.

One can immediately see the potential benefits of this approach. If all we ever wanted was to add items to the end of the collection (or internal array) we would keep the empty portion at the end, as does the List<T>. If in another case we knew that we might want to insert items at the beginning of the collection we would prefer to have the empty portion of the array at the beginning. Also if we wanted to insert in the middle of the collection we would want the gap to be in the middle and so on. The gap buffer makes this possible by not fixing the unused portion of the internal array to any one place, but rather allowing it to adjust to the most optimal place.

At worse case we have to move the gap from the very end of the buffer to the very beginning in which case we would incur the same performance penalty that a List<T> does, but if the next operation were somewhere in the middle we would only have to shift half the contents of the internal buffer. Better still, what if we performed a subsequent operation only a few indexes over from the previous, or maybe even inserted an item at the same index as last time? In that case the gap wouldn't need to shift more than a few items or even not at all. We would simply be filling up the unused space of the gap and none of the performance penalty associated with shifting over the contents of the internal buffer would be incurred. This is why it's said that a gap buffer offers increased performance when repetitious operations occur at or near the same relative index. The closer we are to the last operation, the less overhead there is in moving the gap.

To implement a GapBuffer<T> there are a couple bits of information that we need to keep track of. Whereas the List<T> only needs to maintain a variable that represents the actual count (or rather, where the used part of the internal array ends and the unused portion begins), the gap buffer needs two variables: one to specify the index of where the gap starts and another to specify where the gap ends. The total size of the gap can then be found by subtracting where the gap ends from where it begins. A simple example of this is the Count property which is calculated by subtracting the size of the gap from the total capacity of the internal buffer; what's left over is how much of the internal array is being used:

The real magic of maintaining the internal array in the GapBuffer<T> collection comes from the private PlaceGapStart method. Before each modification to the collection, the PlaceGapStart method is called to move the gap into the insertion (or deletion) position. This will make sure that the available space is where we want to make changes. To move the gap is simply a matter of moving the items of the internal array towards where the gap is located, thereby closing it up and opening a gap in the new location. Like when Doc. Brown is trying to stretch the wire from the clock tower to the light post at the street below; when he pulls the connection closed at one spot it opens up at another. Fortunately for us, this is the effect we want and it is accomplished programmatically like this:

There's a little bit of math involved here, but don't be intimidated by it. Some basic logic determines which way the gap needs to move (if at all), and then some simple calculations determine how many items we need to copy. Once we copy the items that need to move from one side of the gap to the other we also must clear any portion of the gap that has lingering copies of items by calling Array.Clear. If we didn't there might still be copies of items within the gap, and even though they would not be visible to the consumer of the collection, that reference would be enough to keep them from being garbage collected. Once all the items that are affected by the shifting of the gap are copied to their new position, the _gapStart and _gapEnd variables are updated to represent the new gap position. With the hard work of writing the PlaceGapStart method in place, inserting an item into the collection becomes a breeze:

First, the gap is moved to the point of insertion using the PlaceGapStart method. Then we ensure that there is room enough for the new item (and if not, create a new, larger internal array). Once those checks and balances have been done, we can set the item at the specified index to the new value and increase the gap start. By increasing the gap start by one, we're also shrinking the total gap size by one and, in effect, saying that the gap now begins after the item just inserted. The benefits of the gap buffer are realized when we can continually insert new items at the start of the gap and there is no need to shift the contents of the internal buffer or increase its size. Removing an item is even easier:

By placing the gap just before the item to remove and then increasing the size of the gap by one, we basically expand the gap right over the item to be removed. To ensure that the removed item is a candidate for garbage collection we must also reset the item at that position to the generic type's default.

All other common collection functions are done in a similar manor. Most of the time these operations are done in two phases: once for the content before the gap and once again for the content after. Sometimes all that is needed is a little arithmetic as is the case when implementing the iterator. By making a simple adjustment to the index we can have it point to the contents before or after the gap. Any time the index is greater than or equal to _gapStart we know that we must account for the gap and add its size to the desired index. Once that adjustment has been made, it's business as usual and all this takes place unbeknownst to the collection consumer:

To ensure that theory translates into practice I've written a simple performance test to determine the real gains of using a GapBuffer<T> versus a List<T>. I measured the time it took to insert or remove a number of items either at random or at a fixed position (index zero). This should represent the full spectrum of performance gain a GapBuffer<T> can offer. I then did the same for a List<T> and compared the results. In every test the number of items either removed or inserted was 200,000. Here are the results:

Inserting Items Performance Test

Test

List<int>

GapBuffer<int>

Performance Gain

Inserting at random index:

7,985ms

7,459ms

7%

Inserting at fixed index (zero):

16,144ms

35ms

460%

Removing Items Performance Test

Test

List<int>

GapBuffer<int>

Performance Gain

Removing at random index:

8,700ms

7,458ms

17%

Removing at fixed index (zero):

16,360ms

2ms

8179%

As you can see, the performance gain ranges from marginal to dramatic! It also demonstrates the conclusions we made earlier: the more items that need to be shifted around, the more work is necessary and consequently, the more time it will take. Inserting or removing at the first index in a collection represents the absolute extreme of this principle, causing maximum impact on the List<T>. On the other hand the GapBuffer<T> excels at this task. Each time the List<T> needs to insert an item at index zero, every other item in the collection must move and the performance will get worse and worse as the collection grows. With the GapBuffer<T> we know that the gap will arrive at index zero during the first operation and stay there for every subsequent operation, requiring no extra work to continually add items at that index. Even when the index is not consistent and is randomly located, we can see that the GapBuffer<T> offers slightly better performance because it's probable that when inserting items at random, the next index of operation may be nearby and therefore not require a large amount of items to be moved. Removing items works in almost the same way and the results are naturally similar.

Real world performance will be somewhere between the random and index specific tests. When used as a substitute for a List<T>, where there is little repetition in the index used, a very modest performance gain will still be realized. If on the other hand, an application plays to the strengths of the GapBuffer<T> an enormous performance gain can be realized. Every collection has its strength and the GapBuffer<T> makes quick work of repetitive tasks.

In all other common collection operations (such as appending and iterating) I found the List<T> to perform slightly better than the GapBuffer<T>. This is not surprising because, however small it is, there is a little more overhead associated with maintaining the gap in the middle of the array. There are additional method calls to check the placement of the gap, perform adjustments on its location, and to work around it when iterating through the collection. That being said, I didn't find that performance loss to be significant nor did that performance loss outweigh the benefits that the GapBuffer<T> offers. That leads me to conclude that unless one is absolutely sure all work on the collection will be appending and iterating, it's more likely that the GapBuffer<T> will be a better choice, even if the number of insert and remove operations is comparatively low.

I've included the complete source for my implementation of a generic gap buffer along with the project I used to do performance testing. I've taken great care to unit test the code to ensure that it runs perfectly. As I mentioned above, it should be a drop-in replacement for most cases where a List<T> is currently being used. It should go without saying that the source code requires .NET 2.0 and greater to compile. There is no restriction on the use of the code and I only ask that appropriate recognition be given to the author.

At the beginning I stated that the reason there are so many collections available to .NET programmers is because there is no one collection that is perfect in everything. Each has been designed for a specific need and so has the GapBuffer<T>. With it comes some added internal complexity when compared to a List<T> but also offers dramatic performance increases when used for what it's been designed. Should the need ever arise where inserting into a random or repetitive place in a collection becomes frequent, I would highly recommend considering the GapBuffer<T>.

Version 1.1 - Modified the code and article to properly clear the gap of copied items and ensure they are subject to garbage collection.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

Comments and Discussions

In this article you talk about a gap buffer with only one gap, in the middle, but it occurs to me that it should be pretty easy to extend the buffer to support a second gap at the end. That way, if you remove an item from the end, a special end-gap appears and the "main" gap, in the middle, does not have to move. If the user later adds an item at the end, the end-gap disappears again. If the user adds an item at the end when there is no end-gap, it's easy to enlarge the array to create an end-gap. This would be an efficient way to cover a scenario in which some changes occur at the end of a list, while others occur in the middle.

I have a strong suspicion two gaps would be useful, although I can't actually think of a real-world scenario for it. At least this feature would allow GapBuffer to consistently match List's performance for end-inserts.