Introduction

Suppose you have to perform a case-insensitive text search over a Unicode string. You probably know that Boyer-Moore algorithm is the most efficient algorithm for such a task. This article shows how to implement the Boyer-Moore algorithm for Unicode, using C#.

Background

The efficiency of a search algorithm can be estimated by the number of symbol comparisons it takes to scan the whole text in search for the pattern match. The Boyer-Moore algorithm is efficient because it avoids some unnecessary comparisons and produces longer shifts of the pattern along the text. With Boyer-Moore, it is also easy to perform efficient case-insensitive searches. I will not go into details of the Boyer-Moore algorithm as you can find them elsewhere. It is interesting to note however that there are several implementations of the main idea of the algorithm that differ in complexity. Simple implementations that use one-dimensional shift tables are not very efficient when searching for complex patterns that contain several repeating sub-patterns. The implementation I show here uses a two-dimensional shift table. The best way to explain an algorithm is to explain the data structures it uses, and the best way to explain a data structure is to demonstrate it using a simple example. Following is a simple example of the two-dimensional shift table for Boyer-Moore. Suppose we search a string that consists only of characters a, b, c, d, e (in any number and sequence) for a pattern "abdab" (note that "ab" sequence occurs twice in this pattern). The shift-table for the pattern "abdab" would look like this:

The number of columns in the table is equal to the number of characters in the pattern and the number of rows is equal to the number of characters in the charset. As comparison in Boyer-Moore starts from the last character of the pattern, the table is built from right to left. Each table cell contains values by which the pattern should be shifted given that all the previous pattern characters (the characters to the right of the current character) match corresponding text characters. When we start to compare pattern characters to substring characters from right to left we find the shift value for the pattern in the cell whose column corresponds to the pattern character and whose row corresponds to the substring character.

The shift values are calculated for every character that may occur in the string where we search. You may have noticed that zero shifts occur in the cells for which pattern characters match the charset characters. This means that while the pattern matches the substring, the pattern shouldn't be shifted. If the table was scanned from right to left and in every column we get zero shift for the corresponding substring then we have found the substring that matches the pattern. Other shift values are calculated so as to produce maximum shifts without skipping sequences that might be parts of the matching substring. The shift of five, for example, is the "full" shift of the five-character pattern and the shift of three reflects the fact that the two last characters of the pattern coincide with the two first characters (so if the two last characters of the substring match the pattern while others do not, we shift the pattern by three, and not by five, because these two characters may be the beginning of the matching substring). From a broader perspective, we may view the shift table as a representation of a finite automaton whose states are the values stored in the table.

It is easy to build a shift table with the number of rows equal to the number of charset characters if the charset is single-byte. But for Unicode charset, such a table would be quite large and inefficient. It is easy to solve this problem if we notice that table rows differ from each other only for the characters that actually appear in the pattern. In the table shown above, you can see that the rows for the characters c and e contain the same set of shift values and this set would be valid for any other character that doesn't occur in the pattern. So we might store table rows corresponding to characters found in the pattern in some fast-access structure, say a hash-table, and keep a separate set of shift values for all other characters that might appear in the string.

The code

Now when we have discussed how the shift table is built and how it works, let's have a look at the code that builds the thing. There is BMSearcher class in the demo project that performs the actual search. The class' constructor takes a pattern and builds a shift table for it.

Constructor stores the table in a PatternCharShifts object while shift values for characters not present in the pattern are stored in an OtherCharShifts array. PatternCharShifts member is declared protected. The reason for this will be shown later. There is a GetTable method in BMSearcher that returns the string representation of the table built by the constructor (the rows in the table are separated by line breaks). This method is used in the attached demo. The code shown above uses System.Collections.Hashtable for storing shifts as I did in the first version of the demo. Now I reimplemented the demo with the custom hash table class BMHashTable to avoid performance penalty caused by boxing the arguments. It doesn't make any difference to the algorithm itself. Let's now look at the Search method that does the actual search. This method takes two arguments: the string where the search should be performed and position in the string where the search should start from. The method returns the index of the first (starting from the Pos) matching substring. The index is relative to the beginning of the string. The method returns -1 if no match is found.

The BMSearcher class performs case-sensitive search. Implementing case-insensitive search with Boyer-Moore algorithm is quite easy - all we have to do is modify our shift table. We need to add new rows to the table so that the table would contain rows representing pattern characters in both upper and lower case. For example, if we have a pattern "Abc", we need to build the table as described above and then add three rows for a, B, and C to implement the case-insensitive search. Since the shift values should be the same for the same characters in different cases, we may build a table for case-sensitive search and then simply copy rows for the case-complement characters. If the shift table is modified this way, we can use the same search routine for the case-insensitive search.

This technique is implemented in a class CIBMSearcher derived from BMSearcher. Here is the constructor for the class:

This constructor calls the base constructor for building case-sensitive Boyer-Moore table and then, if CaseSensitive value is false, adds new rows to the table.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

The Boyer-Moore algorithm requires set up. That's part of the trade off. If you are searching smaller strings, I wouldn't bother. If, instead, you are writing an application that may be searching potentially megabytes of data (as my application does), then Boyer-Moore is fantastic! I'm looking at updating it to deal with Unicode characters.

Put the word, Membership (upper case M in position 1) in textbox1.
Put the word, membership (lower case m in position 1) in the richtextbox.
Leave the case-sensitive checkbox, unchecked.
Click the find button.

My assumption is that it would find the all lower case word membership and highlight it in the richtextbox, but it doesn't. This particular word seems to cause problems, I've tried several others but they all work correctly.

Can anyone verify that they are seeing the same behavior, if so does anyone have an idea on how to fix?

You can do it much the same way as for case-insensitive. Suppose you look for "forme" in French and want the searcher to find "formé" as well. You let the searcher build the shift table for "forme" and then add a row for é character with the same shift values as for e.
Regards, LeSeul

Your specialized algorithm for Unicode strings is indeed very efficient. Unfortunately, your implementation leaves a few things to be desired as far as 'efficiency' goes.

In particular, the fact that you box every character you compare is a huge performance hit, in terms of both speed and memory.

Boxing (wrapping a value type in an object 'box' so classes like Hashtable can use it) allocates a new object on the heap. Not only does this take time, but it also uses up 12 to 16 bytes for every comparison. The memory is released immediately. But it still brings the next garbage collection that much closer, and it flushes part of the cache.

You can get a 300%-500% improvement if you move from a general Hashtable to a specialized version with char keys.

Hi Jeffrey
Thanks for your comment.
Actually I knew that boxing is an issue but I concentrated on the algorithm itself and didn't want to make an example complex...
But Einstein was right so I implemented custom HT. I've already sent an update to the CP team, should appear here soon.