There are many times when programmers need to search for a substring, for example when parsing text. This is commonly referred to as searching for a needle (substring) in a haystack (the string to search in). The most straightforward way to do this is by using search functions that your language provides:

C: strchr()/memchr(), strstr()/memmem()

C++: string::find()

Ruby: String#index or regular expressions

Python: string.find() or regular expressions

However those functions are usually implemented in a naive way. They usually go through every index in the haystack and try to compare the substring at that index with the given needle. Thus, given a needle of length M and a haystack of length N, those algorithms need O(M*N) steps in the worst case. This is acceptable for small haystacks, but becomes more and more problematic as the the haystack grows.

In this article we’ll examine smarter algorithms, in particular Boyer-Moore and its variants. Finally we’ll present the reader with an optimized Boyer-Moore-Horspool implementation in C++. This implementation differs from most others in that it’s heavily optimized and is capable of accepting haystack input in a streaming manner.

Before we move on, it should be noted that Python’s string.find(), Ruby regular expressions and glibc’s implementation of memmem() actually can use smarter algorithms when conditions are right, but that is besides the main point of this article.

Demonstration of a naive substring search algorithm

Smarter algorithms

Are there faster algorithms for searching needles? Why yes, computer science has already found the answer decades ago. Here are two algorithms that can find needles in less than O(M*N) time:

The ideas behind these algorithms are that, when a mismatch occurs during an iteration, you can sometimes intelligently skip several characters or you can start the next substring match at a higher position in order to avoid re-matching characters unnecessarily. Both algorithms are characterized by the need to precompute some preparation data. It is this data that allows them to search for needles in a way faster than O(M*N).

Knuth-Morris-Pratt simulates how a finite state automata accepts input symbols. During its precomputation phase it builds this automata, which requires O(M) time. Its main search algorithm, which runs this automata, has a worst case time complexity of O(N).

Boyer-Moore is similar, but in its precomputation phase it needs to build two tables, a so-called good suffix table and a bad suffix table. These tables are built from the needle contents. Its main search algorithm requires 3*N steps in the worst case, thus having a worst case time complexity of O(N).

More on Boyer-Moore

Boyer-Moore is a very peculiar algorithm because a longer needle makes it faster! This counter-intuitive property is caused by the fact Boyer-Moore matches the last needle character first and matches from right to left. If the last needle character doesn’t match the haystack then it knows it can skip M characters. When one of the other needle characters mismatch as well, it can usually skip more than 1 character based on the character which mismatched; this information is retrieved from the tables that had to be built in the precomputation phase. Thus, Boyer-Moore usually finishes in sublinear time.

Turbo Boyer-Moore takes more space but needs 2*N steps in the worst case instead of the original’s 3*N steps.

Boyer-Moore-Horspool uses less space (only requires the good suffix table) and has a simpler inner loop with less constant overhead per iteration. Its average performance is O(N) but it has a worst-case performance of O(M*N).

Boyer-Moore-Horspool itself also has variants. In Timo Raita’s 1992 paper “Tuning the Boyer-Moore-Horspool String Searching Algorithm” he asserts that text is rarely random and that there usually exist strong dependencies between characters. Thus there are smarter ways to match the needle: instead of matching it backwards from the last character, one should first match the last character, then the first character, then the middle one, etc. The advantage of this heuristic becomes obvious if we consider an example in which we search for the needle “hello world” in a haystack that often contains the word “world” in combination with a word other than “hello”, e.g. “beautiful world”, “my world”, “another world”. When the algorithm matches the last needle character “d”, instead of wasting time matching “worl” as well, it can match the first needle character “h” and immediately determine that there is no match. This heuristic can drastically improve Boyer-Moore-Horspool’s average performance.

Theory vs practice

Which algorithm should one use, Boyer-Moore, Turbo Boyer-Moore or Boyer-Moore-Horspool? In theory Turbo Boyer-Moore is an attractive alternative because of its worst-case performance, but real-world experience has shown that theory is rarely the same as practice. Clearly, this calls for benchmarks.

Various implementations of Boyer-Moore (and variants) exist in a variety of languages, but for the purpose of benchmarking we will want to use C or C++ because they allow the most optimal implementations. (Or assembly, but let’s not be too l33t today.) However there seem to be few good C/C++ implementations out there that are:

Readable.

Correct.

Tested.

Optimal.

After much digging I’ve found one C++ implementation that most matches our requirements. Its author, Joel Yliluoma, has agreed to open source his code under the MIT license, which I’m very grateful for. He had implemented Boyer-Moore, Turbo Boyer-Moore and Boyer-Moore-Horspool with some of the ideas by Timo Raita 1992. We’ve extended his work by adding unit tests, writing documentation and adding further optimizations. Using this extended code we’ve set up a benchmark suite. The results are as follows.

The benchmark is conducted on a MacBook Pro with OS X Snow Leopard and Intel Core 2 Duo 2.4 Ghz. For comparison reasons we’ve included benchmark results for strstr() and memmem() as well. strstr() is omitted in benchmark 1 because it cannot handle arbitrary random data. We’ve implemented memmem() ourselves because memmem() is a GNU extension and doesn’t exist on OS X. The test program is compiled with -O2 optimizations.

Benchmark 1: random data

This benchmark searches for the needle “I have control\n” in a 200 MB file containing random binary data, 10 iterations.

Boyer-Moore

1191 ms

Boyer-Moore-Horspool (Joel’s original implementation)

787 ms

Boyer-Moore-Horspool (our implementation)

729 ms

Turbo Boyer-Moore

1745 ms

memmem

2956 ms

As expected, memmem() is the slowest of the bunch, being more than twice as slow as Boyer-Moore, but actually doesn’t perform *that* badly.
Turbo Boyer-Moore, despite its name, performs poorly compared to the original.
Boyer-Moore-Horspool is significantly faster than everything else. Our optimized implementation is slightly faster than Joel’s original.

Benchmark 2: Alice in Wonderland

This benchmark searches for the needle “I have control\n” in a special compilation of Alice in Wonderland. We took the HTML version from Project Gutenberg and multiplied its contents many times in order to generate a 200 MB Alice in Wonderland HTML file. This benchmark ran for 10 iterations. This is probably a much better real-world test than the random binary blob in benchmark 1.

Boyer-Moore

1734 ms

Boyer-Moore-Horspool (Joel’s original implementation)

1101 ms

Boyer-Moore-Horspool (our implementation)

1080 ms

Turbo Boyer-Moore

2683 ms

strstr

2116 ms

memmem

3047 ms

The results are very similar to those of benchmark 1. Again, strstr() and memmem() are the slowest of the bunch, and our implementation of Boyer-Moore-Horspool the fastest.

Benchmark 3: triggering bad performance

In this benchmark we attempt to trigger bad peformance in Boyer-Moore-Horspool. We created a 200 MB file containing nothing but newlines, in which we look for the needle “I have control\n\n”. Notice the extra newline at the end of the needle! This causes the algorithm to skip slowly. The benchmark ran for 10 iterations.

Boyer-Moore

1515 ms

Boyer-Moore-Horspool (Joel’s original implementation)

12425 ms

Boyer-Moore-Horspool (our implementation)

9098 ms

Turbo Boyer-Moore

2281 ms

strstr

1820 ms

memmem

2744 ms

Here, Boyer-Moore-Horspool is significantly worse than even memmem() despite the same theoretical worst-case complexity! This is because memmem() has a much simpler inner loop with lower constant overhead. We also see that our optimizations to Boyer-Moore-Horspool really help.

Conclusion

Different substring searching algorithms have significantly different performance characteristics. For small haystacks it doesn’t really matter which one you use, but for larger ones you should definitely consider smarter algorithms like Boyer-Moore. Using these smarter algorithms is not without cost however: because they require preparation (creating an FSA or building tables) you must ask yourself whether the cost of preparation is acceptable for your use cases.

Turbo Boyer-Moore is disappointing, its name doesn’t do it justice. In academia constant overhead doesn’t matter, but here we see that it matters a lot in practice. Turbo Boyer-Moore’s inner loop is so complex that we think we’re better off using the original Boyer-Moore.

We think that for most cases Boyer-Moore-Horspool is the best choice. Despite its theoretical worst-case performance, its worst case is seldomly triggered; we had to spend effort on generating a bad input file and picking a bad needle on purpose. Its inner loop is simple and easy to understand. Its simplicity also results in better CPU cache utilization and branching behavior. On average Boyer-Moore-Horspool performs so well that for us, the choice is clear. Boyer-Moore-Horspool is an excellent choice for things like parsing protocol headers and multipart MIME data. In case of multipart MIME data, most browsers generate fairly long boundary strings, allowing Boyer-Moore-Horspool to skip over non-boundary data very quickly.

The code

Most string search algorithm implementations require the entire haystack data to be in memory. In contrast, we’ve written a “streaming” Boyer-Moore-Horspool implementation which allows one to feed the haystack data piece-of-piece. This is especially useful when parsing large amounts of protocol data received over the network without keeping all data in memory.

Our implementation is heavily optimized for both memory usage and CPU utilization as well as reusability. During the development of this implementation we’ve applied the highest degree of C++ kung-fu in order to get every last bit of efficiency out of the computer.

All data structures are optimized for size and locality of reference. Smaller size means less memory usage, which in turn means less CPU cache usage. On modern systems CPU cache utilization is one of the most important performance factors.

The context structure can be allocated in a single memory allocation, irregardless of the length of the needle. This minimizes memory allocation overhead, both in time (e.g. by allocating on the stack) and in space (by avoiding malloc() bookkeeping overhead). The algorithm does not need any more memory allocations than this initial one. The code also does not dictate a specific way to allocate context structure; the user is free to choose where to allocate it.

We’ve also applied some of the ideas by Timo Raita 1992, but our implementation differs from Joel’s original implementation through minimizing function call overhead. Joel’s implementation calls memcmp(); ours first compares the first character using inline code.

The poor performance of Turbo Boyer-Moore might be due to cache locality issues. If you jump from checking the end of the needle to the start of the needle, then there’s a good chance that you’ll cause a cache miss – which has the same cost as several hundred instructions. A modern Turbo Boyer-Moore would probably work backwards within the same cache line before jumping to the start of the needle. A couple of carefully chosen cache-preload instructions could make Turbo Boyer-Moore. For a good overview of cache effects, google “gallery of processor cache effects”.

http://www.codedanger.com/caglar Caglar Gulcehre

Snort and Bro are also using Boyer-Moore algorithm for signature-matching. I’ve tested them some time ago and they are pretty fast as well.

http://gedmin.as Marius Gedminas

It would be interesting to include Python’s str.find() in the benchmark, as I’ve heard it was highly optimized. And why not include Ruby’s version as well?

Yea I just read about Python’ts str.find(). turns out, they decided to use the same boyer-moore-horspool algorithm as well. Don’t knock an implementation of a ‘find’ unless you know that ‘find’ is slow. Maybe Ruby, C, C++ are all naive, but Python sure isn’t.

Hi Phusion,
two days ago I hit your well presented web-page, get on.
Like you I am interested in search techniques, but using C.

Tom Payne said:
“… If you jump from checking the end of the needle to the start of the needle, then there’s a good chance that you’ll cause a cache miss …”
Right, I found the same thing after making all kind of tries-and-errors two months ago. My tweak workarounds this problem by picking the BYTE/DWORD with lower address first.

In my view, a real wolf/programmer should enter and huff-and-puff all old piggy-houses i.e. implementations, I mean without true knowledge everything regardless how good behaves remains inferior compared to a master touch.
As for me I am a program-mess-er who needs help from a wolf.
Anyway, I was fortunate to calibrate the beautiful-in-its-simplicity BMH by fetching the lower bytes first, which resulted in Railgun_Quadruplet revision 7.

There are great differences between compilers, CPUs and patterns used, your test is performed using an old CPU (mine is even older) and one compiler, it is not enough at all.
More tweaks are needed, everything is in constant evolvement, for example the old architectures (with no fast unaligned DWORD fetch capability) AFAIK all the AMD’s and Intel’s CPUs before i5/i7 demand one approach of how to deal with BYTEs, WORDs, DWORDS whereas the new ones allow very different coding as in Railgun which extracts cheaply 4-bytes from the haystack thus feinting your ‘\n\n’ pattern.

Regards

P.S.
Hongli Lai, Kung-Fu masters bow before a Dao conduit, i.e. Dao rules, Kung-Fu follows i.e. Dao style is needed not Kung-Fu style, remember how the supreme Kung-Fu master bowed before the Buddha(being a Dao’s conduit) conduit in ‘Kung-Fu Hustle’ movie, wow, having not watched it is a nasty movie-miss. I am sure you will see the point immediately. Best wishes.

Please run the benchmark on some modern CPUs like Haswell & Vishera. I still don’t know what impact there will be if the 64KB lookup table become BITwise i.e. 8KB, I mean that the few instructions doing this could hurt (or not) the speed.

@vkaku
I don’t think so, Mischa Sandberg has a superfast solution for some pathological cases, using SSE2 and the superb BNDM. In my view, a simple heuristic deciding whether the pattern’s alphabet is too small (e.g. 4 distinct chars ACGT as in DNA sequences) would be a good shot. Seeing the alphabet is too small we can switch say to BNDM – a gorgeous etude.

Hello, we are Phusion. We provide amazing products and services
for web apps written in Ruby, Python, Node.js and Meteor.

“Phusion” and “Phusion Passenger” are registered trademarks of Phusion. “Rails”, “Ruby on Rails” and the Rails logo are registered trademarks of David Heinemeier Hansson. All other trademarks are property of their respective owners.