Devblog fueled by Christophe De Troyer

Menu

Pattern Matching Algorithms

Hi there!

I’ve implemented some pattern matching algorithms in C#. They were part of a course I took at the university I study at. They are therefore hardly optimized for real life usage. They do represent the conceptual idea of the algorithms.

Brute force method

The brute force method is quite simple. We align our pattern with the text and every time we have a mismatch we shift our pattern one step to the right. This is very bad performance wise. Worst case we will match every letter of our text with every letter of our pattern, which equals to 0(np*nt). Imagine this for a text of 1m characters and pattern of 100.000 characters. This method is good enough for very small text and patterns but as you’ll see, the complexity of other algorithms is hardly bigger. The code is quite straight forward and given below.

Knuth Morris Pratt

This algorithm is rather difficult to explain in a simple blog post. So bear with me as I try. The algorithm fully depends on the sigma-function. This function will return the largest prefix of the pattern, that is also a suffix of the part of the pattern that we have already matched. This is quite crucial to understand. Take a look at the image below.

As you can see we have mismatch at the character ‘x’ and ‘a’. Using the bruteforce method we would just shift our pattern one step to the right. So the first ‘a’ would align with the second ‘b’ of our text. What KMP will do, is calculate the sigmafunction. You can see in the image, a suffix of the last ‘a’ in our pattern is ‘ab’ for example. ‘aab’ is one too but you’ll get it in a minute. This ‘ab’ is also a prefix of our pattern. Meaning, if we shift our pattern just the length of that pre- and suffix, we don’t have to match those characters anymore. Because we know for certain that all the characters before our mismatch matched. So we will shift the pattern as you can see in the image. And this is the idea behind the sigma function. Once you got it, it’s a very simple algorithm.

The code:

private static int[] SigmaTable(string pattern)
{
//We will calculate the longest prefix of 0-&gt;ip that is also a prefix of the pattern
var np = pattern.Length;
var sigmatable = new int[np];
var ip = 2;
var k = 0;
while (true)
{
if (ip &gt;= np) //We have shifted out of our pattern, we are done.
break;
if (pattern[k] == pattern[ip - 1])//Our prefix extends
{
sigmatable[ip] = (k + 1);
ip++;
k++;
continue;
}
if (k &gt; 0)//We have a mismatch, so we need to see if we have a smaller prefix that we can use
//We already calculated this, so just get it.
{
k = sigmatable[ip];
continue;
}
//We have a mismatch, but K = 0, so we dont have a prefix.
sigmatable[ip] = 0;
ip++;
}
sigmatable[np - 1] = k;
sigmatable[0] = -1;
return sigmatable;
}

Quick Search

This is my personal favorite for two reasons: 1) it’s fast and 2) it’s everything but difficult!

This algorithm too, will do some preprocessing, although not as difficult to explain as KMP. What we will do is, for every character in the pattern, store the left most location in the pattern. E.g “abcabcd” will have 1 for a, 2 for b, 3 for c and 7 for d. Now, when we have a mismatch, what we will do is take the first character that comes after the entire pattern in the text. So in our picture above, we would take ‘y’. We then check to see if this character exists in our pattern. If it does not, we don’t even have to try matching, we skip it entirely, i.e shift the beginning of our pattern past that character (see the performance?). If it does exist, we shift our pattern the value of our table to the right and start matching from scratch.
We don’t know if our first characters will match, but we do know there is going to match one. The performance boost this algorithm gets is from the parts where we can skip characters that don’t exist in our pattern.