Introduction

This article about the Knuth-Moris-Pratt algorithm (KMP). KMP is a string matching algorithm that allows you to search patterns in a string in O(n) time and O(m) pre-proccesing time, where n is the text length and m is the pattern length.

Background

The KMP algorithm first calculates a transition array that tells how many shifts the pattern is shifted when a mismatch occurs.

Implementation

The PrefixArray class takes a string parameter, the pattern, and is responsible for calculating the prefix function and returning an array that contains the transition indexes.

In calculating the prefix array, care has been taken to give maximum performance; hence, the general implementation has been tweaked in several places.

The code above shows computing the prefix function; the loop in the code iterates through the pattern and calculates the prefix function for each index. (Note: the iteration starts from 1 as we know that the transition at index 0 is 0.)

The temp array represents all the characters from the 0th index of the pattern to the index of the current loop. This character array is passed into the GetPrefixLength() function, which actually computes the prefix function.

This function takes in the array we discussed as a parameter, and also the first character of the pattern (charToMatch).

The array is iterated and searched for a match with the first character of the pattern (the first character needs to be matched, the prefix starts with the first character). If the first character exist in the array, then we calculate the longest suffix that is a prefix of the pattern.

Obviously, the first match gives us the longest suffix that is a prefix of the pattern.

The loop iterates through the string to search the pattern. If the current character matches the character in the pattern index (the character index that should be matched in the iteration), then there is a match, and we increment the index of the pattern and continue.

If there is a mismatch, then we get the transition index for the specific index, and we see if the character at the transition index + 1 matches with the character that is being matched (this is done so we don’t need to unnecessarily match the character again). If there is a match, then we move the pattern index with the transition index; if not, we move the pattern index with 0.

Next, we see if the pattern index is equal to the pattern length; if so, then we have a match. This code snippet is shown below:

//A complet match, if kis//equal to pattern lengthif(k == patternArray.Length)
{
//Add it to our result
result.Add(i - (patternArray.Length - 1));
//Set k as if the next character is a mismatch//therefore we don’t miss out any other containing pattern
k = transitionArray[k - 1];
}

Using the code

In order to use the code, all you got to do is build the files in the source provided, and use the static method GetAllOccurences(string pattern, string text) of the KMPUtil class like this:

KMPUtil.GetAllOccurences("ab", "abhsdsabsbabaa");

This method will return an ArrayList of indexes where the pattern occurs in the text (zero based).

Something that could make it even better is a comparison to traditional string matching approaches such as regular expressions. I would like to know your opinion on the advantages of this approach over other approaches.