This algorithm is typically used when we want to search for multiple pattern strings in a text e.g. when detecting plagiarism or a primitive way of detecting code duplication but my initial version only lets your search for one pattern.

For text of length n and p patterns of combined length m, its average and best case running time is O(n+m) in space O(p), but its worst-case time is O(nm)

On line 2 we compute the initial hash value for the pattern and for the first m characters of the text where m represents the number of characters in the pattern.

We then work through the text comparing the pattern hash with our current version of the text hash each time.

If they match then we check that the characters in those positions also match, since two different strings can hash to the same value, and if they do then we’re done.

Line 7 is the interesting line because if we recalculate the hash from scratch each time then it will take O(m) time which means the whole algorithm will take O(nm) time since it’s called in a loop.

We therefore need to use a rolling hash which will allow us to work out the value of the next hash from the current one.

For example if we’re searching for a three letter pattern in the text ‘markus’ then on our first iteration of the loop we’ll have hash(“mar”) and on the second iteration we need to work out hash(“ark”).

The hash of “mar” already contains the hash value of “ar” so we need to remove the hash value of the “m” and then add the hash value of “k” to get our new hash value.

I then use scanl to apply a reduction over that collection, passing in the hash of the previous 3 characters each time in this example. I used scanl instead of foldl so that I could see the value of the hash on each iteration rather than only at the end.

Next I use a zip to get the indexes of each letter in the text and then I look for the first entry in the collection which matches the pattern hash and has the same sequence of characters.

The mapToMaybe is used to grab the index of the match and then we return a ‘-1’ if there is no match in the last bit of the line.

I’m assuming that scanl is lazily evaluated and in this case will therefore only evaluate up to where a match is found – if that assumption’s wrong then this version of Rabin Karp is very inefficient!

First we remove the hash of the first character – which has a value of ‘rm-1 * ascii char’ from our hash function – and then we multiply the whole thing by ‘r’ to push each character up by one position

e.g. the 2nd character would initially have a hash value of ‘rm-2 * ascii char’. We multiply by ‘r’ so it now has a hash value of ‘rm-1 * ascii char’.

Then we add the ascii value of the next character along and we have our new hash value.

We can compare the results we get from using hash and reHash to check it’s working:

> hash "mar" 3
7168370
> hash "ark" 3
6386283

> reHash 7168370 'm' 'k' 3
6386283

I hardcoded ‘globalQ’ to make life a bit easier for myself but in a proper implementation we’d randomly generate it.

‘globalR’ would be constant and I wanted it to be available to the hash and reHash functions without needed to be passed explicitly which is why I’ve partially applied hash’ and reHash’ to achieve this.

We can then run the algorithm like so:

> rabinKarp "markusaerelius""sae"5

> rabinKarp "markusaerelius""blah"-1

My whole solution is available on hpaste and as usual if you see any ways to code this better please let me know.