Text Processing: Divide and Conquer

Another day another generic text processing problem that many developers have had to solve before. I have a list of patterns and need to find if they exist in a group of files. If I did not need to do complex post processing then I could just use the command line like so

grep -ri -f patterns files/

and be done with it, alas the world is a cruel place :) So I cooked up a simple program that slurps a file in and does the comparison and tracks the necessary stats. The main problem with this approach is that it is too slow, I had over 4,000 patterns to match against every file. I used perl -d:NYTProf to confirm my theory. Here is the basic report I got from nytprofhtml:

1,125,978 matches in 3 megabytes of sample text files took 6 minutes and 26 seconds. This is way too slow so, let us optimize.

Divide

I started by looking at the patterns file and the majority of them are string literals which is good news. With a little preprocessing I now have all the string literals in a hash whose key is the first letter of the pattern and the value is a hash containing the patterns as keys and the values are coderefs to do whatever processing is needed. Now instead of having to run all of the literal matches every time I can test the first letter and use that to limit the matching scope. The best case scenario is 'x' which only has 1 pattern to test and the worst is 's' with over 700 patterns.

index instead of m//

Now you might be asking yourself why didn't I just use index since I have string literals? Index is designed for fast string searching with literal matches. Lets take a look at a simple example which loads 4,000 patterns and matches it against the entire text of Bram Stoker's Dracula which is a 16,248 line, 836kb file.

Is the sample program and it takes 2 minutes, 10 seconds to run. The bulk of the time is spent doing 4,000 match operations. Now lets try the same thing with index instead.

This program processes the entire Dracula text in 4.1 seconds.

The reason I didn't use index is that some of the literal patterns match as substrings of other words so I get a bunch of false positives. This does not happen with m// because I used \b to match up word boundaries. This was a precautionary measure because with that many input patterns I had to assume there would be some overlap and it turns out there is. A gentle look at simple string operations is a good read if this is new material and chapter 9 of "Mastering Algorithms with Perl" goes into depth about Boyer-Moore which Perl and grep both use.

I need tokens

This is what I had:

Which was fine but now with the need to test the first letter of each word, I need tokens. One of the advantages to making this change is I can now do text normalization on it, specifically removing duplicate tokens so less match attempts need to be made. List::MoreUtils makes this a breeze.

I read in the whole source file and strip out all the characters I do not care about. Then I split it into tokens using whitespace as the separator. Next I filter out any empty or one letter tokens. uniq from List::MoreUtils returns a list with no duplicate values. I was surprised that between List::MoreUtils and List::Util there was no function for getting the shortest or longest string in a list. Then again it is a trivial task to implement them using reduce as shown above.

Conquer

Going back to the initial 3mb of text files this updated script takes under 3 seconds to execute, that is less than 1% of the baseline time. I can now run this script 127 times in the time it took for the first version to complete one run.

More optimization

The more I used this little application on larger and larger datasets the more I wondered if I could get a little more speed out of it. I started by increasing my sample data size to 26 megabytes to simulate a more normal workload. Then I took a look at that foreach loop to see if I could speed things up. I created a simple benchmark script to compare five variations on a solution and then time them.

Method1 turned out to be the fastest but what does this translate to in real work? Method5 takes 33 seconds to process 26mb of text while method1 only takes 22 seconds, a 33.33% reduction in execution time. For much larger datasets I am looking into doing things in parallel however the overhead of forking is greater than the amount of work that needs to be done on a per file basis.

2 Comments

Hi, Kirk. Here’s an alternative approach that you may find interesting. I don’t have exactly the same data set as you to test against, but I reconstructed a similar case, and this takes about 205ms (on my machine, under 5.16.0) compared to 324s for your initial approach of looping over each sample word. That's about 0.06% of the original running time, so it may turn out to be faster than what you currently have.

The reason the obvious approach is so slow is that, if you have N words and M megabytes of text, the regex engine has to search those M megabytes N times. However, if we can construct a regex that matches all the subpatterns as it goes, then we can reduce the runtime by a factor of about N. That is, if we're looking for the words "foo" and "bar", we want to construct a regex like /\b(?:foo|bar)\b/.

In versions of Perl before 5.10, that wouldn't have helped, because the regex engine implemented alternation by trying the whole of the first alternative, then the whole of the second, then the whole of the third, and so on — for each position in the string. So, in this case, the running time of the regex would be O(N*M). But 5.10 uses an Aho–Corasick automaton to optimise cases like this: if the branches are (or begin with) fixed strings, then the regex engine can churn through those in time independent of the number of alternations. In this case, since all the words are fixed strings, that'll work out perfectly.

The other wrinkle, though, is that your task also needs you to know which words matched. Since all the words are fixed strings, the easy way to do that is to capture each match. (If any subpatterns weren't just fixed strings, you could use 5.10's (*MARK:text) regex feature, but there's no need for that complexity here.)

Now the keys of %seen should be precisely those words in @patterns which were found in $content. (And each corresponding value is the count of how many times that word occurred, which may be helpful in some situations.)

I think this technique is a lot more straightforward than the other approaches you've tried; I hope you find it helpful!

... at the top of Aaron's (most excellent) suggestion. Although while on 5.8 this will be much faster, from 5.10 onwards it will be slower, so not much win. It would be much more so if the patterns were regular expressions themselves.

Leave a comment

Name

Email Address

URL

Remember personal info?

Comments
(You may use HTML tags for style)

About Kimmel

I like writing Perl code and since most of it is open source I might as well talk about it too. @KirkKimmel on twitter