Clovis_Sangrail has asked for the
wisdom of the Perl Monks concerning the following question:

Hello Perl Monks,

I use Perl to generate daily audit reports from sets of Jounal Log Files produced by GT.M, an implementation of the MUMPS database/language. Each Journal line of interest includes a Username, a Global Variable and a description of the transaction on it. The report just presents a listing and count of the Global Variable modificatons, broken out by Username.

The customer wanted the capability to ignore some Globals that were not of interest. They can edit a file of such Globals, and I read that file and build an Inclusive-Or type of Regex that I pass to the Perl program as a commandline parameter. The program matches the Global Variable name from each Journal line against that Regex, and skips it if found.

But I did not realize just how popular this capability would be! I figured there would only ever be a few such Globals to skip, but the Customer has entered 54 of them so far, and they say there will be more! The Regex that I give to the Perl program is now about 750 characters long, and some of the bigger banks being audited produce over a million lines of Journal each day.

The reports for those banks do take noticeably longer to produce than when the system first went online, and I don't have much knowledge of or feel for the performance of the Perl Regex engine. Is it linear, like will it take ten times as long to match against a 600-character Regex than against a 60-character one?

I realize that this is just the sort of thing that enterprising Perl students study via test programs, and I may do that sort of thing. But I also do want to be able to tell the folks who sign my check that I am asking around, too.

The reports for those banks do take noticeably longer to produce than when the system first went online

That sounds as if lots of stuff might have been changed in between. Run a profiler over the script(s) and see where the time is actually spent.

I don't have much knowledge of or feel for the performance of the Perl Regex engine. Is it linear, like will it take ten times as long to match against a 600-character Regex than against a 60-character one?

In general, it doesn't depend much on the length of regex, but on the amount of backtracking and searching that the regex engine has to do.

If it's just a big alternation of constant strings, and you use perl 5.10.0 or newer, the trie optimization in the regex engine should handle that case very well (sub-linear even). If your regex grows too big, try increasing ${^RE_TRIE_MAXBUF} -- but only if it's the regex that's actually slow.

And as already mentioned, if you can solve your problem through a hash lookup, that would be even better.

If you really need to have regexes, then put them in an array instead of a single regex, it will generally perform better than a joined single regex (and put the most likely things to match first, if possible):

I'm not a big regex user, so my comments may not reflect what others have experienced.

A few years ago(2003), we saw an explosion in spam on our email machines to more than 100K emails per day per machine. We were using MailScanner to process the email, and found that it couldn't keep up with the quantity we were receiving. So I wrote a preprocessor with Perl and the quickest and dirties trick was to search on 'unique' phrases in the body of the email to identify email that was 'known' spam before passing the result to MailScanner. The original was about 300 lines of script. Since then it's grown to 5000++ lines and was split into 2 persistent scripts. The average email machines now process more than 1,000,000 emails per day. I use 'Time::HiRes' to time the 'while' loop that tests for spam identified within the body of the email. The basic test is:

In testing I tried to use a regex figuring I could include the 'lc' as part of the regex. All benchmarks showed the regex to be much slower than using 'lc' with 'index'.

Why this is important to you is that the '$body' averaged 10KB and the '@BD_data' usually had more than 1K elements. And the clients on the email machines that had problems were banks and the '$looptime' rarely exceeded 100ms. '@BD_data' is ordered by the frequency of spam activity, so the most common 'spam' term is first.