Biggest Regex In The Word

There are two common pattern matching problems that appear simple on the surface, but are very complex if you think about them. These are matching emails and URI’s in free form text. Everyone wrote a URL or an email validation script at one point or another. And I’m willing to bet that 90% of these validation scripts out there are just plain wrong.

It appears that you need exactly 7579 characters to pattern match every possible legal url out there. Or possibly even more because this one doesn’t actually account for https:// addresses. And you thought this was an easy issue that could be solved by a one liner. Shame on you!

In all fairness, how often do we really need the regex of doom though? In most cases (not all mind you) something as simple as “give me all strings that start with http:// and are delimited by spaces on both sides” will work almost as well, and probably much faster.

Let’s face it. Who wants to have something like that sitting in their codebase? You can’t read it, you can’t verify that it works via code inspection, and generating the regex from scratch using the perl script included on the linked page, is probably the only way you can maintain it. Trying to modify it by hand is just asking for a one way trip to Painsville, NJ (that’s the fabled fictional town that invated brain pain if you didn’t know).

I never claimed the regexp to be cover all possible schemes. It didn’t 10 years ago, and it certainly doesn’t now. Assuming you have the Regexp::Common module installed, you can get a similar regexp with the following line of code: perl -MRegexp::Common -E ‘say $RE{URI}’

My current RegEx is 157k. It searches text for ancient references which means it has all the various abbreviations for those references. It resides on GoogleAppEngine and can index 100k in less than 2 seconds but the RegEx portion of that is under a few milliseconds the rest of the processing time is processing each of the found objects. It got so big to handle that I had to write code to generate the RegEx from a flat file.

Your email address will not be published. Required fields are marked *

Comment

Name *

Email *

Website

Currently you have JavaScript disabled. In order to post comments, please make sure JavaScript and Cookies are enabled, and reload the page.Click here for instructions on how to enable JavaScript in your browser.