Isn't that true of any code construct? s/regex/arrays/ and the question makes just as much sense? Or am I missing something?
–
zigdonSep 29 '08 at 21:29

I guess I'm getting at the impenetrableness of some of the longer ones - I've seen regexes used to parse content in contexts that make no sense to me, such as to filter javscript from html - that doesn't seem particularly useful as things change regularly in that area.
–
Rich BradshawSep 29 '08 at 21:31

@Rich, these are all good points. But how are any of these points MORE applicable to regexes than to any code? Regex is a language. Use it appropriately, write good "programs" and you'll be fine, misuse it and you'll run into trouble. No different than C++, Java, or Python.
–
WedgeSep 29 '08 at 22:39

Any code I've seen that uses Regexes tends to use them as a black box:

If by black box you mean abstraction, that's what all programming is, trying to abstract away the difficult part (parsing strings) so that you can concentrate on the problem domain (what kind of strings do I want to match).

even a small change can often result in a completely different regex.

That's true of any code. As long as you are testing your regex to make sure it matches the strings you expect, ideally with unit tests, then you should be confident at changing them.

Edit: please also read Jeff's comment to this answer about production code.

changing them in production code should NEVER make you feel comfortable. Rather, they should be changed on your test server (which should always be identical to your production server, except where your test code is different), tested, and pushed to your prod server. Changing prod code: Bad, Mkay?
–
JeffOct 1 '08 at 15:15

It really comes down to the regex. If it's this huge monolithic expression, then yes, it's a maintainability problem. If you can express them succinctly (perhaps by breaking them up), or if you have good comments and tools to help you understand them, then they can be a powerful tool.

Nothing beats a good comment, even with a simple regular expression. Not all members of my team understand them so a good comment explaining what it is doing (sometimes with a key) is invaluable for maintenance.
–
Jeff YatesSep 29 '08 at 23:18

I don't know which language you're using, but Perl - for example - supports the x flag, so spaces are ignored in regexes unless escaped, so you can break it into several lines and comment everything inline:

To someone that knows regexes, those comments are equivalent to "i++; // Adds one to i
–
Zan LynxSep 29 '08 at 22:02

I doubt jkramer was suggesting those as the exact comments, but merely pointing out the ability to do that. (taking examples too literally)--
–
TanktalusSep 29 '08 at 22:16

Well yes, but I do see code, Perl especially, commented in this way. Instead of using comments explaining regex basics or using "unless" after a keyword, or using short-circuit evaluation of "or", people need to learn Perl syntax.
–
Zan LynxSep 30 '08 at 17:07

It only seems like magic if you don't understand the regex. Any number of small changes in production code can cause major problems so that is not a good reason, in my opinion, to not use regex's. Thorough testing should point out any problems.

>> It only seems like magic if you don't understand the regex. << I think that's the point of Rich's question. Complex regex strings can be very opaque and difficult to understand, not to mention debug.
–
Michael BurrSep 29 '08 at 21:40

Agree with Mike B. Some coding god may immediately understand a page-long regex, but the power of the regex comes with a price for most regular developers :)
–
OregonGhostSep 29 '08 at 21:51

I think the underlying point that you're getting at here is most programmers don't understand regexes very well. They're an extremely important tool. This is an educational deficiency, not a coding deficiency.
–
rmeadorSep 29 '08 at 21:59

@Mike: The same can be said of any complex code. The difference is the developers are trained to understand the code. They also need to be trained to understand the regex's, it's a similar skill so it shouldn't be too difficult.
–
tloachNov 11 '08 at 20:24

Small changes to any code in any language can result in completely different results. Some of them even prevent compilation.

Substitute regex with "C" or "C#" or "Java" or "Python" or "Perl" or "SQL" or "Ruby" or "awk" or ... anything, really, and you get the same question.

Regex is just another language, Huffman coded to be efficient at string matching. Just like Java, Perl, PHP, or especially SQL, each language has strengths and weaknesses, and you need to know the language you're writing in when you're writing it (or maintaining it) to have any hope of being productive.

Edit: Mike, regex's are Huffman coded in that common things to do are shorter than than rarer things. Literal matches of text is generally a single character (the one you want to match). Special characters exist - the common ones are short. Special constructs, such as (?:) are longer. These are not the same things that would be common in general-purpose languages like Perl, C++, etc., so the Huffman coding was targetted at this specialisation.

Complex regexes are fire-and-forget for me. Write it, test it, and when it works, write a comment what it does and we're fine.

In many cases, however, you can breakdown regular expressions to smaller parts, maybe write some well-documented code that combines these regexes. But if you find a multi-line regex in your code, you better be not the one who must maintain it :)

Sounds familiar? That's more or less true of any code. You don't want to have very long methods, you don't want to have very long classes, and you don't want to have very long regular expressions, though methods and classes are by far easier to refactor. But in essence, it's the same concept.

Regular expressions are actually quite expensive..but you are right, they are powerful.
–
camflanSep 29 '08 at 21:32

Regular expressions are expensive until you start using them multiple times. Anything dealing with strings is expensive, but a regular expression will probably work better than looping through each string, seeing if it contains text, then doing the next thing you want to match on.
–
Darren KoppSep 29 '08 at 21:35

expensive how? If you only need a match/no match response then they're O(N), otherwise they can be exponential, but so would the equivalent non-RE way of searching for the same thing: en.wikipedia.org/wiki/…
–
tloachNov 11 '08 at 20:31

Defining named patterns

Some regular expressions use identical subpatterns in several places. Starting with Perl 5.10, it is possible to define named subpatterns in a section of the pattern so that they can be called up by name anywhere in the pattern. This syntactic pattern for this definition group is (?(DEFINE)(?<name>pattern)...). An insertion of a named pattern is written as (?&name).

The example below illustrates this feature using the pattern for floating point numbers that was presented earlier on. The three subpatterns that are used more than once are the optional sign, the digit sequence for an integer and the decimal fraction. The DEFINE group at the end of the pattern contains their definition. Notice that the decimal fraction pattern is the first place where we can reuse the integer pattern.

When used consciously regular expressions are a powerful mechanism that spares you from lines and lines of possible text parsing. They should of course be documented correctly and efficiently tracked in order to verify if initial assumptions are still valid and otherwise updated them accordingly. Regarding maintenance IMHO is better to change a single line of code (the regular expression pattern) than understand lines and lines of parsing code or whatever the regular expressions purpose is.

There are a lot of possibilities to make RegEx more maintainable. In the end it's just a technique a (good?) programmer has to learn when it comes to major (or sometimes even minor) changes. When there weren't some really good pro's no one would bother with them because of their complex syntax. But they are fast, compact and very flexible in doing their job.

For .NET People there could be the "Linq to RegEx" library worse a look or "Readable Regular Expressions Library". It makes them more easy to maintain and yet easier to write. I used both of them in own projects I knew the html-sourcecode I analysed with them could change anytime.

But trust me: When you cotton on to them they could even make fun to write and read. :)

I have a policy of thoroughly commenting non-trivial regexes. That means describing and justifying each atom that doesn't match itself. Some languages (Python, for one) offer "verbose" regexes that ignore whitespace and allow comments; use this whenever possible. Otherwise, go atom by atom in a comment above the regex.

The problem is not with the regexes themselves, but rather with their treatment as a black box. As with any programming language, maintainability has more to do with the person who wrote it and the person who reads it than with the language itself.

There's also a lot to be said for using the right tool for the job. In the example you mentioned in your comment to the original post, a regex is the wrong tool to use for parsing HTML, as is mentioned rather frequently over on PerlMonks. If you try to parse HTML in anything resembling a general manner using only a regex, then you're going to end up either doing it in an incorrect and fragile manner, writing a horrendous and unmaintainable monstrosity of a regex, or (most likely) both.

Your question doesn’t seem to pertain to regular expressions themselves, but only the syntax generally used to express regular expressions. Among many hardcore coders, this syntax has come to be accepted as pretty succinct and powerful, but for longer regular expressions it is actually really unreadable and unmaintainable.

Some people have already mentioned the “x” flag in Perl, which helps a bit, but not much.

I like regular expressions a lot, but not the syntax. It would be nice to be able to construct a regular expression from readable, meaningful method names. For example, instead of this C# code:

This is just a quick idea; I know there are other, unrelated maintainability issues with this (although I would argue they are fewer and more minor). An extra benefit of this is compile-time verification.

Of course, if you think this is over the top and too verbose, you can still have a regular expression syntax that is somewhere in between, perhaps...

This is still a million times more readable and only twice as long. Such a syntax can easily be made to have the same expressive power as normal regular expressions, and it can certainly be integrated into a programming language’s compiler for static analysis.

I don’t really know why there is so much opposition to rethinking the syntax for regular expressions even when entire programming languages are rethought (e.g. Perl 6, or when C# was new). Furthermore, the above very-verbose idea is not even incompatible with “old” regular expressions; the API could easily be implemented as one that constructs an old-style regular expression under the hood.

Regex has been referred to as a "write only" programming language for sure. However, I don't think that means you should avoid them. I just think you should comment the hell out of their intent. I'm usually not a big fan of comments that explain what a line does, I can read the code for that, but Regexs are the exception. Comment everything!

I usually go to the extent of writing a scanner specification file. A scanner, or "scanner generator" is essentially an optimized text parser. Since I usually work with Java my preferred method is JFlex (http://www.jflex.de), but there is also Lex, YACC, and several others.

Scanners work on regular expressions that you can define as macros. Then you implement callbacks when the regular expressions match part of the text.

When it comes to the code I have a specification file containing all the parsing logic. I run it through the scanner generator tool of choice to generate the source code in the language of choice. Then I just wrap all that into a parser function or class of some sort. This abstraction then makes it easy to manage all the regular expression logic, and it is very good performance. Of course, it is overkill if you are working with just one or two regexps, and it easily takes at least 2-3 days to learn what the hell is going on, but if you ever work with, say, 5 or 6 or 30 of them, it becomes a really nice feature and implementing parsing logic starts to only take minutes and they stay easy to maintain and easy to document.

I use them in my apps but I keep the actual regEx expression in the configuration file so if the source text I'm parsing (an email for example) changes format for some reason I can quickly update the config to handle the change without re-building the app.