Thing is: if there is more than one way to solve a problem, the PHP programmer (I usually browse the PHP tag on StackOverflow) will ask for help on the solution involving regular expressions.

Even when it will be less economic, even when the php manual suggests (link) to use str_replace instead of any preg_* or ereg_* function when no fancy substitution rules are required.

Does somebody have a clue about why this happens?

Don't get me wrong, some of my best friends are regular expressions and I don't despise Perl. What I don't get is why there is no looking for alternatives whatsoever, even when the overkill is obvious (regex to switch strings) or the code complexity rises exponentially (regex for getting data from html in PHP)

This question exists because it has historical significance, but it is not considered a good, on-topic question for this site, so please do not use it as evidence that you can ask similar questions here. This question and its answers are frozen and cannot be changed. More info: help center.

Because they're cryptic, so you want to be part of the exclusive kewl kidz' club? And mostly because they provide a short way of expressing a match or extraction, which is what they're made for. Sure for dummy cases, custom parsing if better, but the dev time over writing a quick regex is in favor of the regex.
–
haylemMar 5 '12 at 2:31

17 Answers
17

Because on the subconscious level they feel like an entire smart program who can accomplish a lot on its own accord while being encompassing and self-adjusting (think patterns).

This is why people immediately believe regular expressions will solve any of their text-based task, somehow not thinking it might be overkill and not realizing it might me underkill (parsing languages with it).

When the only tool you have is a regex, every problem looks like ^((?>[a-zA-Z\d!#$%&'*+\-/=?^_{|}~]+\x20*|"((?=[\x01-\x7f])[^"\\]|\\[\x01-\x7f])*"\x20*)*(?<angle><))?((?!\.)(?>\.?[a-zA-Z\d!#$%&'*+\-/=?^_{|}~]+)+|"((?=[\x01-\x7f])[^"\\]|\\[\x01-\x7f])*")@(((?!-)[a-zA-Z\d\-]+(?<!-)\.)+[a-zA-Z]{2,}|\[(((?(?<!\[)\.)(25[0-5]|2[0-4]\d|[01]?\d?\d)){4}|[a-zA-Z\d\-]*[a-zA-Z\d]:((?=[\x01-\x7f])[^\\\[\]]|\\[\x01-\x7f])+)\])(?(angle)>)$

I dunno... I think this pretty much sums up the whole of it. If you know regex, and don't know about the other methods, why would you go looking? You've already got a tool that, if done correctly, will handle the job. Until they stumble across the simpler method or are told about it, regex will be the catch-all method, even if more complex than it need be.
–
AeoDec 21 '10 at 13:55

4

@Tom O'Connor I think it's something close to the Regex for matching an RFC 2822 email address, but I had to take out a couple of characters because they were wreaking havoc with the markdown.
–
glenatronDec 21 '10 at 15:25

In earlier phases of my career (ie. pre-PHP), I was a Perl guru, and one major aspect of Perl gurudom is mastery of regular expressions.

On my current team, I'm literally the only one of us who reaches for regex before other (usually nastier) tools. Seems like to the rest of the team they're pure magic. They'll wheel over to my desk and ask for a regex that takes me literally ten seconds to put together, and then be blown away when it works. I don't know--I've worked with them so long, it's just natural at this point.

In the absence of regex-fluency, you're left with combinations of flow-control statements wrapping strstr and strpos statements, which gets ugly and hard to run in your head. I'd much rather craft one elegant regex than thirty lines of plodding string searching.

I'm curious: do you read regexp's as fluently as you write them?
–
peterchenDec 21 '10 at 16:56

7

I hope you're holding regular regex training sessions and/or documenting the hell out of your code; otherwise you are creating a support nightmare for your coworkers. The time you saved by writing that regex may be lost a hundred times over by people trying make sense of what that "elegant regex" is doing.
–
Jeff KnechtDec 21 '10 at 20:11

3

So great. You can hear the tug-of-war between loving and hating regexes right here in these comments.
–
Dan RayDec 21 '10 at 20:50

1

@Ben Lee: I guess so - OTOH, I've never encountered a commented regex in the wild. Some of the problems with regexes may be based on an attitude of coolness.
–
peterchenFeb 15 '12 at 16:31

On the contrary. People are parrotting the regex are evil meme way too often IMO. It's obvious that preg_match is overused in php, but it's less obvious that it's oftentimes sensible to do so (in PHP).

I would go so far and conjecture that it's yet another microoptimization in php land to use the string functions. There are many and many useful, and they are usually the better choice. But you shouldn't shun preg_match in favour of multiple strpos and if chains. Because in practice it turns out, libpcre is often faster than PHP can execute a loop looking for string alternatives e.g.

As a recent example made me realize, testing if a string is all-lowercase:

if ($string == strtolower($string))

Is more readble than:

if (!preg_match("/[A-Z]/", $string))

And you would assume the first must be faster, since it's all-PHP. But in reality the regex only looks over the string once, and can abort the negated condition as soon as it finds an uppercase letter. The strtolower() approach however looks over the string twice. First strtolower() makes a string duplicate by iterating over each letter, comparing and uppercasing it. Then the == iterates over the original and the copy again, comparing them once more.

So that's not an obvious case. And to be objective the first one is often faster, since you normally just compare short strings. But it's imperative to not go blindly by the assumption that PHP string functions are always advisable over regular expressions.

(I'm tempted to add another rant about @bobince's fun answer regarding xhtml-regexes, and how it's recently often linked in a very unhelpful manner. And the more objective answers below go ignored.)

I agree with your example; still, in this particular case, I would prefer ´strtolower()´ anyway: in non-critical code, even such a big (relatively to the other implementation) execution time optimization is insignificant - unless you want to evaluate the lower-case-ness of a huge text file, but I can't imagine a case in which that would be useful.
–
cbrandolinoDec 21 '10 at 12:56

1

@cbrandolino: No discussion there. This stuff should only every be relevant and evaluated for nested loops, where it might make a factual difference.
–
marioDec 21 '10 at 12:59

4

+1 For the fact people always bash them, far more than they are supported.
–
OrblingDec 21 '10 at 13:47

1

As one of the "regexp bashers": It's fun to see a one-liner more or less express what "manual" string parsing nedds 30 lines for. However, maintenance suffers in most realistic examples. In addition, when trying to apply them to unvalidated input, generating suitable diagnostics for rejected input requires additional acrobatics. For me, it's the prototypical "write only" code - cool for quick scripts, sucksfor long-living apps.
–
peterchenDec 21 '10 at 16:54

1

Anybody who isn’t writing all his regexes in /x mode to allow for whitespace for the elbowroom of cognitive chunking, and for comments to explain why things are being done, should of course have his ears boxed. But for real regexes of reasonable complexity, you need to consider applying top-down design via grammatical regexes. Once you have seen the light, you’ll never go back to /@#$^^@#$^&&*)@#/.
–
tchristMar 4 '12 at 23:52

Regular expressions are very attractive because they are the best tool for parsing a regular language.

They have the following advantages:

They are concise. It generally takes a lot more code to parse a specific regular language using a specific algorithm that you have come up with than with a regexp.

They are quick to use. It generally takes a lot more time to write a parser for a specific regular language using a specific algorithm that you have come up with than with a regexp.

They are easy. Once you learn the set of special characters and their meanings, it is easy to compose a regexp (although a little harder to read them). Regexps are languages themselves - a useful trait because our species has evolved to be very good at language.

They are fast. Once compiled, they can match a string length N in O(N) time.

They are flexible. They can match any regular language and a lot of our data is expressed as a regular language.

They are ubiquitous. Most programming languages have basic regexp support - either through external libraries or embedded into the language itself. There is also not too much variation between the regexp languages themselves.

This makes them attractive for situations to which they are suited, but people may use them in contexts where they are not the best tool, because they:

Don't understand that what they are matching can't be expressed using a regexp (eg. HTML).

Are lazy (in a bad way) - they know a tool and recognise that it isn't the best tool for what they are doing but it will work without problems 95% of the time and takes 95% of the effort of learning a particular parser or writing one from scratch.

Hmmm, I can only guess. Maybe some people have experienced that 30 lines of their code were replaced by a 20-character-long regex, so it feels wrong to them to use anything else instead when regexes can be used.

In terms of our evolutionary history that stands to reason. We were matching patterns long before we were defining grammars or discovering syllogisms.
–
glenatronDec 21 '10 at 11:35

1

I disagree, programming involves logic and pattern matching, two areas. Regexps are very good at pattern matching and should be used for such tasks. Too say "I don't like them", is to throw away a good tool for a particular job.
–
OrblingDec 21 '10 at 13:48

I think the ubiquity of regex is due to the ubiquity of strings. The string is the simplest data structure, the first one that most of us learn. Since all of our code is written in symbolic form, it is natural for a programmer to consider modelling something in symbolic form. But if our programming language offers any resistance when we try to extend its syntax for our clever new symbolic forms, they all end up between quotes. The relational data model has SQL. The XML data model has XQuery. But what about the humble string data model? Regex!

Just yesterday, I was looking over the API for a shiny new Javascript framework that supports HTML5 game development. It has a declarative mechanism for describing the main subsystems that your game would need. How does one specify those features? JSON? Fluent dot notation? An array? Nope -- a string containing a comma- and whitespace-separated list of feature names. I wonder how it parses that list... ?

Because you can see the whole thing at once. By being able to see the whole thing, it can be easier to work with, and that's always nice. It's sort of like the reason that many C++ programmers still use printf-type statements: It's not typesafe (though gcc at least can check types on printf statements), and it's not pretty, but boy is it compact and usable.

If it's a simple enough regex, then they often ARE the best way to do things - their compact form and many capabilities make them perfect for certain tasks. The problem comes when you make the regex so complicated that you can't read it anymore, or when you're using a complex regex to do something that could be more quickly done via simple string operations.

Regex, like any other powerful tool, must be use in proper moderation - not too much, not too little. And unless performance is a big concern, a single regex may at times be quicker to write and easier to debug than a series of string operations.

Hmm, the current answers center too much on technical aspects, and the readability pros/cons (which is an important point). So let me try to shift it a bit more onto the PHP environment/community:

PHP is Perls little stepsister. And an integral part of Perl are regular expressions (they invented that stuff, didn't they?). Therefore it's one cause why regexps are pervasive in PHP too.

The use case of PHP is coincidentally not much unlike the use case for regular expressions. PHP is structurally used for glueing together HTML pages. And regexps work on text. (what WReach said)

Micro optimization. As mentioned before: people use regexps and/or PHP string functions frequently after perceived speed. A core problem in PHP circles, not specific to regexps.

Regular expressions are built-in. In Python, in Java, in C#, in Ruby? there is availability, but a deterrent in having to load an extra module. And see how in PHP or Javascript where it's a core feature, the usage pattern differs. Another exhibit: CSS where it's getting more frequently used.

The PHP manual is at fault. It often is. Regular expressions are easily discoverable, and I postponed this fun fact because it's boring in its obviousness: all the damn tutorials and PHP introduction books always teach about regular expressions, but fail to educate on use cases.

The string API in PHP was designed by the same people that brought you magic quotes and the namespace \ separator. It's encompassing, better than Java, but not glamorous in its entirety. Particularily if strings could double as objects (see Python), string functions might outdo regexps.

But that just as side notes. I believe it's anyway mostly perceptional and technical reasons that lead to overuse and/or shunning regular expressions in general. Yet PHP and its userbase has a few properties which compound it, and why we see more questions on SO about it [citation needed!] and they are "morbidly attractive" there.

I like regular expressions in general I find them easier to read/understand than the 20 lines of code I would have to replace them with. Short regular expressions are quickly read and understood and they are relatively easy to maintain (if the expression changes you only have one line to change versus looking through the 20 lines of code to make the change). There are times where they are misused but so are many other things.

The reason you probably see so much abuse of them is because your browsing the PHP section of StackOverFlow as I am sure you are aware there are a lot of umm immature PHP programmers out there.

-1 For deciding that all programmers like to be obscure, and then not considering any other possible explanation. ...Stating why you think they are ugly or incomprehensible would have helped.
–
MacneilDec 21 '10 at 14:06

1

@Macneil - Please, (although yes, my thoughs are along that line), unless you're quoting me don't state that I said/decided on something I didn't (the first part of your comment). As far as your question, you find them beautiful?! ... I don't. And since this is a subjective site, and that is a subjective opinion, I don't have to nor wish to elaborate on it. Nor will I try, for that matter.
–
RookDec 21 '10 at 14:14

1

@Rook - I think most people look at a complex regular expression, decide all regular expressions are ugly, and then stop thinking. The fact is, they're a very elegant and expressive tool if you can set down your prejudice about them. BTW, by your own logic, a lot of programmers can't do algebra, so algebra is probably inherently evil and should be abolished since it's clearly not very understandable.
–
Dan RayDec 21 '10 at 14:15

Man is a tool-using creature, and regular expressions are powerful tools. A nice metaphor for regular expressions is a meat slicer from a deli. If you want paper-thin slices of turkey, corned beef, etc., it's just the thing. However, you need skilled hands to use it, because you can cut yourself really badly with it and you won't feel a thing until you see the blood. What I mean by this is that the big problem with regular expressions is getting them slightly off means that you match something you shouldn't, or vice versa, and you don't find out until it causes an issue further along in the process.

Regular expressions are very attractive because they wield power. You can do a very complicated piece of work in very few characters.

The problem is that the standard regular expression construct is not Turing-complete which means that there are programs you simply cannot implement with a regular expression, and people don't KNOW that when they are lured by the apparent power of regular expressions.

This - I guess - is the reason for the jwz-quote of "now they have two problems".

I would guess that Perl regular expressions are Turing-complete, but apparently it has not been decisively proved or disproved yet.

Because it's an efficient way to program a finite state machine, which is a powerful tool when it applies. It's basically it's own language for programming FSMs, which is helpful if you know the language, annoying if you don't.

In my experiencie, regexes are like an ancient art, something obscure, some peolpe resent them because they can't understand the sorcery involved and maybe because nobody will explain them to you. I haven't heard of universities teaching them for something less trivial than matching an e-mail.
Then there's the mystical inner workings of it, since most people don't understand them, they must be slow.
And getting them to work fine in the first try is always a challenge for newcomers.

The same thing can be said about Perl, awk, Linux, and everything that has no shiny buttons or nice colored syntax.
So, it's like added complexity to "trivial tasks", just throw some loops, splits, a switch, some magic and that's it, something that might work.
But well, if you are on the other side of the road, regexes are beautiful cookie cutters that look like signal noise without any nasty loops or more stuff to debug. I like them also for the flexibility they provide. When the pattern to match changes, you just change the regex, not the algorithm, or tool/whatever, and it's nice and working again. And since they are a magical string, you can put it outside the sourcecode if you wish.
And another thing that makes me think of perl, if you write a regex that's 20+ chars long, it feels that you accomplished a lot, at least for me, it's just so neat and compact. I'm a lazy programmer also, i don't like writing a lot of code with nice identation and comments and adding some bugs to the mix.