Prologue

Today I'd like to talk about the use of regular expressions to parse
and modify HTML. Or rather, the misuse.

I'm going to try to convince you that it's a very bad idea to use
regexes for HTML. And I'm going to introduce you to
Nokogiri, my new
best friend and life companion, who can do this job way better, and
nearly as fast.

For those of you who just want the meat without all the starch:

You don't parse Ruby or YAML with regular expressions, so don't do it with HTML, either.

If you know how to use Hpricot, you know how to use Nokogiri.

Nokogiri can parse and modify HTML more robustly than regexes, with less penalty than formatting Markdown or Textile.

Nokogiri is 4 to 10 times faster than Hpricot performing the typical HTML-munging operations benchmarked.

The Scene

On one of the open-source projects I contribute to (names will be
withheld for the protection of the innocent, this isn't
Daily WTF), I came across the following code:

In case it's not clear, the goal of this method is to insert a
<span> element inside the link, converting hyperlinks from

<a href='http://foo.com/'> Foo! </a>

to

<a href='http://foo.com/'> <span> Foo! </span> </a>

for CSS styling.

The Problem

Look, I love regexes as much as the next guy, but this regex is
seriously busticated. If there is more than one <a> tag on a line,
only the final one will be spanified. If the tag contains an embedded
newline, nothing will be spanified. There are probably other unobvious
bugs, too, and that means there's a
code smell here.

Sure, the regex could be fixed to work in these cases. But does a
trivial feature like this justify the time spent writing test cases
and playing whack-a-mole with regex bugs? Code smell.

Let's look at it another way: If you were going to modify Ruby code
programmatically, would you use regular expressions? I seriously doubt
it. You'd use something like
ParseTree, which
understands all of Ruby's syntax and will correctly interpret
everything in context, not just in isolation.

What about YAML? Would you modify YAML files with regular expressions?
Hells no. You'd slurp it with YAML.parse(), modify the in-memory
data structures, and then write it back out.

Why wouldn't you do the same with HTML, which has its own nontrivial
(and DTD-dependent) syntax?

Regular expressions just aren't the right tool for this job. Jamie
Zawinski said it best:

Some people, when confronted with a problem, think "I know,
I'll use regular expressions." Now they have two problems.

Why, God? Why?

So, what drives otherwise intelligent people (myself included) to whip
out regular expressions when it comes time to munge HTML?

My only guess is this: A lack of worthy XML/HTML libraries.

Whoa, whoa, put down the flamethrower and let me explain myself. By
"worthy", I mean three things:

Now, Hpricot is pure
genius. It's pretty fast, and the API is absolutely delightful to work
with. It supports CSS as well as XPath queries. I've even used it
(with feed-normalizer) in
a Rails application, and it performed reasonably well. But it's still
much slower than regexes. Here's a (totally unfair) sample benchmark
comparing Hpricot to a comparable (though buggy) regular expression
(see below for a link to the benchmark gist):

For an html snippet 2374 bytes long ...
user system total real
regex * 1000 0.160000 0.010000 0.170000 ( 0.182207)
hpricot * 1000 5.740000 0.650000 6.390000 ( 6.401207)
it took an average of 0.0064 seconds for Hpricot to parse and operate on an HTML snippet 2374 bytes long
For an html snippet 97517 bytes long ...
user system total real
regex * 10 0.100000 0.020000 0.120000 ( 0.122117)
hpricot * 10 3.190000 0.300000 3.490000 ( 3.502819)
it took an average of 0.3503 seconds for Hpricot to parse and operate on an HTML snippet 97517 bytes long

So, historically, I haven't used Hpricot everywhere I could have, and
that's because I was overly-cautious about performance.

Get On With It, Already

Oooooh, if only there was a library with libxml2's speed and Hpricot's
API. Then maybe people wouldn't keep trying to use regular expressions
where an HTML parser is needed.

Check out the full benchmark,
comparing the same operation (spanifying links and removing
possibly-unsafe tags) across regular expressions, Hpricot and
Nokogiri:

For an html snippet 2374 bytes long ...
user system total real
regex * 1000 0.160000 0.010000 0.170000 ( 0.182207)
nokogiri * 1000 1.440000 0.060000 1.500000 ( 1.537546)
hpricot * 1000 5.740000 0.650000 6.390000 ( 6.401207)
it took an average of 0.0015 seconds for Nokogiri to parse and operate on an HTML snippet 2374 bytes long
it took an average of 0.0064 seconds for Hpricot to parse and operate on an HTML snippet 2374 bytes long
For an html snippet 97517 bytes long ...
user system total real
regex * 10 0.100000 0.020000 0.120000 ( 0.122117)
nokogiri * 10 0.310000 0.020000 0.330000 ( 0.322290)
hpricot * 10 3.190000 0.300000 3.490000 ( 3.502819)
it took an average of 0.0322 seconds for Nokogiri to parse and operate on an HTML snippet 97517 bytes long
it took an average of 0.3503 seconds for Hpricot to parse and operate on an HTML snippet 97517 bytes long

Wow! Nokogiri parsed and modified blog-sized HTML snippets in under 2
milliseconds! This performance, though still significantly slower than
regular expressions, is still fast enough for me to consider using it
in a web application server.

Hell, that's as fast (faster, actually) than BlueCloth or RedCloth can
render Markdown or Textile of similar length. If you can justify using
those in your web application, you can certainly afford the overhead
of Nokogiri.

And as for usability, let's compare the regular expressions to the Nokogiri operations: