Sometimes people imply that we've forgotten, or that we don't how to
properly manage our codebase. Those people are super fun to respond
to!

We've gone back and forth a couple of times over the past few years,
but the current policy of Team Nokogiri is to not provide a
gemspec in the Github repo. This is a conscious choice, not an
oversight.

But You Didn't Answer the Question!

Ah, I was hoping you wouldn't notice. Well, OK, let's do this, if
you're serious about it.

I'd like to start by talking about risk. Specifically, the risk
associated with using a known-unstable version of Nokogiri.

Risk

The risk associated with a Nokogiri bug could be loosely defined by
answering the questions:

"How likely is it that a bug exists?" (probability)

"How severe will the consequences of a bug be?" (impact)

Probability

The master branch should be considered unstable. Team Nokogiri are
not 10-foot-tall code-crunching robots; we are humans. We make
mistakes, and as a result, any arbitrary commit on master is likely
to contain bugs.

Just as an example, Nokogiri master was unstable for about five
months between November 2011 and March 2012. It was unstable not
because we were sloppy, or didn't care, but because the fixes were
hard and unobvious.

When we release Nokogiri, we test for memory leaks and invalid memory
access on all kinds of platforms with many flavors of Ruby and lots of
versions of libxml2. Because these tests are time-consuming, we don't
run them on every commit. We run them often when preparing a release.

If we're releasing Nokogiri, it means we think it's rock solid.

And if we're not releasing it, it means there are probably bugs.

Impact

Nokogiri is a gem with native extensions. This means it's not pure
Ruby -- there's C or Java code being compiled and run, which means
that there's always a chance that the gem will crash your application,
or worse. Possible outcomes include:

leaking memory

corrupting data

making benign code crash (due to memory corruption)

So, then, a bug in a native extension can have much worse downside
than you might think. It's not just going to do something unexpected;
it's possibly going to do terrible, awful things to your application
and data.

Nobody wants that to happen. Especially Team Nokogiri.

Risk, Redux

So, if you accept the equation

risk = probability x impact

and you believe me when I say that:

the probablility of a bug in unreleased code is high, and

the impact of a bug is likely to be severe,

then you should easily see that the risk associated with a bug in
Nokogiri is quite high.

Part of Team Nokogiri's job is to try to mitigate this risk. We have a
number of tactics that we use to accomplish this:

we respond quickly to bug reports, particularly when they are possible memory issues

we review each others' commits

we have a thorough test suite, and we test-drive new features

we discuss code design and issues on a core developer mailing list

we use valgrind to test for memory issues (leaks and invalid
access) on multiple combinations of OS, libxml2 and Ruby

we package release candidates, and encourage devs to use them

we do NOT commit a gemspec in our git repository

Yes, that's right, the absence of a gemspec is a risk mitigation
tactic. Not only does Team Nokogiri not want to imply support for
master, we want to actively discourage people from using
it. Because it's not stable.

But I Want to Do It Anyway

Another option, is to email the nokogiri-talk
list and ask for a
release candidate to be built. We're pretty accommodating if there's a
bugfix that's a blocker for you. And if we can't release an RC, we'll
tell you why.

And in the end, nothing is stopping you from cloning the repo and
generating a private gemspec. This is an extra step or two, but it has
the benefit of making sure developers have thought through the costs
and risks involved; and it tends to select for developers who know
what they're doing.

In Conclusion

Team Nokogiri takes stability very seriously. We want everybody who
uses Nokogiri to have a pleasant experience. And so we want to make
sure that you're using the best software we can make.

Please keep in mind that we're trying very hard to do the right thing
for all Nokogiri users out there in Rubyland. Nokogiri loves you very
much, and we hope you love it back.

I want to go to there.

The bet revolved around a real-world use case (Paul and I both work at Benchmark Solutions, a stealth financial market data startup in NYC).

You can view the data structure at the Offical Fairy-Wing Throwdown Repo™, https://github.com/flavorjones/fairy-wing-throwdown, but the summary is that it’s 54K when serialized as JSON, and is comprised (mostly) of an array of key-value stores (i.e., hashes).

Because I wanted to not just win, but to destroy Paul, I implemented the same parsing task using Nokogiri’s DOM parser, SAX parser, and Reader parser, expecting that code complexity and performance would correlate, somehow. In my mind, the graph looked like this:

But I was shocked and dismayed to see the real results:

What the WHAT?

Yes, that’s right. My payback for increasing the complexity of the code was a reduction in performance. The DOM parser was extremely way faster than either the Reader or SAX parsers.

Chart Notes

The “expected performance” line chart is in imaginary units.

The “actual performance” line chart renders performance in number of records processed per second, so bigger is better. The Saikuro and Flog scores were normalized on their values for #transform_via_dom.

The “DOM parser on various platforms” bar chart renders total benchmark runtime, so smaller is better.

When developing Nokogiri, the most valuable tool I use to track down memory-related errors is Valgrind. It rocks! Aaron and I run the entire Nokogiri test suite under Valgrind before releasing any version.

I could wax poetic about Valgrind all day, but for now I'll keep it brief and just say: if you write C code and you're not familiar with Valgrind, get familiar with it. It will save you countless hours of tracking down heisenbugs and memory leaks some day.

In any case, I've been meaning to package up my utility scripts and tools for quite a while. But they're so small, and it's so hard to make them work for every project ... it's looking pretty likely that'll never happen, so blogging about them is probably the best thing for everyone.

Basics

Oooh! But that's not actually what you want. The Matz Ruby Interpreter does a lot of funky things in the name of speed, like using uninitialized variables and reading past the ends of malloced blocks that aren't on an 8-byte boundary. As a result, something as simple as require 'rubygems' will give you 3800 lines of error messages (see this gist for full output).

Without going too far off-topic, I'd should just mention that those "leaks" aren't really leaks, they're characteristic of how the Ruby interpreter manages its internal memory. (You can see this by running this example with --leak-check=full.)

Rakified!

Here's an easy way to run Valgrind on your gem's existing test suite. This rake task assumes you've got Hoe 1.12.1 or higher.

Those basic options will give you a decent-sized stack walkback on errors, will make sure you see every error, and will skip all the BS output mentioned above. You can read Valgrind's documentation for more information, and to tune the output.

If you're not testing a gem, or don't have Hoe installed, try this for Test::Unit suites:

Prologue

Today I'd like to talk about the use of regular expressions to parse
and modify HTML. Or rather, the misuse.

I'm going to try to convince you that it's a very bad idea to use
regexes for HTML. And I'm going to introduce you to
Nokogiri, my new
best friend and life companion, who can do this job way better, and
nearly as fast.

For those of you who just want the meat without all the starch:

You don't parse Ruby or YAML with regular expressions, so don't do it with HTML, either.

If you know how to use Hpricot, you know how to use Nokogiri.

Nokogiri can parse and modify HTML more robustly than regexes, with less penalty than formatting Markdown or Textile.

Nokogiri is 4 to 10 times faster than Hpricot performing the typical HTML-munging operations benchmarked.

The Scene

On one of the open-source projects I contribute to (names will be
withheld for the protection of the innocent, this isn't
Daily WTF), I came across the following code:

In case it's not clear, the goal of this method is to insert a
<span> element inside the link, converting hyperlinks from

<a href='http://foo.com/'> Foo! </a>

to

<a href='http://foo.com/'> <span> Foo! </span> </a>

for CSS styling.

The Problem

Look, I love regexes as much as the next guy, but this regex is
seriously busticated. If there is more than one <a> tag on a line,
only the final one will be spanified. If the tag contains an embedded
newline, nothing will be spanified. There are probably other unobvious
bugs, too, and that means there's a
code smell here.

Sure, the regex could be fixed to work in these cases. But does a
trivial feature like this justify the time spent writing test cases
and playing whack-a-mole with regex bugs? Code smell.

Let's look at it another way: If you were going to modify Ruby code
programmatically, would you use regular expressions? I seriously doubt
it. You'd use something like
ParseTree, which
understands all of Ruby's syntax and will correctly interpret
everything in context, not just in isolation.

What about YAML? Would you modify YAML files with regular expressions?
Hells no. You'd slurp it with YAML.parse(), modify the in-memory
data structures, and then write it back out.

Why wouldn't you do the same with HTML, which has its own nontrivial
(and DTD-dependent) syntax?

Regular expressions just aren't the right tool for this job. Jamie
Zawinski said it best:

Some people, when confronted with a problem, think "I know,
I'll use regular expressions." Now they have two problems.

Why, God? Why?

So, what drives otherwise intelligent people (myself included) to whip
out regular expressions when it comes time to munge HTML?

My only guess is this: A lack of worthy XML/HTML libraries.

Whoa, whoa, put down the flamethrower and let me explain myself. By
"worthy", I mean three things:

Now, Hpricot is pure
genius. It's pretty fast, and the API is absolutely delightful to work
with. It supports CSS as well as XPath queries. I've even used it
(with feed-normalizer) in
a Rails application, and it performed reasonably well. But it's still
much slower than regexes. Here's a (totally unfair) sample benchmark
comparing Hpricot to a comparable (though buggy) regular expression
(see below for a link to the benchmark gist):

For an html snippet 2374 bytes long ...
user system total real
regex * 1000 0.160000 0.010000 0.170000 ( 0.182207)
hpricot * 1000 5.740000 0.650000 6.390000 ( 6.401207)
it took an average of 0.0064 seconds for Hpricot to parse and operate on an HTML snippet 2374 bytes long
For an html snippet 97517 bytes long ...
user system total real
regex * 10 0.100000 0.020000 0.120000 ( 0.122117)
hpricot * 10 3.190000 0.300000 3.490000 ( 3.502819)
it took an average of 0.3503 seconds for Hpricot to parse and operate on an HTML snippet 97517 bytes long

So, historically, I haven't used Hpricot everywhere I could have, and
that's because I was overly-cautious about performance.

Get On With It, Already

Oooooh, if only there was a library with libxml2's speed and Hpricot's
API. Then maybe people wouldn't keep trying to use regular expressions
where an HTML parser is needed.

Check out the full benchmark,
comparing the same operation (spanifying links and removing
possibly-unsafe tags) across regular expressions, Hpricot and
Nokogiri:

For an html snippet 2374 bytes long ...
user system total real
regex * 1000 0.160000 0.010000 0.170000 ( 0.182207)
nokogiri * 1000 1.440000 0.060000 1.500000 ( 1.537546)
hpricot * 1000 5.740000 0.650000 6.390000 ( 6.401207)
it took an average of 0.0015 seconds for Nokogiri to parse and operate on an HTML snippet 2374 bytes long
it took an average of 0.0064 seconds for Hpricot to parse and operate on an HTML snippet 2374 bytes long
For an html snippet 97517 bytes long ...
user system total real
regex * 10 0.100000 0.020000 0.120000 ( 0.122117)
nokogiri * 10 0.310000 0.020000 0.330000 ( 0.322290)
hpricot * 10 3.190000 0.300000 3.490000 ( 3.502819)
it took an average of 0.0322 seconds for Nokogiri to parse and operate on an HTML snippet 97517 bytes long
it took an average of 0.3503 seconds for Hpricot to parse and operate on an HTML snippet 97517 bytes long

Wow! Nokogiri parsed and modified blog-sized HTML snippets in under 2
milliseconds! This performance, though still significantly slower than
regular expressions, is still fast enough for me to consider using it
in a web application server.

Hell, that's as fast (faster, actually) than BlueCloth or RedCloth can
render Markdown or Textile of similar length. If you can justify using
those in your web application, you can certainly afford the overhead
of Nokogiri.

And as for usability, let's compare the regular expressions to the Nokogiri operations:

Yesterday was a big day, and I nearly missed it, since I spent nearly all of the sunlight hours at the wheel of a car. Nine hours sitting on your butt is no way to ... oh wait, that's actually how I spend every day. Just usually not in a rental Hyundai. Never mind, I digress.

It was a big day because Nokogiri was released. I've spent quite a bit of time over the last couple of months working with Aaron Patterson (of Mechanize fame) on this excellent library, and so I'm walking around, feeling satisfied.

"What's Nokogiri?" Good question, I'm glad I asked it.

Nokogiri is the best damn XML/HTML parsing library out there in Rubyland. What makes it so good? You can search by XPath. You can search by CSS. You can search by both XPath and CSS. Plus, it uses libxml2 as the parsing engine, so it's fast. But the best part is, it's got a dead-simple interface that we shamelessly lifted from Hpricot, everyone's favorite delightful parser.

I had big plans to do a series of posts with examples and benchmarks, but right now I'm in DST Hell and don't have the quality time to invest.

So, as I am wont to do, I'm punting. Thankfully, Aaron was his usual prolific self, and has kindly provided lots of documentation and examples:

At my company, Pharos, we're about to
launch a new product which will contain sensitive data for multiple
firms in a single database. This is essentially a lightweight version
of our flagship product, which was built for a single client.

Of course, as a result, I had to refactor like crazy to get rid of the
implicit "one-firm" assumption that was built into the code and database
schemas.

The essential task was to add "firm_id" to each of the private table
schemas, and then make sure that all the code that accesses the model
specifies the firm in the query. The two access idioms that were being
widely used (unsurprisingly):

results = ClassName.find(:all, :conditions => [....])

and

results = ClassName.find_by_entity_id_and_hour(...)

I was able to make minimal changes to the code by supporting the
following new idioms through a mixin (the mixin code is at the end of
the article):

(The second idiom I found easier to make (and the diff easier to read) than:

ClassName.find_by_firm_id_and_entity_id_and_hour(firm_id, ...)

but really, that's a matter of taste.)

But I was still nervous. What if I missed an instance of a database
lookup that wasn't specifying firm, and as a result one client saw
another client's records? That would be a Really Bad Thing
TM, and I want to explicitly make sure that can't
happen. But how?

After a half hour of poking around and futzing, I came up with a
find()-and-friends implementation that will check with_scope
conditions as well as the :conditions parameter to the find() call: