Parsing XML documents with CSS selectors

March 31, 2010

HTML and XML documents are the bread and butter of web developers. On a day to
day basis, you probably create a lot of HTML documents. And odds are you also
need to parse some from time to time: because you consume a web service and
want to extract some information, or because you want to gather data from
scraped web pages, or just because you want to write functional tests for a
website. Retrieving the document is quite easy, but how do you navigate
through it to extract the information you need?

PHP already comes with a lot of useful tools for parsing XML documents:
SimpleXML, DOM, and XMLReader, just to name a few. But as soon as you
need to extract information deeply embedded in the document structure, things
are not as easy as they should be. Of course, XPath is your best friend when
you need to select elements, but the learning curve is really steep. Even
expressions that should be easy can be complex. As an example, here is the
XPath expression to retrieve all h1 tags that have a foo class:

h1[contains(concat(' ', normalize-space(@class), ' '), ' foo ')]

The XPath expression is complex because a tag can have several classes:

The expression should match the first two h1 tags, but not the third one.

Of course, everybody knows that doing the same with a CSS selector is a piece
of cake:

h1.foo

For Symfony 2 functional tests, I wanted a way to leverage the power and
expressiveness of CSS selectors with the tools we already have in PHP. The
first idea that came to my mind was to convert a CSS selector to its XPath
equivalent. But is it possible? The answer is a surrounding 'YES'.

As John Resig wrote in a blog
post some time ago about the
same topic: "The biggest thing to realize is that CSS Selectors are,
typically, very short - but woefully underpowered, when compared to XPath."

Writing a tokenizer, a parser, and a compiler able to convert CSS selectors to
XPath is no trivial task. So, instead of reinventing the wheel, I looked for
some existing libraries. I didn't look too much before stumbling upon lxml, a
Python library. The
lxml.cssselect module of lxml
does exactly this. So, I took the time to translate the Python code to PHP,
added some unit tests, and voilà, the Symfony 2 CSS Selector component was
born.

symfony 1 has a sfDomCssSelector class, but it does not convert the CSS
selector to XPath. It does the job nicely but it is limited to very simple
CSS selectors and it cannot easily be used with standard XML tools.

The Symfony 2 CSS Selector component does only one thing, and it tries to do
it well: converting CSS selectors to XPath expressions. Using it is dead
simple:

This new CSS Selector component will be used in Symfony 2 for functional
tests (but as you will see in the coming weeks, in a very different way than
what we had in symfony 1).

The code is unit-tested and has a good code coverage, so feel free to
use
it (code is on Github: http://github.com/fabpot/symfony under the
Symfony\Components\CssSelector namespace) and send me some feedback.

Stay tuned!

Discussion

stereoscott — March 31, 2010 20:29 #1

Fabien, we all appreciate how much you have given the php community, and this looks like a great tool for both functional tests and for parsing out content from existing xhtml documents. thank you, and well done!

Rich — March 31, 2010 20:44 #2

thanks Fabien, i wish you post this a few months ago - i used to parse XML's a YML file which described the different nodes. of course this solution is much more elegant.

Tom Boutell — March 31, 2010 21:17 #3

Sweet! Like many people I'm sure, I've written halfassed reimplementations of this in the past. Thanks for doing the DRY thing and tracking down a solid existing implementation and porting it.

Toby — April 01, 2010 00:31 #4

This approach looks pretty similar to http://www.fluentdom.org/ which realizes a jQuery style access approach to the DOM tree.

Sebastian Golasch — April 01, 2010 01:32 #5

First of all, thanks to Fabien for another really usefull component.
I´ll give it a short try tomorrow.
But one question bothers me, what about performance and/or benchmarks?!
Are there any first experiences and/or short comments about this topic?

@Toby
FluentDom seem to work like a wrapper around the build-in XML functions of php, while the CssSelector component 'only' converts Css expressions to XPath expressions.

It enables you to use the build in XML functions of php directly.
No need to learn a new api,
less error prone because there is no library between you and your good old xml functions,
lower W.T.F factor while refactoring your 'xml-parsing' code, and so on...

I´am still curious about whats cooking at the 'symfony component kitchen' right now and in the near future...

kiang — April 01, 2010 02:32 #6

Just another similar solution: http://framework.zend.com/manual/en/zend.dom.query.html

Marijn Huizendveld — April 01, 2010 04:39 #7

Sometimes I wonder, are you a robot? Seriously...:-)

Anyway, congratulations!

Fabien — April 01, 2010 07:13 #8

@Toby: FluentDOM is very different as it only uses XPath and not CSS selectors.

@kiang: Zend_Dom_Query has a very limited support of CSS. It is only able to parse simple CSS selectors. The Symfony 2 component supports the whole CSS specification (with just a few exceptions listed on the lxml.cssselect documentation page). Just have a look at the code and you will understand what I'm talking about ;)

Andris — April 01, 2010 10:15 #9

Another jQuery port worth mentioning would be phpQuery (has crazy stuff like plugins and event triggers).

Jordi Boggiano — April 01, 2010 11:05 #10

Great! I will let the dompdf guys know, hopefully they can replace their parser with this, less code to maintain and especially the current one is not liking CSS3 selectors too much.