Native XML Extensions

The DOM extension allows you to operate on XML documents through the DOM API with PHP 5. It is an implementation of the W3C’s Document Object Model Core Level 3, a platform- and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure and style of documents.

DOM is capable of parsing and modifying real world (broken) HTML and it can do XPath queries. It is based on libxml.

It takes some time to get productive with DOM, but that time is well worth IMO. Since DOM is a language agnostic interface, you’ll find implementations in many languages, so if you need to change your programming language, chances are you will already know how to use that language’s DOM API then.

The XMLReader extension is an XML Pull parser. The reader acts as a cursor going forward on the document stream and stopping at each node on the way.

XMLReader, like DOM, is based on libxml. I am not aware on how to trigger the HTML Parser Module, so chances are using XMLReader for parsing broken HTML might be less robust than using DOM where you can explicitly tell it to use libxml’s HTML Parser Module.

Zend_Dom provides tools for working with DOM documents and structures. Currently, we offer Zend_Dom_Query, which provides a unified interface for querying DOM documents utilizing both XPath and CSS selectors.

QueryPath is a PHP library for manipulating XML and HTML. It is designed to work not only with local files, but also with web services and database resources. It implements much of the jQuery interface, but it is heavily tuned for server-side use.

fDOMDocument extends the standard DOM to use exceptions at all occasions of errors instead of PHP warnings or notices. They also add various custom methods and shortcuts for convinience and to simplify the usage of DOM.

3rd Party (not libxml based)

The benefit of building upon DOM/libxml is that you get good performance out of the box because you are based on a native extension. However, not all 3rd party libs go down this route, some of them listed below

WebServices

The YQL Web Service enables applications to query, filter, and combine data from different sources across the Internet. YQL statements have a SQL-like syntax, familiar to any developer with database experience.

ScraperWiki’s external interface allows you to extract data in the form you want for use on the web or in your own applications. You can also extract information about the state of any scraper.

Regular Expressions

Last and least recommended, you can extract data from HTML with Regular Expressions. In general using Regular Expressions on HTML is discouraged.

Most of the snippets you will find on the web to match markup are brittle. In most cases they are only working for a very particular piece of HTML. Tiny markup changes, like adding a space somewhere, can make the Regex fails when it’s not properly written. You should know what you are doing before using Regex on HTML.

HTML parsers already know the syntactical rules of HTML. Regular expression have to be taught them with each new Regex you write. Regex are fine in some cases, but it really depends on your UseCase.

You can write more reliable parsers, but writing a complete and reliable custom parser with Regular Expressions is a waste of time when the aforementioned libraries already exist and do a much better job on this.

Mostly you want to use simple #id and .class or DIV tag selectors for ->find(). But you can also use xpath statements, which sometimes are faster. Also typical jQuery methods like ->children() and ->text() and particularily ->attr() simplify extracting the right HTML snippets. (And already have their SGML entities decoded.)

$qp->xpath("//div/p[1]"); // get first paragraph in a div

QueryPath also allows injecting new tags into the stream (->append), and later output and prettify an updated document (->writeHTML). It can not only parse malformed HTML, but also various XML dialects (with namespaces), and even extract data from HTML microformats (XFN, vCard).

$qp->find("a[target=_blank]")->toggleClass("usability-blunder");

.

phpQuery or QueryPath?

Generally QueryPath is better suited for manipulation of documents. While phpQuery also implements some pseudo AJAX methods (just HTTP requests) to more closely resemble jQuery. It is said that phpQuery is often faster than QueryPath (because overall less features).

Yes you can use simple_html_dom for the purpose. However I have worked quite a lot with the simple_html_dom, particularly for web scrapping and have found it to be too vulnerable. It does the basic job but I won’t recommend it anyways.

I have never used curl for the purpose but what I have learned is that curl can do the job much more efficiently and is much more solid.

we have created quite a few crawlers for our needs before. at the end of the day, it is usually simple regular expressions that do the thing best. while libraries listed above are good for the reason they are created, if you know what you are looking for, regular expressions is more safe way to go, as you can handle also non-valid html/xhtml structures, which would fail, if loaded via most of the parsers.

One general approach I haven’t seen mentioned here is to run HTML through Tidy, which can be set to spit out guaranteed-valid XHTML. Then you can use any old XML library on it.

But to your specific problem, you should take a look at this project: http://fivefilters.org/content-only/ — it’s a modified version of the Readability algorithm, which is designed to extract just the textual content (not headers and footers) from a page.

This sounds like a good task description of W3C XPath technology. It’s easy to express queries like “return all href attributes in img tags that are nested in <foo><bar><baz> elements.” Not being a PHP buff, I can’t tell you in what form XPath may be available. If you can call an external program to process the HTML file you should be able to use a command line version of XPath.
For a quick intro, see http://en.wikipedia.org/wiki/XPath.

QueryPath is good, but be careful of “tracking state” cause if you didnt realise what it means, it can mean you waste a lot of debugging time trying to find out what happened and why the code doesn’t work.

what it means is that each call on the result set modifies the result set in the object, it’s not chainable like in jquery where each link is a new set, you have a single set which is the results from your query and each function call modifies that single set.

in order to get jquery-like behaviour, you need to branch before you do a filter/modify like operation, that means it’ll mirror what happens in jquery much more closely.

“$results” now contains the result set for “input[name=’forename’]” NOT the original query “div p” this tripped me up a lot, what I found was that QueryPath tracks the filters and finds and everything which modifies your results and stores them in the object. you need to do this instead

$forename = $results->branch()->find(“input[name=’forname’]”)

then $results won’t be modified and you can reuse the result set again and again, perhaps somebody with much more knowledge can clear this up a bit, but it’s basically like this from what I’ve found.

A few months ago I wrote a library that can help you work with parsing HTML5 code in PHP. It extends the native DOMDocument library, fixes some bugs and adds some new features (innerHTML, querySelector(), …)
It’s available at https://github.com/ivopetkov/html5-dom-document-php
I hope it will be useful for you too.