XSLT for parsing HTML? No, use R instead

Blogs

Contact

How to parse data from HTML?

For example, we find out that there is the wanted information in the
HTML table:

<h1>Here is the price</h1><table><tr><td>1.0</td></tr></table>

What I am looking for is in the table after the <h1> containing text
"Here is the price", and in this particular example, I am interested
in the first element in the table.

I want to retrieve 1.0 from the previous HTML.

Regular expression matching is not a good way to find the right value.
Regular expression is not able to understand the hierarchical nature
of the XML tags and it can go totally wrong.

It is better to do this using some system that understands the
hierarchical XML structure.

XSLT

I have tried to use XSLT (Extensible Stylesheet Language
Transformations) to this purpose several times. It has always failed.
I tried to do that again today for several hours, but it turned to be
impossible, again.

As a warning, I explain here why this is so painful.

At theory it should work like this: I write an simple XSLT program
file that describes what to match using powerful XPATH commands and
when the XPATH matches, I print what was found. I write a rule like:

<xsl:templatematch="//text(h1)==&quot;Here is the price&quot;/table/tr/td"><xsl:value-ofselect="text(.)"/></xsl:template>

In the ideal world, this matches anywhere in the document h1 tags
where text content is "Here is the price" and finds it’s subtag
table/tr/td. Then we print the text of that node.

XPATH can even index. It allows indexing 5th row and 4th column or
even select the last column (last()).

This is about valid XSLT. In an ideal world you could run it using the command:

xsltproc select.xslt target.html

And this nicely prints the wanted value.

But no, this has never worked. Today I spent several hours to
understand why. I have a theory, but still not a perfect answer.

I have a feeling that several people have tried this but because there
is no discussions in forums, they have failed without understanding why.

The most important answer is in StackOverflow, but it has only four
votes. All the discussions of XSLT are years old, so we are hopefully
getting rid of XSLT.

The most important hint was that HTML has separate namespace. You must
prefix all HTML tags with prefix that is defined in the header. After
that the basic examples started to work somehow.

Then the next problem is that the default behavior of XSLT is
completely stupid: it prints all non-matching XML without tags. The
only way I figured out to prevent that is that your templates must
match all the tags from the root to the tag what you are looking for.

You must not use handy "//" to match "any level" of tags.

To really understand why this is so painful, you should fully
understand XSLT execution model. It is confusing and I have not yet
fully understood it. Read this, and try to explain what means "A node
is processed by finding all the template rules with patterns that
match the node, and choosing the best amongst them".

Then you have problems of structures like "&quot;" in the text. They
cause all kinds of errors. You also have to figure out, how to print new
line character to the output and how to get rid of XML header of the output.

This all means that I gave up. XSLT is not a reasonable way to do this.

There may be other XML tools that can interpret XPATHs, but I did not
find any that looked anyhow supported.

R

I found out that there is a very simple way to do this and it allows
very complex cases very nicely.

In the statistical programming language R there is a library XML that
is able to sniff HTML tables directly from web pages. They turn into a
list of R data frames (handy table format).

For example, this 4 lines of code retrieves all the day’s
electricity hour prices from German electricity exchange:

Latest German hour prices are in the 6th table in the page, in the 9th
column. First row contains the date and the rest contain prices. Every
second row is empty and they are read as value NA (not available).
na.omit removes them. By default, the prices are not read as numbers
but as "factors" (~enumerations). They can be converted to characters
and then to numbers.