Sections

HTML is the structured markup language used for pages on the World Wide Web. Given that it is structured, it's possible to extract information from them programmatically. Given that the schema describing HTML is viewed more as a suggestion than a rule, the parser needs to be very forgiving of errors.

For Haskell, one such parser is the html-conduit parser, part of the relatively light-weight xml-conduit package. This tutorial will walk you through the creation of a simple application: seeing how many hits we get from bing.com when we search for "School of Haskell".

Now that we have the page contents, we need to find the data we're interested in. Examing the page, we see that it's in a span tag, with the id of count. The html-conduit package can parse the data for us. After doing so, we can use operators from the Text.XML.Cursor package to pick out the data we want.

Text.XML.Cursor provides operators inspired by the XPath language. If you are familiar with XPath expressions, these will come naturally. If not - well, they are still fairly straightforward. We extract the page as before, then use parseLBS to parse the lazy ByteString that it returns, and then fromDocument to create the cursor. The $// operator is similar to the // syntax of XPath, and selects all the nodes in the cursor that match the findNodes expression. The &| will apply the fetchData function to each node in turn, the resulting list being passed to processData.

The findNodes function uses element "span" to select the span tags. Then >=> composes that with the next selector attributeIs "id" "count", which selects for - you guessed it - elements with an id attribute of count. Since id attributes are supposed to be unique, that should be our element. The node we want is actually the content of the text node that is a child of the node we found, so we use child to extract that node.

The extractData function uses the content function to extract the actual text from the node we found. Since content operates on a list of Nodes, extractData applies Data.Text.concat to turn the list of Text's into a single Text.

Finally, we process that data - a list of the results of extractData - with processData. Since we want the text from the first element in the list we are passed, we use head before printing it. The resulting string has type Text, so Data.Text.unpack turns it into a string for putStrLn.

As a second example, let's extract the list of URL's from the search. These are simply a tags wrapped in h3 tags. So we change findNodes to find those tags, and extractData to fetch the href attribute. Finally, we process the resulting list by using mapM_ to pass each string to Data.Text.IO.putStrLn to print each URL on a line, rather than using unpack to turn it into a string. This requires changing the imports a bit. In this case, rather than using a qualified import to avoid conflicts with the Prelude, we explicitly import it and hide the functions we want. All these changes are highlighted.

This tutorial did not cover error handling. Given the nature of HTML, errors are common, and the html parser deals with that as well as it can. If you're using XML, then above tools will work - just use the appropriate parser from xml-conduit and the tools described above. If you need to detect errors in your XML, you maight want to look at the XML parsing with validation tutorial.