How to Use Html5lib in Python

Written by mike wallace

Share

Tweet

Share

Email

The Python programming language can support HTML 5 web sites using the library 'Html5lib.' This library allows you to write Python scripts that parse HTML 5 pages using a tree structure. Tree structures are hierarchical views of web page elements. Accessing web page elements is accomplished using a tree walker. The tree walker 'walks' along the connections of tree nodes, and can traverse the entire tree. You can use Python with 'Html5lib' to open, view, and print out an HTML 5 web site.

Skill level:

Easy

Other People Are Reading

Things you need

Python 3.2 programming language with Html5lib module

Show MoreHide

Instructions

1

Open the IDLE text editor in Program Files (or Applications for Macintosh) in the Python directory. A blank source code file opens.

2

Import the "Html5lib" module by writing the following statements at the top of the source code file:

import html5lib

from html5lib import treebuilders, treewalkers, serializer

import urllib2

3

Create a new HTML 5 parser, which you will use to read an HTML website. Declaring a new parser by writing the following:

parser = html5lib.HTMLParser()

4

Open a website by passing its name into the urllib2.urlopen function. For example, if you want to open "www.website_adddress.com," write the following:

URL = urllib2.urlopen("http://www.website_address.com").read()

5

Pass the website into the HTML 5 parser to receive a tree representation. Save this representation into a variable named "tree" by writing the following statement:

tree = parser.parse(URL)

6

Create a tree walker like this:

tree Walker = treewalkers.getTreeWalker("dom")

7

Walk through the tree using the tree walker. The tree walker will return a stream of information that it discovers in the HTML 5 website. To walk through the tree, write the following:

stream = tree Walker(tree)

8

Serialise the stream so that you can easily output it to the console. You can serialise the stream using the following two statements:

Indent the line immediately after the previous statement and write a print function, like this:

print(element)

11

Execute the program by pressing F5. The script will open and then parse an HTML 5 web page. The script then serialises the tree structure of the page and outputs it to the console. The output will vary depending on the web page chosen, but may look something like this: