Beautiful Soup 3 only works on Python 2.x, but Beautiful Soup 4
also works on Python 3.x. Beautiful Soup 4 is faster, has more
features, and works with third-party parsers like lxml and
html5lib. You should use Beautiful Soup 4 for all new projects.

Beautiful
Soup is an HTML/XML parser for Python that can turn even invalid
markup into a parse tree. It provides simple, idiomatic ways of
navigating, searching, and modifying the parse tree. It commonly saves
programmers hours or days of work. There's also a Ruby port called Rubyful Soup.

This document illustrates all major features of Beautiful Soup
version 3.0, with examples. It shows you what the library is good for,
how it works, how to use it, how to make it do what you want, and what
to do when it violates your expectations.

A Beautiful Soup constructor takes an XML or HTML document in the
form of a string (or an open file-like object). It parses the document
and creates a corresponding data structure in memory.

If you give Beautiful Soup a perfectly-formed document, the parsed
data structure looks just like the original document. But if there's
something wrong with the document, Beautiful Soup uses heuristics to
figure out a reasonable structure for the data structure.

Note that BeautifulSoup figured out sensible places to put the
closing tags, even though the original document lacked them.

That document isn't valid HTML, but it's not too bad either. Here's
a really horrible document. Among other problems, it's got a <FORM>
tag that starts outside of a <TABLE> tag and ends inside the <TABLE>
tag. (HTML like this was found on a website run by a major web
company.)

The last cell of the table is outside the <TABLE> tag; Beautiful
Soup decided to close the <TABLE> tag when it closed the <FORM>
tag. The author of the original document probably intended the <FORM>
tag to extend to the end of the table, but Beautiful Soup has no way
of knowing that. Even in a bizarre case like this, Beautiful Soup
parses the invalid document and gives you access to all the data.

The BeautifulSoup class is full of web-browser-like
heuristics for divining the intent of HTML authors. But XML doesn't
have a fixed tag set, so those heuristics don't apply. So
BeautifulSoup doesn't do XML very well.

Use the BeautifulStoneSoup class to parse XML
documents. It's a general class with no special knowledge of any XML
dialect and very simple rules about tag nesting: Here it is in action:

The most common shortcoming of
BeautifulStoneSoup is that it doesn't know about
self-closing tags. HTML has a fixed set of self-closing tags, but with
XML it depends on what the DTD says. You can tell
BeautifulStoneSoup that certain tags are self-closing by
passing in their names as the selfClosingTags argument to
the constructor:

from BeautifulSoup import BeautifulSoupsoup = BeautifulSoup("\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf")soup.contents[0]
# u'\u3053\u308c\u306f'soup.originalEncoding
# 'utf-8'str(soup)
# '\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf'
# Note: this bit uses EUC-JP, so it only works if you have cjkcodecs
# installed, or are running Python 2.4.soup.__str__('euc-jp')
# '\xa4\xb3\xa4\xec\xa4\xcf'

Beautiful Soup uses a class called UnicodeDammit to
detect the encodings of documents you give it and convert them to
Unicode, no matter what. If you need to do this for other documents
(without using Beautiful Soup to parse them), you can use
UnicodeDammit by itself. It's heavily based on code from
the Universal Feed Parser.

If you're running an older version of Python than 2.4, be sure to
download and install cjkcodecs and
iconvcodec, which make Python capable of supporting
more codecs, especially CJK codecs. Also install the chardet
library, for better autodetection.

Beautiful Soup tries the following encodings, in order of priority,
to turn your document into Unicode:

An encoding you pass in as the fromEncoding argument
to the soup constructor.

An encoding discovered in the document itself: for instance, in an
XML declaration or (for HTML documents) an http-equiv
META tag. If Beautiful Soup finds this kind of encoding within the
document, it parses the document again from the beginning and gives
the new encoding a try. The only exception is if you explicitly
specified an encoding, and that encoding actually worked: then it will
ignore any encoding it finds in the document.

An encoding sniffed by looking at the first few bytes of the
file. If an encoding is detected at this stage, it will be one of the
UTF-* encodings, EBCDIC, or ASCII.

Beautiful Soup will almost always guess right if it can make a
guess at all. But for documents with no declarations and in strange
encodings, it will often not be able to guess. It will fall back to
Windows-1252, which will probably be wrong. Here's an EUC-JP example
where Beautiful Soup guesses the encoding wrong. (Again, because it
uses EUC-JP, this example will only work if you are running Python 2.4
or have cjkcodecs installed):

If you give Beautiful Soup a document in the Windows-1252 encoding
(or a similar encoding like ISO-8859-1 or ISO-8859-2), Beautiful Soup
finds and destroys the document's smart quotes and other
Windows-specific characters. Rather than transforming those characters
into their Unicode equivalents, Beautiful Soup transforms them into
HTML entities (BeautifulSoup) or XML entities
(BeautifulStoneSoup).

To prevent this, you can pass smartQuotesTo=None into the soup
constructor: then smart quotes will be converted to Unicode like any
other native-encoding characters. You can also pass in "xml" or "html"
for smartQuotesTo, to change the default behavior of BeautifulSoup
and BeautifulStoneSoup.

You can turn a Beautiful Soup document (or any subset of it) into a
string with the str function, or the prettify or renderContents
methods. You can also use the unicode function to get the whole
document as a Unicode string.

The prettify method adds strategic newlines and spacing to make
the structure of the document obvious. It also strips out text nodes
that contain only whitespace, which might change the meaning of an XML
document. The str and unicode functions don't strip out text nodes
that contain only whitespace, and they don't add any whitespace
between nodes either.

When you call __str__, prettify, or
renderContents, you can specify an output encoding. The
default encoding (the one used by str) is UTF-8. Here's
an example that parses an ISO-8851-1 string and then outputs the same
string in different encodings:

If the original document contained an encoding declaration, then
Beautiful Soup rewrites the declaration to mention the new encoding
when it converts the document back to a string. This means that if you
load an HTML document into BeautifulSoup and print it
back out, not only should the HTML be cleaned up, but it should be
transparently converted to UTF-8.

So far we've focused on loading documents and writing them back
out. Most of the time, though, you're interested in the parse tree:
the data structure Beautiful Soup builds as it parses the document.

A parser object (an instance of BeautifulSoup or
BeautifulStoneSoup) is a deeply-nested, well-connected data
structure that corresponds to the structure of an XML or HTML
document. The parser object contains two other types of objects: Tag
objects, which correspond to tags like the <TITLE> tag and the <B>
tags; and NavigableString objects, which correspond to strings like
"Page title" and "This is paragraph".

There are also some subclasses of NavigableString (CData,
Comment, Declaration, and ProcessingInstruction), which
correspond to special XML constructs. They act like
NavigableStrings, except that when it's time to print them out they
have some extra data attached to them. Here's a document that includes
a comment:

from BeautifulSoup import BeautifulSoupimport rehello = "Hello! <!--I've got to be nice to get what I want.-->"commentSoup = BeautifulSoup(hello)comment = commentSoup.find(text=re.compile("nice"))comment.__class__
# <class 'BeautifulSoup.Comment'>comment
# u"I've got to be nice to get what I want."comment.previousSibling
# u'Hello! 'str(comment)
# "<!--I've got to be nice to get what I want.-->"print commentSoup
# Hello! <!--I've got to be nice to get what I want.-->

Now, let's take a closer look at
the document used at the beginning of the documentation:

SGML tags have attributes:. for instance, each of the <P> tags in
the example HTML above has an "id"
attribute and an "align" attribute. You can access a tag's attributes
by treating the Tag object as though it were a dictionary:

In the example above, the parent
of the <HEAD> Tag is the <HTML> Tag. The parent of the <HTML>
Tag is the BeautifulSoup parser object itself. The parent of the
parser object is None. By following parent, you can move up the
parse tree:

With parent you move up the parse tree. With contents you move
down the tree. contents is an ordered list of the Tag and
NavigableString objects contained within a page element. Only the
top-level parser object and Tag objects have
contents. NavigableString objects are just strings and can't
contain sub-elements, so they don't have contents.

In the example above, the
contents of the first <P> Tag is a list containing a
NavigableString ("This is paragraph "), a <B> Tag, and another
NavigableString ("."). The contents of the <B> Tag: a list
containing a NavigableString ("one").

For your convenience, if a tag has only one child node, and that
child node is a string, the child node is made available as
tag.string, as well as tag.contents[0].
In the example above,
soup.b.string is a NavigableString representing the Unicode string
"one". That's the string contained in the first <B> Tag in the parse
tree.

soup.b.string
# u'one'soup.b.contents[0]
# u'one'

But soup.p.string is None, because the first <P> Tag in the
parse tree has more than one child. soup.head.string is also None,
even though the <HEAD> Tag has only one child, because that child is a
Tag (the <TITLE> Tag), not a NavigableString.

These members let you skip to the next or previous thing on the
same level of the parse tree. In the
document above, the nextSibling of the <HEAD> Tag is the
<BODY> Tag, because the <BODY> Tag is the next thing directly
beneath the <html> Tag. The nextSibling of the <BODY> tag is
None, because there's nothing else directly beneath the <HTML>
Tag.

Some more examples: the nextSibling of the first <P> Tag is the
second <P> Tag. The previousSibling of the <B> Tag inside the
second <P> Tag is the NavigableString "This is paragraph". The
previousSibling of that NavigableString is None, not anything
inside the first <P> Tag.

These members let you move through the document elements in the
order they were processed by the parser, rather than in the order they
appear in the tree. For instance, the next of the <HEAD> Tag is
the <TITLE> Tag, not the <BODY> Tag. This is because, in
the original document, the <TITLE>
tag comes immediately after the <HEAD> tag.

Where next and previous are concerned, a Tag's contents come
before its nextSibling. You usually won't have to use these members,
but sometimes it's the easiest way to get to something buried inside
the parse tree.

You can iterate over the contents of a Tag by treating it as a
list. This is a useful shortcut. Similarly, to see how many child
nodes a Tag has, you can call len(tag) instead of
len(tag.contents). In terms of the
document above:

It's easy to navigate the parse tree by acting as though the name
of the tag you want is a member of a parser or Tag object. We've
been doing it throughout these examples. In terms of the document above, soup.head gives us the first
(and, as it happens, only) <HEAD> Tag in the document:

soup.head
# <head><title>Page title</title></head>

In general, calling mytag.foo returns the first child of mytag
that happens to be a <FOO> Tag. If there aren't any <FOO> Tags
beneath mytag, then mytag.foo returns None.
You can use this to traverse the parse tree very quickly:

You can also use this to quickly jump to a certain part of a parse
tree. For instance, if you're not worried about <TITLE> tags in weird
places outside of the <HEAD> tag, you can just use soup.title to get
an HTML document's title. You don't have to use soup.head.title:

soup.title.string
# u'Page title'

soup.p jumps to the first <P> tag inside a document, wherever it
is. soup.table.tr.td jumps to the first column of the first row of
the first table in the document.

These members actually alias to the first method, covered below. I mention it here because
the alias makes it very easy to zoom in on an interesting part of a
well-known parse tree.

An alternate form of this idiom lets you access the first <FOO> tag
as .fooTag instead of .foo. For instance, soup.table.tr.td could
also be expressed as soup.tableTag.trTag.tdTag, or even
soup.tableTag.tr.tdTag. This is useful if you like to be more
explicit about what you're doing, or if you're parsing XML whose tag
names conflict with the names of Beautiful Soup methods and members.

Incidentally, the two methods described in this section (findAll
and find) are available only to Tag objects and the top-level
parser objects, not to NavigableString objects. The methods defined
in Searching Within the
Parse Tree are also available to NavigableString objects.

This doesn't look useful, but True is very useful when
restricting attribute values.

You can pass in a callable object which
takes a Tag object as its only argument, and returns a
boolean. Every Tag object that findAll encounters will be passed
into this object, and if the call returns True then the tag is
considered to match.

As with the name argument, you can pass a keyword argument
different kinds of object to impose different restrictions on the
corresponding attribute. You can pass a string, as seen above, to
restrict an attribute to a single value. You can also pass a regular
expression, a list, a hash, the special values True or None, or a
callable that takes the attribute value as its argument (note that the
value may be None). Some examples:

The special values True and None are of special
interest. True matches a tag that has any value for the given
attribute, and None matches a tag that has no value for the
given attribute. Some examples:

If you need to impose complex or interlocking restrictions on a
tag's attributes, pass in a callable object for name, as seen above, and deal with the Tag
object.

You might have noticed a problem here. What
if you have a document with a tag that defines an attribute called
name? You can't use a keyword argument called name because the
Beautiful Soup search methods already define a name argument. You
also can't use a Python reserved word like for as a keyword
argument.

Beautiful Soup provides a special argument called attrs which you
can use in these situations. attrs is a dictionary that acts just
like the keyword arguments:

You can use attrs if you need to put restrictions on attributes
whose names are Python reserved words, like class, for, or
import; or attributes whose names are non-keyword arguments to the
Beautiful Soup search methods: name, recursive, limit, text,
or attrs itself.

The attrs argument would be a pretty obscure feature
were it not for one thing: CSS. It's very useful to search for a tag
that has a certain CSS class, but the name of the CSS attribute,
class, is also a Python reserved word.

You could search by CSS class with soup.find("tagName", {
"class" : "cssClass" }), but that's a lot of code for such a
common operation. Instead, you can pass a string for attrs instead
of a dictionary. The string will be used to restrict the CSS class.

text is an argument that lets
you search for NavigableString objects instead of Tags. Its value
can be a string, a regular expression, a list or dictionary, True or
None, or a callable that takes a NavigableString object as its
argument:

If you use text, then any values you give for name and the
keyword arguments are ignored.

recursive is a boolean
argument (defaulting to True) which tells Beautiful Soup whether to
go all the way down the parse tree, or whether to only look at the
immediate children of the Tag or the parser object. Here's the
difference:

When recursive is false, only the immediate children of the
<HTML> tag are searched. If you know that's all you need to search,
you can save some time this way.

Setting limit argument lets you
stop the search once Beautiful Soup finds a certain number of matches.
If there are a thousand tables in your document, but you only need the
fourth one, pass in 4 to limit and you'll save time. By default,
there is no limit.

Okay, now let's look at the other search methods. They all take
pretty much the same arguments as findAll.

The find method is almost exactly like findAll, except that
instead of finding all the matching objects, it only finds the first
one. It's like imposing a limit of 1 on the result set, and then
extracting the single result from the array.
In terms of the document above:

In general, when you see a search method with a plural name (like
findAll or findNextSiblings), that method takes a limit argument
and returns a list of results. When you see a search method that
doesn't have a plural name (like find or findNextSibling), you
know that the method doesn't take a limit and returns a single
result.

Previous versions of Beautiful Soup had methods like first,
fetch, and fetchPrevious. These methods are sitll there, but
they're deprecated, and may go away soon. The total effect of all
those names was very confusing. The new names are named consistently:
as mentioned above, if the method name is plural or refers to All,
it returns multiple objects. Otherwise, it returns one object.

The methods described above, findAll and find, start at a
certain point in the parse tree and go down. They recursively iterate
through an object's contents until they bottom out.

This means that you can't call these methods on NavigableString
objects, because they have no contents: they're always the leaves of
the parse tree.

But downwards isn't the only way you can iterate through a
document. Back in Navigating the
Parse Tree I showed you many other ways: parent, nextSibling,
and so on. Each of these iteration techniques has two corresponding
methods: one that works like findAll, and one that works like
find. And since NavigableString objects do support these
operations, you can call these methods on them as well as on Tag
objects and the main parser object.

Why is this useful? Well, sometimes you just can't use findAll or
find to get to the Tag or NavigableString you want. For
instance, consider some HTML like this:

There are a number of ways to navigate to the <LI> tag that contains
the data you want. The most obvious is this:

soup('li', limit=2)[1]
# <li>The data you want</li>

It should be equally obvious that that's not a very stable way to get
that <LI> tag. If you're only scraping this page once it doesn't
matter, but if you're going to scrape it many times over a long
period, such considerations become important. If the irrelevant list
grows another <LI> tag, you'll get that tag instead of the one you
want, and your script will break or give the wrong data.

soup('ul', limit=2)[1].li
# <li>The data you want</li>

That's is a little better, because it can survive changes to the
irrelevant list. But if the document grows another irrelevant list at
the top, you'll get the first <LI> tag of that list instead of the one
you want. A more reliable way of referring to the ul tag you want
would better reflect that tag's place in the structure of the
document.

When you look at that HTML, you might think of the list you want as
'the <UL> tag beneath the <H1> tag'. The problem is that the tag isn't
contained inside the <H1> tag; it just happens to comes after it. It's
easy enough to get the <H1> tag, but there's no way to get from there
to the <UL> tag using first and fetch, because those methods only
search the contents of the <H1> tag. You need to navigate to the
<UL> tag with the next or nextSibling members:

But that's more trouble than you should need to go through. The
methods in this section provide a useful shorthand. They can be used
whenever you find yourself wanting to write a while loop over one of
the navigation members. Given a starting point somewhere in the tree,
they navigate the tree in some way and keep track of Tag or
NavigableString objects that match the criteria you specify. Instead of
the first loop in the example code above, you can just write this:

The loops are replaced with calls to findNextSibling and
findNext. The rest of this section is a reference to all the methods
of this kind. Again, there are two methods for every navigation
member: one that returns a list the way findAll does, and one that
returns a scalar the way find does.

These methods repeatedly follow an object's parent member,
gathering Tag or NavigableText objects that match the criteria you
specify. They don't take a text argument, because there's no way any
object can have a NavigableString for a parent. In terms of the document above:

Now you know how to find things in the parse tree. But maybe you
want to modify it and print it back out. You can just rip an element
out of its parent's contents, but the rest of the document will
still have references to the thing you ripped out. Beautiful Soup
offers several methods that let you modify the parse tree while
maintaining its internal consistency.

The replaceWith method extracts one page element and replaces it
with a different one. The new element can be a Tag (possibly with a
whole parse tree beneath it) or a NavigableString. If you pass a
plain old string into replaceWith, it gets turned into a
NavigableString. The navigation members are changed as though the
document had been parsed that way in the first place.

The Tag class and the parser classes support a method called
insert. It works just like a Python list's insert method: it takes
an index to the tag's contents member, and sticks a new element in
that slot.

This was demonstrated in the previous section, when we replaced a
tag in the document with a brand new tag. You can use insert to
build up an entire parse tree from scratch:

An element can occur in only one place in one parse tree. If you
give insert an element that's already connected to a soup object, it
gets disconnected (with extract) before it gets connected
elsewhere. In this example, I try to insert my NavigableString into
a second part of the soup, but it doesn't get inserted again. It gets
moved:

If you're getting errors that say:
"'ascii' codec can't encode character 'x' in position y: ordinal not in range(128)",
the problem is probably with your Python installation rather than with
Beautiful Soup. Try printing out the non-ASCII characters without
running them through Beautiful Soup and you should have the same
problem. For instance, try running code like this:

If this works but Beautiful Soup doesn't, there's probably a bug in
Beautiful Soup. However, if this doesn't work, the problem's with your
Python setup. Python is playing it safe and not sending non-ASCII
characters to your terminal. There are two ways to override this
behavior.

The easy way is to remap standard output to a converter that's
not afraid to send ISO-Latin-1 or UTF-8 characters to the terminal.

codecs.lookup returns a number of bound methods and
other objects related to a codec. The last one is a
StreamWriter object capable of wrapping an output
stream.

The hard way is to create a sitecustomize.py file
in your Python installation which sets the default encoding to
ISO-Latin-1 or to UTF-8. Then all your Python programs will use that
encoding for standard output, without you having to do something for
each program. In my installation, I have a
/usr/lib/python/sitecustomize.py which looks like this:

Remember, even if your terminal display is restricted to ASCII, you
can still use Beautiful Soup to parse, process, and write documents in
UTF-8 and other encodings. You just can't print certain strings with
print.

Beautiful Soup can handle poorly-structured SGML, but sometimes
it loses data when it gets stuff that's not SGML at all. This is
not nearly as common as poorly-structured markup, but if you're
building a web crawler or something you'll surely run into it.

If your document starts a declaration and never finishes it,
Beautiful Soup assumes the rest of your document is part of the
declaration. If the document ends in the middle of the declaration,
Beautiful Soup ignores the declaration totally. A couple examples:

The parse tree built by the BeautifulSoup class offends my
senses!

Beautiful Soup will never run as fast as ElementTree or a
custom-built SGMLParser subclass. ElementTree is written in C, and
SGMLParser lets you write your own mini-Beautiful Soup that only
does what you want. The point of Beautiful Soup is to save programmer
time, not processor time.

The search methods described above are driven by generator
methods. You can use these methods yourself: they're called
nextGenerator, previousGenerator, nextSiblingGenerator,
previousSiblingGenerator, and parentGenerator. Tag and parser
objects also have childGenerator and recursiveChildGenerator
available.

Here's a simple example that strips HTML tags out of a document by
iterating over the document and collecting all the strings.

MinimalSoup is a subclass of BeautifulSoup. It knows most
facts about HTML like which tags are self-closing, the special
behavior of the <SCRIPT> tag, the possibility of an encoding mentioned
in a <META> tag, etc. But it has no nesting heuristics at all. So it
doesn't know that <LI> tags go underneath <UL> tags and not the other
way around. It's useful for parsing pathologically bad markup, and for
subclassing.

ICantBelieveItsBeautifulSoup is also a subclass of
BeautifulSoup. It has HTML heuristics that conform more
closely to the HTML standard, but ignore how HTML is used in the real
world. For instance, it's valid HTML to nest <B> tags, but in the real
world a nested <B> tag almost always means that the author forgot to
close the first <B> tag. If you run into someone who actually nests
<B> tags, then you can use ICantBelieveItsBeautifulSoup.

BeautifulSOAP is a subclass of
BeautifulStoneSoup. It's useful for parsing documents
like SOAP messages, which use a subelement when they could just use an
attribute of the parent element. Here's an example:

When the built-in parser classes won't do the job, you need to
customize. This usually means customizing the lists of nestable and
self-closing tags. You can customize the list of self-closing tags by
passing a selfClosingTags argument
into the soup constructor. To customize the lists of nestable tags,
though, you'll have to subclass.

The most useful classes to subclass are MinimalSoup (for HTML)
and BeautifulStoneSoup (for XML). I'm going to show you how to
override RESET_NESTING_TAGS and NESTABLE_TAGS in a subclass. This
is the most complicated part of Beautiful Soup and I'm not going to
explain it very well here, but I'll get something written and then I
can improve it with feedback.

When Beautiful Soup is parsing a document, it keeps a stack of open
tags. Whenever it sees a new start tag, it tosses that tag on top of
the stack. But before it does, it might close some of the open tags
and remove them from the stack. Which tags it closes depends on the
qualities of tag it just found, and the qualities of the tags in the
stack.

The best way to explain it is through example. Let's say the stack
looks like ['html', 'p', 'b'], and Beautiful Soup encounters a <P>
tag. If it just tossed another 'p' onto the stack, this would imply
that the second <P> tag is within the first <P> tag, not to mention
the open <B> tag. But that's not the way <P> tags work. You can't
stick a <P> tag inside another <P> tag. A <P> tag isn't "nestable" at
all.

So when Beautiful Soup encounters a <P> tag, it closes and pops all
the tags up to and including the previously encountered tag of the
same type. This is the default behavior, and this is how
BeautifulStoneSoup treats every tag. It's what you get when a
tag is not mentioned in either NESTABLE_TAGS or
RESET_NESTING_TAGS. It's also what you get when a tag shows up in
RESET_NESTING_TAGS but has no entry in NESTABLE_TAGS, the way the
<P> tag does.

Let's say the stack looks like ['html', 'span', 'b'], and
Beautiful Soup encounters a <SPAN> tag. Now, <SPAN> tags can contain
other <SPAN> tags without limit, so there's no need to pop up to the
previous <SPAN> tag when you encounter one. This is represented by
mapping the tag name to an empty list in NESTABLE_TAGS. This kind of
tag should not be mentioned in RESET_NESTING_TAGS: there are no
circumstances when encountering a <SPAN> tag would cause any tags to
be popped.

Third example: suppose the stack looks like ['ol','li','ul']:
that is, we've got an ordered list, the first element of which
contains an unordered list. Now suppose Beautiful Soup encounters a
<LI> tag. It shouldn't pop up to the first <LI> tag, because this new
<LI> tag is part of the unordered sublist. It's okay for an <LI> tag
to be inside another <LI> tag, so long as there's a <UL> or <OL> tag
in the way.

That is: <TD> tags can be nested within <TR> tags. <TR> tags can be
nested within <TABLE>, <TBODY>, <TFOOT>, and <THEAD> tags. <TBODY>,
<TFOOT>, and <THEAD> tags can be nested in <TABLE> tags, and <TABLE>
tags can be nested in other <TABLE> tags. If you know about HTML
tables, these rules should already make sense to you.

One more example. Say the stack looks like ['html', 'p', 'table']
and Beautiful Soup encounters a <P> tag.

At first glance, this looks just like the example where the stack
is ['html', 'p', 'b'] and Beautiful Soup encounters a <P> tag. In
that example, we closed the <B> and <P> tags, because you can't have
one paragraph inside another.

Except... you can have a paragraph that contains a table,
and then the table contains a paragraph. So the right thing to do is
to not close any of these tags. Beautiful Soup does the right thing:

What's the difference? The difference is that <TABLE> is in
RESET_NESTING_TAGS and <B> is not. A tag that's in
RESET_NESTING_TAGS doesn't get popped off the stack as easily as a
tag that's not.

Okay, hopefully you get the idea. Here's the NESTABLE_TAGS for
the BeautifulSoup class. Correlate this with what you know about
HTML, and you should be able to create your own NESTABLE_TAGS for
bizarre HTML documents that don't follow the normal rules, and for
other XML dialects that have different nesting rules.

Since you're subclassing anyway, you might as well override
SELF_CLOSING_TAGS while you're at it. It's a dictionary that maps
self-closing tag names to any values at all (like
RESET_NESTING_TAGS, it's actually a list in the form of a
dictionary). Then you won't have to pass that list in to the
constructor (as selfClosingTags) every time you instantiate your
subclass.

When you parse a document, you can convert HTML or XML entity
references to the corresponding Unicode characters. This code converts
the HTML entity "&eacute;" to the Unicode character LATIN SMALL
LETTER E WITH ACUTE, and the numeric entity "&#101;" to the Unicode
character LATIN SMALL LETTER E.

That's if you use HTML_ENTITIES (which is just the string
"html"). If you use XML_ENTITIES (or the string "xml"), then only
numeric entities and the five XML entities ("&quot;",
"&apos;", "&gt;", "&lt;", and "&amp;") get
converted. If you use ALL_ENTITIES (or the list ["xml", "html"]),
then both kinds of entities will be converted. This last one is
neccessary because &apos; is an XML entity but not an HTML
entity.

If you tell Beautiful Soup to convert XML or HTML entities into the
corresponding Unicode characters, then Windows-1252 characters (like
Microsoft smart quotes) also get transformed into Unicode
characters. This happens even if you told Beautiful Soup to convert
those characters to entities.

Beautiful Soup does pretty well at handling bad markup when "bad
markup" means tags in the wrong places. But sometimes the markup is
just malformed, and the underlying parser can't handle it. So
Beautiful Soup runs regular expressions against an input document
before trying to parse it.

By default, Beautiful Soup uses regular expressions and replacement
functions to do search-and-replace on input documents. It finds
self-closing tags that look like <BR/>, and changes them to look like
<BR />. It finds declarations that have extraneous whitespace, like
<! --Comment-->, and removes the whitespace: <!--Comment-->.

If you have bad markup that needs fixing in some other way, you can
pass your own list of (regular expression, replacement function)
tuples into the soup constructor, as the markupMassage argument.

Let's take an example: a page that has a malformed comment. The
underlying SGML parser can't cope with this, and ignores the comment
and everything afterwards:

Oops, we're still missing the <BR> tag. Our markupMassage
overrides the parser's default massage, so the default
search-and-replace functions don't get run. The parser makes it past
the comment, but it dies at the malformed self-closing tag. Let's add
our new massage function to the default list, so we run all the
functions.

Recall that all the search methods take more or less the same arguments. Behind the scenes, your arguments
to a search method get transformed into a SoupStrainer object. If
you call one of the methods that returns a list (like findAll), the
SoupStrainer object is made available as the source property of
the resulting list.

Yeah, who cares, right? You can carry around a method call's
arguments in many other ways. But another thing you can do with
SoupStrainer is pass it into the soup constructor to restrict the
parts of the document that actually get parsed. That brings us to the
next section:

Beautiful Soup turns every element of a document into a Python
object and connects it to a bunch of other Python objects. If you only
need a subset of the document, this is really slow. But you can pass
in a SoupStrainer as the
parseOnlyThese argument to the soup constructor. Beautiful Soup
checks each element against the SoupStrainer, and only if it matches
is the element turned into a Tag or NavigableText, and added to
the tree.

If an element is added to to the tree, then so are its
children—even if they wouldn't have matched the SoupStrainer
on their own. This lets you parse only the chunks of a document that
contain the data you want.

Here are several different ways of parsing the document into soup,
depending on which parts you want. All of these are faster and use
less memory than parsing the whole document and then using the same
SoupStrainer to pick out the parts you want.

There is one major difference between the SoupStrainer you pass
into a search method and the one you pass into a soup
constructor. Recall that the name argument can take a function whose argument is a Tag
object. You can't do this for a SoupStrainer's name, because
the SoupStrainer is used to decide whether or not a Tag object
should be created in the first place. You can pass in a function for a
SoupStrainer's name, but it can't take a Tag object: it can only
take the tag name and a map of arguments.

When Beautiful Soup parses a document, it loads into memory a large,
densely connected data structure. If you just need a string from that
data structure, you might think that you can grab the string and leave
the rest of it to be garbage collected. Not so. That string is a
NavigableString object. It's got a parent member that points to a
Tag object, which points to other Tag objects, and so on. So long
as you hold on to any part of the tree, you're keeping the whole thing
in memory.

The extract method breaks those connections. If you call
extract on the string you need, it gets disconnected from the rest
of the parse tree. The rest of the tree can then go out of scope and
be garbage collected, while you use the string for something else. If
you just need a small part of the tree, you can call extract on its
top-level Tag and let the rest of the tree get garbage collected.

This works the other way, too. If there's a big chunk of the
document you don't need, you can call extract to rip it out
of the tree, then abandon it to be garbage collected while retaining
control of the (smaller) tree.

If extract doesn't work for you, you can try
Tag.decompose. It's slower than extract but more thorough. It
recursively disassembles a Tag and its contents, disconnecting every
part of a tree from every other part.

Matt
Croydon got Beautiful Soup 1.x to work on his Nokia Series 60
smartphone. C.R. Sandeep
wrote a real-time currency converter for the Series 60 using Beautiful
Soup, but he won't show us how he did it.

Here's a
short script from jacobian.org to fix the metadata on music files
downloaded from allofmp3.com.

Mike Foord didn't like the way Beautiful Soup can change HTML if
you write the tree back out, so he wrote HTML
Scraper. It's basically a version of HTMLParser that can handle
bad HTML. It might be obsolete with the release of Beautiful Soup 3.0,
though; I'm not sure.

That's it! Have fun! I wrote Beautiful Soup to save everybody
time. Once you get used to it, you should be able to wrangle data out
of poorly-designed websites in just a few minutes. Send me email if
you have any comments, run into problems, or want me to know about
your project that uses Beautiful Soup.

--Leonard

This document (source) is part of Crummy, the webspace of Leonard Richardson (contact information). It was last modified on Saturday, December 07 2013, 20:02:22 Nowhere Standard Time and last built on Tuesday, November 20 2018, 00:00:01 Nowhere Standard Time.