When Beautiful Soup was first released in 2004, the state of HTML
parsing in Python was appalling. Over the past eight years, things
have improved so dramatically that Beautiful Soup's HTML parser is no
longer a competitive advantage. I don't want to duplicate other
peoples', work, so I'm getting Beautiful Soup out of the parser
businesss. Beautiful Soup's job is now to provide a Pythonic
screen-scraping API on top of a data structure created by a
third-party parser.

This will be Beautiful Soup 4, and I've been planning it for
years. With help from Thomas Kluyver and Ezio Melotti, I've now met
the three main goals of Beautiful Soup 4:

Make a single codebase that works under Python 2 and Python 3.

Stop using SGMLParser (removed in Python 3) and make it possible to
swap out one parser for another.

Support two major Python parsers (lxml and html5lib) as well as
Python's (not currently very good) batteries-included parser,
html.parser.

The first version of BS4 is almost ready for release, and I'd like you
to test it out, if you haven't already. I still to fix some things, in
particular some performance problems. But, note that even with the
performance problems, BS4 is faster than BS3 across the board.

On Python 2 or Python 3 you can install the BS4 beta with this command:

There are three major things I'd like your feedback on before
completing the release.

Hall of Fame

The BS3 documentation lists open-source projects that use Beautiful
Soup. I stopped maintaining this list many years ago because there are
hundreds of these projects, and since most of them are
screen-scrapers, they're pretty ephemeral.

I'd like to bring this feature back as a "hall of fame", featuring
applications of Beautiful Soup that grab a reader's attention. People
who used Beautiful Soup in a high-profile way or to tackle a big
issue. Projects that are interesting to hear about even if the
software doesn't work anymore, or uses an old version of Beautiful
Soup, or if Beautiful Soup was used internally and the public only saw
the results.

My bias is towards projects having to do with space, science,
journalism, politics and social justice. Here are some examples so you
know the kind of thing I'm thinking of:

"Movable Type", a work of digital art on display in the lobby of the
New York Times building, uses Beautiful Soup to scrape New York Times
feeds.

If you did anything of this sort, or know of someone who did, I'd
like to hear about it.

Do you prefer lxml or html5lib?

Right now, the parser ranking goes lxml, html5lib, html.parser. I like
lxml because it's incredibly fast and it can parse anything. But I'd
like to see what you think of the trees it generates. Would html5lib,
with its web-browser-like heuristics, be a better default?

substitute_html_entities

BS3 had a number of overlapping and inconsistent ways of turning
HTML/XML entities into Unicode characters, and possibly turning
Microsoft smart quotes into HTML entities at the same time. In BS4,
all this stuff is gone. HTML and XML entities are *always* converted
into Unicode characters.

This is great but there's one problem: output. If you want to turn
those Unicode characters back into entities when outputting as a
string, you need to call soup.encode(substitute_html_entities=True),
which is a little clunky. I'm thinking of adding an
output_html_entities attribute that you can set on a soup or tag to
control whether this substitution happens. Do you like this idea?

I think I also need to ensure that characters like "&" and "always converted to XML entities on output, even though this will hurt performance a bit.

Conclusion

What you install with easy_install beautifulsoup4 is a beta
release. If I hear of a problem soon, there's still time to fix it,
even if it means a major change to the API. So please try it out and
give me feedback.

I'm using BeautifulSoup to take a book created with a remote coauthor on Google Docs and turn it into an eBook for Kindle and ePub readers. I've gone from knowing virtually nothing about BS to being a big fan; it is becoming my go-to for all problems XML/HTML.