Hpricot is over.

After years of lack of a proper maintainer for one of why's jewels, it has been
decided to finally close the book on hpricot. Most users have migrated to alternatives
and there is simply no time or energy to continue with the current codebase.

If you feel that you have the time and wish to take it over, I suggest you instead
think about making the hpricot-like API within nokogiri 100% compatible, that is a better
use of your time.

But if you still feel like "No damnit, I wanna work on hpricot itself still!" then fork
this repo and start work. Send @evanphx or @nicksieger a message if you feel like you
want to take over the gem name with new releases under the hpricot name.

Thanks to _why for all the fun. We'll never forget it.

Now back to your original README content...

Hpricot, Read Any HTML

Hpricot is a fast, flexible HTML parser written in C. It's designed to be very
accommodating (like Tanaka Akira's HTree) and to have a very helpful library
(like some JavaScript libs -- JQuery, Prototype -- give you.) The XPath and CSS
parser, in fact, is based on John Resig's JQuery.

Also, Hpricot can be handy for reading broken XML files, since many of the same
techniques can be used. If a quote is missing, Hpricot tries to figure it out.
If tags overlap, Hpricot works on sorting them out. You know, that sort of
thing.

Please read this entire document before making assumptions about how this
software works.

An Overview

Let's clear up what Hpricot is.

Hpricot is a standalone library. It requires no other libraries. Just Ruby!

While priding itself on speed, Hpricot works hard to sort out bad HTML and
pays a small penalty in order to get that right. So that's slightly more important
to me than speed.

If you can see it in Firefox, then Hpricot should parse it. That's
how it should be! Let me know the minute it's otherwise.

Primarily, Hpricot is used for reading HTML and tries to sort out troubled
HTML by having some idea of what good HTML is. Some people still like to use
Hpricot for XML reading, but remember to use the Hpricot::XML() method for that!

See COPYING for the terms of this software. (Spoiler: it's absolutely free.)

If you have any trouble, don't hesitate to contact the author. As always, I'm
not going to say "Use at your own risk" because I don't want this library to be
risky. If you trip on something, I'll share the liability by repairing things
as quickly as I can. Your responsibility is to report the inadequacies.

Installing Hpricot

You may get the latest stable version from Rubyforge. Win32 binaries,
Java binaries (for JRuby), and source gems are available.

Loading Hpricot Itself

Load an HTML Page

The Hpricot() method takes a string or any IO object and loads the
contents into a document object.

doc=Hpricot("<p>A simple <b>test</b> string.</p>")

To load from a file, just get the stream open:

doc=open("index.html"){|f|Hpricot(f)}

To load from a web URL, use open-uri, which comes with Ruby:

require'open-uri'doc=open("http://qwantz.com/"){|f|Hpricot(f)}

Hpricot uses an internal buffer to parse the file, so the IO will stream
properly and large documents won't be loaded into memory all at once. However,
the parsed document object will be present in memory, in its entirety.

Hpricot Fixups

When loading HTML documents, you have a few settings that can make Hpricot more
or less intense about how it gets involved.

:fixup_tags

Really, there are so many ways to clean up HTML and your intentions may be to
keep the HTML as-is. So Hpricot's default behavior is to keep things flexible.
Making sure to open and close all the tags, but ignore any validation problems.

As of Hpricot 0.4, there's a new :fixup_tags option which will attempt
to shift the document's tags to meet XHTML 1.0 Strict.

doc=open("index.html"){|f|Hpricotf,:fixup_tags=>true}

This doesn't quite meet the XHTML 1.0 Strict standard, it just tries to follow
the rules a bit better. Like: say Hpricot finds a paragraph in a link, it's
going to move the paragraph below the link. Or up and out of other elements
where paragraphs don't belong.

If an unknown element is found, it is ignored. Again, :fixup_tags.

:xhtml_strict

So, let's go beyond just trying to fix the hierarchy. The
:xhtml_strict option really tries to force the document to be an XHTML
1.0 Strict document. Even at the cost of removing elements that get in the way.

doc=open("index.html"){|f|Hpricotf,:xhtml_strict=>true}

What measures does :xhtml_strict take?

Shift elements into their proper containers just like :fixup_tags.

Remove unknown elements.

Remove unknown attributes.

Remove illegal content.

Alter the doctype to XHTML 1.0 Strict.

Hpricot.XML()

The last option is the :xml option, which makes some slight variations
on the standard mode. The main difference is that :xml mode won't try to output
tags which are friendlier for browsers. For example, if an opening and closing
br tag is found, XML mode won't try to turn that into an empty element.

XML mode also doesn't downcase the tags and attributes for you. So pay attention
to case, friends.

The primary way to use Hpricot's XML mode is to call the Hpricot.XML method: