Clone this wiki locally

Trying to parse out some trick HTML? Add your question to this page and we’ll see if we can track down a simpler path to it.

Update innerHTML of a illformed DOM

Q: How do I update the innerHTML of the following snippet?

If I do a doc.search(‘#myH1’).innerHTML, it shows only “Some good text here” and shows <ul> as a sibling of <h1> tag and not the parent. I know that <ul> cannot be inside a <h1>. Is there a way to update the innerHTML of <h1> section completely? i.e. removing <ul> tags and updating innerHTML

Finding all images in an HTML email

Q: How do I find all images in an HTML email?

I’m testing HTML emails to make sure all the image paths are properly formed and the images exist. I currently know how to find images if they are <TD BACKGROUND=...> or <IMG SRC=...>. But if my email template maker guy adds an image elsewhere my test code will phail. Fetching all images will solve my problem without me having to talk to a human. Thank you!

A: Simple! Use the imgCSS selector: (doc/"img").

So, to throw an error when an image is found:

unless (doc/"img").empty?
raise Exception, "no images allowed"
end

Extracting multiple children from a table

Q: I am new to ruby, rails and Hpricot, and don’t understand most of the XPath or cSS stuff! I have managed to get hpricot to scrape through a document to find the section that I want, but now I am stuck with a table which is in the form.

…and I really need to be able to have these in the form [“Field 1”, “Field 2”, “Field 3”, “Field 4”, “Field 5”] for each row [there will be many rows]. I tried telling it to remove the first child to get rid of the first contents, however it seems to go through all the code and also removes the Field 4 . Anybody able to help me do that please?

A: This might not be optimal, but it seems to get the job done for what you want:

Iterating over XML

Q: I have an XML product feed that has some nodes that are always named the same, and some that can be different for different products. I know how to parse nodes when I know their names. How do I parse nodes when I don’t know in advance what they will be called? These “dynamic” nodes are always the children of a given node — how do I parse dynamically just for one node?

A: Looking for a solution to this same problem, I came up with traversing the document using #containers: “Return all children of this node which can contain other nodes. This is a good way to get all HTML elements which aren‘t text, comment, doctype or processing instruction nodes.”

doc.at(:parent_of_dynamic_nodes).containers.each do |node|
#process node
end

Selecting only Immediate Children

Q: So, I’ve got an Hpricot::Elem, whose HTML looks like:

<ul>
<ul>
<li>A</li>
</ul>
<li>B</li>
<li>C</li>
</ul>

How does one find only its immediate children li’s (i.e. B and C, but not A)? For example, e.search("li") problematically gives me all of e ’s descendants, not just immediate children. I want something like e.search("./li"), but that totally doesn’t work.

A: There are two possible selectors which may be used. The XPath selector would be /li. The CSS selector would be >li. Neither selector should have spaces in it. Spaces will trip up 0.5.

When you continue a search from an element, that element is treated as a root node.

Excepting the First

I know I can select the first div element with the expression div.test:first-child, but how to I select the other two elements? I’d like to remove any divs of the test class which aren’t first children.

Preceding / Following Children?

How do I find C to D? I suppose I somehow have to use preceding-sibling, but I can’t seem to figure out how…

A: An easy way is to use #following_siblings. (The opposite is #preceding_siblings)

c_and_d = doc.at('//a[@name="articlestart"]').following_siblings

Follow-up to ‘Preceding/Following Children’ (text nodes)

Q: How do you solve the preceding question when the tags are interspersed with
text nodes? For example,

<C>...</C>
Some text
<tag> </tag>
More text
Even more text
<D>...</D>

A: Use the next method instead of next_sibling in the previous code snippet. That will get text nodes as well as container nodes. Note that this will also include comments. next_node is an alias for next.

I am attempting to use pure xpath if at all possible, however I am willing to hear other suggestions even though I may have to rethink my design a little. There is enough other code on the page to make it difficult. You can see here:

doc/"a[@id=p-*]"

Would be the ideal statement.

A: Try the following:

doc/"a[@id^=p-]"

This operator matches the beginning of the string.

Build a larger tree from several fragments

Q: What’s the best way to combine several HTML fragments into one tree, without just concatenating the strings?

Suppose I have fragment 1:

<p>This is a paragraph la la la.</p>

and fragment 2:

<ul><li>This is a test.</li><li>This is only a test.</li></ul>

What’s the best way to combine them into an Hpricot doc that contains:

<html>
<p>This is a paragraph la la la.</p>
<ul><li>This is a test.</li><li>This is only a test</li></ul>
</html>

… without flattening to strings, concatenating them, and reparsing?

I’d like to stay in the Hpricot domain if possible. It seems to me that it’s much faster to just join the trees than to round-trip through the emitter and parser, and I’m also concerned about what would happen if some of the input is bogus and produces invalid nesting.

A: …

Parsing not valid HTML

Q: What’s the best way of parse a not valid HTML?

I am trying to extract the body of a HTML page, http://www.c2.com/cgi/wiki?AtsUserStories, withdoc_content = doc.search('html/body'). The problem is, that page doesn’t have the <html> and </html> tags. That kind of problem happens to me a lot, pages that don’t have </body>, or that <head> comes before <html>. I thought Hpricot already deals with that kind of problem, but this not happens now.

So, how can I deal with that kind of problem? Thanks!

A: On pages that I’ve been using this on, I’ve just tried to make it valid html as much as possible with regular expressions before throwing it into Hpricot. For instance if it’s missing the <html></html> tags like on that page, insert them first before bringing it in. You could insert a </body> the same way by just searching for a </html> or inserting a </body></html> if both are missing.

If you can find where the body is supposed to be, I’d try to insert that with a regexp. I’ve even removed some tags that were mostly useless to the output rendering (some unmatched tags) because I couldn’t get to the source. Ended up with a happy Hpricot.

Outputting HTML instead of XHTML

[This question wasn’t formatted correctly, I’ve taken a guess at the intention in transferring over the wiki page]

Q: Hpricot seems to output XHTML instead of HTML by default. Is there a way to force HTML?

For example:

Hpricot('<br>').to_s

returns <br /> and not <br> like I wanted.

A…

Hpricot and character encoding

Q: Hpricot (or perhaps it’s Ruby in general?) seems to struggle with character encoding. When using Hpricot with documents that contain “funny” characters such as `, the results are wonky. Does anyone have any advice on how to deal with this?

How to find elements’ relative position?

Q: Given two elements A and B how do I tell which comes before the other ? If I could use sth like start position of the element in the html document I could compare the positions and figure it out but there’s no such feature, or is there ?

Warning while using :last-child

Q: Hpricot is throwing a warning, but doing what I want when I use :last-child. What gives?

A: I had the same problem. When I changed to last-of-type instead of last-child the warning went away. Of course that won’t work if you really need the last child of any type and not the last p.

Starting from Scratch

Q: My code should output a snippet of HTML for inclusion in the middle of a document.Does Hpricot have a role? Can I use it to build my snippet in the abstract(using code that knows nothing about HTML, but enough about Hpricot),and then can I call Hpricot to output the HTML of the snippet?

Getting a hold of malformed data that isn’t in an element.

Q: I’m dealing with data on HORRIBLY designed HTML pages, using tables and presentational elements for everything. What’s the best method to grab data from after a known element?

How can I get ahold of, say, the ‘volume’ of 0.1? It’s outside any element except the TD itself – I guess I’m looking for a psuedoselector of some sort for :test combined with a selector for :after, so I can grab whatever text is directly after a given <b> but before any later element.

A: I would pull out the text of the td with Hpricot, then just make a regular expression to try to get the volume, since it’s not labelled any other particular way.

Hpricot as data extractor

Q: Hpricot seems like a very cool tool, but I just ran into it and don’t grok Hpricot (yet).

I’m trying to extract data from a text field populated by tinymce so I can then translate it for pdf output. I have snippets of code that will let me extract sets of data like the table header row values (column headers), the table row row values (the data), items in an ordered or unordered list, etc.

What I’ve done is walk the html stored in the db using doc.each_child with a recursive method that recognizes “key” elements using elem.stag.name to direct actions in the method. It does pay to look behind the doc and scrounge around in the actual Hpricot code itself (I’m curious why some sections in code are marked nodoc by various authors – that’s usually where I find my questions answered).

Parsing XML whose tag names conflicts with XPath axis_

Q: I have an XML (gnucash data, to be precise) which has one tag called <act:parent>. I want to extract the content of this tag, but if I use the XPath "//act:parent" I find nothing.

I assume that this is a conflict with the parent axis. I don’t know if this is a bug of hpricot, or if hpricot is behaving right but I should “escape” somehow these “reserved words”. I’m pretty new to XPath and XML

Finding an XML element that may appear in multiple namespaces

Q: I am downloading a series of WSDLs and I need to download all imported files such as additional WSDLs and schemas. The import element appears in both the WSDL and schema namespaces so it can be in at least two namespaces. Add to that the arbitrary namespace prefixes and the elements can appear under any imaginable namepace prefix (x, xs, xsd, auntdahliasankle, etc.). I need to be able to select all of the import elements in an XML file regardless of the namespace prefix.

I did a search on the attributes location and schemalocation and pulled back all elements that had those attributes. It worked but lacked umph. Namespaces are eeeeeeeeeeeevil.

Getting namespaced element

Q: I want to get namespaced element (eg. <evil:Tag>FAIL</evil:Tag>).

A: You can use the %() method: doc.%('evil:tag'). Beware that Hpricot seems to downcase tag names. Credit: Garrick van Buren.

Accessing pages by HTTPPOST Request

Q: I’m trying to access a page that does require a POST variable, e.g. some URLhttp://www.someserver.com/somefile.html and I want to pass a variable named foo of value “bar” using a HTTPPOST request (I allready tried substituting a HTTPGET request, didn’t work)

Matching and iterating over multiple possible XML elements

(@doc/"item | entry").each do |stuff|
# what you'll choose to parse…
end

Why some feedburner’s feeds can the first or the other ?

Finding all text elements within a html document

Q: How can I find the inner_html of all text elements (e.g h1, h2, …, p, label, …)
It’s easy to get all the elements when the html elements aren’t nested, but as soon as you nest elements (what you surely will do) it gets tough.