This is correct behavior for XML files, but it can cause problems if you are trying to use an NSXMLParser to monkey with XHTML/HTML.

I was using an NSXMLParser to modify an XHTML webpage from Simple Wikipedia, and it was turning: “#include &lt;stdio&gt;” into “#include <stdio>“, which then displayed as “#include “, because WebKit thought <stdio> was a tag.

Solution: Better Tools

For scraping/reading a webpage, XPath is the best choice. It is faster and less memory intensive then NSXMLParser, and very concise. My experience with it has been positive.

Frankly that code scares me. I worry I’m not escaping something I should be. Experience has taught me I don’t have the experience of the teams who wrote HTML libraries, so it’s dangerous to try and recreate their work.

(UPDATED 2009-05-26: And indeed, I screwed up. I was replacing & with &amp;, and that was causing trouble. While my “fix” of not converting & seems to work on one website, it will not in general.)

I would like to experiment with using JavaScript instead of an NSXMLParser, but at the moment I have a working (and surprisingly compact) NSXMLParser implementation, and much less familiarity with JavaScript then Objective-C. And compiled Obj-C code should be more performant then JavaScript. So I’m sticking with what I have, at least until I’ve gotten Prometheus 1.0 out the door.