Applications like NewsMac Pro need to be able to decode these entities and translate them to the appropriate character. Straightforward you might think, but actually it isn’t. There are multiple ways in which characters can be encoded, as before with a textual name, but also with a decimal or hex value. In NewsMac Pro I used to use NSAttributtedString’s initWithHTML method, however for what ever reason this seem to lock up under Tiger, so I had to find an alternative solution. I thought I’d post the following code to help out other developers because if you go searching on this topic you will most likely get people telling you to use the NSAttributedString method.

This probably isn’t the most elegant bit of code ever, but it serves its purpose:

2 Comments on “Removing entities from HTML in Cocoa”

The codes array does currently not map properly because it has as its first 4 entries @"&", @"<", @">" and @""", which are not the equivalent of 160, 161, 162, and 163. This also causes the other entries to be off by 4.

I’d fixed that bug long ago but forgotten to update the blog post. I’ve updated the code to use the built-in CFXMLCreateStringByUnescapingEntities() function to decode the basic entities before tackling everything else. It’s too bad this function doesn’t do all the work!