Native Applescript HTML entity decoder

NOTE: In the following discussion, an ASCII 0 null character has been inserted between "&" and the remaining characters of HTML entities to prevent the HTML entities from being decoded into their original Unicode characters by the web browser. This won't affect the "Open this Scriptlet in your editor" sections, whose Applescript code may be copied and used as is.

The handler can decode the following: - Any valid decimal or hexadecimal numeric character entity - All 252 character reference entities in the HTML 4 entity set - The five predefined XML entities &â€‹quot; , &â€‹amp; , &â€‹apos; , &â€‹lt; , and &â€‹gt; , representing the double quotation mark, ampersand, apostrophe, less-than sign, and greater-than sign (Unicode code points 34, 38, 39, 60, and 62, respectively), which apart from &â€‹apos; are a subset of the HTML 4 entity set

The handler can process input text strings up to at least 1,000,000 characters in length and containing up to at least 8000 HTML entities (the limits of testing thus far). It takes advantage of several features to maximize execution speed: Applescript's text item delimiters for searching and replacing, the reconstruction of characters from their Unicode code points via the character id ... specifier, and hashed access to a master property list of character reference entities.

The handler takes an Applescript record as its input argument with one required and two optional boolean properties: Required property:htmlString - the text string to be decoded Optional boolean properties:condenseWhitespaces true - condenses consecutive space, tab, return, and/or linefeed characters (ASCII 32, 9, 13, and 10) in the input text string into a single space character as per the HTML 5 standard false (the default value if the property is omitted) - preserves white space characters exactly as they appear in the input string as per the HTML 4 standarddecodeAngleBracketsPerHtml5 true (the default value if the property is omitted) - decodes the character reference entities &â€‹lang; and &â€‹rang; as mathematical left and right angle brackets (Unicode code points 10216 and 10217) as per the HTML 5 standard false - decodes &â€‹lang; and &â€‹rang; as left- and right-pointing angle brackets (Unicode code points 9001 and 9002) as per the HTML 4 standard (now deprecated in the current Unicode standard)

The handler's return value is the decoded text string.

Several HTML decoding solutions are available on the Mac OS X platform, including Cocoa's NSAttributedString class used in association with the NSHTMLTextDocumentType attribute, the decode_entities function of Perl's HTML::Entities module, the unescape function of Python's HTMLParser module, and PHP's html_entity_decode function. The current handler offers the following potential advantages: (1) Offers more granular control over whitespace handling than the alternative solutions (to my knowledge, the Cocoa solution always implements HTML 5 behavior and condenses consecutive whitespace characters) (2) Does not strip HTML tags, in contrast with the Cocoa solution which strips HTML tags (3) For input text strings with fewer than about 250 HTML entities to be decoded: - Executes faster than the Perl, Python, and PHP solutions for input text strings smaller than about 100,000 characters in length - Executes at speeds similar to the Perl, Python, and PHP solutions for input text strings greater than about 100,000 characters in length (4) For input text strings with fewer than about 25 HTML entities to be decoded: - Executes faster than the Cocoa solution for input text strings smaller than about 10,000 characters in length - Executes at speeds similar to the Cocoa solution for input text strings greater than about 10,000 characters in length

The following are known limitations of the current handler: (1) Does not decode incompletely formed HTML entities, specifically those lacking the trailing semicolon character (this mimics the behavior of the PHP decoder) (2) Does not recognize the over 2000 new named character references in the HTML 5 standard, which represent a superset of the HTML 4 entity set and which seem thus far at least to be largely unimplemented on web pages (the hashed property list required to handle this many items would exceed Applescript's list size limit; NOTE: the current handler does recognize the newer entities - in fact, all HTML entities - if they are coded in decimal or hexadecimal numeric form) (3) Executes more slowly than the alternative solutions for input text strings with more than about 250 HTML entities to be decoded (for example, execution times to decode a Wikipedia article ~900,000 characters in length and containing ~4550 HTML entities: Perl ~0.5 sec, Cocoa ~1.5 sec, current Applescript handler ~2.5 sec) (4) Slows down significantly if the condenseWhitespaces input argument is set to true and the input text string contains a substantial number of long consecutive sequences of whitespace characters

(* Notes:
- "&" is separated from the remaining characters of the HTML entity in the initial "set" statements to prevent decoding by the web browser
- ellipsis1, ellipsis2, and ellipsis3 are the character, decimal numeric, and hexadecimal numeric forms of the same ellipsis character (Unicode code point 8230)
- Angle brackets are Unicode code points 10216 and 10217 in the first and third output strings, and Unicode code points 9001 and 9002 in the second and fourth output strings
*)

set |âŒ˜| to current application-- Get an NSMutableString version of the input string.set str to |âŒ˜|'s class "NSMutableString"'s stringWithString:(str)-- If not condensing white spaces, replace them with HTML equivalents.if (not condenseWhitespaces) then-- (The concatentations shown here are only needed when displaying the script code on a Web site. The entities themselves can be used otherwise.)tell str to replaceOccurrencesOfString:(space) withString:("&" & "nbsp;") options:(0) range:({0, its |length|()})tell str to replaceOccurrencesOfString:(tab) withString:("&" & "#9;") options:(0) range:({0, its |length|()})tell str to replaceOccurrencesOfString:("\\R") withString:("<br />") options:(|âŒ˜|'s NSRegularExpressionSearch) range:({0, its |length|()})endif-- Derive an NSData object from the HTML string and an NSAttributedString from that.set HTMLData to str's dataUsingEncoding:(|âŒ˜|'s NSUTF8StringEncoding)set attributedStr to |âŒ˜|'s class "NSAttributedString"'s alloc()'s initWithHTML:(HTMLData) documentAttributes:(missing value)-- Read off the decoded string from the NSAttributedString.set decodedString to attributedStr's |string|()-- Any angle brackets in the result are HTML5 interpretations. Replace them with the other type if required.if (not decodeAngleBracketsPerHtml5) thenset decodedString to decodedString's stringByReplacingOccurrencesOfString:(character id 10216) withString:(character id 9001)set decodedString to decodedString's stringByReplacingOccurrencesOfString:(character id 10217) withString:(character id 9002)endif

-- Return the final result as AppleScript text.return decodedString as textend decodeHtml

Re: Native Applescript HTML entity decoder

Thanks, Nigel, and wow, what a creative way to preserve whitespaces with the ASObjC decoder! That was one of the problems that prompted me to write an Applescript solution. The other side-effect of the ASObjcC decoder is that it strips away HTML tags. I often use those tags as handles for regular expression searches of downloaded web pages. The only Cocoa solution I could find involves Core Foundation's CFXMLCreateStringByUnescapingEntities function. If that could be bridged to Applescript, that might be yet another good solution.

Re: Native Applescript HTML entity decoder

Thanks for the link, DJ. Is there any way of executing the CFXMLCreateStringByUnescapingEntities function from an Applescript script directly or indirectly that does not involve a full-fledged Cocoa application?

Re: Native Applescript HTML entity decoder

bmose wrote:

The other side-effect of the ASObjcC decoder is that it strips away HTML tags. I often use those tags as handles for regular expression searches of downloaded web pages.

I suppose it depends on what your ultimate aim is. In a script I use to get the content of MacScripter thread pages, formatted in a certain way as plain text, I use the tags to identify the sections I want to edit, do the edits, then delete all irrelevant tags and run whatever's left through an NSAttributedString. If your aim's just to convert HTML entities but leave the tags in place, you might get away with entitising (ouch!) the tag brackets too in the script above:

Re: Native Applescript HTML entity decoder

bmose wrote:

Thanks for the link, DJ. Is there any way of executing the CFXMLCreateStringByUnescapingEntities function from an Applescript script directly or indirectly that does not involve a full-fledged Cocoa application?

You can wrap it in a framework. There are several open source third-party frameworks with categories on NSString to do what you want -- you could build your own framework from one of those.

Re: Native Applescript HTML entity decoder

Nigel Garvey wrote:

If your aim's just to convert HTML entities but leave the tags in place, you might get away with entitising (ouch!) the tag brackets too

Thanks for another creative entitization (:lol:) suggestion. I tend to parse as you do: use tags as search handles -> extract the desired text -> decode HTML entities. My focus on preserving HTML tags is more for robustness so that that option is available for some future need.

Shane Stanley wrote:

There are several open source third-party frameworks with categories on NSString

That sounds like a great solution for the current task and also a powerful tool in general. I don't have experience creating frameworks. Could you possibly suggest a specific link that might be particularly helpful for someone like myself climbing up the learning curve?

Re: Native Applescript HTML entity decoder

bmose wrote:

I don't have experience creating frameworks. Could you possibly suggest a specific link that might be particularly helpful for someone like myself climbing up the learning curve?

If you've used Xcode at all, it's pretty simple. Assuming you have tracked down suitable Objective-C files, you create a new project in Xcode, choose macOS and Cocoa Framework as the template, then add your Objective-C .h and .m files to the project. The only settings you may need to change are for deployment target, and what headers are exposed (Build Phases -> Headers, and make public what you want exposed).

So in theory you can go to something like this <https://github.com/mwaterfall/MWFeedParser>, download it, copy NSString+HTML.h, NSString+HTML.m, GTMNSString+HTML.h and GTMNSString+HTML.m to your project (plus the required copyright attributions), Build (For Profiling), put it in ~/Library/Frameworks and use it like this:

Applescript:

use framework "Foundation"
use framework "NameOfFramework"
use scripting additions

on decodeHtml(handlerArgument)set str to htmlString of handlerArgumentset str to current application's NSString's stringWithString:strreturn str's stringByDecodingHTMLEntities() as stringend decodeHtml

Because the code is a category that extends NSString rather than adding a new class, that's it.

However... that's not a totally good idea. The problem is that when you use that in scripts run from app menus, you're loading the framework into the host app. And it's bad form to add categories like that -- it's probably safe, but there's a small element of risk. I wouldn't distribute it that way. You're better off changing the categories to new classes with your own prefix. It's more work to call them that way, but it's safer. (This is what I do with SMSForder in BridgePlus).

Re: Native Applescript HTML entity decoder

Thanks for the link and the very helpful instructions. It opens up so many possibilities. I appreciate your safety advice about categories and will try to get into the habit of making my own classes from the start.