Parsing, Traversing, And Mutating HTML With ColdFusion And jSoup

Earlier this week, James Moberg asked me if I had ever used the jSoup HTML Parser with ColdFusion. Until then, I had never even heard of it; all of my experimentation with HTML parsing in ColdFusion has been done with TagSoup. Now that ColdFusion 10 allows for custom, per-application Java libraries, however, playing with JAR files is incredibly simple. So, I decided to throw it in an Application.cfc and try it out. And, let me say, it is awesome! jSoup uses a jQuery-like syntax to allow for method chaining and effortless CSS-oriented DOM traversal.

NOTE: At the time of this writing, ColdFusion 10 was in public beta.

In an earlier post, I demonstrated how to use ColdFusion 10 with TagSoup to parse "dirty" HTML into valid XML documents. This was a relatively involved process and resulted in a Document Object Model (DOM) that required XPath queries for data extraction. jSoup's "dirty" HTML parsing is much more simple; and, like jQuery, it results in a fully-encapsulated Document Object Model (DOM) representation that presents methods for effortless DOM traversal, data extraction, and element mutation.

To demonstrate the power of jSoup, I'm going to make an HTTP request to my Tumblr blog and extract the posts that are image-based, including the IMG source and a link to the post. In order to do this, I needed to configure my Application.cfc ColdFusion Framework component to load the jSoup Java library. Luckily, the jSoup library is fully contained in a single JAR file with no dependencies.

Application.cfc - Our ColdFusion Framework Component

<cfscript>

// NOTE: CFScript tag included for Gist color-coding only. Remove!

component

output="false"

hint="I define the application settings and event handlers."

{

// Define our standard Application settings.

this.name = hash( getCurrentTemplatePath() );

this.applicationTimeout = createTimeSpan( 0, 0, 1, 0 );

// Define our per-application Java library settings. Here, we

// are telling it to load JAR and CLASS files in the lib directory

// that is located in our application root. In this case, we are

// loading the JSoup 1.6.2 Class for parsing, traversing, and

// mutating HTML Documents.

this.javaSettings = {

loadPaths: [

"./lib/"

],

loadColdFusionClassPath: true

};

}

// NOTE: CFScript tag included for Gist color-coding only. Remove!

</cfscript>

This loads the "jsoup-1.6.2.jar" JAR file contained within the local Lib directory. ColdFusion 10 makes it that easy!

With the jSoup JAR file loaded, I can now parse my Tumblr blog. In the following code, notice that we're using jSoup to make the actual HTTP request to the Tumblr blog; in addition to HTML parsing, access, and mutation, jSoup also provides methods for making full-feature HTTP requests (GET and POST) including headers and cookie values.

<cffunction

name="renderLinkView"

returntype="string"

output="false"

hint="I render the link View for the given link and image.">

<!--- Define images. --->

<cfargument name="href" />

<cfargument name="imageSource" />

<!--- Render the link view. --->

<cfsavecontent variable="local.view">

<cfoutput>

<a href="#arguments.href#" target="_blank">

<img src="#arguments.imageSource#" height="100" />

</a>

</cfoutput>

</cfsavecontent>

<!--- Return the link view. --->

<cfreturn local.view />

</cffunction>

<cfscript>

// Create our JSoup class. The class mostly has static methods

// for parsing so we don't need to initialize it.

jSoupClass = createObject( "java", "org.jsoup.Jsoup" );

// Create a connection to the Tumblr blog and execute a GET HTTP

// request on the connection. Hello muscular women!

dom = jSoupClass.connect( "http://bennadel.tumblr.com" )

.get()

;

// Get all of the posts that have an image as the primary media

// element. From there, we can subsquently select both the image

// and the link to the blog post.

//

// NOTE: If you have a space around your inner selector, jSoup

// will throw an unexpected token error:

// == Could not parse query '': unexpected token at '' ==

posts = dom.select( "div.post:has(div.media img)" );

// Loop over the blog posts to generate the images and links.

for ( post in posts ){

// Once we have a node within the document, select() requests

// on the node will be relative to the given node within the

// Document Object Model.

// Get the link element. This is the immediate child of the

// current post.

link = post.select( "> a" );

// Get the media image for the post.

image = post.select( "div.media img" );

// Render the link. Notice that we are preceeding the

// attribute name with "abs:". This gets jSoup to return the

// absolute URL for the attribute value. If we did not have

// it and the URL was relative, it would return only the

// relative value.

writeOutput(

renderLinkView(

link.attr( "abs:href" ),

image.attr( "abs:src" )

)

);

}

</cfscript>

As you can see, the retrieval of the remote HTML and the parsing of it into a DOM is basically one line of code. Then, using the select() and attr() methods (think find() and attr() in jQuery), we can easily move around the DOM, extracting all the relevant information we desire. When we run the above code, we get the following collection of Female Muscle photos:

I think jSoup just became my goto library for HTML parsing in ColdFusion. I don't think I could come up with a way to make this process any easier. I want to give a huge thanks to James Moberg for bringing this Java library to my attention.

Any idea of the easiest way to support jSoup on pre-CF10 servers? Is it back to using the JavaClassLoader.cfc?

Really digging the increased performance jSoup seems to give over TagSoup. Not sure if its actually due to the different *.jar or because of the extra CFC instantiation for my TagSoup/CF9 to jSoup/CF10 comparison.

I don't believe this will execute JavaScript. But, I've heard about something called PhantomJS which is a "headless" WebKit engine which apparently will help with actions like that. I haven't had a chance to look into it yet.

@Matthew,

Oooh, hmm. I know that the JavaLoader is supposed to be doing the same thing (or at least in theory). I don't know enough about Java to know what that kind of error that would be. Perhaps ColdFusion is trying to decorate a Java object (to look/act like a Struct) and cannot translate some underlying property access on the Java object? Really, I have no idea - sorry!

Incredibly useful .. thanks Ben. Just playing around with it a bit .. is it just my Friday numbness, or is there a possible issue composing useful ID selectors in CFML for this library i.e. how to escape the hash sign for an ID selector

@John, just drop the jar file into the following dir on the server:c:\JRun4\servers\cfusion\cfusion-ear\cfusion-war\WEB-INF\lib\The path varies on your own CF installation or if you went multi-instance or not.Then you can simply initiate it like this: <cfset jsoup = createObject("java","org.jsoup.Jsoup")>

Hi. I'm trying to use jsoup to sanitize user-submitted HTML. Regex just doesn't cut it. When I ask jsoup to add some extra attributes to its whitelist I get this error: "The addAttributes method was not found."

The addTags method throws the same error if I give it a string rather than an array, so I made sure I'm giving addAttributes an array but it does't help.

I also tried converting the CF array into a Java array using JavaCast which makes no difference either.

Great exemple, somebody know how to do exactly the same but to export as image and not as xml.I need to parse an html file or Url as jpeg file with CF10 more quickly as possible and I need to choose some extras like pixel size output,..

I am the co-founder and lead engineer at InVision App, Inc — the world's leading prototyping,
collaboration & workflow platform. I also rock out in JavaScript and ColdFusion 24x7 and I dream about
promise resolving asynchronously.