Converting XHTML To Text-Only Version Using ColdFusion And XSLT

The other day, I was having a discussion about sending emails using ColdFusion. At one point, the conversation turned to email format. To me, in this day an age, it seems silly to even worry about text-only versions of emails. I mean really - are there even any clients anymore that can't handle HTML formatting? I think even BlackBerrys can handle HTML formatted emails. As such, I generally have no problem building apps that only send out HTML versions.

But, I did think it would be a fun exercise to come up with a way to take XHTML content for emails and automatically convert it into a text-only version. I really love writing and working with XML and it just seemed that XML Transformations using XSLT would be the right tool for the job. The following demo is what I came up with after a little bit of trial and error. I'm no XSLT expert (far from it), so it's not perfect. But, considering that this is automatically created, "just in case" content, I think it's pretty good:

<!--- Save HTML content. --->

<cfsavecontent variable="strHTML">

<h1>

Thank you for your purchase!

</h1>

<p>

Invoice number: <strong>12345</strong><br />

Price: <strong>$19.95</strong>

</p>

<hr />

<h2>

Purchased Products

</h2>

<table cellspacing="5" border="1">

<tr>

<td>

Muscle Girls Gone Wild

</td>

<td>

$10.95

</td>

</tr>

<tr>

<td>

Female Muscle - The Definitive Guide

</td>

<td>

$9.00

</td>

</tr>

</table>

<hr />

<p>

If you have any questions about your order please

contact us at

<a href="mailto:orders@amazon.com">orders@amazon.com</a>.

</p>

</cfsavecontent>

<!--- Define the XSLT --->

<cfsavecontent variable="strXSLT">

<?xml version="1.0" encoding="ISO-8859-1"?>

<xsl:transform

version="1.0"

xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<!--- Store variable for new line. --->

<xsl:variable

name="new-line"

select="'&#10;'"

/>

<!--- Store variable for double-new line. --->

<xsl:variable

name="new-lines"

select="concat( $new-line, $new-line )"

/>

<!---

Match the root node plus any nodes that are not

matched specifically by the templates defined

below.

--->

<xsl:template match="*">

<xsl:apply-templates select="text()|*" />

</xsl:template>

<!--- For all text nodes, output trimmed value. --->

<xsl:template match="text()">

<xsl:value-of select="normalize-space( . )" />

</xsl:template>

<!--- Denote primary header with hrule. --->

<xsl:template match="h1">

<xsl:apply-templates select="text()|*" />

<xsl:value-of select="$new-line" />

<xsl:text>---------------------------------</xsl:text>

<xsl:value-of select="$new-lines" />

</xsl:template>

<!--- Denote secondary headers with hash marks. --->

<xsl:template match="h2|h3|h4|h5">

<xsl:text>## </xsl:text>

<xsl:apply-templates select="text()|*" />

<xsl:value-of select="$new-lines" />

</xsl:template>

<!--- Turn block level elements into text-only. --->

<xsl:template match="p|blockquote|li">

<xsl:apply-templates select="text()|*" />

<xsl:value-of select="$new-lines" />

</xsl:template>

<!--- Add new line after table. --->

<xsl:template match="table">

<xsl:apply-templates select="*" />

<xsl:value-of select="$new-line" />

</xsl:template>

<!--- Turn table rows into bracketed values. --->

<xsl:template match="tr">

<xsl:apply-templates select="*" />

<xsl:value-of select="$new-line" />

</xsl:template>

<!--- Bracket table values. --->

<xsl:template match="td">

<xsl:value-of select="'[ '" />

<xsl:apply-templates select="text()|*" />

<xsl:value-of select="' ]'" />

</xsl:template>

<!---

Strip out any inline tags (and start them off with

an initial space so that nested and sibling tags don't

get concatenated text).

--->

<xsl:template match="strong|em|span|a">

<xsl:text> </xsl:text>

<xsl:value-of select="text()" />

</xsl:template>

<!---

Replace hrule with manual dashes.

NOTE: template also named for manual execution.

--->

<xsl:template match="hr" name="hr">

<xsl:text>. . . . . . . . . . . . . . . . .</xsl:text>

<xsl:value-of select="$new-lines" />

</xsl:template>

<!--- Replace break tag with new line. --->

<xsl:template match="br">

<xsl:value-of select="$new-line" />

</xsl:template>

</xsl:transform>

</cfsavecontent>

<!---

Convert to the HTML to text only. As we are doing this,

we need to wrap the HTML in a root node so that the XML

document we parse is well formatted.

--->

<cfset strTextOnly = XmlTransform(

("<data>" & strHTML & "</data>"),

Trim( strXSLT )

) />

<!--- Strip out doc type. --->

<cfset strTextOnly = Trim(

REReplace(

strTextOnly,

"<[^>]*>",

"",

"one"

)

) />

<!--- Output the text-only verson. --->

<cfset WriteOutput( strTextOnly ) />

As you can see, the HTML would need to be stored in some sort of content buffer and it would have to be XHTML compliant such that it could be parsed using XmlParse(). My HTML content doesn't happen to have any special characters (ex: ampersand); but, if it did, I assume they would have to be escaped prior to XML parsing. Once the XHTML is parsed, I then use ColdFusion's XmlTransform() and the given XSLT document to create the following output (copied from rendered page source):

If you have any questions about your order please contact us at orders@amazon.com.

For an automated process, I think that's pretty cool! I'm not sure I would even bother putting this into an application; but, if I needed to, it's nice to see that automatically converting HTML email content into text-only content is a rather straightforward task.

The problem isn't so much if e-mail clients can or can't read HTML-based messages, it's the darn rendering engines running within the email clients or webmail clients that make it a horrendous effort to create equal looking e-mails ;-)

mein gott! you forget that many people prefer to turn off html email rendering, not merely to avoid their bandwidth sucking aquaintances spamming them with twinkling eFun, but so that slimy requests for evil embedded links can't be used by noxious, pestulant spammers to verify their manky catalogues of addresses and domains.

I'm glad to know that this has helped. Your question about the & is a most excellent one. In fact, I think I know where the answer is; but I have never looked into it before. On a different post, Eric Stevens said something about creating an XML entity to work with these kinds of values:

to the top of the HTML file. I tried this before I wrote my previous comment, but I forgot that the actual HTML content has to be wrapped in <data> tags. I ended up wrapping the DOCTYPE inside the <data> tags which, of course, won't work.

I am the co-founder and lead engineer at InVision App, Inc — the world's leading prototyping,
collaboration & workflow platform. I also rock out in JavaScript and ColdFusion 24x7 and I dream about
promise resolving asynchronously.