A myriad of markup systems

It’s hard to avoid the legions of custom markup systems out there these days. Every Wiki has it’s own syntactical quirks, while packages like Markdown, Textile, BBCode (in dozens of variants), reStructuredText offer easy ways of hooking markup conversion in to existing applications. When it comes to being totally over-implemented and infuratingly inconsistent, markup systems are rapidly catching up with template packages. Never one to miss out on an opportunity to reinvent the wheel, I’ve worked on several of each ;)

My most recent markup handling attempt has just been published as part of my SitePoint article on Bookmarklets (cliché). It’s a structured markup language in a bookmarklet: activate the bookmarklet to convert the text in any textarea on a page to XHTML. The syntax is ridiculously simple, and serves my limited needs just fine:

= This is a header
Here is a paragraph.
* This is a list of items
* Another item in the list

Converts to:

<h4>This is a header</h4>
<p>Here is a paragraph.</p>
<ul>
<li>This is a list of items</li>
<li>Another item in the list</li>
</ul>

The algorithm is simple, and easily portable to any language you care to mention:

Normalise newlines to \n, for cross-platform consistency.

Split the text up on double newlines, to create a list of blocks.

For each block:

If it starts with an equals sign, wrap it in header tags.

If it starts with an asterisk, split it in to lines, make each a list item (stripping off the asterisk at the start of the line if required) and glue them all together inside a <ul>.

Otherwise, wrap it in a <p> tag provided it doesn’t have one already.

Glue everything back together again with a couple of newlines, to make the underlying XHTML look pretty.

The bookmarklet comes in two flavours: Expand HTML Shorthand (the full version) and Expand HTML Shorthand IE, which loses header support in order to fit within IE’s crippling 508 character limit. A more capable bookmarklet could be built using the import-script-stub method described in my article, but the implementation of such a thing is left as an exercise for the reader (I’ve always wanted to say that).

Incidentally, there’s a very common bug in markup systems that allow inline styles that proves extremely difficult to fix: that of improperly nested tags. Say you have a system where *text* is bold and _text_ is italic; what happens when the user enters _italic*italic-bold_bold*? Most systems (and that includes Markdown, Textile and my home-rolled Python solution) use naive regular expressions for inline markup processing and will output vadly formed XHTML: <em>italic<strong>italic-bold</em>bold</strong>. To truly solve this problem requires a context-sensitive parser, which involves an unpleasantly large amount of effort to solve what looks like a simple bug.