Mon, 29 Mar 2010

I maintain the websites for several clubs. No surprise there -- anyone
with even a moderate knowledge of HTML, or just a lack of fear of
it, invariably gets drafted into that job in any non-computer club.

In one club, the person in charge of scheduling sends out an elaborate
document every three months in various formats -- Word, RTF, Excel, it's
different each time. The only regularity is that it's always full of
crap that makes it hard to turn it into a nice simple HTML table.

This quarter, the formats were Word and RTF. I used unrtf to turn
the RTF version into HTML -- and found a horrifying page full of
lines like this:

I've put the actual page content in bold; the rest is just junk,
mostly doing nothing, mostly not even legal HTML,
that needs to be eliminated if I want
the page to load and display reasonably.

I didn't want to clean up that mess by hand! So I needed some regular
expressions to clean it up in an editor.
I tried emacs first, but emacs makes it hard to try an expression then
modify it a little when the first try doesn't work, so I switched to vim.

The key to this sort of cleanup is non-greedy regular expressions.
When you have a bad tag sandwiched in the middle of a line containing
other tags, you want to remove everything from the <font
through the next > -- but no farther, or else you'll delete
real content. If you have a line like

<td><font size=3>Hello<font> world</td>

you only want to delete through the <font>, not through the </td>.

In general, you make a regular expression non-greedy by adding a ?
after the wildcard -- e.g. <font.*?>. But that doesn't work
in vim. In vim, you have to use \{M,N} which matches
from M to N repetitions of whatever immediately precedes it.
You can also use the shortcut \{-} to mean the same thing
as *? (0 or more matches) in other programs.

Using that, I built up a series of regexp substitutes to clean up
that unrtf mess in vim: