My life in DocBook ligature hell

I am working on a new project. The result will be a document. I want to publish it in several formats, including HTML, PDF, and Word. (Hot tip: if you rename a RTF file to end in DOC, you’re done. Don’t make DOC files, they suck. Make RTF files, which any editor can read. You just have to trick your users and Word into using RTF’s by renaming them.)

The accepted standard these days for formatting a document into PDF, RTF and HTML is DocBook. In fact, the modern implementation of DocBook is just “XML + XSL (+ FO)”. You could do pretty much the same thing with XML and XSL yourself, but then your document would not be compatible with someone else’s formatting instructions, and you’d be redoing a bunch of work other people already did.

So DocBook is interesting, but by default it was making my document ugly. Really, really ugly. I just had to do something about it, and that lead me into playing with fonts and layout. Fonts went easy, because there’s really good info in DocBook XSL: The Complete Guide. Farting around finding free TTF fonts was a pain in the butt, but I eventually paired Futura with Garamond, on the advice of the nifty (and niftily named) Esperfonto.

Since I was obsessing about the way the page looked, I decided it would be nice to turn on whatever magic DocBook has available to make printed pages look nice. After all, if DocBook is good enough for O’Reilly, it ought to be good enough for me. I’d already noticed that DocBook was making my quotes curve (as long as you use the hideously verbose <quote></quote> tags). Then I looked into getting my apostrophes to curve.

Hello? Tap tap. Is this thing on?

Turns out the number one rule of the DocBook Club is you don’t talk about the DocBook Club. There’s plenty of mailing lists and stuff, but there are no answers. Eventually, it seemed to me the actual answer is “insert the Unicode symbol for them yourself”.

Wow, that’s seriously fucking stupid. Not even Word is that stupid… and I thought Word had the market cornered on stupid.

So I gave up on curved apostrophes for a bit and went to go turn on ligatures. Since you are probably like me and don’t know your ligatures from your descenders, check out this great picture:

Click on the picture to go learn more. It’s a great website — lots of typo geekery and puns… my kind of stuff. There’s also cool stuff on ligatures there.

So after much gnashing of teeth and experimenting, here’s what I’m coming to see about DocBook, PDF and HTML:

Ligatures work, and can be quite beautiful.

Ligatures make searching in the PDF document fail mysteriously, because the underlying word is gone, and what remains are the characters that get displayed. “Difficult” is a different word than “Di ffi-ligature cult”. Because PDF is all about the flash, no one has gotten around to fixing this. It would take rocket science. What needed to happen was about 20 years ago someone should have stood up and said, “wait! words are words, and instructions for showing words are not words, and when the two are different, both have to be preserved!”. And in fact, MacOS seems to have this… watch this video to see that Macs can add ligatures “on top of” underlying text without forgetting what the text itself is:

Ligatures don’t work with all fonts. That makes them tricky to use in PDF (but doable) and really dangerous to do in HTML (where there’s no guarantee what font you’ll get).

DocBook has no support whatsoever for ligatures, just as support is absent for curved apostrophes. It just boils down to “type the Unicode”.

So, now I’m contemplating one of two ways forward:

Just forget them all together, which would be a shame

Write a script called “ligify” that finds all the ligatures and puts them in, and only let it run for the PDF generation, not the HTML generation. (Bad interactions between nXML mode and XML’s stupid macros mean I already have to do my own macro expansion with sed anyway.)

I’m becoming an expert in DocBook publishing, which would be a nice thing to have on my resume, except that I fear I’d slit my wrists if I was doing this stuff every day. How can it be that SGML has existed 20 years, and it’s STILL THIS BAD? This is clearly an example of Leonard’s Law (software will be as bad as its users will tolerate).