~ A blog about the complex relation between computers and history

Tag Archives: Word

The sudden realization that the new MS Word format, .docx, is called Office Open XML for a reason made me spend the whole day in trying to figure out, how these XSL-transformations actually work and whether they could be used in converting these new .docx files to something more edi(ta)ble.

Turned out that the XSL transformations were in principle a pretty simple thing to do, just like a friend me had told. Here’s and example of how to convert a .docx file to LaTeX, in its crudes form:

First, you need to break open the .docx file. It basically is a simple zipped archive, so an ‘unzip testdoc.docx’ should do the trick; you’ll end up with several files and sub-directories, of which only the directory called ‘word’ is necessary for this test.

You can save that in a file called docxtolatex.xsl in the ‘word’ directory. Then, in that directory, run ‘xsltproc docxtolatex.xsl document.xml’, and you’ll have your screen full of the document, in LaTeX markup.

You’ll notice, that this XSLT only converts bold, italics and footnotes. But then again, that’s what I often only need to convert…

For years now, I’ve been using a venerable old tool, rtf2latex, to convert documents from MS Word to LaTeX. For my purposes it has served well: the purpose being mostly to transform submitted articles into something, that can then made to conform to my LaTeX-style for the journal whose layout I’m doing. In practice the needs are: keep the footnotes intact, let italics, bold and underline survive the translation. Nothing else is needed, as I trust LaTeX to do rest.

I’ve felt strangely uneasy about using rtf2latex lately, though. The fact that is is no more available in Debian has made me look for alternatives. Also, the Word documents I received I still had to translate first from .doc to RTF. wvWare seemed the proper alternative, but it does not work with footnotes at all, and its web page says, that its use for this purpose is deprecated in favour of Abiword.

Abiword I use occasionally, it is your typical Gnome program. Not too complicated, works well, and is nice to look at. But this conversion function I was never able to get working, until today, when I realised, that I need to install also the abiword-plugins… how stupid of me. Now the conversion from MS .doc to latex works well, although the resulting documens is slightly too fancy to me. I’d be happy with something that preserves only the logic of the markup, and discards all of the funny spaces that are used to make it look like a Word document (Why on earth would anyone want that?)

But I guess I can finally stop worrying about not accidentally removing rtf2latex from my system. A replacement has been found! And although from the web page and release history it might seem, that Abiword is a dead project, the traffic on the development mailing list demonstrates, that the project is very much alive. I guess we will have the version 2.6 someday — not that there’s really anything wrong with 2.4.6.