I'm a software development engineer in Microsoft Office and have been working mostly on the RichEdit editor since 1994. In this blog I focus on mathematics in Office along with some posts on RichEdit and the early Windows days

Science and Nature have difficulties with Word 2007 mathematics

Science and Nature, two premier science publications, are having difficulties with Word 2007’s elegant new mathematics facility. Part of the reason is due to misunderstanding about Word’s MathML support, which hopefully this post will help to rectify. And part of it is that the new facility represents mathematical text in a way that Word itself understands. Such mathematical text can differ dramatically from text entered using the Equation Editor and MathType, which use embedded OLE objects opaque to Word. Since this second area is primarily responsible for the choices made in Word 2007, I discuss it first.

As soon as mathematical text is represented in a way that Word itself understands, things are both simpler and more complicated. Things are simpler because Word’s user interfaces, formatting commands, object model, etc., can be used directly with mathematical text. Things are more complicated because this convergence in user interfaces allows users to insert Word-oriented features into math zones such as

·Images

·Revision markings

·Footnotes and comments

·Elaborate formatting and styles, …

The file format needs to be general enough to express such material faithfully. Unfortunately, MathML 2.0 isn’t able to handle embedded XML namespaces and as such simply isn’t general enough to represent Word 2007 technical documents. Accordingly we had to develop an XML approach that is general enough and we created OMML (Office MathML), which can be embedded in Word’s primary XML, WordProcessingML, and vice versa.

Office 2007 also ships XSLTs to convert OMML to MathML (omml2mml.xsl) and MathML to OMML (mml2omml.xsl). These XSLTs are used, for example, by Word for MathML clipboard support. They are stored in the subdirectory C:\Program Files\Microsoft Office\Office12. Naturally the MathML resulting from OMML in this way is missing content like images, revision markings, footnotes, etc., but for many purposes that’s acceptable. It just isn’t acceptable in the Word docx format, since this format has to reproduce exactly what the user created. The docx format and OMML are international standards and are thoroughly documented as noted in previous blog posts.

One of the very nice features of XML is that it can be translated relatively easily from one kind of XML to another. David Carlisle has used this flexibility to advantage in converting Word’s HTML to HTML with embedded MathML. Word’s HTML contains the math zones in two formats: OMML in comments and images. David’s program extracts the OMML, uses the omml2mml.xsl to convert to MathML and puts it all back together. Admittedly David is a magician, but he proves it can be done J

The bottom line is that Word 2007’s new math facility is a huge improvement over past approaches. But anytime such big improvements occur, there can be, and evidently are, problems with upgrading. I think the trouble is well worth it in both user convenience and the marvelous typographic quality. I’ve been doing technical word processing since the late 1960s and Word 2007’s mathematical capabilities still amaze me. Not that it’s finished; we do have a number of features to add…

We are in the process of contacting Nature and Science to understand their difficulties better and hopefully to offer solutions. The new docx math format is substantially richer typographically than earlier formats and should be considerably more valuable for a publisher. Admittedly when you have an infrastructure that works, the easiest thing is to just keep using it. But the thoroughly documented docx format should provide a much more faithful conversion path than that of the earlier doc format and MathML is readily extracted from it. In addition, we have more exciting things in mind. It’s a great time to be a scientist, engineer, mathematician, or student of those disciplines.

Excellent to hear that! I guess a lot has to do with third party apps they use in the downstream processing of documents and their compatability with the docx format. I don’t really know, but presumably most DTP programs are not yet compatible with docx? Or are they? But sorting these things out with publishers would be incredibly helpful.

Also, quite a number of journals use automated systems for paper submission, where you essentially upload a doc file and then they create a pdf out of it on the server. They will need to invest quite heavily to update those pieces of software, I guess… I came across http://www.editorialmanager.com quite often, maybe working with them to add docx support would be helpful?

since Adobe didn’t want Microsoft to ship it with Office 2007. But once that’s installed, you can create the pdf directly from Word and then post it on the web. I’ve been using this facility for over a year now with my Unicode Technical Note #28 (http://www.unicode.org/notes/tn28/UTN28-PlainTextMath-v2.pdf)

That is not the sort of functionality these editorial systems provide. Here is how it works: An author uploads his paper, often it has to be uploaded as a number of seperate documents (one with the title, abstract and author details, one without any clue of who wrote it, one with the figures, one with the tables etc). The backendsystem then combines these various documents into a number of PDFs automatically: One where everything is combined for the author to submit, one for the editors, one for the reviewers (and for those for example the parts with the author name are excluded). All of this runs on backend servers, and essentially the software that runs there would need to be changed to accept docx files as input. Obviously that is a major, major investment to update these backend systems, after all the software would be required to almost understand the entire spec for docx if it wanted to compile it into pdf.

What might help a lot would be a code component provided by you guys that handles some of that conversion. Have you for example considered a component (dll) to which you can fetch a docx file and spit out pdf data? You must have all the code around, but making that available for third party developers might make it incredibly simpler to modify such backend systems to be able to deal with the new file formats.

Word 2007’s object model does have the method Document.ExportAsFixedFormat, which enables a program to export pdf from Word. To see this, launch Word, type Alt+F11 to get to Visual Basic, then choose View/Object browser and click on Document. Further clicking on ExportAsFixedFormat shows this method’s prototype and the argument WdExportFormat can take either wdExportFormatPDF or wdExportFormatXPS.

But automating Word via the object model on the server seems a sure way to kill any scalability… That suggestion might work for small desktop apps, but never ever for a server app which needs to convert stuff. The guidence from MS itself very clearly advises against using the automation object model in server apps.

I have not been posting as frequent as I would have liked, but I plan to correct this soon. Meanwhile, here are several links to useful OpenXML (wordprocessingML and Word 2007 focused) links: End user downloads Compatibility packs for older ver