I'm a software development engineer in Microsoft Office and have been working mostly on the RichEdit editor since 1994. In this blog I focus on mathematics in Office along with some posts on RichEdit and the early Windows days

Math Find/Replace and Rich Text Searches

A number of readers have inquired how to Find/Replace mathematical expressions in Word 2007. This post shows how it could be done nicely, although unfortunately this functionality didn’t make it into Word 2007. A previous post shows how to find simple variables in a math zone. The basic idea of finding more complex expressions is to use a rich-text search.

A rich-text search matches one or more rich-text properties in addition to matching the associated plain text according to various options. The basic algorithm for a rich-text search is to loop on a plain-text search followed by tests for the desired rich text properties. If a plain-text hit also satisfies the rich-text property tests, then a desired rich-text search hit is found.

To illustrate this approach, consider searching for a mathematical expression. This functionality ships in Office 2007’s RichEdit control, although it’s not used by any applications to date and it’s only partially described in a Microsoft Confidential document. In math built-up format, as distinguished from math linear format, mathematical objects like fraction and subscript are represented by a start delimiter, the first argument, an argument separator if the object has more than one argument, the second argument, etc., with the final argument terminated by an end delimiter. For example, the fraction a over b is represented in built-up format by {fraca|b} where {frac is the start delimiter, | is the argument separator, and } is the end delimiter. Similarly the subscript object ab is represented by {suba|b}. Here the start delimiter is the same character for all math objects and is the Unicode character U+FDD0 in RichEdit (Word uses a different character). The kind of the object is specified by a rich-text object-name property associated with the start delimiter. So in plain text, the built-up forms of the fraction and subscript are identical if the fraction arguments are the same as their subscript counterparts. In the example here, a plain-text search for {fraca|b} matches {suba|b} as well {fraca|b}.

Searches generally deal with plain text only, so a search for a fraction would match any object with two arguments if the arguments are the same as those of the fraction. A rich-text search is able to match only fractions when searching for fractions and only subscripts when searching for subscripts. In general, only one kind of built-up object is matched.

This is accomplished by executing an iterative loop with a plain-text search followed by a check on the object-name property for each math object in the search text. So long as the checks for object names fail and there's more text to search in the target text, the loop iterates. If there’s no more text to search, the search fails. If a plain-text match occurs and each math object has the same name in the search text as its counterpart in the target text, then the loop is exited with a successful search. Else iteration continues.

This kind of search is a special case of fuzzy rich-text searches in which a single rich-text property has to match for certain text runs as well as having a plain-text match for the whole search string. Other kinds of rich text searches require more or all rich-text properties in the search and target texts to match. For years Microsoft Word has offered one such kind of rich-text search: it requires the uniform character formatting of the whole source string to match that of a target hit as well as having a plain-text match. The math search discussed here differs in that only the math-object start-delimiter name property of each object needs to match.

Find/Replace combines this Find process with the option to replace the found expression with the mathematical expression entered into Replace text control. A cool way to enter the desired Find and Replace strings in the Find/Replace dialog text fields is to type Alt+= to turn on the math zone in these fields and then type the desired math expressions using the linear format as in Word 2007. The text controls need to be RichEdit rich-text controls to do this. Or you can paste the desired math expressions into the Find and Replace text fields. I'll give a more complete specification for how this is done in RichEdit in a later post.

this comment is really not related to this post, I still would be interested in your opinion.

A lot of (i.e. almost all) scientific journals do not accept docx files for submission. This is sort of expected, given how new it is. But they don’t even accept files that are saved into the old Word 2003 format that contain equations that are created with the new equation editor of Word 2007. See Science magazine as just one example of many:

This essentially means that the new equation editor can’t be used for scientific work at this time… Are you trying to mitigate this problem in any way? If I understood the discussion correctly, there is code in Word 2007 that can export equations to MathML. Maybe you could release an add-in that would for example take a docx as input and then output MathML representations for all equations in the document so that magazines could use those for their further workflow? Are you trying to engage with scientific magazines to help them with this problem?

I have to say I am a bit surprised that this shows up now. I would have assumed that you got in touch with science magazines before doing the new equation editor and would have had some solution to this problem…

> Maybe you could release an add-in that would for example take a docx as input and then output MathML representations

I think that the tools to make that are all available, if you peek into the xml output from Word you should be able to pull out all the ooxml math with a simple xpath query, then to convert it to mathml, MS have an xsl stylesheet which may be in the office distribution (but I didn’t see it in the beta I had), but is available linked from here:

Sorry to take so long to reply. I’m working hard on the next version 🙂

First Alt+= isn’t enabled in Office 2007 versions of RichEdit. The testers hadn’t tested it enough to release it. The hot key will work in later versions as it does in Word 2007. You can ship messages to RichEdit to turn math zones on and off. For example, EM_SETCHARFORMAT can turn on a math zone if the wparam has SCF_ONLYCFEFFECTS set (0x0200) and the CHARFORMAT2::dwEffects has CFE_MATH (0x10000000) set and the CHARFORMAT2::dwMask has CFM_MATH (0x10000000) set.

Re talking with publishers, Word 2007’s math and the underlying math components are a tour de force. We were lucky to be able to ship them at all. Even though Word 2007’s math isn’t yet as interoperable as one would like, it is all openly documented. As David Carlisle points out and has proved in his post (see my blog on web math (http://blogs.msdn.com/murrays/archive/2007/04/15/creating-math-web-documents-using-word-2007.aspx), one can get the math from Word 2007 in a standard form. You can also readily make pdf’s from Office 2007 applications like Word 2007. I’m currently working on improving LaTeX interoperability. As they say, Rome wasn’t built in a day 🙂

But why not just include a MathML representation of equations in docx files? Just as an additional representation, not the one used for loading etc. It seems that almost everything required to do so is present already in any case (i.e. the transform from OMML to MathML).

The upside would be huge: Publishers could use their existing and working tools and would not have to completly rework their publishing process. Pointing out that OMML is open and documented is nice, but does not help to solve the problem that Word 2007 cannot be used for scientific publications at this point. It would also be nice if MS could acknowledge that publishers are not starting from scratch with respect to equatinos. If you compare the cost for you to include MathML export with the cost of every publisher to change their publishing process to include OMML support, one just would hope that you simply include MathML export by default in docx.

This would really be similiar to the way OLE objecst are stored. There you have the binary representation that is used to edit it, but there is always also a simple bitmapt stored in the file, so that a program that cannot handle the binary BLOB can still render the thing. Why not do exactly the same with MathML?