Wednesday, February 23, 2011

Example of Extracting Comments From a Microsoft Word Document
This post shows one way to grab all the comments from a Microsoft Word 2007/2010 document and display them as HTML. The method shown here leverages the fact that, starting with Word 2007, documents are zipped packages of XML files and associated resources that can be "cracked" open simply by changing the extension from .docx to .zip. This capability is part of the Office Open XML standard, ECMA-376. In other words, a Word document is really a ZIP package that inside contains a virtual directory structure with XML files and resources (like images) that comprise the document. The package concept as described on MSDN is analogous to a filing cabinet. There is a good tutorial on the open format on this Office training page. The ZIP-like behavior applies to more than just the Word format we are dealing with here. It applies to Excel spreadsheets (.xlsx), PowerPoint presentations (.pptx) and XPS documents (.xps).

We thought this was kind of interesting when we first learned about it and thought about a way to exploit this format. What we came up with a scenario when you want to get the comments out of a document. With that in mind, let's begin.

Suppose you have the document "Software Spec.docx" that has comments in it with at most hyperlinks in them and you want to extract all the comments and hyperlinks. First we'll need to get at the comments stored as an XML file inside the package. To do this, change the extension to .zip and then unzip the file so that you end up with a directory looking like this:

If you go into the unzipped folder you are at the top level folder:

If you go into the word directory, you should see something that looks like the following image:

This contains the file comments.xml that has the comments in it. But we are also considering that the comments have hyperlinks in them, so we need to go even further and go into the _rels folder

In the _rels folder there is a comments.xml.rels file that contains the hyperlinks that are used in the comments. Together the comments.xml and the comments.xml.rels can be used to get what we want.

To get the comments out we'll use an XSL transform on the two XML comment files to transform XML to HTML. So the basic strategy is this:

1. Take the comments.xml as is and place in a directory (we'll call it the transform directory) where we'll do the transformation. A different place than the unzipped folder is best to start with.

2. In the transform directory put the comments.xml.rels file.

3. In the transform directory create a transform.xslt file and put the content shown below in it.

5. Finally, in the comments.xml file, add the line that references the transform.xslt file.

<?xml-stylesheet type="text/xsl" href="transform.xslt"?>

6. From Windows Explorer you should be able to open the comments.xml file in a browser and the comments will be transformed. Of course there are many other ways to leverage the transform, but letting the browser do the work is easy for demonstration purposes. At the time of this post this worked in IE 9 and Firefox 3.6 but not Chrome. (Didn't investigate in detail, but may be related to this security issue.)

Above we specified that we were dealing with comments and hyperlinks in the comments. But, more generally, comments in a Word file can have images, SmartArt and a lot more. To make the comment extraction method given here more robust you would need to modify the XSLT to take all this into account. For example, if you inserted a SmartArt Graphic into a comment, the word\comments.xml file would reference a relationship in the word\_rels\comments.xml.rels file which would reference word\diagrams\data1.xml (for example) that might in turn reference another file \word\diagrams\drawing1.xml (for example). The point is, it can get quite complicated and all the paths need to be followed to reconstruct the comments as they appear in the document.

Note that we have to deal with two XML files, comments.xml and comments.xml.rels, with one transform. We do this by using the XSLT document function. Notice in the transform.xlst there is this line: