Tuesday, March 8, 2011

C++ Console Application to Get Comments from a Microsoft Word File

The goal of this post is to show how to construct a C++ console application that will extract comments from a Word document. This post builds on a previous post which showed extracting comments from a Microsoft Word document (2007 or greater). In the previous post, Getting Comments from a Microsoft Word File: Leveraging the OPC Format, we did the extraction by changing the extension of the Word document and accessing the files directly in the ZIP structure. In this post, we take the Word document as is and use a console application written in C++/COM and leveraging the OPC API to directly access the comments. The code shown here was run in Visual Studio 2010 on Windows 7.

The key to the console application logic is to understand the document parts of the Word XML format. When we crack open the Word ZIP file we could get the comments file directly. Using the API we have to follow the pattern set out in the API. The pattern for a Word document is discussed here on MSDN and here. The main document part (../word/document.xml) is the main part in the package and that the comments part (../word/comments.xml) has a relationship to the main document part that can be used to obtain the coments. On our first try, we kept trying to get the comments part directly from the package relationships which didn't work. However, once we got the document part from the package (see the FindPartByRelationshipType method in the program below), we then could use the same logic to get the comments part from the document part.

A crucial part of the console application are the definitions of content types and relationship types of parts to parts. These definitions are defined in the header file (ExtractComments.h) for this application. For example, the content type of the comments part is:

Note: In this console application we did not deal with the fact that comments in a Word document can contain more than just text. In the previous post we did deal with hyperlinks as example of content besides text in comments. These improvements to this code would need to be added here. Specifically, if you look at the ECMA-376 part1 for the docx format, you can find the details of what a comment can contain and it includes charts, diagrams, hyperlinks, images, video, and embedded content.

The code shown here was build starting from the SDK samples provides with the OPC SDK Samples for Windows 7. In particular we started from the SetAuthor project inside of the AllOPCSamples.zip. We changed the SetAuthor program to suit our purpose here. The console application takes a file name as an argument. In Visual Studio, set the file name under the configuration properties of the project as shown below.

The code is shown below and as well as links for downloading it. Before getting to the code here is a sketch of the pseudo-logic of the code. We use the syntax of (x,y) -> z to mean x and y are used to return z. A bit simplistic, but helps clarify what is coming in and what is going out.