Introduction & Background

With the advent of Microsoft Office 2007 Open XML formats, the philosophy of Office report generation was deeply changed into making it detached from Office itself and open to any kind of programming language which is capable of reading compressed archives and manipulating XML. For further reading, visit:

In this article, I'm going to illustrate an SOA approach for generating Docx reports in a distributed environment with the necessity of having MS Office 2007 installed only on the developer machine (not the production server). The application is composed of the following parts:

An ASP.NET web application

An IIS-hosted WCF service

A business tier

A data access tier

A database

The scope of this article is limited to the top two tiers. By using the Open XML SDK (now 2.0), it's possible to programmatically read and write inside Office Open XML packages - that means, reading and writing Office files without using Office COM objects. This approach is very fast, easy, light on resources, and stable. The WCF service in this application must be able to create Docx reports on the basis of an existing docx template and some database data serialized as XML. Docx files are constructed in a modular way. To be able to appreciate this, you can just rename a docx file, changing its extension to ".zip". To know more about how this archive is organized, visit this link. The part that we're interested in is called Custom XML (read this). The approach that's best to follow for manipulating data within a Docx file is binding content controls to custom XML parts.

The need for such a system arose in my company when the head office requested a data report accessible through the web which was supposed to be formatted exactly like they wanted. They supplied a sample docx document containing sample data, and they expected automatic generation of those reports. With this system, there's no need to painfully replicate a Word format in HTML, because the system input is the docx template itself. The future of this application involves PDF conversion of the docx reports, which totally eliminates the need of having MS Office installed anywhere in the system.

1. Generating a docx Template Document

The first thing to do is to build a docx document which defines the layout of the reports by using Word 2007 or above. In this document, there are going to be static parts (text-blocks, images, and so on), and dynamic parts which are going to be dependent on the data. At first, we build and format the docx file as we expect it to look with dynamic data on it. Then, when we're happy enough with the way it looks, it's time to add the content controls. On the Word ribbon, we need to go to the Developer tab (if you don't see it, click here to learn how to activate it). In this tab, we can find some content controls, such as rich text, plain text, image, etc. We now need to replace the sample static data that we've put into the document with the appropriate content controls.

2. Creating Custom XML Parts

Using Word 2007, we're able to put Content Controls into a docx document, but we're not able to bind those controls to custom data. In order to do this, we either need to modify the XML files inside the docx archive "manually", or follow the much simpler approach of using a tool like Word 2007 Content Control Toolkit. At this point, our docx document doesn't contain any custom XML parts. We can create these by using WCCT. Open the docx document inside WCCT. On the right panel, click on "Create a new Custom XML part". The custom XML part will be created and we'll be able to see it from the "Bind view" tab. On the left part of the window, we will be able to see references to the content controls that we've inserted in the file. Clicking on the "Edit view" tab of the right panel, it's possible to edit the XML. The XML structure that we need to create has to be valid, and needs to correspond to the content controls in the page. For example:

When we've finished creating the XML, it's always good to get the XML syntax checked by WCCT by clicking on the "Check Syntax" button. We're now ready to go back to the "Bind View". We will now be able to see the XML nodes we've just inserted in a tree-like structure, and the fun part is about to begin. We'll now bind the XML nodes to the content controls, and this is as easy as drag-and-drop. Select one of the nodes on the right panel, and drag it on the reference to one of the content controls of the document. Repeat this operation for all of the XML nodes until all the content controls have been bound to data. When you're done, save the file and click on the Preview button to open the document using Word. Notice how the custom XML data has replaced the text inside the content controls.

3. Building the WCF Service

The WCF service will replace the custom XML inside the docx template with business logic XML data. Using the Open XML SDK, this is actually very easy. Here's the replaceCustomXML method:

4. Building the ASP.NET Client

The ASP.NET client will have a template.xml file which replicates the structure of the custom XML part in the server's docx template. Ideally, there would be a web page which automatically generates web controls for inputting data which mirrors the structure of the XML template file. After the data is inputted, the web client must compose an XML document which follows the structure of the existing template.xml but replaces the data with those inputted by the user. The XML string is then sent to the WCF service which returns the bytes of the docx file. These bytes can then either be saved as a docx file on the server, or sent directly to the client through HTTP.

5. Points of Interest

Using the Office Open XML SDK 2.0 is a piece of cake, and it's a revolutionary approach to generating MS Office based reports.

The best approach to inputting custom data in a docx document is to bind content controls to XML. Actually, Microsoft, in the beginning, took another alternative approach which permitted more flexibility (associating an XML schema to documents), but it was stopped due to patent infringement issues. (Go here if you're interested to read more about this.) Word Content Control Toolkit makes life a lot easier when it comes to binding custom XML to content controls.

The Office Open XML format gives the possibility of generating MS Office documents without needing to interface to MS Office components. This gives the possibility of building distributed applications.

More to come: The future of this demo application includes adding PDF conversion. By doing this, the need of having MS Office installed somewhere in the system is totally eliminated, because PDF becomes the document exchange format.

License

Share

About the Author

I've been involved in object-oriented software development since 2006, when I graduated in Information and TLC Engineering. I've been working for several software companies / departments, mainly on Microsoft and Sun/Oracle technologies. My favourite programming language is C#, next comes Java.
I love design patterns and when I need to resolve a problem, I try to get the best solution, which is often not the quickest one.

"On the best teams, different individuals provide occasional leadership, taking charge in areas where they have particular strengths. No one is the permanent leader, because that person would then cease to be a peer and the team interaction would begin to break down. The structure of a team is a network, not a hierarchy ..."
My favourite team work quotation by DeMarco - Lister in Peopleware

Microsoft seems to have been the one to get the worst out of it so far. They've had to make changes to MsWord.
The strategy I've illustrated in my article is in fact not the original Microsoft approach, but the modified one due to patent infringements. The original approach involved linking a schema to the document, then inserting XML Nodes defined in the schema onto the document surface, having a lot more flexibility. The approach I've illustrated is based on Content Controls instead. I learnt about all this by posting questions on the msdn forums and I got some really good answers from various Microsoft MVPs.

Thanks Erion, I was able to read the project files now. It did not work anonymously, but I was able to log on with my own google docs account and then download it. I'm looking forward to seeing how you accomplished this - it looks a lot easier than how I've been modifying the XML natively...

A possible work-around would be to insert all the images you will need (if they are not too many and not too big) into the docx template file and make them invisible. When the file will be saved (by Word 2007 or above) the images will be renamed and packed into the docx archive. To be able to locate them inside the archive you will have to rename the docx file by adding a ".zip" extension. This will enable you to explore its contents.
Once you have located the images and understood how they have been named, you can then programmatically modify the reference to the image file inside the custom xml part, following my tutorial.