Basics of OpenXML (Word 2007) for Beginners

Introduction

Since Microsoft unveiled OpenXML with Office 2007, many people started to check if they can take advantage of it. However, if you search on newsgroups/communities/forums, you will find find that it is much more complicated and difficult to study and implement.

I think it is not difficult, but lengthy and somewhat complex.

ODF vs. OpenXML: It is another point of debate. ODF (OpenDocument Format) is simpler than MS OpenXML. Both follow the same XML+ Zip format. However, there is lack of help/tutorials/support of Open Source technologies if we compare those with MS products.

This article: I try to explain here the basics of OpenXML programming to help beginners. I have dealt with Word 2007, and hence I will cover the part regarding Word 2007 only. However, OpenXML implementations are quite similar across Office components.

This is not my new invention, but I am putting the basic facts scattered over the internet in one place.

Basics

Let's start with the basics. A Word file with the extension DOCX is actually a compressed archive (Zip) of some files. These files are nothing but XML files and some folders/subfolders. These files are inter-related with relations.

The following figures shows the files inside a DOCX:

To view these files, just open the DOCX file with WinZip (or any other software you have). Everything (some exclusions like images, ActiveX) is converted into XML. You need to remember the following keywords: Package, Parts, Relations.

Package: Package is nothing but your DOCX file. This zip file is called a Package.

Parts: Parts are nothing but files in the Package. E.g., the area where you type (after opening Word) is the main document part. If you insert an image, it will be another part. Everything is managed in parts (numbering [bullets], images, styles, settings etc.). If you want to insert/delete/retrieve images, then you have to play with ImageParts (a sub-class of part) and so on.

Relations: The parts are linked with relations. The main relations are maintained in .rels files inside the _rels folder within a package. Of-course, you can find XML tags in this file. There are other relation files in the word/_rels folder. These are sub-relationships. E.g., if you include an image for a bullet (picture bullets), then you can find the numbering.xml.rels file in this folder. There are many other relation files and it is hard to list all of those.

Relations IDs: Each relation has a unique ID. This is referred in the referencing part and in the relation file. With the help of this ID, Office searches for the appropriate referenced parts and displays them accordingly. For instance, add a new image in your document, then save it. Open it with WinZip. Open document.xml, look for the w:drawing tag, then inside that, look for the a:blip r:embed tag. The value of this tag will look like rId2. Then, open the document.xml.rels file and search for rId2; you will find the path of the image in the package!

Do it

To deal with DOCX programmatically and to simplify programming, you may want to download this SDK [Microsoft SDK for OpenXML Formats] [SDK 2.0 here] provided by Microsoft. The final release is not out yet. Download it -> Add a reference to your project -> Import it.

Remember, to travel within this XML, you need XmlNamespaceManager and add the required namespaces to that. You can add the required namespaces in the document.xml file. If you want to add paragraphs, then add child nodes in xmlDoc and then save the xmlDoc (xmlDoc.Save(mainDoc.GetStream())).

Add New Part

To add a new part, you can use the AddPart and AddNewPart methods of the WordprocessingDocument and d d MainDocumentPart classes. These are generic methods and you need to specify which part you want to add. The method returns the part you added, and then you can play with that.

The above code will add an image in the package. Remember, it will not display in your document unless you manually add paragraphs and the required nodes in mainDoc's XML. After executing the code above, open the package with WinZip, and check that the image is added under the media folder. Also, check the relation file document.xml.rels and search for media/image; you will find a new relation tag is added and a new unique ID is created for that image.

You can iterate through each part using the Parts property of the Partclass. Try to use a for-each loop and check each part in Debug mode (put a breakpoint inside the for-each loop). [Check the mainDoc.Parts pProperty].

Delete existing part

You can see the ID of the part from document.xml. Once you have the ID of the part, call the GetPartById method of mainDoc. This will return the part that you want to delete. Then, call thr DeletePart method. This will delete the part as well as updates the relation file (document.xml.rels).

Saving Document

WordprocessingDocument.Close() automatically saves and closes the document. You don't need to save it explicitly.

Final Words

You need to work hard to understand OpenXML. Debugging and some R&D will help you know it better.

Share

About the Author

He is just an ordinary developer and curious about new technologies. Fond of detailed analysis of How Stuff Works! He likes debugging, trouble-shooting and making out the application of difficulties. Finding another ways to achieve the result is his passion!

I did not find any solution for this as well.
I was suggested to try the following, maybe it will help you.
You can use System.Printing namespace which is a WPF printing API. To use it you can create a DocumentPaginator from the WordprocessingDocument.
To be honest I didn’t know how... so I used this finished solution to print Word files in VB.NET: