Tuesday, 6 March 2012

Copying the textual data from an openxml based docx file was pretty OK - you just had to copy all the child nodes of //body node from the source to the destination.

Conceptually this would be in the lines of :

1.) load the source document's MainDocumentPart.Document to a XmlDocument.
2.) load the destination document's MainDocumentPart.Document to another XmlDocument.
3.) locate //body child node for each XmlDocument
4.) copy all child node from the source //body child node to the destination document //body child node. You might have trouble copying directly, hence use an ImportNode and then do an add.
5.) Once done for all child node,save the XmlDocument back to the MainDocumentPart.Document

All seems fine except for the fact that images/hyperlinks wouldnt appear in the destination document when opened. The following stuff has to be done additionally to get this working

1.) For each image part in the source, add a new Image part into the destination. This makes sure the final document has got the following entries added (rename the docx to zip and check):a.) Word\Media folder has got the the images as separate filesb.) Word\relations xml has got an entry for each image with target pointing to the appropriate file in Word\Mediac.) The content types xml in the root has got an entry for this specific file type; say jpg.

Once you have the following three entries appearing right, your image should appear OK in the final document. Note that once the AddPart<ImagePart>() is done with, you would have to Close() the Package explicitly. This makes sure the above entries such as relation, content type entries get rightly saved into the target document. This is a critical step. Just saving the MainDocumentPart.Document to a FileStream is not going to help.

2.) Similar to Images, for each hyper link in source, add a new hyper link relation. This too would cause relation ship entries, content types entries in the destination rightly created.

Points of interest

If you are copying from multiple source document, you would have to make sure that while copying the child nodes, the relation-id of the content in concern (say image/blip/hyperlink) is temporarily updated to a unique-id such that it does not conflict with the same relation-id in a different source file.

Additionally, when adding the related imagepart/hyperlink, make sure that the imagepart/hyperlink part id is same as the new id that you created. If everything goes good, when you save the Package, openxml sdk would rename all relation-id to be sequential and also update all references in the document content with the sequential-id it generated.