Word XML’s Context Free Chunks: Building a document from multiple pieces of content

This is a bit of a more obscure feature that I like to point out every now and then. It’s great if you are interested in building up a Word document from multiple pieces of content. This is a common scenario I’ve talked with a number of folks about over the years. In the design of WordprocessingML it was clear that we should make it easier to bring document fragments with rich formatting into an existing Word XML document without having to do a bunch of extra clean up work around style name conflicts, etc. This is why we created the context free chunk element, to allow people to insert a block of content where all the style and list definitions were defined locally for that chunk rather than for the entire document.

Example Scenario:

The simplified version of the scenario is that you want the ability to dynamically generate a document bringing in content from other sources. The contents of the document could be based on a number of outside factors, such as who the user is that is authoring it, what they are writing about, what the conditions of the market are, etc. For example, imagine you work for an investment bank and you want a solution that automatically generates a report template based on the company, industry, and analyst that is going to write the report. In order to do this, you need to bring in content from all over the place and use it to create the document. There may be disclosure clauses that you want to insert; a chart that shows historical financial figures; boilerplate description of the company, etc.

Basics

This is just one example of something I’ve talked to a number of people about supporting. Rather than dig into the scenarios more though, I want to talk about one bit of functionality we had in the WordprocessingML schema for Office 2003 that was designed to address this case. I was planning to talk about this a bit later as I wanted to talk about more intro-level stuff first, but thanks to a great post last week by John Durant, I figured I would briefly describe it now.

One of the difficult problems with almost any document format is deciding how to move new content in. This sounds like it should be easy, but there are a number of issues to deal with. In the WordprocessingML schema, all styles and list definitions are declared at the beginning of the file. Then in the content below, the various objects (tables, paragraphs, text runs) can reference those styles. This is a very common model (similar to CSS in HTML), which means there isn’t a ton of repeated data. The problem comes with adding new content somewhere in the body though. When you add new content, if you can’t define everything local to that content then you need to go parse the style definitions of the source document to make sure that all the styles referenced in you new content are already declared in the target, and guarantee there are no collisions.

Target Document:

Let’s say you had a file that looked like this:

Introduction

First item in the list

Second item in the list

The XML for that might look something like this (I’m just going to use shorthand so this isn’t really a valid Word XML file):

Getting the result we want:

So, if our goal is to add the content from the source into the content of the target to create a new document, we need to worry about a couple things. The first thing is that our source document uses the “h2” style, but that isn’t defined in our target document. This means we’ll need to update the style information for the target. The second problem is that our source document uses a list with id “1” that has “a. b. c.” styled numbers. In the target document, there is also a list of id “1”, but it’s number style is different. If we don’t fix this problem up, then the two list items in the source would belong to the same list that’s already in the target, and you would end up with this:

Introduction

1. First item in the list 2. Second item in the list

Disclaimer

It is important to understand the following issues:

3. Something confusing 4. Something else confusing

Obviously that isn’t what we want. To correct this, we would need to modify the source document so that the list uses a different id, and then add the list definition to the top of the target document.

Easier way:

Of course there is an easier way to do all of this. Building up a document from multiple parts was an important scenario for us. Because of that, we created an element in our schema called a cfChunk. The cfChunk allows you to create a temporary “mini-document”. You can place a cfChunk within an existing WordprocessingML file, and then within that cfChunk you can make new style and list definitions that apply locally to that chunk. When Word opens the file, we’ll then merge that content with the rest of the file, and take care of any conflicts. This is similar to what happens when you copy content from one document and paste it into another. If the style names match, then we’ll inherit the definitions from the target document. If the style from the source doesn’t yet exist, we’ll create it. The cfChunk is one of those pieces of functionality that’s rarely talked about but it’s extremely useful. I think the main reason it isn’t talked about is that for someone to see the benefits of it, they need to already understand that basics of WordprocessingML.

Go ahead and try it out for yourself. You can imagine taking a template with a bunch of placeholder XML tags and posting it up on the server. Then your solution could just grab all the pieces of content you need, wrap them in a cfChunk tag, and swap them out with the placeholder XML tags in your template. I’m really pushing to extend this functionality for the new schemas in Word 12, so let me know if you find it useful or if there is some other kind of behavior you’d like to see added.

This is great stuff and exactly what we have been looking for. Please continue to push this in the new schemas for Word 12. I would be very interested in you expanding on the extended functionality you allude to.

I would like to see you write about or point us to some guidelines for deleting an abituary chunck/block of text from a WordML document (not using Word) such that the document will stay well-formed and valid. In some sense this is the opposite of <cfChunk>.

An issue related to deletion is whether Word will clean up un-referenced styles, fonts, etc. when it opens a document that does not refer to them any longer (because a chunk/block was deleted from the document).

Some problem I got in the past, was when I copied contents between word documents created with different localised versions of Office. "Heading 1" in English version was "Hoofding 1" in Dutch version. I remember Word didn’t adjust namings in the merged document, so there was now a "Hoofding 1" named style in an English document.

Hey Brad, are you curious about how to delete any random selection of text? Or something that is more structured.

In answer to your question about whether we clean up styles, the answer is no. When we open the file even if the style isn’t in use we keep it around because it may actually be part of the template and used later. People often create a simple template that has a bunch of predefined styles that aren’t currently in use, but they are kept around so that people using the template can take advantage of them. You’ll need to delete the style if you don’t want to use it.

A tool that might be cool for someone to build would be one that cleans up all the styles by deleting any style or list definition that isn’t in use. I’ve seen similar tools that use Word’s object model, but it would also be cool to do it using WordML.

Ignace, I’ve seen that issue before. It’s pretty tough given that many style names are user defined, so we couldn’t really try to do a translation. For the styles that are predefined though I thought there was something smarter that we did, but maybe that isn’t the case. I’ll check it out.

"…are you curious about how to delete any random selection of text? Or something that is more structured."

My answer:

No more random than what you can select within Word. Doing this using the OM is simple, my desire is some guidlines for doing it directly on wordML, either through the XML DOM or some other way.

Of course there are the obvious things to consider, being sure the resulting document is left schema valid, etc. However, I suspect than internally Word is following some algorithm for modifying its wordML presentation when text is deleted using the OM. We would like to match this working on wordML directly in our own code – running on a server.

When we create documents on the server we not only need to insert arbituary chunks of text (cfChunk)into an existing document but we need to also delete arbituary chunks of text. The areas of text to be conditionally deleted are designated by special markers in <t> elements. The markers may be in the same <t>, <t>s in different runs, or <t>s in different paragraphs. I want to process this wordML directly to carry out these potential deletions but make sure I end up with a schema valid wordML document when I am done.

Brad – your question on deleting text is actually faily easy to do with XSLT. Since Word is a series of <w:p> tags you only need to remove the <w:t> content and/or <w:p> tags between your marks. If you want to maintain the page layout just remove the <w:t> content and leave the <w:p> tags in place. Of course this is the simple case, next you will ask about images, tables, lists…. These will get more complicated. It depends specifically on what you are trying to accomplish as to how to solve this one.

Context Free Chunks <w:cfChunk> are very useful for constructing a document from pre-existing documen-tettes or little documents. It reminds me of the old XML hub-and-spoke document strategy where you have a top level document that is mainly composed of a series of file entity references. This strategy was, and is still in some cases, in use for some home brew, and more simplistic XML Content Management Solutions where the content was, oddly enough, chunked into high level document fragments. It also is very much like doc/sub-doc in word. While this has its places and is a really an excellent feature of WordML it is after all only a one-way process. I’m sure that the reverse process is possible by taking a range and copying it out. Is there a ‘save range to file’ function in the API or do you have to do the dance of copy, paste into a new document, save and close? Just curious.

I do have an application for the one-way construction and was wondering if there is a limit to the number of chunks that can be inserted? I’ve been testing some scenarios and I find that I can nest cfChunks inside of a cfChunk is there a limit on the depth of these as well? Another interesting observation is that when I chunk in outline numbered lists they are all separate instances but when I chunk in bulleted lists they are all the same list instance. Maybe it’s numbered verses non-numbered that creates this behavior. It would be useful to be able to specify whether the lists should be a single or separate list instances. More scenario testing required…

This brings me to a point about documentation on how to employ the various features in WordML. While the schema documentation is quite accurate it does not explain what the values and ranges of various attributes are or how to use them. I find that I need to reverse engineer most things to get the information I’m looking for. Sometimes this is actually fun and other times it’s frustrating. There is a wealth of information in the various MSDN articles and blogs but they’re not a substitute for an interface spec.

Hey David, one way to grab the XML to use in the cfChunks it to use the .xml property that you get off the range object. Just grab the range you want to reuse and then you can get the xml of that range.

A common way I’ve seen this done is if you have a document that has some structure applied with the custom XML support, you can quickly access any of those XML nodes using an XPath. SelectSingleNode will return a range object, and you can then just grab the XML of that range object and store it seperately for re-use later.

I agree with you that we need to provide better documentation this time around (with Office 12). There is actually a ton of great content for Office 2003 XML, but it’s fairly spread out.

This works beautifully! Saves me eons of time in making sure that the inference tool actually kept the styles I needed (which it nearly never did). We need a WordML book — if I only had time to write one.

This is excellent information, and exactly what I was looking for (I am creating one Word document from other Word documents). My only question is how to use this in conjunction with embedded objects?

It seems Word saves embedded objects in one XML element (<w:docOleData>) under the root <w:wordDocument> element. How can this element be created when there are multiple embedded objects and how can they correctly be referenced later in the doucment in the appropriate <o:OLEObject> elements?

It’s a shame I can’t use the new "12" format where this would not be a problem!

I am also looking to merge several Word documents with embedded images/objects. I noticed the rID{#} tags and also the rsidR tag that has a unique number what appears to be per document. It seems the document.xml.rels file should also contain the rsidR as an element so I could place "duplicate" rIDs inside this file and move my embedded images to the media directory. I am contemplating the best way to merge at a minimum of 7 documents being worked on by 7 different people.

I would appreciate your feedback on whether we are approaching our situation in a recommended fashion. We have a 100+ document that we need to dynamically populate with data. We have broken the document into logical chapters. Each chapter has its own custom schema. We perform multiple xml/XSLT transforms then save to chapter##.xml. We would now like to merge all the chapters into one master document. Would cfchunk be the recommended approach?

I played around with the cfChunk tag, but yes, I am still looking for a "Word Merge" or "Word Append" solution. So far I think it might have to be custom, but still bugging the product team and a few others for support.

So far I have:

Users use a standard template.

Users use the standard styles in the template.

Users place "placeholders" for images that will be inserted by "me" later.

It looks like I will need to:

Open the document.xml.rels files of the source document.

Find the "max" Relationship Id="rId?" tag in the document.xml.rels file.

Open the document to append and build a mapping of it’s rIds to the max rId + 1 from the source.

Replace all the rIds in the document.xml and document.xml.rels.

Determine any Target’s in the document.xml.rels not in the source document.

Copy them to the source document.

This is not performing all the complex items like OLE objects and such. I have a limited scope and think I can get away this approach (though they might “throw” in a PowerPoint slide). The cfChunk tag does not seem to handle the external objects since their references are in the document.xml.rels file.

The documents I am working with also “contain” content types and still looking at what I will (if anything) need to do to them.

We’ve been using cfChunk pretty successfully (building WordML through XSLT). One problem we’ve hit is embedding page orientation changes (we have a landscape page we’d like to include); Word seems to ignore them! Is there any way the contents of a cfChunk can change the orientation of a page? Currently it’s looking like we’ll need to change the orientation in the ‘master’ document (outside the cfChunk).

To solve my Word Append problem, I actually resorted to the Microsoft.Office.Interop.Word DLL (not to derail the blog). This does not really take advantage of the XML and cfChunk tag, but worked for appending documents together and keeps the formatting along with embedded images and such. I will be testing to see how robust it is, but put it in a loop for 1000 times running on a VPC and it ran successfully. The document was fairly complex and I opened and closed Word each time to see how well it handled the volume.

private void button1_Click(object sender, EventArgs e)

{

Microsoft.Office.Interop.Word.ApplicationClass oWord = null;

Microsoft.Office.Interop.Word.Document oDocument = null ;

Microsoft.Office.Interop.Word.Document oDocumentDest = null;

object saveChanges = false;

object missing = System.Reflection.Missing.Value;

object fileName = @"C:Source.docx";

object isVisible = false; // change to true to see Word working

object readOnly = false;

object destFilename = @"C:Destination.docx";

try

{

// Copy the document to preseve the styles/headers/etc (this is using the source file as a template) set in it