Search and Replace Text in an Open XML WordprocessingML Document

A reoccurring question around Open XML is how to search and replace text in a word-processing document. There have been several attempts at presenting example code to do this, however, until now I have not seen any examples that correctly implement this. This post presents some example code that implements a correct algorithm to search and replace text.

The first challenge is handle the case when the text you are searching for spans runs with different formatting. A simple example will demonstrate the problem. You want to replace ‘Hello World’ with ‘Hi World’. If, in the document, the word ‘World’ is bolded, then the markup will look something like this:

Even though the search text spans runs, the algorithm should find the text and replace it. The next challenge is to define exactly the semantics of searching and replacing text if the text that you are searching for spans runs with different formatting. In short, the replaced text takes on the run formatting of the run that contains the first character of the search string. An example makes this clear. In the following sentence, the first four characters of the word ‘include’ are bolded:

On the Insert tab, the galleries include items.

If you replace ‘include’ with ‘do not include’, then the sentence should be formatted like this:

On the Insert tab, the galleries do not include items.

The replaced text takes on the formatting of the ‘i’ character of include, which was bolded.

Here is a short screen-cast that walks through the algorithm and the code.

Search and Replace Algorithm

It certainly would be possible to carefully define an algorithm to search for text that spans runs, noting where the searched text intersects bookmarks, comments, and the like. However, this algorithm would be pretty complicated, and to be done properly, a test team would need to write extensive test specs, and supply a plethora of sample documents that exercise all edge cases. It is a non-trivial project.

However, there is another approach that we can take that is pretty simple, easy to test, and yields the correct results in all cases. The algorithm consists of:

Concatenate all text in a paragraph into a single string, and search for the search string in the concatenated text. If the search text is found, then continue with the following steps.

Iterate through all runs in the paragraph, and break all runs into runs of a single character. There are a variety of special characters, such as carriage return, hard tab, break, and the non-breaking hyphen character. Normally, these special characters will coexist in runs with text elements. When breaking runs into runs of a single character, these special characters should also be placed into their own run. At the end of this process, no run will contain more than a single character, whether it is a character of text, or one of the special characters that is represented by an XML element.

After breaking runs of text into multiple runs of single characters, it is then pretty easy to iterate through the runs looking for a string of runs that match the characters in the search string.

If the algorithm finds a string of runs that match the search string, then it inserts a new run into the document. This new run contains the run properties of the first run in the string of runs that match the search string. In addition, the algorithm deletes the set of single-character runs that matched the search string. This process is repeated until no strings of runs are found that match the search string.

After the algorithm replaces the single-character runs with a new run containing the replacement text, then the algorithm coalesces adjacent runs with the same formatting into a single run.

Algorithm Walk-through

It will be helpful to walk through an example, and examine the markup at each step in the process. The following paragraph contains the text, “See this markup.” The letters ‘th’ in the word ‘this’ is bolded. We want to change the word ‘this’ to the word ‘the’.

The algorithm can then iterate through the runs, finding the series of runs where the text of the runs matches ‘t’, ‘h’, ‘I’, ‘s’. The algorithm then inserts a new run containing the replace text, taking the run properties from the run that contained the ‘t’ in the search string, which indicates that the run is bolded. It also removes the single character runs that match the search string. The adjusted markup looks like this.

Additional Notes

There are a few additional notes worth mentioning about this algorithm.

This algorithm only works for paragraphs that do not contain tracked revisions. While it is certainly possible to write this functionality for content that contains tracked revisions, it significantly complicates the algorithm. The code as written checks for the existence of tracked revisions (using the code presented in Using XML DOM to Detect Tracked Revisions in Open XML WordprocessingML Documents), and throws an exception if they exist.

If revision tracking is turned on for a document, the correct functionality would be to create the revision tracking markup, which is beyond the scope of this example. If revision tracking is turned on, the example code throws an exception.

While my favorite way to write these types of algorithms is to use LINQ to XML, to make this code more widely applicable, I used System.Xml.XmlDocument, which is an implementation of XML DOM. This makes it easier to translate this code to a variety of other platforms, such as PHP or Java.

The code searches and replaces text in the main document part, all headers, all footers, the endnote part, and the footnote part.

I'm trying to use the code to dynamically populate a template with data values from another source and want to 'mark' the places in the document which need to be replaced by using string patterns like !!replace this!! or **replace this** but am having difficulty.

I'm a bit embarrassed - you found a bug in the code. The code would not find the string to replace if the search string matched the string that was at the very end of the paragraph. Line 115 should read:

Hi again Eric, I have one other questions about the search and replace code. I have found that it does not work if the text to be replaced is inside of a Text Box. I have tried stepping through the code to work out why but have been unsuccessful. Do you know why this would be?

Thanks again Eric. It works for me now. This SearchAndReplace class has been invaluable to a project I'm working on. I'm in the transition from Lotus Notes developer to Sharepoint developer and resources like this website have made things alot easier.

I've been having a play with the code sample, and noticed that it doesn't seem to correctly replace instances of the search string in a document after they're used in a text box. So if, for example, I have the following document:

hello world

[text box] hello world [/text box]

hello world

...and am replacing the string 'hello' with 'hi', the result I get is...

hi world

[text box] hi world [/text box]

hello world

I'm pretty sure this is not the intended behaviour! Otherwise it's a very helpful article, thanks :)

I am not seeing the symptoms you mention, but I am sure that it is due to me not creating the document properly. Would you kindly send me a sample document where text is not being replaced properly? You can either post it here on the forum, or send it as a message on OpenXMLDeveloper.org, or send to my email eric at ericwhite.com.

First of all Thank you very much for providing the very usefull knowledge and code for office automation using OpenXML.

I am Sanjeev, just started using OpenXml for editing the office document. I found your very usefull article and code to search and replace the text in word files. It is very good and very usefull for me to complete the tasks. Thank you for that.

I am using Search and replace functionality by placing some place holders in the documents. I succeeded to replace the place holders, but only thing I am not able do is to replace the place holders with the text having multiple lines.

I tried by adding "W:br" node in the XmlElement, but still it is not giving the text on next line.

Sanjeev, I encountered a similar problem trying to insert a <w:br /> in between lines of text. My approach was not as elegant as yours since I was just adding the "<w:br />" string at the end of the text as I built the multiple lines. Through my debugging, I found that the SearchAndReplace was converting the "<" and ">" to "<" and ">" and I couldn't figure out how to prevent it. What I ended up doing was using a Regex.Replace to put back the "<" and ">" characters. Probably not the best way of doing things, but I had to get it working. Perhaps Eric or someone else here can provide a better solution.