Retrieving the Text of the Paragraphs – VB

Our next goal is to retrieve the text of the paragraphs in the document. Text is stored in the “t” nodes that are contained in “r” nodes that are children of the paragraph node. Text may be broken up into multiple “t” nodes, so we have to concatenate all of the text in the “t” nodes.

Blog TOCEven though we could modify our query to include the code to extract the text of each paragraph, for demonstration purposes, we’re going to approach the problem in a different way. We’re going to write a new query that uses our first query as its source. Due to lazy evaluation, this is basically as efficient as if we were to simply modify the first query. The approach creates more short-lived objects on the heap, but if this approach makes our code more clear, it is a good tradeoff.

The above code uses the StringConcatenate aggregate operator that we showed in the aggregation topic.

One of the features of Open XML is that a user can turn on the “Track Changes” feature, and the document will track all changes to text. The above code would only work if there were no tracked changes. However, it is easy to modify our code so that we retrieve the correct text for each paragraph regardless of whether there are tracked changes or not. To do this, we need to find all of the children of the w:p element that have the name w:r or w:ins, and ignore all other elements. We can modify the last of the three above queries, as follows:

This approach introduces a small issue. In LINQ to XML, all names are atomized; that is, if two XName objects are in the same namespace, and if they have the same local name, they will share the same instance. It takes a little bit of work for the implicit conversion operator in LINQ to XML to atomize a name. In certain scenarios in LINQ to XML, atomization can be a significant percentage of processor time. You can easily minimize this. This post describes atomization in more detail. So if we pre-atomize our names, our query will execute faster, at least in theory. In practice, I can’t say that I’ve ever been in a situation where this would make a difference, but when processing huge files, it might. But whatever, in general, when I have code like this, I pre-atomize my XName objects: