Archive for the ‘whitespace’ Tag

In this post I will highlight VTD-XML v2.13’s whitespace handling capability along with some examples in Java.

Quick Review

Native to non-extractive parsing, VTD-XML’s handling of XML tokens and elements frequently revolves around the concept of byte segments.Once an XML document is parsed into VTD tokens, the byte segment enveloping the entire content of any token or element can be visualized of as a pair of descriptors (i.e. offset and length) projecting into the original document.

For a large class of XML content extraction and modification operations, non-extractive parsing allows applications to circumvent the tedious, cycle-wasting tasks of de-serializing and re-serializing byte content of elements, and thereby help achieving maximum performance possible.

Whitespace handling in 2.12

Version 2.12 of VTD-XML introduces two new methods that help either trim or expand the surrounding white spaces of byte
segments denoted by 64-bit integers.

trimWhiteSpaces(long l), of VTDNav class, accepts a byte segment descriptor, removes both the leading and trailing white spaces, and returns a new descriptor.

expandWhiteSpaces(long l), of the same VTDNav class, takes a segment descriptor ands returns a new descriptor that includes all the leading and trailing white spaces around the input segment

It is worth noting that both methods are greedy: they will remove/expand as many white spaces as they possibly can. Furthermore, you can make the observation that the effect of one call often negates the other.

Additions in 2.13

Three static constants and three more methods are added to VTDNav class in 2.13.

Those constants are:

VTDNav.WS_LEADING

VTDNav.WS_TRAILING

VTDNav.WS_BOTH

The two new methods are:

trimWhiteSpaces(long l, short actionType) still trims the white spaces around the segment and returns a new segment descriptor.But the trimming operation can now be applied to the leading whitespaces, the trailing ones, or both, depending on the value of actionType.

trimWhiteSpaces(int index, short actionType) brings you the power and convenience of trimming a VTD record without the hassle of manually compose a 64-bit segment descriptor.

expandWhteSpaces(long l, short actionType) still expands the whitespaces. But the expansion can also be designated to include either the leading whitespaces, the trailing ones, or both.

Common Use Cases

Suppose you want to remove some element fragments from the master XML document, but you want the remaining XML text to retain the orginal format, or make slight, fine granular changes to it (ex. paragraph separation, indentation). You can also extract out
a segment of XML bytes without losing its surrounding formatting line breaks or tabs.