PDF Clown 0.1.1 — Text highlighting and lots of good stuff

Next release is going to introduce new exciting features (text highlighting, optional/layered contents, Type1/CFF font support, etc.) along with improvements and consolidations of existing ones (enhanced text extraction, enhanced content rendering, enhanced acroform creation and filling, etc.). This post will be kept updated according to development progress, so please stay tuned! 😉
These are some of the things I have been working on till now:

Bidirectional traversal has been accomplished by the introduction of explicit references to ascendants: composite objects (PdfDictionary, PdfArray, PdfStream) are now aware of their parent container, so walking through the ascending path to the root PdfIndirectObject (and File) is absolutely trivial! This functionality has loads of engaging potential applications, such as fine-grained object cloning based on structure context (as in case of Acroform annotations residing on a given page).

Ascendant-aware objects are intelligent enough to automatically detect and notify changes to their parent container, making incremental updates transparent to the user.

Simple objects have been made immutable to avoid risks of unintended changes and promote their efficient reuse.

As expected (you may have noticed some TODO task comments about this within the project’s code base), object parsing of PostScript-related formats (PDF file, PDF content stream and CMaps) has been organized under the same class hierarchy to improve its consistency and maintainability.

2. Text highlighting

Text highlighting was a much-requested feature. It took me less than one hour of enjoyable coding to write a prototype which could populate a PDF file with highlight annotations matching an arbitrary text pattern, as you can see in the following figure representing a page of Alice in Wonderland resulting from the search of “rabbit” occurrences:

This text highlighting sample leverages both text extraction [line 55] and annotation [line 106] functionalities of PDF Clown, as you can see in its source code:

This is another example matching words which contain “co” (regular expression “\w*co\w*”):

Here you can appreciate the dehyphenation functionality applied to another search (words beginning with “devel” — regular expression “\bdevel\w*”):

3. Metadata streams (XMP)

XMPmetadata streams are now available for reading and writing on any dictionary or stream entity within a PDF document (see PdfObjectWrapper.get/setMetadata()).

4. Optional/Layered contents

Smoothing out some PDF spec awkwardness while implementing the content layer (aka optional content) functionality proved to be an interesting challenge. The result was nothing but satisfaction: a clean, intuitive and rich programming interface which automates lots of annoying housekeeping tasks and lets you access even the whole raw structures in case of special needs! 😎

The figure above represents a document generated by the following code sample; for the sake of comparison, I took an iText example and translated it to PDF Clown, adding some niceties like the cooperation between the PrimitiveComposer (whose lower-level role is graphics composition through primitive operations like showing text lines and drawing shapes) and the BlockComposer (whose higher-level role is to arrange text within page areas managing alignments, paragraph spacing and indentation, hyphenation, and so on).

content layering [lines 89, 91]: content is enclosed within a layer section, making its visibility dependent on the layer state. There’s a subtle discrepancy in the PDF spec when it comes to nested layers: one may assume they imply a hierarchical dependency of the sublayer states, but that’s NOT the case — if you hide a layer its descendants are still visible! To work around this counterintuitive behaviour, many software toolkits wrap contents within multiple nested layer blocks; for example, if you want to wrap the text “nested layer 1” into a layer (resource name /Pr2) which is a sublayer of another one (resource name /Pr1), the content stream will contain this cumbersome syntax:
4 0 obj
<< /Length 205 >>
stream
[...] /OC /Pr1 BDC
/OC /Pr2 BDC
q
BT
1 0 0 1 100 800 Tm
/F1 12 Tf
(nested layer 1)Tj
ET
QEMC
EMC
[...]
endstream
endobj

This beast is repeated as many times as there are distinct content chunks to include within the same layer; it goes even worse as the number of nesting levels increases — just awful! 😯 Instead of this, PDF Clown defines a default hierarchical membership for each layer which can be used as a single, terse wrapping block (resource name /Pr2):
4 0 obj
<< /Length 185 >>
stream
[...]/OC /Pr2 BDC
q
BT
1 0 0 1 100 800 Tm
/F1 12 Tf
(nested layer 1)Tj
ET
QEMC
[...]
endstream
endobj

This way code is concise and more maintainable (if you want to rearrange the hierarchical structure of the layers you don’t have to walk through the content stream hunting layer block occurrences for correction — just go to the membership associated to the layer and update its hierarchical path!). 🙂

simple layer group creation and insertion [lines 104-105]

option group definition [lines 148-152]

5. AcroForm fields filling

Text fields have been enhanced to support automatic appearance update on value change.

Hi Adi,
if you carefully examined the code sample above, it should be quite easy to identify the right spot where to place your logic: hasNext() and next() methods of the extraction iterator (TextExtractor.IIntervalFilter) define the matching intervals. As shown in the sample, in order to search through the extracted text you have to consolidate the chunks in a big string (TextExtractor.toString(textStrings)), then you can find your matchings by regex or plain indexOf() or whatsoever you want.

In my opinion, getting rid of the regex (apparently) doesn’t make sense, as in any case (even if you are looking for simple, literal matchings without wildcards) you are coping with unstructured textual data.

hi,PDF Clown rasterization capabilities are currently at alpha stage — as thoroughly documented, no assumption can be done about its effectiveness at the moment (you can solve it contributing code for its implementation ;-)).

Connect

I'm new to PDFClown. Very impressed with it. While it seems very powerful, it seems a bit more than I really need. Any good docs/overview out there other than the PDFClown reference? My immediate need is to insert an image onto a page and then put several text strings over the image. For example... PrimitiveComposer composer = new PrimitiveComposer(page […]

I was able to replace the operand with one for an Artifact. However, Its change appeared during the debug session. But the file.save doesn't seem to keep the changes. I'm not using the scanner anymore so I'm not using the stamper either. I'm not executing the stamper.Flush() method. Is there some kind of flush that should be happening pio […]