1. Content stream manipulation

Since its very inception, I have been really delighted by the concept subtending the ContentScanner class, as it proved to be a versatile processor for handling content stream object trees along with their graphics state: you could use it directly to read existing content streams, modify them and also create new ones in a convenient object-oriented fashion, or it could be plugged into specialized tools (e.g. PrimitiveComposer, TextExtractor, Renderer, etc.) for more advanced applications.

But till version series 0.1.x it suffered a significant drawback: it lacked separation of concerns from its object model, that is the algorithmic responsibility to carry out the tasks was delegated to the respective content stream operations. This may work well in case there’s just a single task (“read/write the content stream”), but when further tasks are required (e.g. rendering the content stream into a graphics context) it rapidly becomes unbearable.

Therefore I proceeded with a massive refactoring which was informed by two main concurrent requirements: algorithmic separation between process and structure (accomplished through the classic Visitor pattern) and preservation of the distinctive cursor-based behavior of ContentScanner (solved through dedicated impedance-matching logic).

All the non-core functionalities which were bloating the original ContentScanner (like rendering and content wrappers) have been extracted into specialized processors (respectively: ContentRenderer and ContentModeller), resulting in the following classes:

ContentVisitor: abstract content stream processor supporting all the common graphics state transformations;

1.1. ContentScanner

ContentScanner‘s new implementation focuses exclusively on its core purpose, that is to enable users to manipulate content streams through their low-level, procedural, stacked model (operations and composite objects along with their graphics state).

1.2. ContentModeller

ContentModeller works as a parser which maps the low-level content stream model to its high-level, declarative, flat representation through a dedicated model rooted in GraphicsElement abstract class (which corresponds to GraphicsObjectWrapper hierarchy of ContentScanner’s old implementation). This simplified-yet-equivalent representation can be modified and saved back into the content stream.

Let’s see a practical example of the flexibility delivered by the new renderer: suppose that you have a multi-layered (OCG) document and you would like to selectively render only the contents belonging to a specific layer. To accomplish this, you can subclass ContentRenderer and tweak the drawing switch according to your own logic:

The following sample (from an old brochure of the Natural Tunnel State Park, Virginia) demonstrates how the renderer has evolved since its pre-alpha stage: text-showing operations have been temporarily implemented through substitute fonts emulating the styles (italic, bold, regular…) of the actual ones — such trick works nicely for thumbnail generation. Next step will address full-size rendering quality, adding support to glyph outlines.

The substitute fonts seem to work quite well also for non-Latin Unicode characters (as mapped on Ubuntu GNU/Linux):

First page of the Universal Declaration of Human Rights, Arabic translation, as rendered by PDF ClownFirst page of the Universal Declaration of Human Rights, Chinese translation, as rendered by PDF Clown

2. Content composition engine

PDF Clown 0.2.0 introduces the much-requested keystone of its content composition stack: DocumentComposer class. This engine features a layout model inspired by a distilled, meaningful subset of the HTML+CSS ruleset.

Its high-level typographic model (columns, sections, paragraphs, tables and so on) is laid out leveraging the existing lower-level functionalities provided by BlockComposer (paragraph typesetting) and GraphicsComposer (primitive graphics instructions — previously named PrimitiveComposer), the latter of which in turn sits upon the above-mentioned ContentScanner for feeding into the content stream (IContentContext).

PDF Clown’s content composition stack

This subject is massively broad, so here I’m going to give you just some little highlight about its features (development is currently underway — I’ll add more details as it advances):

PDF Clown, according to the CSS3 specification, automatically balances the column heights, that is, it sets the maximum column height so that the heights of the content in each column are approximately equal. This is possible because of a powerful simulation algorithm which ensures an accurate arrangement. Should the content exceed the available height on the paged medium, it would automatically flow into the next page.

import org.pdfclown.documents.Document;
import org.pdfclown.documents.contents.composition.*;
import org.pdfclown.documents.contents.fonts.StandardType1Font;
import org.pdfclown.util.math.geom.Dimension;
. . .
DocumentComposer composer = new DocumentComposer(document);
/*
NOTE: Composer's style is at the root of the style model, that is, its definitions
are inherited by the descending elements, analogously to the style of BODY element
in HTML DOM.
*/
composer.getStyle()
.withTextAlign(XAlignmentEnum.Justify)
.withFontSize(new Length(12));
/*
NOTE: Element type styles are analogous to CSS styles defined through element type
selectors.
*/
composer.getStyle(Paragraph.class)
.withTextIndent(new Length(10));
composer.getStyle(Heading.class)
.withMargin(new QuadLength(0, 0, 10, 0));
/*
NOTE: Styles can be defined analogously to CSS class definitions and can be derived
analogously to Less mixins (http://lesscss.org/).
*/
Style strongStyle = new Style("strong")
.withFont(new Font(new StandardType1Font(document, StandardType1Font.FamilyEnum.Times, true, false), null));
Style emStyle = new Style("em")
.withFont(new Font(new StandardType1Font(document, StandardType1Font.FamilyEnum.Times, false, true), null));
Style noteStyle = new Style("note")
.withBorder(new Border(
null,
new QuadBorderStyle(BorderStyleEnum.Solid, BorderStyleEnum.None, BorderStyleEnum.None, BorderStyleEnum.None),
new QuadLength(.1, 0, 0, 0),
null))
.withFont(new Font(null, 6d))
.withMargin(new QuadLength(30, 0, 0, 0))
.withPadding(new QuadLength(5, 0, 0, 0))
.withTextAlign(XAlignmentEnum.Left)
.withTextIndent(new Length(0));
Style superStyle = new Style("super")
.withFont(new Font(null, 6.5d))
.withVerticalAlign(LineAlignmentEnum.Super);
Section section = new Section("Hello World, this is PDF Clown!");
/*
NOTE: Group is a typographic element analogous to DIV element in HTML DOM.
*/
Group group = new Group(
new Image(
new Style()
.withFloat(FloatEnum.Left)
.withMargin(new QuadLength(new Length(5)))
.withSize(new Dimension(100,0)),
document,
"Clown.jpg"
),
new Paragraph(
new Text("PDF Clown's layout engine supports the "),
new Text(strongStyle, "multi-column layout model"),
new Text(" described by the CSS3 specification"),
new Text(superStyle, "[1]"),
new Text(" which extends the block layout mode to allow the easy definition of multiple columns "
+ "of text (and any other kind of content, like tables, images, lists and so on).")
),
new Paragraph(
new Text("PDF Clown, according to the CSS3 specification, "),
new Text(emStyle, "automatically balances the column heights"),
new Text(", i.e., it sets the maximum column height so that the heights of the content in each column "
+ "are approximately equal. This is possible because of a powerful simulation algorithm which ensures "
+ "an accurate arrangement. Should the content exceed the available height on the paged medium, it "
+ "would automatically flow into the next page.")
),
new Paragraph(
new Text("Columns can be defined by count (number of columns desired), width (minimum column width desired)"
+ " or both: in any case, the official CSS3 pseudo-algorithm is applied"),
new Text(superStyle, "[2]"),
new Text(". If you are interested in further info about CSS multi-column layouts, I recommend you to see "
+ "Mozilla's documentation for a great introduction to CSS Multi-column Layout Module"),
new Text(superStyle, "[3]"),
new Text(".")
),
new Paragraph(noteStyle,
new Text("1. http://www.w3.org/TR/css3-multicol/\n"
+ "2. http://www.w3.org/TR/css3-multicol/#pseudo-algorithm\n"
+ "3. https://developer.mozilla.org/en-US/docs/Web/Guide/CSS/Using_multi-column_layouts")
)
);
/*
NOTE: This is the declarative CSS3-equivalent style which prescribes the layout engine to treat
this group as a multi-column block (in this case: 2 columns with a 14-point gap between).
*/
group.getStyle().withColumn(new Column(2, new Length(14)));
section.add(group);
composer.show(section);
composer.close();

Honoring the KISS principle, all the magic here is done by a minimal declaration (see line 100 above) which, analogously to the CSS fragment {column-count:2; column-gap:14pt;}, prescribes the PDF Clown’s layout engine to render the content group as a multi-column block:

group.getStyle().withColumn(new Column(2, new Length(14)));

Comparing this neat solution to a well-renowned library like iText, some awkward shortcomings emerge in the way iText deals with multi-column layout: com.itextpdf.text.pdf.ColumnText class works as a dedicated processor outside the common declarative pattern (i.e., you cannot directly feed a column-aware element into the document as you do for tables, paragraphs and so on). In my opinion, that’s a really bad thing: a well-designed layout engine should hide those implementation details and carry out its duties transparently — you just feed the contents and it takes care to do the right thing according to their properties and inherent behaviors. Such a crippled layout model forces users to ridiculous bends and twists just to get contents in place!… let’s examine a few of them:

awful treatment of column intrusions: iText requires you to explicitly define the shape of your columns (sic!), distinguishing between “simple” (rectangularly-bound) and “irregular” (arbitrarily-shaped) columns, the latter forcing you to tediously specify each vertex…

adaptive column intrusion detection: PDF Clown’s layout engine keeps track of absolutely-positioned elements and its block composer takes care to automatically flow content around those already-occupied areas. What you have to do is just adding content the way shown in the code example above, no convoluted coding here!

smooth, consistent separation between content and layout models: in PDF Clown, layout processing is DocumentComposer’s business, while content definition is user’s. Multi-column layout is just another style property of your contents, not a strange beast to wrestle with!

NOTE: iText is indisputably a powerful library; my criticism is limited to a specific aspect of its composition model and represents nothing but my opinion: it will be up to the users to decide what’s good for them once the library is released.

As I said, multi-column layout is just a little treat in a full-fledged layout engine… PDF Clown is maturing: in the next weeks new technical details, code snippets and announcements will appear here. Stay tuned with its Twitter stream!

2.2. Tables

I know many of you eventually craved PDF Clown to natively support table composition… So, here we go: fully styleable, rowspan/colspan-enabled, arbitrarily nestable… really sweet!

Element construction: any composition element features a uniform set of constructors designed for compact definition. Here it is their parameter pattern:
where style is the element’s style (either custom or class), children are the elements contained by the element.

Style: the resolved style of each element is a combination of multiple styles:

2.4. Page Breaks

PDF Clown 0.2.0 supports CSS-like page breaks.

Page breaks sample generated by PDF Clown

And this is the corresponding code (lines 45 and 46 apply the page breaks):

import org.pdfclown.documents.Document;
import org.pdfclown.documents.contents.colors.DeviceRGBColor;
import org.pdfclown.documents.contents.composition.*;
. . .
DocumentComposer composer = new DocumentComposer(document);
/*
We decide that table cells sport a solid border by default (analogous
to CSS styles defined through an element type selector).
*/
composer.getStyle(Cell.class)
.withBorder(new Border(
new QuadColor(new DeviceRGBColor(0, 0, 0)),
new QuadBorderStyle(BorderStyleEnum.Solid),
new QuadLength(new Length(1)),
new QuadCornerRadius()))
.withPadding(new QuadLength(new Length(5)));
/*
We decide to highlight inline code references through a dedicated style.
*/
Style codeStyle = new Style("code")
.withBackgroundColor(new DeviceRGBColor(1, 1, 0))
.withFontType(new StandardType1Font(document, FamilyEnum.Courier, false, false));
/*
The contents will be included in a section.
*/
Section section = new Section("Hello World, this is PDF Clown!");
section.add(
new Paragraph(
new Text("This paragraph is the last content on this page as its next sibling is marked with CSS-like "),
new Text(codeStyle, "page-break-before: always"),
new Text(". Clean and simple!")
)
);
section.add(
new Group(
new Paragraph(
/*
Here it is the custom style applied to the isolated paragraph.
*/
new Style()
.withPageBreakAfter(PageBreakEnum.Always)
.withPageBreakBefore(PageBreakEnum.Always),
new Text("This paragraph is isolated on this page as we marked it with both CSS-like "),
new Text(codeStyle, "page-break-before: always"),
new Text(" and "),
new Text(codeStyle, "page-break-after: always")
),
new Table(
new Row(
new Cell("Cell1,1"),
new Cell("Cell1,2"),
new Cell("Cell1,3"),
new Cell("Cell1,4")
),
new Row(
new Cell("Cell2,1"),
new Cell("Cell2,2").withColSpan(2),
new Cell("Cell2,4")
)
)
)
);
composer.show(section);
composer.close();

This is a demonstration of some of the fine typesetting capabilities of the new layout engine of PDF Clown (the code which generated the sample shown above is listed below):

composition event listener: DocumentComposer notifies its relevant events to a dedicated listener (DocumentComposer.DocumentListener), so you can apply custom logic when the engine requires a new page (onContextInit), begins to compose the current page (onContextBegin), ends to compose the current page (onContextEnd) and so on. In this demonstration (see code below, line 28) a margin note is added reacting to the end of the page layout.

drop caps: stylish initial letters work like a charm, you just need to float your letter to the left and choose its font and size, at your will (see code below, lines 96-99).

vertical fill property: have you ever found yourself trying to convince your traditional horizontally-flowing layout engine (like those HTML-based) to automatically place, for example, a paragraph aligned to the bottom of a page (like footnotes), or to center a title in the middle of a page? That’s often a somewhat tricky and brave deed, which typically results in some inglorious coding gymnastics, stretching here and there, or resorting to the awkward and infamous tables… PDF Clown features a specific style property (VerticalFill) which addresses this kind of situation in the most clean and simple way, vertically stretching the element box to cover the whole usable page area. In this demonstration (see code below, line 113) the paragraphs following the title are aligned to the bottom of the page.

import java.awt.Dimension;
import java.awt.geom.*;
import org.pdfclown.documents.Document;
import org.pdfclown.documents.contents.composition.*;
. . .
DocumentComposer composer = new DocumentComposer(
/*
NOTE: The composer works along with a listener whose event
callbacks can be customized.
If you don't need any customization, you pass your document
variable directly to the DocumentComposer constructor (behind
the scenes it instantiates the default listener implementation).
*/
new DocumentComposer.DocumentListener(document)
{
/*
'onContextEnd' notifies that the layout on the current page
has ended.
*/
@Override
public void onContextEnd(
Event event
)
{
showMarginNote(event.getSource());
super.onContextEnd(event);
}
private void showMarginNote(
DocumentComposer composer
)
{
/*
NOTE: In this example, we decided that when the page ends,
a vertically-oriented note is placed on the right margin.
NOTE: This lower-level construct (which works directly with
BlockComposer) will be replaced by high-level elements
(paragraphs) as soon as absolute positioning will be available.
*/
BlockComposer block = composer.getBaseComposer();
GraphicsComposer graphics = block.getBaseComposer();
Dimension2D pageSize = composer.getContext().getSize();
Style pageStyle = composer.getStyle();
graphics.beginLocalState();
graphics.rotate(
90,
new Point2D.Double(
pageSize.getWidth()
- pageStyle.getMargin().getRight().getValue(),
pageSize.getHeight()
- pageStyle.getMargin().getBottom().getValue() / 2
)
);
block.begin(
new Rectangle2D.Double(0, 0,
pageSize.getHeight() / 2,
pageStyle.getMargin().getRight().getValue()),
XAlignmentEnum.Left,
YAlignmentEnum.Middle
);
graphics.setFont(composer.getStyle(null).getFontType(), 8);
block.showText("Generated by PDF Clown on " + new java.util.Date());
block.showBreak();
block.showText("For more info, visit http://www.pdfclown.org");
block.end();
graphics.end();
}
}
);
// Style definition.
composer.getStyle()
.withLineSpace(new Length(0))
.withMargin(new QuadLength(new Length(50)));
composer.getStyle(null)
.withFont(new Font(
org.pdfclown.documents.contents.fonts.Font.get(
document,
"TravelingTypewriter.otf"),
14))
.withTextAlign(XAlignmentEnum.Justify);
composer.getStyle(Paragraph.class)
.withMargin(new QuadLength(8, 0, 0, 0))
.withTextIndent(new Length(24));
org.pdfclown.documents.contents.fonts.Font decorativeFont =
org.pdfclown.documents.contents.fonts.Font.get(
document,
"Ruritania-Outline.ttf");
composer.getStyle(Heading.class)
.withFont(new Font(decorativeFont, 56))
.withLineSpace(new Length(.25, UnitModeEnum.Relative));
Style firstLetterStyle = new Style("firstLetter")
.withFloat(FloatEnum.Left)
.withFont(new Font(decorativeFont, new Length(2, UnitModeEnum.Relative)))
.withMargin(new QuadLength(0, 5, 0, 0));
// Content insertion.
Section section = new Section(
new Heading(
new Text("Chapter 1"),
new Text(
new Style().withFontSize(new Length(32)),
"\nDown the Rabbit- Hole"
)
),
new Group(
new Style()
.withVerticalAlign(LineAlignmentEnum.Bottom)
.withVerticalFill(VerticalFillEnum.FirstPage),
new Paragraph(
new Style().withTextIndent(new Length(0)),
new Text(firstLetterStyle, "A"),
new Text("lice was beginning to get very tired of sitting "
+ "by her sister on the bank, and of having nothing to do: "
+ "once or twice she had peeped into the book her sister "
+ "was reading, but it had no pictures or conversations in "
+ "it, 'and what is the use of a book,' thought Alice "
+ "'without pictures or conversation?'")
),
new Image(
new Style()
.withFloat(FloatEnum.Right)
.withMargin(new QuadLength(new Length(5)))
.withSize(new Dimension(0,250)),
document,
"alice_white_rabbit.jpg"
),
new Paragraph("So she was considering in her own mind (as well "
+ "as she could, for the hot day made her feel very sleepy and "
+ "stupid), whether the pleasure of making a daisy-chain would "
+ "be worth the trouble of getting up and picking the daisies, "
+ "when suddenly a White Rabbit with pink eyes ran close by her."),
new Paragraph("There was nothing so VERY remarkable in that; nor "
+ "did Alice think it so VERY much out of the way to hear the "
+ "Rabbit say to itself, 'Oh dear! Oh dear! I shall be late!' "
+ "(when she thought it over afterwards, it occurred to her that "
+ "she ought to have wondered at this, but at the time it all "
+ "seemed quite natural); but when the Rabbit actually TOOK A "
+ "WATCH OUT OF ITS WAISTCOAT- POCKET, and looked at it, and then "
+ "hurried on, Alice started to her feet, for it flashed across "
+ "her mind that she had never before seen a rabbit with either a "
+ "waistcoat-pocket, or a watch to take out of it, and burning with "
+ "curiosity, she ran across the field after it, and fortunately "
+ "was just in time to see it pop down a large rabbit-hole under the "
+ "hedge.")
)
);
composer.show(section);
composer.close();

Layout areas revealed

The layout process works balancing concurring constraints: the picture above reveals how this composition takes place (for each content element, the gray dashed shape represents the potential frame while the green shape represents the actually-occupied area).

3. Form flattening

A request from a user on Stack Overflow urged the implementation of an Acroform flattener to convert field annotations into static representations for content consolidation. Here it is an example of its use:

5. Automated object stream compression

Object streams [PDF:1.7:3.4.6] and cross-reference streams [PDF:1.7:3.4.7] have been switched from manual to automatic compression: till version 0.1.2.0 full PDF compression relied on the client’s choice of which data objects to aggregate into object streams; now all this process is transparent to the client and affects all the legally-compressible data objects.

1. Document Inspector

Since its earliest versions, PDF Clown has been shipped including a simple Swing-based proof of concept for viewing PDF file structures. Now that little fledgling is going to become a comprehensive tool for the visual editing of the structure of PDF files: PDF Clown Document Inspector. It was initially planned to be part of 0.1.2 version as a dedicated project within the PDF Clown distribution, but approaching the release deadline it wasn’t ready yet.

This tool conforms to the PDF model as defined by PDF Clown (see the diagram above), which adheres to the official PDF Reference 1.7/ISO 32000-1. This implies that a PDF file is represented through several concurrent views which work at different abstraction levels: Document view (document layer), File view (file/object layer, hierarchical) and XRef view (file/object layer, flat).

1.1. Document view

Document view (see the left pane in the above screenshot) shows the high-level structure of a PDF file; selecting a node, its data is shown in the right pane through several views — in this case, selecting a page node shows its content stream structure (Contents view, see below) and its rendering (Render view [¹], see above). Note that the page model represented by both Contents view and Render view corresponds to the content (sub)layer described in the diagram above.

Here it is just one of the possible functionalities: hovering the mouse pointer over a show-text-operation node, a tooltip pops up revealing the actual text encoded inside it (in this example, inspecting a russian-language document):

There’s such a potential for custom features that I’m considering to make it pluggable so as to let it be extended with additional modules, at user’s will.

1.2. File view

File view shows the low-level representation of the same entities you found in the above-mentioned Document view, expressed as primitive objects like dictionaries (PdfDictionary), arrays (PdfArray), streams (PdfStream) and so on.

1.3. XRef view

XRef view lists the entries of the cross-reference index (either table or stream, but that’s a technical detail you can happily ignore as it’s transparently handled by the library).

It’s really interesting to note that all the views (Document, File, XRef) are always kept synchronized: when you select a node in one of these views, its corresponding entities in each of the others are automatically selected, allowing to seamlessly switch from one view to another.

[¹] Rendering is still partial as it’s under development (pre-alpha stage).

1. Multimedia

For a long time I kept low priority over multimedia features (chapter 9 of PDF Reference 1.7), but recently I received some solicitation about that on the project’s forum… so yes, video embedding through Screen annotations is now ready!

Screen annotations as implemented by PDF Clown feature a couple of nice JavaScript-based enhancements: video preview at arbitrary position (video is automatically loaded on page opening, ready to be played starting on a given time frame) and user control (YouTube-like play/pause behavior by mouse click on the player — this may seem obvious, but anyone who worked with these annotations knows how painful it is, requiring awkward workarounds like dedicated play/pause buttons…). Furthermore, a useful fall-back FileAttachment annotation is placed along its Screen annotation for gentle degradation in case the PDF viewer has no multimedia capabilities.

Spurred by an engaging user request, file specification management (now modelled in org.pdfclown.documents.files namespace instead of the old org.pdfclown.documents.fileSpec) has been thoroughly revised to smoothly support PDF stream objects import/export from/to external files.

This practically means that, instead of embedding stream data directly into a PDF file, such data can reside in an external (local or remote) file and be linked from within the PDF file through a file specification object (org.pdfclown.documents.files.FileSpecification). Thus common resources such as images can be shared among multiple documents (useful for example in a server scenario where documents may be assembled on-the-fly).

Anyway, there’s a caveat to consider before approaching externalized streams: as they are prone to security issues, their actual support by PDF viewers is very restricted (e.g., see so-called “privileged locations” in Adobe Acrobat’s Enhanced Security preferences) or even non-existent (e.g., see Evince).

Here it is a code sample demonstrating how external references are applied to PDF stream objects:

PDF stream data is exported and linked back [lines 62-68];

linked files are imported back into their respective PDF stream objects [lines 95-98].

Working on file specifications involved also the support to file identifiers (PDF 1.7, § 10.3 — modelled by org.pdfclown.files.FileIdentifier class), which enforce referential integrity on document interchange. Their generation and update are now part of the document life cycle automatically managed by PDF Clown.

4. Advanced cloning

Since its inception, PDF Clown has supported a cloning mechanism capable of elegantly copying any structure/content of a PDF file without specialized code or torture-chamber algorithms (those exotic, lengthy, exhaustingly cumbersome monster methods you may sometime see when peering through the source of some well-known library…). Its implementation wasn’t complete, though: it couldn’t deal with circular references (which precluded annotations and some other structures) and there was no way to customize its filters on-the-fly in order to select just a graph subset to clone (which practically resolved in an identity transformation).

The good news is that 0.1.2 implementation overcomes such limitations leveraging the generic object visitor (org.pdfclown.objects.Visitor) through the Cloner class (org.pdfclown.objects.Cloner), which hosts a customizable collection of filters used to apply arbitrary transformations on cloning structures.

Let’s see an example. We want to copy a page into another PDF document (by the way: there’s a utility, org.pdfclown.tools.PageManager, which is purposely devoted to this activity, but here we want to dig deeply into its inner workings…):

Next release is going to introduce new exciting features (text highlighting, optional/layered contents, Type1/CFF font support, etc.) along with improvements and consolidations of existing ones (enhanced text extraction, enhanced content rendering, enhanced acroform creation and filling, etc.). This post will be kept updated according to development progress, so please stay tuned!
These are some of the things I have been working on till now:

Bidirectional traversal has been accomplished by the introduction of explicit references to ascendants: composite objects (PdfDictionary, PdfArray, PdfStream) are now aware of their parent container, so walking through the ascending path to the root PdfIndirectObject (and File) is absolutely trivial! This functionality has loads of engaging potential applications, such as fine-grained object cloning based on structure context (as in case of Acroform annotations residing on a given page).

Ascendant-aware objects are intelligent enough to automatically detect and notify changes to their parent container, making incremental updates transparent to the user.

Simple objects have been made immutable to avoid risks of unintended changes and promote their efficient reuse.

As expected (you may have noticed some TODO task comments about this within the project’s code base), object parsing of PostScript-related formats (PDF file, PDF content stream and CMaps) has been organized under the same class hierarchy to improve its consistency and maintainability.

2. Text highlighting

Text highlighting was a much-requested feature. It took me less than one hour of enjoyable coding to write a prototype which could populate a PDF file with highlight annotations matching an arbitrary text pattern, as you can see in the following figure representing a page of Alice in Wonderland resulting from the search of “rabbit” occurrences:

This text highlighting sample leverages both text extraction [line 55] and annotation [line 106] functionalities of PDF Clown, as you can see in its source code:

This is another example matching words which contain “co” (regular expression “\w*co\w*”):

Here you can appreciate the dehyphenation functionality applied to another search (words beginning with “devel” — regular expression “\bdevel\w*”):

3. Metadata streams (XMP)

XMPmetadata streams are now available for reading and writing on any dictionary or stream entity within a PDF document (see PdfObjectWrapper.get/setMetadata()).

4. Optional/Layered contents

Smoothing out some PDF spec awkwardness while implementing the content layer (aka optional content) functionality proved to be an interesting challenge. The result was nothing but satisfaction: a clean, intuitive and rich programming interface which automates lots of annoying housekeeping tasks and lets you access even the whole raw structures in case of special needs!

The figure above represents a document generated by the following code sample; for the sake of comparison, I took an iText example and translated it to PDF Clown, adding some niceties like the cooperation between the PrimitiveComposer (whose lower-level role is graphics composition through primitive operations like showing text lines and drawing shapes) and the BlockComposer (whose higher-level role is to arrange text within page areas managing alignments, paragraph spacing and indentation, hyphenation, and so on).

content layering [lines 89, 91]: content is enclosed within a layer section, making its visibility dependent on the layer state. There’s a subtle discrepancy in the PDF spec when it comes to nested layers: one may assume they imply a hierarchical dependency of the sublayer states, but that’s NOT the case — if you hide a layer its descendants are still visible! To work around this counterintuitive behaviour, many software toolkits wrap contents within multiple nested layer blocks; for example, if you want to wrap the text “nested layer 1” into a layer (resource name /Pr2) which is a sublayer of another one (resource name /Pr1), the content stream will contain this cumbersome syntax:
4 0 obj
<< /Length 205 >>
stream
[...] /OC /Pr1 BDC
/OC /Pr2 BDC
q
BT
1 0 0 1 100 800 Tm
/F1 12 Tf
(nested layer 1)Tj
ET
QEMC
EMC
[...]
endstream
endobj

This beast is repeated as many times as there are distinct content chunks to include within the same layer; it goes even worse as the number of nesting levels increases — just awful! Instead of this, PDF Clown defines a default hierarchical membership for each layer which can be used as a single, terse wrapping block (resource name /Pr2):
4 0 obj
<< /Length 185 >>
stream
[...]/OC /Pr2 BDC
q
BT
1 0 0 1 100 800 Tm
/F1 12 Tf
(nested layer 1)Tj
ET
QEMC
[...]
endstream
endobj

This way code is concise and more maintainable (if you want to rearrange the hierarchical structure of the layers you don’t have to walk through the content stream hunting layer block occurrences for correction — just go to the membership associated to the layer and update its hierarchical path!).

simple layer group creation and insertion [lines 104-105]

option group definition [lines 148-152]

5. AcroForm fields filling

Text fields have been enhanced to support automatic appearance update on value change.

New features currently under development that will be available in the next (0.1.0) release:

Cross-reference streams and object streams

Version compatibility check

Content rasterization

Functions

Page data size (a.k.a. How to split a PDF document based on maximum file size)

It’s time to reveal you that I decided to consolidate the project’s identity (and simplify your typing life) changing its namespace prefix (it.stefanochizzolini.clown) in favor of the more succinct org.pdfclown: I know you were eager to strip that cluttering italian identifier!

Last week I was informed that USGS adopted PDF Clown for relayering their topographic maps and attaching metadata to them. Although on a technical note it’s stated that its use will be only transitory, as they are converging toward a solution natively integrated with their main application suite (TerraGo), nonetheless its service in such a production environment seems to be an eloquent demonstration of its reliability. 8)

1. Cross-reference streams and object streams

After lots of requests, I’m currently busy on the development of cross-reference stream and object stream read/write functionalities [PDF:1.6:3.4.6-7]; in particular, stream reading has been partially based upon the code that Joshua Tauberer wrote some months ago while he was experimenting with PDF Clown on PDF files analysis for his US Congress activity tracker, GovTrack.

2. Version compatibility check

Working on cross-reference streams induced me to start supporting version-compatibility checking via annotations. This feature conveniently allows users to transparently control that the PDF files they are creating or modifying conform to a target PDF version (as specified in PDF file header) according to a configurable compatibility policy, defined through Document.Configuration.CompatibilityModeEnum — these are the alternative policies applicable:

Passthrough: document’s conformance version is ignored; any feature is accepted without checking its compatibility.

Loose: document’s conformance version is automatically updated to support actually used features.

Strict: document’s conformance version is mandatory; any unsupported feature is forbidden and causes an exception to be thrown in case of attempted use.

Automatic compatibility checking is very handy as users can enforce generated PDF files’ conformance without manual intervention; for example, you don’t have to tweak your PDF file version to 1.5 if you plan to use the optional content functionality (OCG [PDF:1.6:4.10]), just sit back and see it be done!

3. Content rasterization

I’m quite impressed how naturally the existing model is integrating with PDF printing and image rasterization functionalities. Leveraging the existing model means that there’s a common infrastructure (see ContentScanner and ContentObject hierarchy) that serves disparate purposes (content creation, content analysis, content extraction, content rasterization, content editing, and so on), simplifying its understanding, use, maintenance and extension. I wanna stress that my goal is to come to an elegant viewer, NOT to a cumbersome retrofit component that’s added as an alien to fill the gap!

Yes, I know these goodies had been outside my official plans for a long time, but during the last week of September, while crawling through the PDF Clown sources, I stumbled upon the above-mentioned ContentScanner and ContentObject hierarchy: I realized they were just ready for supporting content rendering, so I thought “What are we waiting for? Let’s do it!”… but don’t expect that 0.1 will deliver a full-fledged PDF viewer and printer — I’ll start prototyping the most basic graphics primitives such as space coordinates transformations, path drawing, color selection and so on. Advanced operations such as glyph outline drawing will necessarily appear afterwards. Anyway, I’m confident that at the end of the development process it will be possible to print and display PDF pages (and even independent parts of them such as external forms) along with their thumbnails.

The figure below compares an example of PDF Clown’s current rasterization capabilities (on the left, via Java 2D graphics) with its equivalent generated by Adobe Reader (on the right). As you can see, path drawing is highly conformant with the reference implementation, while no text rendering has been implemented yet.

Creating this figure was absolutely trivial — here it is the code sample used (line 34 executes the actual rendering of the first page of the document):

As you can see in the following code chunk, Renderer.render(…) method takes care to prepare the target graphics context [line 31] delegating its rendering to the chosen content context [line 32] (that is, in this case, a Page object):

5. Page data size (a.k.a. How to split a PDF document based on maximum file size)

org.pdfclown.tools.PageManager has been enhanced with the introduction of an elegant algorithm that accurately calculates the data size of PDF pages keeping shared resources (like fonts, images and so on) into consideration: this practically means that you can evaluate the incremental size of each page in a document, splitting the file when the collected pages reach the maximum file size you intended for your target split PDF files, without creating any cumbersome temporary file!