1. Content stream manipulation

Since its very inception, I have been really delighted by the concept subtending the ContentScanner class, as it proved to be a versatile processor for handling content stream object trees along with their graphics state: you could use it directly to read existing content streams, modify them and also create new ones in a convenient object-oriented fashion, or it could be plugged into specialized tools (e.g. PrimitiveComposer, TextExtractor, Renderer, etc.) for more advanced applications.

But till version series 0.1.x it suffered a significant drawback: it lacked separation of concerns from its object model, that is the algorithmic responsibility to carry out the tasks was delegated to the respective content stream operations. This may work well in case there’s just a single task (“read/write the content stream”), but when further tasks are required (e.g. rendering the content stream into a graphics context) it rapidly becomes unbearable.

Therefore I proceeded with a massive refactoring which was informed by two main concurrent requirements: algorithmic separation between process and structure (accomplished through the classic Visitor pattern) and preservation of the distinctive cursor-based behavior of ContentScanner (solved through dedicated impedance-matching logic).

All the non-core functionalities which were bloating the original ContentScanner (like rendering and content wrappers) have been extracted into specialized processors (respectively: ContentRenderer and ContentModeller), resulting in the following classes:

ContentVisitor: abstract content stream processor supporting all the common graphics state transformations;

1.1. ContentScanner

ContentScanner‘s new implementation focuses exclusively on its core purpose, that is to enable users to manipulate content streams through their low-level, procedural, stacked model (operations and composite objects along with their graphics state).

1.2. ContentModeller

ContentModeller works as a parser which maps the low-level content stream model to its high-level, declarative, flat representation through a dedicated model rooted in GraphicsElement abstract class (which corresponds to GraphicsObjectWrapper hierarchy of ContentScanner’s old implementation). This simplified-yet-equivalent representation can be modified and saved back into the content stream.

Let’s see a practical example of the flexibility delivered by the new renderer: suppose that you have a multi-layered (OCG) document and you would like to selectively render only the contents belonging to a specific layer. To accomplish this, you can subclass ContentRenderer and tweak the drawing switch according to your own logic:

The following sample (from an old brochure of the Natural Tunnel State Park, Virginia) demonstrates how the renderer has evolved since its pre-alpha stage: text-showing operations have been temporarily implemented through substitute fonts emulating the styles (italic, bold, regular…) of the actual ones — such trick works nicely for thumbnail generation. Next step will address full-size rendering quality, adding support to glyph outlines.

The substitute fonts seem to work quite well also for non-Latin Unicode characters (as mapped on Ubuntu GNU/Linux):

First page of the Universal Declaration of Human Rights, Arabic translation, as rendered by PDF ClownFirst page of the Universal Declaration of Human Rights, Chinese translation, as rendered by PDF Clown

2. Content composition engine

PDF Clown 0.2.0 introduces the much-requested keystone of its content composition stack: DocumentComposer class. This engine features a layout model inspired by a distilled, meaningful subset of the HTML+CSS ruleset.

Its high-level typographic model (columns, sections, paragraphs, tables and so on) is laid out leveraging the existing lower-level functionalities provided by BlockComposer (paragraph typesetting) and GraphicsComposer (primitive graphics instructions — previously named PrimitiveComposer), the latter of which in turn sits upon the above-mentioned ContentScanner for feeding into the content stream (IContentContext).

PDF Clown’s content composition stack

This subject is massively broad, so here I’m going to give you just some little highlight about its features (development is currently underway — I’ll add more details as it advances):

PDF Clown, according to the CSS3 specification, automatically balances the column heights, that is, it sets the maximum column height so that the heights of the content in each column are approximately equal. This is possible because of a powerful simulation algorithm which ensures an accurate arrangement. Should the content exceed the available height on the paged medium, it would automatically flow into the next page.

import org.pdfclown.documents.Document;
import org.pdfclown.documents.contents.composition.*;
import org.pdfclown.documents.contents.fonts.StandardType1Font;
import org.pdfclown.util.math.geom.Dimension;
. . .
DocumentComposer composer = new DocumentComposer(document);
/*
NOTE: Composer's style is at the root of the style model, that is, its definitions
are inherited by the descending elements, analogously to the style of BODY element
in HTML DOM.
*/
composer.getStyle()
.withTextAlign(XAlignmentEnum.Justify)
.withFontSize(new Length(12));
/*
NOTE: Element type styles are analogous to CSS styles defined through element type
selectors.
*/
composer.getStyle(Paragraph.class)
.withTextIndent(new Length(10));
composer.getStyle(Heading.class)
.withMargin(new QuadLength(0, 0, 10, 0));
/*
NOTE: Styles can be defined analogously to CSS class definitions and can be derived
analogously to Less mixins (http://lesscss.org/).
*/
Style strongStyle = new Style("strong")
.withFont(new Font(new StandardType1Font(document, StandardType1Font.FamilyEnum.Times, true, false), null));
Style emStyle = new Style("em")
.withFont(new Font(new StandardType1Font(document, StandardType1Font.FamilyEnum.Times, false, true), null));
Style noteStyle = new Style("note")
.withBorder(new Border(
null,
new QuadBorderStyle(BorderStyleEnum.Solid, BorderStyleEnum.None, BorderStyleEnum.None, BorderStyleEnum.None),
new QuadLength(.1, 0, 0, 0),
null))
.withFont(new Font(null, 6d))
.withMargin(new QuadLength(30, 0, 0, 0))
.withPadding(new QuadLength(5, 0, 0, 0))
.withTextAlign(XAlignmentEnum.Left)
.withTextIndent(new Length(0));
Style superStyle = new Style("super")
.withFont(new Font(null, 6.5d))
.withVerticalAlign(LineAlignmentEnum.Super);
Section section = new Section("Hello World, this is PDF Clown!");
/*
NOTE: Group is a typographic element analogous to DIV element in HTML DOM.
*/
Group group = new Group(
new Image(
new Style()
.withFloat(FloatEnum.Left)
.withMargin(new QuadLength(new Length(5)))
.withSize(new Dimension(100,0)),
document,
"Clown.jpg"
),
new Paragraph(
new Text("PDF Clown's layout engine supports the "),
new Text(strongStyle, "multi-column layout model"),
new Text(" described by the CSS3 specification"),
new Text(superStyle, "[1]"),
new Text(" which extends the block layout mode to allow the easy definition of multiple columns "
+ "of text (and any other kind of content, like tables, images, lists and so on).")
),
new Paragraph(
new Text("PDF Clown, according to the CSS3 specification, "),
new Text(emStyle, "automatically balances the column heights"),
new Text(", i.e., it sets the maximum column height so that the heights of the content in each column "
+ "are approximately equal. This is possible because of a powerful simulation algorithm which ensures "
+ "an accurate arrangement. Should the content exceed the available height on the paged medium, it "
+ "would automatically flow into the next page.")
),
new Paragraph(
new Text("Columns can be defined by count (number of columns desired), width (minimum column width desired)"
+ " or both: in any case, the official CSS3 pseudo-algorithm is applied"),
new Text(superStyle, "[2]"),
new Text(". If you are interested in further info about CSS multi-column layouts, I recommend you to see "
+ "Mozilla's documentation for a great introduction to CSS Multi-column Layout Module"),
new Text(superStyle, "[3]"),
new Text(".")
),
new Paragraph(noteStyle,
new Text("1. http://www.w3.org/TR/css3-multicol/\n"
+ "2. http://www.w3.org/TR/css3-multicol/#pseudo-algorithm\n"
+ "3. https://developer.mozilla.org/en-US/docs/Web/Guide/CSS/Using_multi-column_layouts")
)
);
/*
NOTE: This is the declarative CSS3-equivalent style which prescribes the layout engine to treat
this group as a multi-column block (in this case: 2 columns with a 14-point gap between).
*/
group.getStyle().withColumn(new Column(2, new Length(14)));
section.add(group);
composer.show(section);
composer.close();

Honoring the KISS principle, all the magic here is done by a minimal declaration (see line 100 above) which, analogously to the CSS fragment {column-count:2; column-gap:14pt;}, prescribes the PDF Clown’s layout engine to render the content group as a multi-column block:

group.getStyle().withColumn(new Column(2, new Length(14)));

Comparing this neat solution to a well-renowned library like iText, some awkward shortcomings emerge in the way iText deals with multi-column layout: com.itextpdf.text.pdf.ColumnText class works as a dedicated processor outside the common declarative pattern (i.e., you cannot directly feed a column-aware element into the document as you do for tables, paragraphs and so on). In my opinion, that’s a really bad thing: a well-designed layout engine should hide those implementation details and carry out its duties transparently — you just feed the contents and it takes care to do the right thing according to their properties and inherent behaviors. Such a crippled layout model forces users to ridiculous bends and twists just to get contents in place!… let’s examine a few of them:

awful treatment of column intrusions: iText requires you to explicitly define the shape of your columns (sic!), distinguishing between “simple” (rectangularly-bound) and “irregular” (arbitrarily-shaped) columns, the latter forcing you to tediously specify each vertex…

adaptive column intrusion detection: PDF Clown’s layout engine keeps track of absolutely-positioned elements and its block composer takes care to automatically flow content around those already-occupied areas. What you have to do is just adding content the way shown in the code example above, no convoluted coding here!

smooth, consistent separation between content and layout models: in PDF Clown, layout processing is DocumentComposer’s business, while content definition is user’s. Multi-column layout is just another style property of your contents, not a strange beast to wrestle with!

NOTE: iText is indisputably a powerful library; my criticism is limited to a specific aspect of its composition model and represents nothing but my opinion: it will be up to the users to decide what’s good for them once the library is released.

As I said, multi-column layout is just a little treat in a full-fledged layout engine… PDF Clown is maturing: in the next weeks new technical details, code snippets and announcements will appear here. Stay tuned with its Twitter stream!

2.2. Tables

I know many of you eventually craved PDF Clown to natively support table composition… So, here we go: fully styleable, rowspan/colspan-enabled, arbitrarily nestable… really sweet!

Element construction: any composition element features a uniform set of constructors designed for compact definition. Here it is their parameter pattern:
where style is the element’s style (either custom or class), children are the elements contained by the element.

Style: the resolved style of each element is a combination of multiple styles:

2.4. Page Breaks

PDF Clown 0.2.0 supports CSS-like page breaks.

Page breaks sample generated by PDF Clown

And this is the corresponding code (lines 45 and 46 apply the page breaks):

import org.pdfclown.documents.Document;
import org.pdfclown.documents.contents.colors.DeviceRGBColor;
import org.pdfclown.documents.contents.composition.*;
. . .
DocumentComposer composer = new DocumentComposer(document);
/*
We decide that table cells sport a solid border by default (analogous
to CSS styles defined through an element type selector).
*/
composer.getStyle(Cell.class)
.withBorder(new Border(
new QuadColor(new DeviceRGBColor(0, 0, 0)),
new QuadBorderStyle(BorderStyleEnum.Solid),
new QuadLength(new Length(1)),
new QuadCornerRadius()))
.withPadding(new QuadLength(new Length(5)));
/*
We decide to highlight inline code references through a dedicated style.
*/
Style codeStyle = new Style("code")
.withBackgroundColor(new DeviceRGBColor(1, 1, 0))
.withFontType(new StandardType1Font(document, FamilyEnum.Courier, false, false));
/*
The contents will be included in a section.
*/
Section section = new Section("Hello World, this is PDF Clown!");
section.add(
new Paragraph(
new Text("This paragraph is the last content on this page as its next sibling is marked with CSS-like "),
new Text(codeStyle, "page-break-before: always"),
new Text(". Clean and simple!")
)
);
section.add(
new Group(
new Paragraph(
/*
Here it is the custom style applied to the isolated paragraph.
*/
new Style()
.withPageBreakAfter(PageBreakEnum.Always)
.withPageBreakBefore(PageBreakEnum.Always),
new Text("This paragraph is isolated on this page as we marked it with both CSS-like "),
new Text(codeStyle, "page-break-before: always"),
new Text(" and "),
new Text(codeStyle, "page-break-after: always")
),
new Table(
new Row(
new Cell("Cell1,1"),
new Cell("Cell1,2"),
new Cell("Cell1,3"),
new Cell("Cell1,4")
),
new Row(
new Cell("Cell2,1"),
new Cell("Cell2,2").withColSpan(2),
new Cell("Cell2,4")
)
)
)
);
composer.show(section);
composer.close();

This is a demonstration of some of the fine typesetting capabilities of the new layout engine of PDF Clown (the code which generated the sample shown above is listed below):

composition event listener: DocumentComposer notifies its relevant events to a dedicated listener (DocumentComposer.DocumentListener), so you can apply custom logic when the engine requires a new page (onContextInit), begins to compose the current page (onContextBegin), ends to compose the current page (onContextEnd) and so on. In this demonstration (see code below, line 28) a margin note is added reacting to the end of the page layout.

drop caps: stylish initial letters work like a charm, you just need to float your letter to the left and choose its font and size, at your will (see code below, lines 96-99).

vertical fill property: have you ever found yourself trying to convince your traditional horizontally-flowing layout engine (like those HTML-based) to automatically place, for example, a paragraph aligned to the bottom of a page (like footnotes), or to center a title in the middle of a page? That’s often a somewhat tricky and brave deed, which typically results in some inglorious coding gymnastics, stretching here and there, or resorting to the awkward and infamous tables… PDF Clown features a specific style property (VerticalFill) which addresses this kind of situation in the most clean and simple way, vertically stretching the element box to cover the whole usable page area. In this demonstration (see code below, line 113) the paragraphs following the title are aligned to the bottom of the page.

import java.awt.Dimension;
import java.awt.geom.*;
import org.pdfclown.documents.Document;
import org.pdfclown.documents.contents.composition.*;
. . .
DocumentComposer composer = new DocumentComposer(
/*
NOTE: The composer works along with a listener whose event
callbacks can be customized.
If you don't need any customization, you pass your document
variable directly to the DocumentComposer constructor (behind
the scenes it instantiates the default listener implementation).
*/
new DocumentComposer.DocumentListener(document)
{
/*
'onContextEnd' notifies that the layout on the current page
has ended.
*/
@Override
public void onContextEnd(
Event event
)
{
showMarginNote(event.getSource());
super.onContextEnd(event);
}
private void showMarginNote(
DocumentComposer composer
)
{
/*
NOTE: In this example, we decided that when the page ends,
a vertically-oriented note is placed on the right margin.
NOTE: This lower-level construct (which works directly with
BlockComposer) will be replaced by high-level elements
(paragraphs) as soon as absolute positioning will be available.
*/
BlockComposer block = composer.getBaseComposer();
GraphicsComposer graphics = block.getBaseComposer();
Dimension2D pageSize = composer.getContext().getSize();
Style pageStyle = composer.getStyle();
graphics.beginLocalState();
graphics.rotate(
90,
new Point2D.Double(
pageSize.getWidth()
- pageStyle.getMargin().getRight().getValue(),
pageSize.getHeight()
- pageStyle.getMargin().getBottom().getValue() / 2
)
);
block.begin(
new Rectangle2D.Double(0, 0,
pageSize.getHeight() / 2,
pageStyle.getMargin().getRight().getValue()),
XAlignmentEnum.Left,
YAlignmentEnum.Middle
);
graphics.setFont(composer.getStyle(null).getFontType(), 8);
block.showText("Generated by PDF Clown on " + new java.util.Date());
block.showBreak();
block.showText("For more info, visit http://www.pdfclown.org");
block.end();
graphics.end();
}
}
);
// Style definition.
composer.getStyle()
.withLineSpace(new Length(0))
.withMargin(new QuadLength(new Length(50)));
composer.getStyle(null)
.withFont(new Font(
org.pdfclown.documents.contents.fonts.Font.get(
document,
"TravelingTypewriter.otf"),
14))
.withTextAlign(XAlignmentEnum.Justify);
composer.getStyle(Paragraph.class)
.withMargin(new QuadLength(8, 0, 0, 0))
.withTextIndent(new Length(24));
org.pdfclown.documents.contents.fonts.Font decorativeFont =
org.pdfclown.documents.contents.fonts.Font.get(
document,
"Ruritania-Outline.ttf");
composer.getStyle(Heading.class)
.withFont(new Font(decorativeFont, 56))
.withLineSpace(new Length(.25, UnitModeEnum.Relative));
Style firstLetterStyle = new Style("firstLetter")
.withFloat(FloatEnum.Left)
.withFont(new Font(decorativeFont, new Length(2, UnitModeEnum.Relative)))
.withMargin(new QuadLength(0, 5, 0, 0));
// Content insertion.
Section section = new Section(
new Heading(
new Text("Chapter 1"),
new Text(
new Style().withFontSize(new Length(32)),
"\nDown the Rabbit- Hole"
)
),
new Group(
new Style()
.withVerticalAlign(LineAlignmentEnum.Bottom)
.withVerticalFill(VerticalFillEnum.FirstPage),
new Paragraph(
new Style().withTextIndent(new Length(0)),
new Text(firstLetterStyle, "A"),
new Text("lice was beginning to get very tired of sitting "
+ "by her sister on the bank, and of having nothing to do: "
+ "once or twice she had peeped into the book her sister "
+ "was reading, but it had no pictures or conversations in "
+ "it, 'and what is the use of a book,' thought Alice "
+ "'without pictures or conversation?'")
),
new Image(
new Style()
.withFloat(FloatEnum.Right)
.withMargin(new QuadLength(new Length(5)))
.withSize(new Dimension(0,250)),
document,
"alice_white_rabbit.jpg"
),
new Paragraph("So she was considering in her own mind (as well "
+ "as she could, for the hot day made her feel very sleepy and "
+ "stupid), whether the pleasure of making a daisy-chain would "
+ "be worth the trouble of getting up and picking the daisies, "
+ "when suddenly a White Rabbit with pink eyes ran close by her."),
new Paragraph("There was nothing so VERY remarkable in that; nor "
+ "did Alice think it so VERY much out of the way to hear the "
+ "Rabbit say to itself, 'Oh dear! Oh dear! I shall be late!' "
+ "(when she thought it over afterwards, it occurred to her that "
+ "she ought to have wondered at this, but at the time it all "
+ "seemed quite natural); but when the Rabbit actually TOOK A "
+ "WATCH OUT OF ITS WAISTCOAT- POCKET, and looked at it, and then "
+ "hurried on, Alice started to her feet, for it flashed across "
+ "her mind that she had never before seen a rabbit with either a "
+ "waistcoat-pocket, or a watch to take out of it, and burning with "
+ "curiosity, she ran across the field after it, and fortunately "
+ "was just in time to see it pop down a large rabbit-hole under the "
+ "hedge.")
)
);
composer.show(section);
composer.close();

Layout areas revealed

The layout process works balancing concurring constraints: the picture above reveals how this composition takes place (for each content element, the gray dashed shape represents the potential frame while the green shape represents the actually-occupied area).

3. Form flattening

A request from a user on Stack Overflow urged the implementation of an Acroform flattener to convert field annotations into static representations for content consolidation. Here it is an example of its use:

5. Automated object stream compression

Object streams [PDF:1.7:3.4.6] and cross-reference streams [PDF:1.7:3.4.7] have been switched from manual to automatic compression: till version 0.1.2.0 full PDF compression relied on the client’s choice of which data objects to aggregate into object streams; now all this process is transparent to the client and affects all the legally-compressible data objects.

Yes, it does: you can find a working sample (AcroFormFillingSample) included in the downloadable distribution. If you are willing to get the latest enhancements (like form flattening (see AcroFormFlatteningSample)), you can check out the project’s repository. enjoy! 😉

Connect

I had the same problem, but I noticed those files that error out are files with Forms. I even tried it with the sample generator and got the same error. Here is how to replicate it: 1. Run the Sample class. 2. Select option 0 - AcroFormCreationSample 3. Create the form file and quit the application. 4. Copy the output file into the samples folder 5. Run the […]