New features currently under development that will be available in the next (0.1.0) release:

Cross-reference streams and object streams

Version compatibility check

Content rasterization

Functions

Page data size (a.k.a. How to split a PDF document based on maximum file size)

It’s time to reveal you that I decided to consolidate the project’s identity (and simplify your typing life) changing its namespace prefix (it.stefanochizzolini.clown) in favor of the more succinct org.pdfclown: I know you were eager to strip that cluttering italian identifier! 😉

Last week I was informed that USGS adopted PDF Clown for relayering their topographic maps and attaching metadata to them. Although on a technical note it’s stated that its use will be only transitory, as they are converging toward a solution natively integrated with their main application suite (TerraGo), nonetheless its service in such a production environment seems to be an eloquent demonstration of its reliability. 8)

1. Cross-reference streams and object streams

After lots of requests, I’m currently busy on the development of cross-reference stream and object stream read/write functionalities [PDF:1.6:3.4.6-7]; in particular, stream reading has been partially based upon the code that Joshua Tauberer wrote some months ago while he was experimenting with PDF Clown on PDF files analysis for his US Congress activity tracker, GovTrack.

2. Version compatibility check

Working on cross-reference streams induced me to start supporting version-compatibility checking via annotations. This feature conveniently allows users to transparently control that the PDF files they are creating or modifying conform to a target PDF version (as specified in PDF file header) according to a configurable compatibility policy, defined through Document.Configuration.CompatibilityModeEnum — these are the alternative policies applicable:

Passthrough: document’s conformance version is ignored; any feature is accepted without checking its compatibility.

Loose: document’s conformance version is automatically updated to support actually used features.

Strict: document’s conformance version is mandatory; any unsupported feature is forbidden and causes an exception to be thrown in case of attempted use.

Automatic compatibility checking is very handy as users can enforce generated PDF files’ conformance without manual intervention; for example, you don’t have to tweak your PDF file version to 1.5 if you plan to use the optional content functionality (OCG [PDF:1.6:4.10]), just sit back and see it be done! 🙂

3. Content rasterization

I’m quite impressed how naturally the existing model is integrating with PDF printing and image rasterization functionalities. Leveraging the existing model means that there’s a common infrastructure (see ContentScanner and ContentObject hierarchy) that serves disparate purposes (content creation, content analysis, content extraction, content rasterization, content editing, and so on), simplifying its understanding, use, maintenance and extension. I wanna stress that my goal is to come to an elegant viewer, NOT to a cumbersome retrofit component that’s added as an alien to fill the gap! 😉

Yes, I know these goodies had been outside my official plans for a long time, but during the last week of September, while crawling through the PDF Clown sources, I stumbled upon the above-mentioned ContentScanner and ContentObject hierarchy: I realized they were just ready for supporting content rendering, so I thought “What are we waiting for? Let’s do it!”… but don’t expect that 0.1 will deliver a full-fledged PDF viewer and printer — I’ll start prototyping the most basic graphics primitives such as space coordinates transformations, path drawing, color selection and so on. Advanced operations such as glyph outline drawing will necessarily appear afterwards. Anyway, I’m confident that at the end of the development process it will be possible to print and display PDF pages (and even independent parts of them such as external forms) along with their thumbnails.

The figure below compares an example of PDF Clown’s current rasterization capabilities (on the left, via Java 2D graphics) with its equivalent generated by Adobe Reader (on the right). As you can see, path drawing is highly conformant with the reference implementation, while no text rendering has been implemented yet.

Creating this figure was absolutely trivial — here it is the code sample used (line 34 executes the actual rendering of the first page of the document):

As you can see in the following code chunk, Renderer.render(…) method takes care to prepare the target graphics context [line 31] delegating its rendering to the chosen content context [line 32] (that is, in this case, a Page object):

5. Page data size (a.k.a. How to split a PDF document based on maximum file size)

org.pdfclown.tools.PageManager has been enhanced with the introduction of an elegant algorithm that accurately calculates the data size of PDF pages keeping shared resources (like fonts, images and so on) into consideration: this practically means that you can evaluate the incremental size of each page in a document, splitting the file when the collected pages reach the maximum file size you intended for your target split PDF files, without creating any cumbersome temporary file!

Post navigation

I have been testing the raster functionality of PDF Clown on several (literally hundreds) of different PDF documents including ones created by acrobat 7, 8, 9 and X and some created with the itextsharp library.
Some of the files render very well and some give errors. I don’t know if the best way to contribute to your work would be to try and figure out why myself and give you the results or maybe the best way would be to send you a rar with two folders, one good, one bad, with a xrf listing what error I received on each.
I have very little knowledge of the PDF object and I have a feeling trying to parse code line by line wouldn’t help as much as sending the docs to you if you are interested. Either way, thanks for the work and let me know if and how I can help make it better.

I have been hunting around for a way to print a pdf for which seems like an age now. I downloaded your sample and hooked it all up; when i start the printing sample it gets to the point where the printer says it is spooling but then fails to render (the pdf I’m trying to print is just a 1 page document containing some text).
I know the printing functionality is in its early stage but any thoughts on the matter would be greatly appreciated.

when you work at content level modifying an operation like ShowSimpleText you have to consider that content streams do NOT directly deal with “readable” text, as they are simply concerned by graphical entities called “glyphs” referenced through an arbitrary encoding.

So, if you want to assign some text to a ShowSimpleText operation you have to convert your text into its byte representation as defined by the current font:

Good idea to write a content rasterizer! After hitting alignment issues using PDF Clown I also wrote a primitive one. It really helped me to understand what the layout was doing, and where it was going wrong.

Please consider, anytime you discover and fix an issue affecting PDF Clown, to report your findings and solutions — that’s a fair way to give back at least some of the benefit you received using it.
Thank you!

Thank you for the excellent work! Really waiting for the 0.1 release of C#.

btw. I couldn’t get those patches work on 0.0.8 release. The patch util gives errors when trying to upgrade with patch #1 in Windows.
I hope that you’ll release all the future fixes as full source code straight to sourceforge — it would be easier for c# developers 🙂

I assure you that the published patches for PDF Clown have all been successfully tested for merge to the distributed code base.

Please keep in mind to choose the unified diff format when applying those patches and check that your patch utility is able to handle the Unix end-of-line character sequence (you may use the patch command through Cygwin).

Connect

Hi, I am currently looking for a library that can offer me compression, where I get to keep resolution, but still decreases the file size. All I found out about PDF Clown’s compression is, that it has that feature. Could you help me figure out if it is actually a fit for me? Also is this project discontinued?

my extraction include coordinates and size of every object. It works well, but I have one problem: When the page rotation is rightward there is a bug in text extracting- I got the text and the position but something wrong with the size of the box and the font-size of the text. this pdf file is attached here i will be happy if you look at this file and tell m […]