Forums

Help

Write Large PDF

HI,
I\'m trying to write large pdf file without using much memory
but the test.pdf file is always 0 until the end of the process
it seams that it\\\'s using ram before closing the stream and writing to disk
any suggestion?

1) setting the write mode hint is a one shot call. it is reset after each write, so this belongs in the loop.
2) using different locator instances on the PDDocument level when writing FORCES a full write - this emulates the Adobe Reader "Save As" (still, this is overriden by the write mode hint if you so this in the loop).
3) As the location does not really change, a simple "save()" is fine. This will default to an incremental write.
4) The STDocument uses internally a 4k buffer. If you stay within this range, nothing gets written until flush.
5) If i debug your code, is see the size of the "text.pdf" slowly build up in my explorer.

So all in all, you are not using incremental mode. The serialized PDF is definitely not (completely) in memory. But, keep in mind, your memory consumption is not only dependent on the PDF serialization. The PD abstraction handles many and large objects. jPod is designed to swap out these objects when no longer used but this also depends on your usage (references that create paths to gc roots).

HI,
I've updated the code according to your suggestions but the write generation is still quite slow:
2 seconds on my core i7 with SSD disk for 1000 pages.
We actually produce 25000 page per second with some proprietary tecnology that we want to dismiss
the sample code runs at 350 pg/sec.
What could we optimize to solve this ?

But, being earnest. I don't know what your old library is doing and if it can be compared at all. We do a lot of stuff that is not of any importance in batch creation because we are interaction focused. This is surely dead freight if generating docs in the background.

If you're really interested in speeding up you should profile… if you come up with specific questions i can tell you why we did it this way or if we have simply made some mistakes.

In addition i can provide you with the insight of some other user that gave us feedback (that is not yet included in the distribution, so you may take advantage of it)

• Gained significant performance boost when reusing the same NumberFormat object in COSWriter.java. This is due to the localization service being expensive to create for every number written out.

• We also overrode PDFPage.toByteArray() and in that method we created a new RandomAccessByteArray object that defaulted the byte array to 100,000 bytes. We thought there may be a way to default this array size based on the number of operations. This is so we don’t keep creating new byte arrays. We observed over four million byte arrays being created.

If these changes sound like something you’d be interested in reviewing I can start working on making a patch.

We’ve also found that we gained a lot of performance improvements related to garbage collection pauses were if we set the following JVM arg:

-XX:SoftRefLRUPolicyMSPerMB=25

We thought maybe you or other jPod users may be interested in this optimization.

If you have similiar (ore more) findings, please come back and let us know.

another issue is that for 100.000 pages it took over than 10gb. I think there should be some problems in resource reusing but I dunno Jpod internals, maybe the library is not designed for such kind of tasks?
Thanks for the support.

I never created a document with 100.000 pages (and maybe one should revisit the requirements ?) so i did not track down possible problems (btw. 10GB of main memory or what??)

In theory jPod should be able to handle large documents as it is random access based and can swap out unused indirect objects.

That said, as this is not in our focus, we do not test and there may be some defect here. As designed, after (incremental) writing, each object should have cleared the "dirty" state and as such be subject to garbage collection (if indirectly referenced). Keeping a reference to the objects you built up will havoc this mechanic. The only object structure that always is in memory is the xref. But again, i do not exclude a memory leak, so you have to start your profiler….

Another hint: when creating large page nodes, you should think about "rebalancing" the tree in multiple hierarchies.

ah ok I understood from this:If these changes sound like something you’d be interested in reviewing I can start working on making a patch.\
that you would create a patch anyway yes the changes are obvious.

for the balanced tree I have understood what the problem is
but I have also to admit that I've some difficulties to understand what to do.
something like :
(pseudocode)
document.addPageNode(page);
page.addPageNode(page2)
ecc … ?

Is it possible to engage you or your company for support this task ?
Thanks
Marco

The code and the resulting PDF look fine from a "standard" point of view. The increments are more or less the same size and few objects are rendered in a redundant way.

If you applied already the tips from our other user, you NEED TO PROFILE… There may arise penalities from garbage collection (because of our SoftReference caching), you may detect that collections are resized in an unperformant way for this bulk or other deficiencies that show up only with large docs.

I put a debugger inside NumberFormat and it's not used at all
PDFPage class doesn't exists so I didn't find any toByteArray
CosWriter has a toByteArray but already use RandomAccessByteArray
I've also added -XX:SoftRefLRUPolicyMSPerMB=25 (and test with other values) no effect.

Profiling seams that the majority of the time is spent in:
CosObjectWalkerShallow.

OK - by PDFPAge i think our fellow talked about CSContent. The default is to create a small (20 bytes) ByteArray that very slowly resizes. You will encounter the problem when writing mor page content. The same holds for COSWriter. You will not have any damage when only writing small objects…

But, the biggest "issue" is our internal garbage collection. I never thought about this. It is a typical artifact from interactive PDF handling, where, between two saves, some objects may already be out of scope. Therefore we do a cleanup to avoid having dangling objects.

In your scenario (as the old pages still exist in memory, even if swapped out) this leads to O(n*n) penality. And, best of all, you don't need it. Just go and delete the call to "incrementalGarbageCollect" in COSWriter (not garbageCollect).