Posted
by
timothy
on Thursday August 07, 2014 @01:16PM
from the when-vi-is-not-the-answer dept.

New submitter Fotis Georgatos (3006465) writes I recently engaged in a conversation about handling PDF texts for a range of needs, such as creation, manipulation, merging, text extraction and searching, digital signing etc etc. A couple of potential picks popped up (PDFBox, itext), given some Java experience of the other fellows. And then comes the reality of choosing software as a long term knowledge investment! ideally, we would like to combine these features:

open source, with a community following ; the kind of stuff Slashdotters would prefer

I'd like to poll the collective Slashdot crowd wisdom about if/which PDF related libraries, they have written software with, keeps them happy for *all* the above reasons. And if not happy with that all, what do they thing is the best bet for learning one piece of software in the area, with great reusability across different circumstances and little need for extra hacks? I'd really like to hear the smoked out war stories. It is easy to obtain a list of such libraries, yet tricky to understand whethe people have obtained success with them!

Indeed. I'm curious why is not "closed source, with a strong industry support" an option?

Because both "open source" and "strong industry support" when put together like that prettymuch means that they don't want to get stuck holding the bag if the company goes out of business.With "strong industry support" the odds of a company going out of business is minimized andwith "open source" even if it does go out of business then you can still continue to use thesoftware indefinitely while you look for a replacement.

I'm using a non-free, but source-provided library called Clib-PDF. It's a pretty nice library with a pretty easy API, and even has PHP bindings (so it must've been a viable mainstream choice at one point). But somehow the company (or was it just a single guy) disappeared years ago. Luckily, we paid for and got the source, and I've been able to keep using it (and even fixing things in the source) without any ongoing support. So not quite open source, but not quite the disaster of discontinued closed source.

I suspect that the author of this library sold it to one of the commercial companies who proceeded to shut down a viable competitor. But who knows...

Indeed. I'm curious why is not "closed source, with a strong industry support" an option?

i guess because he knows what kind of crap *that* can be.

of course his full requirement list is ridiculous, nevertheless as a request to the community on a public forum. anything else? this dude is just looking for someone else wanting to do his fucking job, but he also wants a medal for it (read: a tap on the shoulder). likely he *is* in a "closed source, with a strong industry support" environment, so screw him.

please note that the ad hominem in my comment was rethoric. It was actually referring to the content of your article and the attitude it implies. I think it should be obvious but maybe it is worth pointing it out now.

of course many may have learnt something interesting from the ensuing threads. like with any other discussions. that's the cool thing about fora. but I still find that simply dumping the full set of requirements for your assignement on a public forum isn't professional or nice at all. you can

Any number of reasons. A big one is stability. If it's open source, it won't go *POOF* one day. It won't double in price one day. It won't get an ugly redesign with no ability to stick with the version that worked.

Then there's the ability to work out the finer points where the documentation was unclear, greater ability to debug any problems

I can't imagine a good reason that open source with a strong community wouldn't at least be a nice to have for any software.

Yes... and their principles are; I want somebody else to do a lot of hard work and I want the benefit for free. Oh, and I don't even want to research this myself, I want others to do the work... for free.

Oh, and I don't even want to research this myself, I want others to do the work... for free.

Thats a bit unfair. I had a need to convert some output into PDF a while back and started looking thru the many proprietary and open source options, and there were a lot to choose from. It was awfully hard to determine which were quality and which weren't. Installing and trying to program a working solution against each API would have taken up a huge amount of time. There is nothing at all wrong with asking if others have been thru some of that process and found a favorite. I certainly would have liked

Is it a true open standard like TCP/IP, or more like a rubberstamped "standard" by a single company like OOXML?

What prevents Adobe from adding incompatible features to Acrobat and Adobe Reader, and thus make the PDF unreadable by software that adheres to the so-called standard? For example, last year, I could connect to the danish tax department from my Linux machine, using standard TCP/IP, only to be told that to view my tax returns I would have to install Adobe Reader.

PDF is a family of standards. There's a core specification and a set of extensions (including, somewhat confusingly, some that remove functionality). In particular, PDF/A removes some of the crazy things (embedded video, audio and JavaScript) and is intended for archiving. Adobe is the driving force behind the standard, and do add non-standard things to their tools (although via a well-defined extension interface), but they're very careful to differentiate PDF/A because a lot of their deep-pocketed custo

Nothing prevents Adobe from adding incompatible features. In fact, nothing stops Adobe from adding compatible features. We have a use case for U3Ds embedded in PDFs (as per the ISO Standard, 13.6), and I haven't found any PDF reader other than Adobe's that displays that.

OP here. Your concern is very valid and you are not alone, I have the same concerns.
However, at the moment we know of no other standard that actually renders alike among zillions of Desktops, smartphones, automated processing agents etc.
Can you really replace.pdf format with something of similar functionality AND not ask the majority of users to install X, Y, Z?

I think that means you haven't seen enough PDFS. Adobe makes heavy use of proprietary add ons that only work with Adobe products. Then there are all the security vulnerabilities they can contain.

PDFS are great for internal use, if you create them and you consume them. Dealing with those made by random people kind of sucks. Sometimes you get a pdf that's just composed of images for each page. So no text extraction is possible.

I'm not recommending any libraries for the original poster, because they all suck i

Text extraction can be a pain even if the text is not images, due to encodings and text placement. Also, it doesn't take proprietary add-ons to lose all open source viewers I'm aware of: I know of none that support the 3D images in the Standard, chapter 13.6.;

Adobe pushes PDF as a method of data collection. People make fancy PDF forms and e-mail them out. Inside of the form is a button that says "when complete, click here to submit form" which attaches the filled out form to an e-mail and sends it back to the publisher. From there, folks somehow extract the fields from the file and dump it into a database, which seems like a messy and complicated process. Honestly a web form would be easier to implement in many cases.

Because maybe it's not his first project? Fine, let me ask you: how many times did you get burned by totally unmaintainable third-party dependencies, before you vowed "NEVER AGAIN will I get so utterly fucked over?"

Was your fifth project the one where you couldn't ever port to a new architecture or OS, or was it the one where the only company who had the source, went into bankruptcy and it took years for the liquidation to happen and you never really figured out where the assets are? No wait, your fifth project was the one where they just withdrew it from the market for "strategic reasons" and you never found out why and there was no replacement. Ah, then there was the race condition that you knew you could find if only you could read through the code, but the sole developer didn't even know what "race condition" means so he ignored your bug report. And the time the DRM server incorectly said the API key had expired so you didn't get any sales that day. Then there was that time you had the source but weren't allowed to change some parts of it: I loved the comment "by reading this you are violating the License Agreement" followed by the base64 string of dynamically interpreted code. Of course you violated the agreement, and decoded it: finding a bug you weren't allowed to fix. And of course let's not forget the time the developer might have actually hypothetically allowed the code to be maintained or might have even done it himself, but he had lost it, the one and only copy in the entire world, which had been used to compile the code that literally tens of thousands of people were depending on. That one's a classic, almost right up there with the vendor who died, taking all his customers' hopes of maintenance with him to the grave.

Holy crap. I get why the public doesn't know to demand Free Software. Even smart people can be uninformed or lack expertise outside their areas. But developers, really? You have to be LITERALLY STUPID to not see "open source" as at least a major advantage, if not necessarily always the winner. Maybe it's not always a solid requirement, but if you don't always at least start your searches that way and try to get something that at least can be maintained, then yes, you're a moron.

"Oh no, I'm not a moron," you explain, "I just happen to think that some large projects aren't ever going to need maintenance, because surely it's simple enought that a good programmer will get everything right the first time." You're right: you're not a moron; you're an imbecil. Sorry about the mistake.

Because maybe it's not his first project? Fine, let me ask you: how many times did you get burned by totally unmaintainable third-party dependencies, before you vowed "NEVER AGAIN will I get so utterly fucked over?"

This. Wish I hadn't run out of mod points -- and frankly I'm tired of some bottom of the barrel programmer who's attitude is "we can just rewrite everything every 5 years" get promoted into management and then tie our code to whatever proprietary crap the next cute sales person brings.

Separate. Isolate. Defend. Treat every piece of third-party code that you don't have source for as an enemy whose only goal is to financially rape you. I don't care if that enemy goes by Oracle, Microsoft, or Joe's Dis

Separate. Isolate. Defend. Treat every piece of third-party code that you don't have source for as an enemy whose only goal is to financially rape you. I don't care if that enemy goes by Oracle, Microsoft, or Joe's Discount Software.

Having the source doesn't help you if it's an unmaintainable piece of crap, which is presumably where the OP's requirement for a community came from - if a load of people are hacking on it actively then there's a good chance that, if you end up needing to maintain it in-house, there's a pool of people to hire or send consulting work to.

+1. Yeap. You are on track, guys! The whole point is, that even if when we are involved commercially in a project, it's optimal to promote the usage of open source software anyhow, as a matter of enhancing community effort and investing time in a manner that you can benefit from it many times more in the future. Kudos.

PDFlib is cheap compared to licensing Adobe's libraries from DataLogics. (speaking as one who switched from the latter to the former).... A full source license for pdflib and tetlib were much less that Adobe/DataLogics non-source license... less than 1 FTE. Then again, your milage may vary.

PDFLib happens to be the cleanest and best PDF code solution I've ever worked with.

TCPDF [tcpdf.org] Open-source PDF-reader built in PHPFPDF [fpdf.org] Combine with TCPDF above to create a PDF-writer using PHPSetAssign [setasign.com] Not open-source but this company offers both free and paid libraries that combine with the libraries above to allow PDF encryption / decryption using PHP.- The paid versions support more complex ciphers and I swear by them personally

I second this. PDFlib is good software for making PDFs. Their TET tool for extracting text can return to you where (coordinates) each letter on a page is, if you desire, or just dump the whole page or each word at a time, etc.

Office Automation is problematic -- because it literally opens up a hidden window of your Office app and simulates clicking around the UI to do what you need, if something unexpected happens it can unhide the window to show the user a message. This might be good enough for a desktop app, but if you're running it on a server it'll just freeze up your process with noone there to click it.

For Office->PDF conversion of word docs, Aspose.Words has a fairly easy API and generally very accurate rendering. I hig

No...it does not simulate clicking. It uses the underlying COM representation to perform its functions. That said, it does not work well in a multi-threaded environment, nor where you can't setup user (e.g. restricted web server credentials). So you either have to impersonate or use COM+ configuration to run the office tool under a different user name.
So if you're just starting out...do not use Office automation in a server environment unless you're willing to deal with these issues. Try Aspose as suggeste

I wouldn't recommend Office Automation on a server if there is any alternative. For beginners, there's too many gotchas and for advanced users, there's plenty of alternatives that will do what you want without too much difficulty. Office with.Net is especially problematic because the COM components run as out-of-process servers and due to.Net's garbage collection and COM interoperability, they are difficult to get to shut down properly.

I've found these tools useful, with an honorable mention to gnupdf. I've never used it personally, but the code looks pretty solid. That said, when I really needed to produce great multilingual PDF I pulled out the PDF spec, gritted my teeth, and generated it directly.

I have no idea if it supports data: URIs but I've used HTMLDOC to turn html tables into PDF (since every PDF library I've ever used is absolutely shit at tables compared to HTML). It supports inline styles and <style type="text/css"> tags. It's not quite dead, but this year's update was the first since 2006 [msweet.org].

I would have posted to make the same recommendation if someone else hadn't already mentioned it, so I'll just follow up with another recommendation for itext. It's a pretty easy to use library, and it's been around for a while, so it's pretty stable.

The problem with iText is that it used to be MPL, but the maintainer got ticked off at commercial users several years ago and changed to license to AGPL. Apparently now they're relaxing the license for a fee, but they've changed their mind before - no guarantees that they won't change it again.

After investigating and trying at least 9 other open source kits I eventually gave up and went with PrinceXml. You can try the 'trial' version easily and it just works easily. Their support is actually good as well. I wish there was a good pdf toolkit that was open source. But they all seem to just do one odd piece of the puzzle poorly.

PrinceXML is reliable, simple and produces the most beautiful PDFs ever. We've used it to replace InDesign as a tool for high end magazine page generation and have analysed the output of both - PrinceXML is significantly cleaner. However, it does help if you combine it with an image (re)sizing tool otherwise you end up with huge bloat with oversized images embedded in your PDF.

At least on the C# side of things, the three libraries I've used (iTextSharp, PdfSharp, and Aspose.Pdf) are all a bit of an unintuitive mess with inconsistencies all over the place and very little documentation. In the case of iText, their revenue stream is putting all their documentation into a book for people to buy, so it's not uncommon to get an intentionally vague response when asking for help.

I cycle between each depending on what I need to do, because they all have their own quirks and supported features. I've even piped from one to another to get certain parts of the process working.

It could be that iText is just what he needs though. iTextSharp is the C# port of the original iText Java library. At times, it is easier to find code examples for iText than iTextSharp. Since the iTextSharp folks did their best to use C# conventions, the Java call names aren't always the same as the C# ones.

This. These three are what you need; you can then script a wrapper around them if you need to, but they'll provide you with everything you need as far as actual manipulation and display goes. Poppler keeps it simple, pdftk can handle most manipulation needs, and ghostscript is there to covere any esoteric issues that still fall under postscript/pdf/EPS.

Might want to also include imagemagick, for import/export/optimization of most image formats you might be bursting/adding in PDF.

PDFLib GmbH [pdflib.com] (german LLC) build exactly one product: PDFLib. And they've been doing that since 1997. AFAIK the company was run by one guy - the initial developer - alone for most of the time. Now it's probably a shop of 5 or so.

So it's not FOSS - yeah, that's a real shame. But the devs get to eat, you can demand service and response if you run into a bug and you can expect a good product and with PDFLib you're probably going to get it too.

I haven't come across a single project doing non-trivial PDF stuff that doesn't use PDFLib. I've used it myself a little, and the cookbook that comes with the product was very good, so it comes recommended.

I've created approximately 3 billion pages of PDF with it, since 2000. Very, very well done. The library is well thought out, and it can work even with bindings to languages that you would not think are usable. It's fast, really has a nice scope model, has a nice consistency, and rounds off the edges of PDF better than anything else with it.

If you come it with their import library, and pcos library, it can do almost everything you want. The developers are helpful and don't mess around.

The trouble I can see with PDFLib is the stupid "per machine" licensing. Per machine licensing for a software library is ridiculous - the description of the license on their website pretty much rules out using it in any situation other than some sort of central PDF processing behemoth service.

Clearly you haven't dealt with Oracle's licensing, compared to that, PDFLib is highly liberal stuff. They even give you one machine free (dev box).

I started with Reportlab (the open source parts), found it to low level so I considered using the commercial edition because it has a templating language. As I was not very fond of investing time in learning yet another templating language, I reconsidered, and gave HTML with CSS a try for printing. I used wkhtmlpdf for a while but switched to WeasyPrint in the end: it was created for using HTML with CSS for printing, seemed to be more actively developed when compared to wkhtmlpdf.

Does anyone have experience with MuPDF? http://www.mupdf.com/ [mupdf.com]
It's open-source, but requires license for commercial use. It appears to offer the best performance and portability. Its top level application(MuPDF) is highly rated across most platforms: Google Play, Apple Store and Ubuntu.

Is there anything that can handle the gruesome CT600 forms that the UK Tax authority require us to fill in every year? These have lots of embedded scripting and can only be read with Acrobat Reader. However, this year, Adobe have stopped releasing Acrobat for Linux.

(An added bonus, the internal logic of the CT600 is buggy: for example if a particular tax option does not apply, it is fussy about the distinction of 0 vs empty, and this leads to subsequent validation errors (naturally with confusing messages)

I needed to layout a novel in a PDF. I've previously worked with iText and prefer not to construct a PDF one element at a time. I wanted an HTML to PDF workflow. I then tried wkhtmltopdf, but it doesn't support most of the hardcore design needs: hyphenation, widow/orphan control, alternating page margins, and page headers and footers based on the section of a document.
PrinceXML supports all that. Writing only CSS, and based on html content, you'll be able to replicate anything a designer can do in InDesig

Not sure how current it is, but when I was looking for the same a few years back all that was really available for PHP was HTML->PDF libraries which were not sufficient for anything but the most basic forms. A decent invoice form was hard to get right with these tools. Then I came across FOP. Or more specifically XML-FOP. Combine that with a little XSL and the output was amazing, and could do more than the HTML converters. The only problem is that the FOP tool was a Java based program so PHP would need to execute a shell command to call it. With tight control of what info was passed to that shell command, it seemed an appropriate trade-off for the job at hand. You can still get FOP in the ubuntu repos - apt-get install fop. The learning curve for FOP is a little steep to begin, but no more than any other XML dialect. And being XML, you have a lot of options in building the required FOP file. I opted to put my data into my own XML file, then utilize an XSL file to convert it if/when needed. More details here: http://xmlgraphics.apache.org/... [apache.org]

PDF::API2 is nice, but unfortunately it doesn't handle newer PDFs with compressed xrefs and/or object streams yet. Also, support for writing text in anything different from ASCII and maybe Latin-1 is close to missing.

I'll be honest that I don't have a broad range of experience with libraries. I've used a couple of html-to-pdf implementations and PDFSharp. The licensing for PDFSharp is very permissive, support can be paid for if required and the library is quite fast. As an aside, it has a cousin, MigraDoc, which produces abstract documents which you can finalise to Office formats, if you need that too.

libpoppler works; it just only meets the requirements that libpoppler was designed for. It correctly displays most PDFs, but fails with esoteric features used only in a small subset.

In that sense, libpoppler is like a swiffer mop: it handles most normal dirt, dust, and general cleaning needs for tile and hardwood; but you will need a mop, or potentially nylon or bristle scrubbers and power tools, to clean some deep-set grime from linoleum or porcelain tile. I've had mops fail to clean traffic grime fr

I've seen commercial programs actually do this to support PDF report generation. They just leverage the existing code they have for printing reports and redirect it to a virtual printer. I think it was the Amyuni libraries which are clearly closed source. One thing I can say is that a virtual printer that directly generates PDF files from the GDI output (we're talking Windows here) tends to create cleaner output files (smaller size, less rendering errors) than the Postscript printer output to PDF route.