Stapler - A python utility for manipulating PDF docs based on pypdf

* Dependencies * Stapler depends only on the packages python and python-pypdf, both ofwhich can be found in the archlinux repositories.

* History *

Stapler is a pure python replacement for PDFtk, a tool for manipulating PDF documents from the command line. PDFtk was written in Java, and natively compiled with gcj. And it has been discontinued a few years ago and bitrot is setting in (i.e. it does not compile anymore in archlinux).Since I used it quite a lot, I decided to look for an alternative and found pypdf, a PDF library written in pure Python. I couldn't find a tool which actually uses the library, so I started writing my own.At some point I plan on providing a GUI, but the command line version will always exist.

* License *

A simplified BSD Style license describes the terms under which Stapler is distributed. A copy of the BSD Style License used is found in the file "LICENSE"

* Usage *

I am too lazy at the moment to learn how to create a proper man page so this hasto suffice.

What you _cannot_ do yet is not to specifying any ranges. I will probably merge select and cat at some point in the future so that you can specify pages and ranges, and if you don't, it just uses the whole file.

The delete command works almost exactly the same as select, but inverse. It cherrypicks the pages and ranges which you _didn't_ specify out of the pdfs.

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

Are you planning on adding eventually the full functionality of pdftk -- rotate, watermark, encrypt, etc? I occasionally use pdftk, and I'd like to see a replacement since it's apparently having troubles keeping up to date. Great work!

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

firecat53 wrote:

Are you planning on adding eventually the full functionality of pdftk -- rotate, watermark, encrypt, etc? I occasionally use pdftk, and I'd like to see a replacement since it's apparently having troubles keeping up to date. Great work!

Scott

if you help me with ideas for the command syntax, sure why not. pypdf supports these things. can you make a complete list of things pdftk does that would be important to port over? and how the command line syntax for these features work. I'll then get working EDIT: I just had a look at the pdftk man page (didn't think of that when i wrote the above...) there are some things that will not be possible with the current version of pypdf (and frankly i doubt there is going to be a new release): update_info (because there is no way to write document properties with pypdf, you can read them just fine though) fill_form (similar reason. it's just not supported)

the rest will be fine. I'll rename split to burst and then I'll mimic the cat function. the others should be no problem either.although... i didn't find anything on rotate in the pdftk manual. how should i implement a rotate function? Rotate complete documents or single pages?

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

Is pdftk well and truly dead? Since it does have the ability to write metadata, and some other functionality that pypdf doesn't have, is it worth it to instead put time into bringing pdftk and its dependencies current so we can continue to use it, rather than reinventing the wheel? Not being a programmer, I don't necessarily have a good idea of what that would entail, so please don't take my comments too seriously!! I don't know enough about the differences of the code being python vs whatever pdftk was coded with.

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

pdftk was coded in java and compiled with gcc-gcj, whilst pdftk hasn't been maintained for 3-4 years, hasn't compiled in arch for as long as i can remember (+1 1/2 years) and gcc-gcj recently fell out of the official repositories probably because it didn't compile anymore or the maintainer decided to give it up. As far as I can tell, bitrot took its toll on pdftk and I for one certainly don't have the expertise to rescue it. Additionally, I was always pissed to have to install a huge eclipse package as a dependency for gcc-gcj which in turn was quite big, just to have a pdf utility. Stapler/pypdf on the other hand are pure python which cuts back on the deps quite a bit

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

I really like this idea and think it will be handy for me since I always have tons of PDFs laying around for different papers. Now, I can easily merge or split them to facilitate sending just the important parts to co-workers without having to fuss with pdftk...

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

This is a great job. Please implement as many features as possible as long as pyPDF supports. And from my point of view, encryption and decryption are especially important. Thank you for your work. Hope it to be a good replacement for pdftk.

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

@Philip:

Many thanks for creating this very handy little utility. I had already given up on pdftk and was using jpdftweak from AUR as an alternative (albeit it also uses Java). However your low overhead Stapler is most welcome and does most things I need.

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

Violin wrote:

tinhtruong wrote:

My most wanted features of pdftk to port to your new app is to able to uncompress pdf file to text file and the compress it back. Those feature is really useful in removing watermarks in PDF.

I think that's just impossible.

I just thought exactly the same thing. converting it to pure text is possible, but the other route isnt. I'm pretty sure that pdftk doesn't do that either.(although i haven't even seen the pdf-to-text feature in the pdftk documentation, but i'll implement it)

It will be a week or two until i'll work on it (exams coming up), be patient, i'll do it.

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

which can be easily suppressed using 2>/dev/null, but is there a cleaner way I wonder?

unfortunately, that is out of my hands. it's a message from pypdf which is no longer maintained (meaning I can't reach the author). I patched it so it doesn't use these deprecated things. I will file a bugreport at some point. for the moment, i have a source package with the (actually very small) patch here and a prepackaged (i686) version of it here.

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

Heller_Barde wrote:

I patched it so it doesn't use these deprecated things. I will file a bugreport at some point. for the moment, i have a source package with the (actually very small) patch here and a prepackaged (i686) version of it here.

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

aktinos wrote:

Nice job!Another pretty good alternative is qpdf.

oooooh, shiny *chasing butterflies*I didn't know that existed. Looks really nice, but there seems to be no concatenation of files and that sort of thing, but I will surely keep qpdf close by for when I need to (un)set those restrictions on a pdf once in a while. the option to compress - uncompress streams seems nice too. I'll have to experiment with that

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

Heller_Barde wrote:

Violin wrote:

tinhtruong wrote:

My most wanted features of pdftk to port to your new app is to able to uncompress pdf file to text file and the compress it back. Those feature is really useful in removing watermarks in PDF.

I think that's just impossible.

I just thought exactly the same thing. converting it to pure text is possible, but the other route isnt. I'm pretty sure that pdftk doesn't do that either.(although i haven't even seen the pdf-to-text feature in the pdftk documentation, but i'll implement it)

It will be a week or two until i'll work on it (exams coming up), be patient, i'll do it.

cheers Phil

pdftk can do that (uncompress and then compress) just fine. On the front page of pdftk there is a line said:Uncompress and Re-Compress Page StreamsI have used it many times on Windows to remove watermarks on PDF files (because it's broken on Arch right now).But I'm watching closely to your progress

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

tinhtruong:I think we didn't quite understand you right, i apologize. I thought you meant dumping the text from the pdf, not the "source" code, but the text that gets displayed. I doubt that pypdf can do that, but i'll look into it. I see now all the advanced features of pdftk and the respective shortcomings of pypdf.

I'll do my best. It'll have to wait another week though, first i have to recover from a backup-snafu

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

I have a quick question. I used to use pdftk on debian to turn a bunch of single page pdf's in a directory into one pdf. Page order wasn't important. I just do it so I am sending one file instead of several and compressed files are bounced by the firewall. The syntax I used was pdftk *.pdf cat output funnies-$(date +%Y%m%d).pdf. Is there a way to read all the pdf files in a directory with stapler and turn them into one PDF?

Re: Stapler - A python utility for manipulating PDF docs based on pypdf

ursa65 wrote:

I have a quick question. I used to use pdftk on debian to turn a bunch of single page pdf's in a directory into one pdf. Page order wasn't important. I just do it so I am sending one file instead of several and compressed files are bounced by the firewall. The syntax I used was pdftk *.pdf cat output funnies-$(date +%Y%m%d).pdf. Is there a way to read all the pdf files in a directory with stapler and turn them into one PDF?

Thanks in advance

of course, it's the cat option and it's detailed in the readme (and the first post of this thread)it works like this:

stapler cat *.pdf your_output_file.pdf

the cat option concatenates all but the last file specified on the command line into the last file specified on the command line

EDIT: @tinhtruong:I saw that there is no uncompress action in pypdf, but I think it needs to do it anyway, because pypdf can do the watermark thingie, thus it needs to uncompress the stuff, concatenate it, and compress it again. I'll take a look at the library, but I can't promise anything.