In the past few months I've been bitten more than a few times by the problemof not having the right document around. Sometimes I recycled a document Ineeded (who keeps water bills for two years?) and other times I just lostit... because paper. I wrote this to make my life easier.

How it Works

Buy a document scanner like this one (used by me) or this other onerecommended by another user.

Set it up to "scan to FTP" or something similar. It should be able to pushscanned images to a server without you having to do anything. If yourscanner doesn't know how to automatically upload the file somewhere, you canalways do that manually. Paperless doesn't care how the documents get intoits local consumption directory.

Have the target server run the Paperless consumption script to OCR the PDFand index it into a local database.

Use the web frontend to sift through the database and find what you want.

Download the PDF you need/want via the web interface and do whatever youlike with it. You can even print it and send it as if it's the original.In most cases, no one will care or notice.

Here's what you get:

.. image:: docs/_static/screenshot.png :alt: The before and after :target: docs/_static/screenshot.png

Stability

Paperless is still under active development (just look at the git commithistory) so don't expect it to be 100% stable. I'm using it for my owndocuments, but I'm crazy like that. If you use this and it breaks something,you get to keep all the shiny pieces.

Requirements

This is all really a quite simple, shiny, user-friendly wrapper around some verypowerful tools.

ImageMagick_ converts the images between colour and greyscale.

Tesseract_ does the character recognition.

Unpaper_ despeckles and deskews the scanned image.

GNU Privacy Guard_ is used as the encryption backend.

Python 3_ is the language of the project.

Pillow_ loads the image data as a python object to be used with PyOCR.

PyOCR_ is a slick programmatic wrapper around tesseract.

Django_ is the framework this project is written against.

Python-GNUPG_ decrypts the PDFs on-the-fly to allow you to downloadunencrypted files, leaving the encrypted ones on-disk.

Documentation

It's all available on ReadTheDocs_.

Similar Projects

There's another project out there called Mayan EDMS_ that has a surprisingamount of technical overlap with Paperless. Also based on Django and usinga consumer model with Tesseract and unpaper, Mayan EDMS is much morefeatureful and comes with a slick UI as well. It may be that Paperless isbetter suited for low-resource environments (like a Rasberry Pi), but to behonest, this is just a guess as I haven't tested this myself. One thing'sfor certain though, Paperless is a much better name.

Important Note

Document scanners are typically used to scan sensitive documents. Things likeyour social insurance number, tax records, invoices, etc. While paperlessencrypts the original PDFs via the consumption script, the OCR'd text is notencrypted and is therefore stored in the clear (it needs to be searchable, soif someone has ideas on how to do that on encrypted data, I'm all ears). Thismeans that paperless should never be run on an untrusted host. Instead, Irecommend that if you do want to use it, run it locally on a server in your ownhome.

Donations

As with all Free software, the power is less in the finances and more in thecollective efforts. I really appreciate every pull request and bug reportoffered up by Paperless' users, so please keep that stuff coming. If however,you're not one for coding/design/documentation, and would like to contributefinancially, I won't say no ;-)

The thing is, I'm doing ok for money, so I would instead ask you to donate tothe United Nations High Commissioner for Refugees_. They're doing importantwork and they need the money a lot more than I do.