Scan with raspberry pi, convert with aws lambda to searchable PDF

2018-02-08

I have long dreamed for a setup which lets me just press the scan button on my scanner and — without any further input — uploads it as a searchable PDF onto some cloud drive. Thanks to the good support of scanners by SANE and the ease of use of AWS lambda it’s actually quite easy (judging to the length of this post it looks like quite a task, but in the end it is straightforwards and is — surprisingly — quite free of hacks).

In this solution you:

set up SANE on your raspberry pi 3 so it scans your document

set up scanbd to detect the scan button

set up a S3 bucket for uploading

set up a lambda function which uses tesseract to create a searchable PDF

Personally I’m using Raspbian Stretch Lite as OS on my Raspberry and a Fujitsu S1300i.

Before you start: you might just want to wipe your pi and start fresh. Takes you about 15 minutes extra, you can follow my howto so you can do that headless (without attaching monitor/keyboard to the pi).

Set up SANE

First I tried to compile SANE from source, believing that this is the only way to get my scanner to work. After hours of trying and simplifying this howto (And after I wiped the pi3 two times to start over!) I figured out that apt install works just fine! So bear in mind that this howto was done with sweat and after hours of painful try-and-error :)

Just install:

sudo apt install sane-utils -y

No need to install the whole sane package which comes with 162 packages needing 430MB of space (sic!). sane-utils is enough. Now, when you plug your scanner to your pi and do..

So all you’d need to do is get this 1300i_0D12.nal file. Get it from installation files (i.e. that old CD rom), or just google for your firmware file and hope that there’s no security concerns.. In my case I found it on github and installed it with:

Set up scanbd

Scanbd is very badly documented. This is sad, because once you get it working, it’s doing its job very well. Plus: there’s really no alternative to scanbd.

Scanbd is just a daemon which regularly polls the scanner to see if a button was pressed. If it was, it just starts a shell script which itself then uses sane to scan. I found this stackoverflow answer a good explanation how scanbd works.

There are a few howtos on the web which are overly complicated (i.e. copying all files of sane to scanbd), after 2-3 fresh installs I found out a quite forward way to get it working.

Fist, install it via

sudo apt install scanbd -y

then, edit /etc/scanbd/scanbd.conf and set (if your scanbd.conf is missing — as it was missing for me on the first try — take this conf file as a start):

debug-level = 7: to see errors more easily while setting up

user = pi: to run script and the scanning process as user pi

Start scanbd with

sudo scanbd -f

and you’d see that scanbd is polling. When you hit the scan button, then you should see output lines of scanbd trying to run /etc/scanbd/scripts/test.script which doesn’t exist. So far, so good!

Hitting the scanner button should scan. Buuut: if you now power off the scanner (close the lid on my model) or unplug it or whatever, and then replug it, then scanbd crashes spectacularly with a segmentation fault. There is this reported bug which is solved with version 1.5.1, but instead of compiling from source it’s easier to start it over systemd and tell it to restart the service after crash:

First, edit /lib/systemd/system/scanbd.service and in the [Service] section add the line Restart=on-failure.

Now, hitting the scanner button should work out of the box. Also try restarting the pi and replugging your scanner. You also may want to have a look at syslog, where all scanbds messages end up: tail -f /var/log/syslog

If, for any reasons your service would just not start, then examine /lib/systemd/system/scanbd.service and check if ExecStart references your scanbd (use which scanbd) and your scanbd.conf, and also that SANE_CONFIG_DIR is set correctly.

Upload to S3

The idea is to offload as much computing as possible into the cloud. In theory you could also just run tesseract on your pi and then store it somewhere, but first I wanted to free up the pi as fast as possible for the next scan and second I was just searching for another excuse to try out lambda..

So in the next step we’ll alter the script so it uploads to s3. But before we can do that we’ll need to create a user on AWS which has just enough rights to do that.

AWS: add bucket and user

S3: Create a temporary upload bucket e.g. temporary-upload (be sure to choose a region close to you. Upload speed is a lot faster for closer regions). Note the ARN of the bucket.

IAM: create a policy ReadWriteOCR, switch into JSON editor and paste this (replace the arns):

Write the scanner script

Now – finally – all the things are in place to finish the scanner script.

The below script..

scans in batch mode: creates multiple files until the feeder is empty

does a duplex scan (there’s no detection if both sides contain content. It means that if it’s a one sided paper the second page is just empty)

scan with resolution 300: this is the default. It is a pretty fast scan and the quality is just what OCR (tesseract) recommends

does a .tar.gz archive. I did some speed tests and in my case it was quicker to first gzip the file before uploading. But that greatly depends on your upload speed

does the compression and uploading in the background so the scanner is ready to do the next scan

Take the script and save it in /etc/scanbd/scripts/scan.sh, the only thing you’d need to adapt is the s3 bucket name. You may also comment out the rm -rf in the second last line until you’re sure your lambda function doesn’t eat up your files)

Upload the zip file into an s3 bucket of your choice (the bucket needs to be in the same region you want your lambda function to run in)

Now, set up a lambda function with:

Name: e.g. scan-ocr

Runtime: Python 3.6

Role: Choose an existing role

Existing role: the role you created earlier: ReadWriteOCR

Then, in the lambda function set

Function code: Handler = handler.handler

Environment variables:

S3_DEST_BUCKET=<ocr-document-bucket> Destination bucket name where lambda will upload the OCRed pdf

EMPTY_PAGE_THRESHOLD=200 if tesseract finds less than 200 characters on a page it’s — from experience — likely to be empty and will be removed (assumes you’re using a duplex scanner). If you want to disable empty page removal, just put this to 0

UPLOAD_TYPE: discard: just to get going for now, the OCRed file will be discarded. Later on you’ll configure this lambda function to upload to S3 or Google Drive.

Basic settings:

Description: e.g. take tar.gz and turn it into OCRed PDF

Timeout: 5:00 minutes: This is the max value which lambda allows. For 6 page scans my lambda needed about 12s, so with 5 minutes you should be fine handling ~150 pages :)

Memory: I chose 2048MB. The more memory you take, the faster the execution time (see also the official doc). 128MB is not enough. It will lead to out of memory exceptions.

Now, load in the zip file you just uploaded to s3:

Function code:

Code entry type: upload from s3

S3 link URL: the zip file location in the form https://s3.<region>.amazonaws.com/<bucket>/ocr-lambda.zip

Test it

In theory this all would work out of the box of course. But let’s try it out. Upload a tar.gz from a test scan to your temporary s3 bucket. Then, hit configure test event from the dropdown at the top of your lambda function. Now, put this json into the editor:

After saving the test you can run it and you’ll see all the text output of the lambda function, and hopefully the line all fine, discarding file, but not deleting source file.

Upload to S3 / Google Drive

Originally I just had the lambda function upload the file to S3 and hoped to find a nice frontend above S3 (but failed. Apparently there’s nothing really decent), but then I realized that I’d need some text search anyway. Otherwise, half the fun of OCR (apart from copy-pasting lines from invoices into my ebanking, which is my main use case) is gone anyway, I decided to go for Google Drive support.

If you don’t need Google Drive and just want uploads to another S3 bucket, then you could skip this section and instead put the env vars UPLOAD_TYPE = s3, and set S3_BUCKET to your destination bucket name add this json to your policy and you’ll be fine:

Your browser should open and asks if you’d like to authenticate your lambda function to go over your google api account and created files in your google drive and access the files it created (which it won’t need). See here for more details about the right you’re granting.

Once you grant the right, you’ll see a bunch of environment variables you need to copy-paste over to your lambda function.

Optionally, if you wish your PDFs to be stored in a specific folder, go to that folder in your google drive, copy the part in the url after /folders/ and put that into an additional environment variabled named GDRIVE_FOLDER

Add trigger

Now, to the very very last thing: Your lambda function should auto-trigger once your raspberry pi3 uploads a file into your temporary s3 bucket. First, reload the page of your lambda function, then, from the Add triggers menu of your lambda function (top left) choose S3, then in Configure trigger dialogue:

Bucket: the bucket where the lambda function should listen to

Event tpye: Object Created (All)

Prefix and Suffix you can leave empty

That’s it! Now, pressing your button on your scanner should make the whole chain reaction start and you should see your OCRed file in Google Drive (or S3, if you chose so). If it does not, you should be able to go to the Monitoring “tab” top of your lambda function and see if it triggered at all and head over to its log file.