Introduction

On a recent project we needed to setup a template to gather information off participants. At first I thought I’d just set up a Microsoft Word form and get a small script together to extract the information. A while ago I did a lot of scripts for extracting information from Word documents so I felt that this would be easy. However, most of the team are on Macs and the Word forms on Mac are the old variety. I also had a go at a Groovy+docx4j script to extract the form data but I failed to get very far in my time box so gave it away as too much effort.

I then looked at the Forms Central app that comes with Adobe Acrobat Pro 11. I’d not used it before but it was quite straightforward to setup a form and export it as a PDF. I then grabbed the Apache PDFBox library and used it to extract the fields. In all it was a pretty straightforward bit of work.

Discussion

The code I’ve included is pretty straight-forward. I output the data using the YAML format as an example but I could have also pushed out XML or CSV.

You may notice that the PDF field names are a little odd (Name_uVH8IPMbm6VsY*FfF09oJg) - I’ve kept these as-is in my script but it’d be easy (_uVH8IPMbm6VsY*FfF09oJg) to strip out those identifiers at the end.

Lastly, I got a few questions as to why I focussed on a file-based format and I thought I’d note my answers below:

Google Forms

This would have been ideal but these weren’t one-shot forms. It’d be likely that the interviewer would revisit the form with further information and Google Forms doesn’t appear to provide this functionality

Survey Monkey

As for Google Forms

A small web application (e.g. in Grails)

I really didn’t see the need for this level of effort in this project

There’s a huge number of form building software out there that have heaps of features. For this project however, I just needed a file-based format that works across platforms. PDF suits this well and the extraction code is pretty straight-forward.