OCR recognizes table as two printed columns one after another

I am a member of a small local and non-profit genealogy society http://cfgs.org. One of our functions is to publish old birth and death records online where others can find them. Years ago someone typed a many pages of tombstone records from local cemeteries. These tables contain the person’s name along with a birth and/or death date. They are in an unlined table format with a large space between the name and dates.

We want to put these records online so people can find them. Several of us have tried scanning these records using the built-in OCR with our all-in-one printers. We have also scanned them into PDF format and used Acrobat Pro to recognize the text. In both cases the result is two columns, like newspaper columns instead of a table. This disassociates the dates from the names and when this is places in a word processor or spreadsheet we get one column with the dates are below the names.

We really would like to avoid having volunteers manually enter these records into Excel or a database like Access. Does anyone have a suggestion on how to get these typed pages into a data table format?

Another problem, but totally different, is that neither Excel nor Access recognize a number as a date prior to Jan 1, 1900 and many of our dates are older than this. This makes sorting or searching difficult.

To generate a table format with the data you have, someone will need to add the table gridlines to each page (on photocopies, perhaps) before doing the OCR. As for sorting, once OCR'd into a Word table, Word's table tools may allow you to do that where Excel & Access won't.

I've taken a look at the PDF in the link and I have to say that, even if it was in a table format (i.e. with gridlines), it would still require considerably more work to make sortable & searchable. Most of the given name entries don't have surnames on the same lines, plus some rows have extra information about age at death.

That said, you can at least get a start by taking the exported names list for each page and turning that into a single-column table, which you can then split into two columns into the second of which you can then drag the corresponding date entries from lower-down the document. Whilst this will still be a lot of work, it should be much faster than re-typing the lot.

PS: You could even split the table into three columns and put the birthdates and demise dates into separate columns.

I think one of the Abbyy OCR products will allow you to define pages that are tables, not columns. I used this at work a couple years ago for a project. I don't remember the exact name of the product, and I know they have a few different OCR products. It wasn't too expensive.