README.md

Kaira - convert matrikels to datasets

This is a Python software which lets user to convert matrikel old finnish matrikel books to a csv- and json-format. Supported bookseries at the moment are "Suomen Rintamamiehet 1939-43", "Suomen Pienviljelijät", "Siirtokarjalaisten tie" and "Suuret maatilat". The book series were originally published in 1970s and they contain brief descriptions of the peope, their life, children, spouses etc. This data is scientifically interesting but difficult to analyze statistically in a written format.

What does this tool do?

Kaira is meant to be used as a tool to extract interesting data from old matrikels books which have been scanned and OCR'd. Extracted data can then be edited and exported into csv- or json-formats for statistical analysis. The tool was originally developed in Lammi Biological Station in collaboration with John Loehr.

#How does it work?

First you need a digital scan of the book. Preferably as good quality as possible.

Run an OCR for the scanned documents to get the raw text in a simple .txt or .html format. Picking up a good OCR-software and settings is a bit trial and error. We first used Adobe's product but eventually found ABBYY Finereader. ABBYY could produce really good quality text and save it to handy html-files.

We run "chunker" for the raw text-file which tries to isolate every one person entry to a separate XML-tag for easy processing. Implementation depends on source material, but with soldiers this is done with a regex which looks for patterns common in beginning of the one soldier's entry. It works most of the time but might make mistakes which has to be fixed in the fixer-tool (more about that below). For other book series contents are picked from html-document.

Kaira then reads the XML-file and runs multiple tailored regexes and other domain-specific logic and generates a csv-file containing the data. At this point user can use GUI to find missing information, edit the xml-file to fix the extraction errors and rerun the process etc.

#GUI
Kaira includes a simple GUI for user to read, export and edit the OCR files and related content. Check detailed usage instructions from wiki.

#Development
Check project Wiki to see documentation about how to extend the software with new bookseries and more detailed information about how to set up dev-environment, what you need to know etc.

#Future
On my part the development will likely stop in beginning of June 2015. Some critical bug fixes might be done afterwards.

#Attribution
Please cite if you use this software or datasets generated by it in your research: