The Newspaper and Periodical OCR Corpus of the National Library of Finland (1771-1874)

Digilib-Pub

ID:

http://urn.fi/urn:nbn:fi:lb-2015051201

The corpus is available in Kielipankki - the Language Bank of Finland, download: http://urn.fi/urn:nbn:fi:lb-201505112, as well as on the Taito server (https://sui.csc.fi/group/sui/ssh-console), directory name: /appl/kielipankki/Digilib-pub.

License information: http://kielipankki.fi/ClarinEulaLBDigilibPub

This corpus consists of the OCR results of the material published before 1875 in the corpus of publications digitized by the National Library of Finland. This part of the corpus is so old that any copyrights in it must have expired before 2015.

The full corpus, as FIN-CLARIN has it, is organized in eleven branches named arc01, ..., arc11. Each document is stored as a zip archive containing scanned image files in different resolutions, and the OCR results as XML documents. This distribution has the same structure but contains only the OCR results.

Each of the distribution files arc01.zip, ..., arc09.zip contains the material extracted from one branch of the full corpus; arc10 is currently missing from this distribution for technical reasons; and arc11 did not contain any relevant material.

The distribution file "digilib_pub_1771_1874_every.zip" contains all 10 branches in one archive.