Technology

Corpora

We believe in improving the quality of open source digital preservation tools through good software development practices. A key practice is public testing of software using continuous integration services. For this to be effective shareable test corpora that represent the real world issues facing the community are needed.

This page lists corpora that we’ve used in software testing and hack events, and we are also looking to improve it. We’re also happy to receive suggestions for additions to the list, here’s a few pointers to consider.

Size

There’s no rules regarding size or numbers of files in a corpus but very large test collections do bring some problems as they’re:

difficult to use on virtual build services, e.g. Travis, as they take too long to download;

awkward for unit testing, these take too long to run over a large corpus; and

time consuming to copy and distribute.

We won’t discount a suggestion due to size alone, the Govdocs corpus is certainly large. It’s also very useful for testing format identification tools.

Scope

Corpora that focus on representing a single problem or a small set of related problems are preferred as they’re easier to use. Restricting scope helps keep the overall size of the corpora manageable side-stepping the issues with large collections described above. Smaller corpora can be combined to build larger test sets.

Corpora listing

We believe in improving the quality of open source digital preservation tools through good software development practices. A key practice is public testing of software using continuous integration services. For this to be effective shareable test corpora that represent the real world issues facing the community are needed. This page lists corpora that we’ve used […]

We believe in improving the quality of open source digital preservation tools through good software development practices. A key practice is public testing of software using continuous integration services. For this to be effective shareable test corpora that represent the real world issues facing the community are needed. This page lists corpora that we’ve used […]

We believe in improving the quality of open source digital preservation tools through good software development practices. A key practice is public testing of software using continuous integration services. For this to be effective shareable test corpora that represent the real world issues facing the community are needed. This page lists corpora that we’ve used […]

Share this page

Latest news

The iPres Working Group invites you to provide feedback on the Future of iPres, the international digital preservation conference. We look forward to hearing from you! As you may know, in September 2018 the iPres Steering Committee approved the convening of the iPres Working Group and at iPres 2019, we will share the outcome of our […]

Upcoming Event

Born-digital material introduces new challenges around trust and authenticity, The ARCHANGEL Project is investigating the use of blockchain to verify that documents stored in digital archives have not been altered or modified. This webinar will introduce blockchain technology, explain the project and give a demo of the software we have developed. Speakers Mark Bell, The […]