Archives

Author: Gregory Jansen

Here are at UMD we have a proving ground for large scale archives. This week we are loading some small, medium and large collections into our new repository system, DRASTIC, to see how it performs.

First we are going to load them without triggering any automated workflow in response, i.e. no follow up processing will happen. This will give us a performance metric for bulk loading collections. Then we can compare this metric with a variety of other measurements. These include running the Elasticsearch and Brown Dog-based workflows (indexing, extraction, full text conversion) with the collections already residing in the repository, as well as ingest and workflow in tandem. We expect the workflow to introduce delays, as limited worker processes will have to perform both ingest and workflow tasks.

The loading process is recursive. Our collection files are supplied by an Nginx web server that also produces tidy JSON indexes of folder content. By creating an ingest task for a top level folder URL, we instruct the workers to ingest the entire collection recursively. This means that the first folder ingest task will create more ingest tasks for all its sub-folders. The sub-folders tasks create ingest tasks for *their* sub-folders, and so on..

monitoring an ingest through Celery work queues

Sidenote: If we allow these recursive folder ingest tasks to execute right away, we may quickly fill our task queue with ingest tasks. I managed to fill up a server’s disk space in this way. If dealing with very large collections, in the millions, this may put too much strain on your message queue system, especially if ingest tasks will also result in further workflow tasks. The solution we can up with was to delay execution of the recursive folder ingest tasks, until the queue size was below a certain threshold. The threshold has to be low enough not to overwhelm your message queue system, but also high enough that you don’t create a “deadlock” situation, where the queue is completely filled with folder ingest tasks and none of them can run. (This can happen if you have a folder that happens to have more sub-folders than your chosen max queued tasks number.)