Thursday, October 18, 2007

Performance anxiety

We started the AANRO evaluation with a look at performance, because 200,000 odd items is quite a lot to manage. The repository manager will need to think about how long it will take to re-index after a disaster, an upgrade or a configuration change that requires indexes to be rebuilt.

VITAL is now no longer being considered, but there is one final test to report on that front and we have some new data about Fez 2.

VITAL with 130,000 or so records

Before we decided to stop evaluating VITAL, Bron Dye ran a test on a virtual machine with 3GB of memory (more than we had for previous tests).

Bron used the Fedora batch too to ingest 96000 into fedora at a rate of about 40,000 per hour.

On Saturday Morning, she told VITAL to index the records.

On Monday, the VITAL portal would not respond to HTTP requests.

The admin page for VITAL did work so Bron told it to stop indexing, and proceeded to ingest some more Fedora records to take the total up to 130592.

The VITAL server now returns a server error, while the underlying Fedora works fine.

(We have to leave the experiment there, but we still have the data so if VTLS would like to have a look we'd be happy to help with testing. Could it be that stopping the indexer caused the problem?).

Fez 2

We had some issues installing Fez 2 at USQ, and it was taking some time to sort them out, so the Fez team very kindly offered to let us use their virtual demo server.

The bottom line is that we're seeing about 40,000 records a day indexing into Fez, at this stage there will be 150,000ish on Saturday morning – it's Thursday afternoon now.

You can see the Fez demo server, but be aware that the AANRO data may disappear at any time, and performance will be slow until it finished indexing on Saturday morning. As it is the site is usable, with most pages taking 4 or 5 seconds to build and I'm impressed that the records show up pretty well considering all we did was put them in MODS format. Shows the benefits of standardization.

So on a demo machine, indexing AANRO metadata is a days-long proposition with Fez 2 just as it was with VITAL 3.1.1 This would add an enormous overhead to building and testing a new repository, not to mention disaster recovery, and things would only get slower once we start sourcing full-text for items. On the positive side, the Fez team are actively improving the software as we speak but I'd be looking for an order of magnitude improvement before I wanted to work with all the data in one instance – this might be achievable with some more optimization and some beefier hardware, I'm not sure.

The current performance does not rule out using Fez but it does point in the direction of a federated architecture, with smaller repositories feeding a central portal, running something like Apache Solr. For comparison, experiments I did on my MacBook laptop had Solr indexing AANRO metadata at something like 3000 items per minute, but that was without the overhead of having to fetch all the items from Fedora and generate preservation metadata like Fez is doing, so it's not sensible to compare it with a repository. Any federated solution would have to look at using caching if there are performance bottlenecks.

Christiaan Kortekaas points out that UQ's Fez repository has 66487 items, with 5261 currently publicly available.