Understanding records in Koha

Throughout the years, I’ve found several open source ILS and most of them try to water down the way librarians have catalogued resources for years. Yes, we all agree ISO 2709 is obsolete, but MARC has proven to be very complete, and most of the efforts out there (Dublin Core, etc.) try to reduce the expression level a librarian can have. If your beef is with ISO 2709, there’s MARC-XML if you want something that is easier to debug in terms of encoding, etc.

That said, Koha faces a challenge: it needs to balance the expressiveness of MARC with the rigidness of SQL. It also needs to balance the convenience of SQL with the potential shortcomings of their database of choice (MySQL) with large collections (over a couple thousand records) and particularly with searching and indexing.

Koha’s approach to solve this problem is to incorporate Zebra to the mix. Zebra is a very elegant, but very difficult to understand piece of Danish open source software that is very good at indexing and searching resources that can come from, say, MARC. It runs as a separate process (not part of the Web stack) and it can also be enabled as a Z39.50 server (Koha itself is a Z39.50 consumer, courtesy of Perl)

The purpose of this post is to help readers navigate how records are managed in Koha and avoid frustrations when deploying Koha instances and migrating existing records.

Koha has a very simple workflow for cataloguing new resources, either from Z39.50, from a MARC (ISO 2709 or XML) file or from scratch. It has templates for cataloguing, it has the Z39.50 and MARC capabilities, and it has authorities. The use case of starting a library from scratch in Koha is actually a very solid one.

But all of the libraries I’ve worked with in the last 7 years already have a collection. This collection might be ISIS, Documanager, another SQL database or even a spreadsheet. Few of them have MARC files, and even if they had (i.e., vendors provide them), they still want ETLs to be applied (normalization, Z39.50 validations, etc.) that require processing.

So, how do we incorporate records massively into Koha? There are two methods, MARC import or fiddling with SQL directly, but only one answer: MARC import.

See, MARC can potentially have hundreds of fields and subfields, and we don’t necessarily know beforehand which ones are catalogued by the librarians, by other libraries’ librarians or even by the publisher. Trying to water it down by removing the fields we don’t “want” is simply denying a full fidelity experience for patrons.

But, in the other hand, MySQL is not designed to accommodate a random, variable number of columns. So Koha takes the most used attributes (like title or author) and “burns” them into SQL. For multivalued attributes, like subjects or items, it uses additional tables. And then it takes the MARC-XML and shoves it on a entire field.

Whoa. So what happens if a conservatorium is making heavy use of 383b (Opus number) and then want to search massively for this field/subfield combination? Well, you can’t just tell Koha to wait until MySQL loads all the XMLs in memory, blows them up and traverse them – it’s just not gonna happen within timeout.

At this point you must have figured out that the obvious solution is to drop the SQL database and go with a document-oriented database. If someone just wants to catalog 14 field/subfields and eventually a super detailed librarian comes in and starts doing 150, you would be fine.

Because right now, without that, it’s Zebra that kicks in. It behaves more like an object storage and it’s very good at searching and indexing (and it serves as Z39.50 server, which is nice) but it’s a process running separately and management can sometimes be harsh.

Earlier we discussed the use case where Koha excels: creating records from scratch. Does this mean that Koha won’t work for an existing collection? No. It just means the workflows are a tad more complicated.

I write my own Perl code to migrate records (some scripts available here, on the move to GitHub), and the output is always MARC. In the past I’ve done ISO 2709, yes, but I only do MARC-XML now. Although it can potentially use up more disk space, and it could be a bit more slow to load, it has a quick benefit for us non-English speakers: it allows to solve encoding issues faster (with the binary, I had to do hexadecimal sed’s and other weird things and it messed up with headers, etc.)

Sometimes I do one record per file (depending on the I/O reality I have to face) but you can do several at a time: a “collection” in just one file, that tends to use up more RAM but also makes it more difficult to pinpoint and solve problems with specific records. I use the bulkmarcimport tool. I make sure the holdings (field 942 in Koha unless you change it) are there before loading, otherwise I really mess up the DB. And my trial/error process usually involves using mysql’s dump and restore facilities and removing the content of the /var/lib/koha/zebradb directory, effectively starting from scratch.

Koha requires indexing, and it can be very frustrating to learn that after you import all your records, you still can’t find anything on the OPAC. Most distro packages for Koha have a helper script called koha-rebuild-zebra which helps you in the process. Actually, in my experience deploying large Koha installations, most of the management and operational issues have something to do with indexing. APT packages for Koha will install a cron task to rebuild Zebra, pointing at the extreme importance (dependency) on this process.

Since Koha now works with instance names (a combination of Zebra installations, MySQL databases and template files) you can rebuild using something like:

koha-rebuild-zebra -b -v -f mybiblio

Feel free to review how that script works and what other (Perl) scripts it calls. It’s fun and useful to understand how old pieces of Koha fit a generally new paradigm. That said, it’s time to embrace cloud patterns and practices for open source ILS – imagine using a bus topic for selective information dissemination or circulation, and an abstract document-oriented cloud storage for the catalogue, with extensive object caching for searches. And to do it all without VMs, which are usually a management nightmare for understaffed libraries.