Andornot Blog

I learned an interesting lesson about Solr relevancy tuning due to a request from a client to improve their search results. A search for chest tube was ranking a record titled "Heimlich Valve" over a record titled "Understanding Chest Tube Management," and a search for diabetes put "Novolin-Pen Quick Guide" above "My Diabetes Toolkit Booklet," for example.

The high-scoring records without terms in their titles had topic = "chest tube" or topic = "diabetes", yes, but so did the second-place records with the terms in their titles! Looking at the boosts, you would think that the total relevancy score would be a sum of (title score) plus (topic score) plus the others.

Well, you'd be wrong.

In Solr DisMax queries, the total relevancy score is not the sum of contributing field scores. Instead, the highest individual contributing field score takes precedence. It’s a winner-takes-all situation. Oh.

In the samples above, the boost on the incidence of “chest tube” or “diabetes” in the topic field was enough to overcome the title field's contribution, in the context of Solr’s TF-IDF scoring algorithm. I.e. it’s not just a matter of “the term is there” versus “the term is not there”, instead the score is proportional to the number of query terms the field contains and inversely proportional to the number of times those query terms appear across the whole collection of documents. Field and document length matters. Also whether the term appears nearer the front of the text.

So I could just ratchet up the boost on the title field and be done with it, right? Well, maybe.

As someone else* has said: DisMax is great for finding a needle in a haystack. It’s just not that good at searching for hay in a haystack.

The client’s collection has a small number of records, and the records themselves are quite short, consisting of a handful of highly focused metadata. The title and topic fields are pithy and the titles are particularly good at summarizing the “aboutness” of the record, so I focused on those aspects when re-arranging relevancy boosts.

New Solr field type: *_notf, a text field for title and topic that does not retain term frequencies or term positions. This means a term hit will not be correlated to term frequency in the field. It is not necessary to take term frequency into account in a title because the title’s “aboutness” isn’t related to the number of times a term appears in it. The logic of term frequency makes sense in the long text of an article, say, but not in the brief phrase that is a title. Or topic.

Note that phrase matching still uses the original version of the title and topic fields, because they index term positions. Thus they can score higher when the terms chest and tube appear together as the phrase “chest tube”.

Also, I added a tie=1.0 parameter to the DisMax scoring, so that the total relevancy score of any given record will be the sum of contributing field scores, like I expected in the first place.

total score = max(field scores) + tie * sum(other field scores)

So, lesson learned. Probably. And the lesson has particular importance to me because the vast majority of our clients are libraries, archives or museums who spend time honing their metadata rather than relying on keyword search across masses of undifferentiated text. Must. Respect. Cataloguer.

Library and Archives Canada has announced the launch of the 2018 funding cycle for the Documentary Heritage Communities Program (DHCP). This is the fourth year of a planned 5 year program, with $1.5 million available this year, as in previous rounds.

The DHCP provides financial assistance to the Canadian documentary heritage community for activities that:

increase access to, and awareness of, Canada’s local documentary heritage institutions and their holdings; and

increase the capacity of local documentary heritage institutions to better sustain and preserve Canada’s documentary heritage.

The deadline for submitting completed application packages is February 7, 2018.

This program is a great opportunity for archives, museums, historical societies and other cultural institutions to digitize their collections, develop search engines and virtual exhibits, and other activities that preserve and promote their valuable resources.

The program is aimed at non-governmental organizations specifically, including:

Archives;

Privately funded libraries;

Historical societies;

Genealogical organizations/societies;

Professional Associations; and

Museums with an archival component.

Businesses, government and government institution (including municipal governments and Crown Corporations), museums without archives, and universities and colleges are not eligible.

Types of projects which would be considered for funding include:

Conversion and digitization for access purposes;

Conservation and preservation treatment;

The development (research, design and production) of virtual and physical exhibitions, including travelling exhibits;

We have extensive experience with digitizing documents, books and audio and video materials, and developing systems to manage those collections and make them searchable or presented in virtual exhibits.

Contact us to discuss collections you have and ideas for proposals. We'll do our best to help you obtain funding from the DHCP program!

About this time last year we blogged about a new version of Omeka, Omeka S, entering beta release. Now we're happy to see that a final 1.0 release of Omeka S has just been released.

Omeka is a free, open-source content management system (CMS) for online digital collections. With Omeka, you can quickly build a searchable repository of archival, artifact or other records and assemble them into virtual exhibits to showcase your holdings.

Most content management systems are designed to manage a single website with a hierarchy of pages, in which are placed text and other media. In contrast, Omeka is based around items (e.g. historic documents, photographs, audio or video recordings, etc.) which can be arranged into item sets and pages of items. One Item can be used in multiple ways, as part of different exhibits, for example.

An easy-to-use web interface provides site administrators with access to all the important back-end features: configuring the site appearance and navigation, uploading items (individually or in batches, such as from a database export), changing themes, and creating content pages.

Omeka S offers users a brand-new interface and features such as:

Manage multiple separate sites from a single installation of Omeka.

Build and publish pages, exhibits, or digital stories by adding and mixing different content blocks.

Use importers to bring in content from a spreadsheet or an Omeka Classic site.

Geolocate your content and display maps on sites using Mapping.

Connect your installation with Fedora and DSpace repositories, with the ability to update content periodically.

Use mobile-ready themes to customize the look of each site.

Omeka is a great choice for museums, archives, historical societies and others with cultural collections who want to make their collections searchable online. It's as easy to use for volunteers with little experience as by professional curators, archivists and historians.

Yesterday I had the pleasure of speaking to students in the Library Technologies and Information Management class at Langara College. These budding library techs will learn to create a database for a class project using DB/TextWorks, hopefully with a bit of inspiration from the ideas I was able to share with them.

The image above shows screens from the Andornot Starter Kit, a ready-to-use DB/TextWorks database suitable for a small library.

Not all software has such longevity as DB/TextWorks, but I think this popular app endures because it remains unique in the market. For clients of ours with a modest budget who need to manage diverse kinds of information and don't have programming skills, it remains an excellent choice, once we heavily recommend to many clients.

We see it used in law firms to create and manage databases of experts, memos, precedents, boilerplate documents, corporate archives, and of course a traditional library catalogue. In hospitals, it's used to manage patient education materials, and libraries with a strong circulation component. Elsewhere, we see it used to manage museum artifact collections, archival documents, databases of digitized historic documents and audio-visual recordings. In municipalities, it manages bylaws, real estate development applications, council documents… the list is endless.

There are many highly-specific database applications available, tailored to the needs of particular organizations (e.g. Inmagic Genie for specialized libraries, Lucidea's Argus for museums, etc.), but few tools that are as easy to use as DB/TextWorks that can be applied to managing any kind of information. Anyone can learn to create a database and snazzy search and edit screens and have a functional, aesthetically pleasing database in a very short time, with little technical aptitude needed. Managing this information is easy with the many built-in, pre-programmed features, such as validation lists, batch modifications, the URL checker, and so on.

Two other long-standing database programs are of course MS Access, included with almost every copy of the MS Office suite, and Apple's FileMaker. The former is practically free and so ubiquitous that many people use it out of necessity, while the latter is quite visually appealing and with many useful features. However, in our experience, both require a higher level of technical skills to really make useful. DB/TextWorks simply has more of the programming already done.

It's reasons like this that cause it to still be an excellent choice in many cases, when budgets and user skills are modest, and thus is well-worthwhile learning to use in a library technician or similar programm. Paired with a search interface like our Andornot Discovery Interface, VuFind, Omeka, or Inmagic Presto, it becomes a perfect back-end to a highly functional front-end, a great combination for managing and searching information.

Contact us to learn more about any of the above, or if you're a school or student and would like a trial version of DB/TextWorks to use.

Andornot is delighted to be once again sponsoring the SLA Western Canada Chapter Year End Event, on November 27, 2017.

This year's guest speaker is is CBC Vancouver’s on-air meteorologist Johanna Wagstaffe. As a prominent woman in STEM and a podcaster for CBC's Fault Lines and 2050: Degrees of Change, Johanna will be addressing her experiences communicating specialized information and research to non-specialist audiences.

Catered snacks and a beer or wine is included with your ticket and will be available from 6:15pm. A cash bar will also be available.

Where: The Post @ 750, an event space located at 110-750 Hamilton Street in downtown Vancouver (on Hamilton, between Robson and Georgia).

Tickets:Purchase your tickets online today and your name will be added to the guest list at the door. Students and SLA members receive a discount but all information professionals and interested students are welcome to attend. Should you wish to sponsor a ticket for a student, you may purchase that option at the link above and event organizers will contact you with further details.