Rules-based tagging for metadata

The value of human taxonomists

May 24, 2018

With a large volume of content, it’s tempting to solely use an automated system for tagging and annotation. While today’s machine learning and rules-based platforms can achieve a reasonable level of classification, they haven’t yet approached human ability when it comes to reading more deeply into the material and applying tags that might be relevant, but not obvious.

Librarians and taxonomists can breathe a sigh of relief—they won’t be replaced by machines any time soon! In fact, human discernment and reasoning are necessary for the accurate application of metadata tags.

The process employed by The New York Times (NYT) for tagging news content is instructive. The Times uses a three-step procedure that employs machines, editors and taxonomists to ensure the highest “quality of metadata upon which to deliver highly relevant, targeted content to readers.”[1]

Several years ago, the NYT began development of its own automated machine learning system for tagging that works in real-time, as part of the writing process. Once the journalist has finished writing, they can review the suggested annotations as easily as performing a spell check.

A big part of the system’s logic relies on rules. According to Jennifer Parrucci, Senior Taxonomist at the Times, “These rules might take into account things like the frequency of words or phrases in an asset, the position of words or phrases, for example whether a phrase appears in the headline or lead paragraph, a combination of words appearing in the same sentence, or a minimum amount of names or phrases associated with a subject appearing in an asset.”[2]

The software suggests tags based on the rules which reflect the NYT’s metadata schema for subjects, titles, people, locations, etc. An editor reviews those suggestions and has the option to search for additional tags, or accept/reject those provided. They can also request that new terms get added to the schema’s vocabulary.

Requests for new tags are routed to the organization’s taxonomists. Parrucci says, “Taxonomists review the suggestions and decide whether they should be added to the vocabulary, taking into account factors such as: news value, frequency of occurrence and uniqueness of the term.”[3]

If accepted, the taxonomists write new rules for the system that incorporate the terms. The taxonomists have another important role when it comes to checking the quality of the day’s assets. During their review, they assess whether the main focus of the article is properly tagged and ensure that it’s neither over- or under-tagged. These human taxonomists judge whether the software rules are suggesting the right tags, and tweak the rules as they go.

This iterative process upholds the quality of the NYT’s tagging schema while also adapting it to today’s topics. It also demonstrates that machines are unlikely to be taking jobs from librarians and taxonomists any time soon. Rather, automated tagging should be seen as another tool that helps streamline an important process. Even if your organization uses tagging software, remember that the results are never going to be perfect—and human oversight and quality control will make all the difference.