Massively-Multiplayer Online Bibliography

Massively-Multiplayer Online Bibliography (MMOB) is the name for a series of projects aiming to perform significant feats of online bibliography in a fun, collaborative, and principled way, that would be useful to everyone and acceptable to professionals. It will rely on volunteer labor, free software, and open Web standards.

There are hundreds of millions of essays and articles out there. Many of them are already available online, in one or another of the large free repositories, such as Project Gutenberg, the Internet Archive, the Hathi Trust, etc.

However, while their full text is available, searchable, and indexed by search engines -- hence discoverable when searching by words in the text -- there is no good way to distinguish between the tens of thousands of articles that mention Timbuktu and the significantly fewer that are aboutTimbuktu.

"Oh, but Google will send you to good relevant articles about Timbuktu", you might say. That's most probably true for most topics, however, the way PageRank and similar algorithms work, it would only send you to those resources already identified as relevant or useful by people linking to them, which reinforces and recirculates largely the same group of resources for most queries. How would we ever discover additional resources already available to us in the large and growing open repositories?

Traditional library catalogues offer (human-curated) "aboutness" statements for catalogue items. However, catalogue items are typically book volumes rather than individual essays and articles. Thus, a library catalogue will tell us Francis Bacon's Essays (Dover edition) is about "1. English essays -- Early modern. 1500-1700." Not terribly useful, is it? That, itself, does not tell us some of these essays are about "truth", "envy", "sedition", "revenge", etc.

T. S. Eliot's The Sacred Wood is about "Criticism" and "Literature", according to the Library of Congress, but this does not help us discover the influential essays on "Hamlet and His Problems" or "Tradition and the Individual Talent" inside. Let us also remember that even tables of contents are not enough: another essay in Eliot's book is called "A Romantic Aristocrat"; a fine title, but it gives no clue as to who it is about. Lord Byron, perhaps?

Conversely, if we had an extensive data collection of "aboutness" statements (essay X is about topic Y), much of the currently-invisible cultural wealth already available online will become discoverable, and therefore found, read, used, discussed, and built upon, once again enriching our present and future culture and research. It would tell us, for example, that "A Romantic Aristocrat" is in fact about George Wyndham, and it would contribute to a large collection's ability to answer the question "What works do you have that are about (not just mention) George Wyndham?". Wouldn't that be tremendously helpful?

Wouldn't it be nice to be able to browse a huge list of essays -- by language, period, author, title -- pick one that you'd like to read, read it, and then pick one or more topics it was about, from a standardized tree-like list of topics? Or to go over other people's previous classifications and endorse or question them with a single click, to help create a more robust result?

We can build such a collection, one essay and one "aboutness" statement at a time. And we can do so in a way that builds on and interoperates with other large-scale bibliographic efforts, so nothing is wasted.

Essentially, we would build a crowdsourced curation system that would attach multiple "aboutness" values to each individual work (article, essay):

Volunteers would read a work, pick a language to classify in (remember: different classification schemes break the universe down into different ontologies), pick a classification source to classify by, where more than one is available (e.g. Library of Congress Subject Headings, Wikipedia article titles, Library of Congress item titles), be presented with a convenient, browsable, navigable, searchable tree-like view of classifications, and select one or more classifications to attach to the work.

Volunteers would also be able to "upvote" or "downvote" other volunteers' classifications, to help gain confidence in some classifications over others. (This later allows a user searching for material to constrain the search to, for example, only works that have a particular classification at confidence level 3 or more, if an unconstrained search produced too many false positives.)

Users would be able to search for materials on the open Web according to one or more of these classifications

To create an aboutness statement, we need stable identifiers for both the individual work and the topic. Most databases do not, today, catalog at the individual work level, so there's much work to be done.

We can also begin by working only on essays contained in databases that do catalog at the work level (e.g. Project Ben-Yehuda [disclosure: Ijon is its founding editor])

Gradually, the Table of Contents for Everything project will provide us with stable URIs for more and more essays and articles we can classify.

As always in linked open data, sameAs relationships can subsequently be established between whatever URIs we end up using and the URIs of major databases (e.g. Library of Congress), when they get around to cataloging at work level.

There are already several authority files (i.e. sets of data including possible "subject headings" one might assign to a work for an "aboutness" statement) from libraries and related institutions published as (linked) open data on the web. For an overview see the datasets tagged with "authorities" on the Data Hub. Datasets from other institutions (e.g. wikidata) might be relevant as well.

LCSH subject headings are already publicly available as linked data, and can be used for English-speaking classifiers. Similar "thesauruses" or classification trees need to be made accessible to allow volunteers to add classifications in other languages, for the benefit of searchers in other languages. This is a project for experienced volunteers who can "talk the talk" with bibliographers and national libraries.

The Integrated Authority File (GND) used for subject cataloging in German-speaking libraries and curated in a distributed fashion by many German libraries and library service centers is also available as Linked Open Data.

All of these classifications are stored (either as Linked Data triples or in some conventional RDBMS [exposable as triples]) and can then be reviewed, revised, upvoted/downvoted, and of course searched.

Important note: The above is a mixture of library metadata and linked data geekery. If it makes no sense to you, please don't worry, you can still be involved in the project!

There is a list of open collections curated by the OpenGLAM initiative that may be of interest in this context. Only few of the collections are collections of textual material (mostly manuscripts), most are collections of digitized works of art, of digitized photographs (sometimes containing manuscripts). of digital sound or of digitized comics.

All work will happen on the Web, via a modern browser. (i.e. no required downloads, no Flash, no IE6 :))

The Aboutness Project is humble: it seeks to create value in an underserved area (discoverability of non-academic non-fiction resources), in a non-exclusive and non-authoritative manner, and it makes no claim for being comprehensive (yet).

The Aboutness Project is a good netizen: we build on free software and open resources, and we aim to not duplicate efforts or reinvent wheels. We give back: our code and data will be placed in the public domain (and/or CC0).

The Aboutness Project starts with low-hanging fruit: We start with resources that are readily available with work-level URIs (e.g. some works on English Wikisource), and with authority data that's available and open. We'll learn as we go, and will gradually reach for higher fruit.

Where do we have the conversation? -- on this wiki page? On a mailing list (which?)?

Shall we store the aboutness triples on Wikidata? We can share them or publish them in any number of ways, but what is to be our primary store? (storing them on Wikidata means a Wikidata item for every essay!)

A huge amount of books are now available as either scans/PDFs or text thanks to massive digitization projects such as the ones by The Internet Archive, Google, the Hathi Trust, etc.

Those projects focus on quantity over quality, perhaps leaving the meticulous improving of metadata for later, but quite probably, never.

Among those books, the ones that are least well described by metadata are non-fiction collections -- essay collections, article anthologies, digests. That's because book-level metadata can never do justice for the multiple items inside.

The Aboutness Project (above) can help classify these individual works by their content, but, it needs a way to refer to these individual works in the first place, and that's not available for individuals essays in the book-level resources exposed by the aforementioned services.

The "proper" solution would, of course, be to change the way the content hosts (Internet Archive etc.) operate, and add work-level cataloging and content-management. Since that is a formidable task and beyond our control, what MMOB can do about it is this:

We can create an extrinsic catalogue for these works, all pointing at the one (book-level) resource, but featuring individual data entities for every work (essay, article) inside. Our data entities (themselves metadata for the actual content at the original host) can then be used in The Aboutness Project. The data would be created by volunteers, typing (or proofreading OCRed) tables of contents and identifying authors (with VIAF etc.).

No less importantly, the data entities we produce can serve as the basis for the original hosts' catalogue, if and when they begin supporting work-level cataloguing.