Hello, I’m Nick Rabinowitz, and as Elton mentioned in his last post I’ve joined the GAP team to help create an online interface for reading and visualizing GAP texts, tentatively titled “GapVis.”

In the next few posts, I’ll set out the work that I have done in three parts: first, I’ll identify and explain some of the problems we encountered when trying to adapt the HESTIA NarrativeMap to GAP; then I’ll outline the architectural and technical choices we’ve made to address those challenges; and finally I’ll discuss some of the features of the new GapVis interface. The first two posts will be a bit more technical, and are mainly for fellow Digital Humanities coders – so if the terms “API”, “component”, and “framework” make your eyes roll back in your head, feel free to come back for post 3, which will be more for the lay reader.

For the last month or two, I’ve been working on adapting and expanding the prototype work I did on the HESTIA project into a fully-fledged Javascript-based web app capable of presenting a rich, multi-layered interface for any text run through the Edinburgh Geoparser. You can see the working version here, and the code is available on Github. Note that only two works are currently available at the moment, and given the experimental nature of the project, we included some features that older browsers (e.g. IE8 or lower) won’t support.

My original work on the HESTIA project was intended primarily as a proof-of-concept expanding my timemap.js library to represent a narrative sequence, and it was tightly tied to the specific text we were working with, the Histories of Herodotus. When I began to look at expanding this work into a generalized application that could present any text, and especially when we began to discuss a range of new features and visualizations we wanted to add, it quickly became clear that I was going to need to build the new application from the ground up. Well, not entirely from the ground up – there are a range of frameworks now available for building Javascript-based web applications, and after comparing several options I settled on Backbone.js as the basis for the new app.

Backbone offers a nice structure for building Model classes that sync easily with a RESTful API, which the GAP project was already creating, as well as a framework for managing dynamic views, UI events, and browser history (i.e. making the URL in the address bar change with the application state). At the same time, it offers a great deal of flexibility and extensibility, and is minimally prescriptive about how you should use it. This means you can use it in almost any way you want, but there’s a certain learning curve as you try to figure out how to integrate Backbone into your project.

For example, in the interface we came up with, many user actions need to have multiple effects on the interface: Clicking on the “next page” button shows the next page, but it also updates the page number, advances the timeline, and changes the URL. Backbone offers a basic pattern for handling user actions, but doesn’t in and of itself tell you how to manage the cross-references and function calls required to update various pieces of the interface, which might all be managed by different parts of your code. To address this, I created a global State model that all of the pieces of the application can both update (when the user does something) and listen to, updating themselves when the model is changed to reflect the new state. This is a common pattern, and once I started working with it I saw that Backbone offered some great tools to support it, but I had to come to it myself.

Despite the learning curve, Backbone has turned out to be a really useful framework, and it’s made it easier to make some of the nifty features we’ve built into the application. I’ll give a bit more detail on the architecture in the next post, but in the meantime, I encourage you to check it out, click around, and tell me what breaks :).

We’ve now finished the geoparsing work – all the source texts have been put through the pipeline to identify place-names (geotagging) and provide spatial co-ordinates for them (georesolution). Geoparsing is the first step for GAP, providing the material for the various visualisations.

Previous posts have described the story so far on the geoparsing aspect of the project:

processing the Hestia Herodotus data for the Edinburgh Geoparser and working out how to evaluate the geotagging step, using the hand-annotated Herodotus text as a gold standard for toponym identification (described here);

setting up a local Pleiades+ database and experimenting with making it cross-searchable with Geonames (described here);

analysing the georesolution step to work out how to improve it (described here);

Judging by the lack of recent activity on our blog, it rather looks like the GAP team spent our summer after #dh11 surfing the waves in California. I’m happy to report that we’ve been up to far more interesting stuff than that…

1. Upon our return to the UK, Leif and I gave a GAP presentation at the hugely popular Digital Classicist seminar based at the Institute of Classical Studies, London, details of which can be found here: http://www.digitalclassicist.org/wip/wip2011.html. You’ll also be able to download a pdf of our presentation and even an mp3 for your ipod, should you want to hear the pair of us whittling on in the comfort of your own home.

2. We have also been hard at work getting GAP data compatible with another project which Leif and I are running: Pelagios. Pelagios is a growing international alliance of groups doing ancient world research, who have clubbed together in order to find a way of linking their data in an open and transparent way. (Partners include the likes of Pleiades, Perseus and CLAROS, for example.) Our aim is to enable researchers and the general public to discover all kinds of interesting stuff related to ancient places and visualize it in fun and meaningful ways. Eric will shortly be posting an explanation of how we at GAP have done this: but, if you’re interested, check out the latest from the Pelagios blog: http://pelagios-project.blogspot.com/.

3. Lastly, for the time being, I’d like to welcome another new member to the GAP team: Nick Rabinowitz. Nick is a tech wizard whom Leif and I know from HESTIA, for which he developed his timemap.js for reading Herodotus’s Histories (see: http://www.open.ac.uk/Arts/hestia/herodotus/basic.html). The basic concept is that a split reading pane allows the user to read through Herodotus’s narrative and see all the places mentioned in it pop in and out of a map view. Nick is now applying and refining this technology for reading through the GAP texts – I for one can’t wait to see the results. Watch this space!

The GAP team recently returned from Digital Humanities 2011 conference at Stanford where we gave a paper. Not only was it a highly enjoyable conference, it also gave us the first opportunity for all four of us to sit in the same room together at the same time! Elton and Leif made it out a week early to visit Eric in beautiful Berkeley and get some last minute coding done, helped by some earlier text crunching by Kate. This really helped us get to a point where GAP is starting to deliver on its promise: the ability to visualise the spatiality of texts, and discover texts associated with places. On top of that we also had a stimulating discussion with Google’s John Orwant, saw some great papers and ate like kings 🙂

Here’s an early “Alpha” version of displaying identified places on a map. It has bugs, and certain books will probably crash your browser, so we’re showing Tacitus only for now until we work out ways to progressively load identified places:

The GAP project explores workflows needed to identify ancient places from unstructured texts (books) so that researchers can reference these ancient places in Linked Data applications. Most of the important challenges that we note relate to problems concerning identifiers of texts, fragments of texts (including individual “tokens”), and place entities. Below we describe some of these issues.

Why Token Identification Matters

Tokens (usually individual words) are fundamental unites in text analysis and entity identification. The clear identification of tokens represents a fundamental need for making text analysis and entity identification an integral part of scholarly practice. The adage of “garbage in, garbage out” applies to textual analysis, and tokenization represents an important first step in many later analytic approaches to texts. The reliability and quality of tokenization processes impacts later downstream analysis.

Various text mining algorithms are far from perfect. Such algorithms often require special “tuning” to suite the book or corpus under study. The results of these processes can and should be questioned. Moreover, researchers may want to apply different sorts of text analysis algorithms to the same texts, perhaps using certain approaches for the identification of historical events or persons, and other algorithms to identify historical places. Researchers will need to combine results of different algorithmic analysis to compare and evaluate the outcomes of different approaches to text mining.

Because of these needs, individual tokens need clear, consistent, and persistent identifiers. Such identifiers could be used to compare and contrast entity identification results. For example the token “Paris” may be identified as a person (a character from the Iliad) by one algorithm, while a different algorithm may identify Paris as a geographic place. Persistent identifiers for tokens can be useful for identifying these conflicting results.

Identifying Tokens in Books

Google Books offers fairly stable URIs to individual books and pages in individual books. We note that the URIs to Google Books and pages could be made more trustworthy if they did not include query parameters, but they are suitable for referencing entities at the granularity of a single page of a given book. If one looks at the HTML markup of the Google Books data, one finds individual tokens (words) bounded by <span> elements. These <span> elements themselves have title attributes that describe bounding boxes for the tokens. Presumably these bounding boxes note the position of the word or token on the scanned image of a page. Google probably uses these bounding box data to highlight terms relevant to a user’s search request.

The bounding box data represents the only identification for specific tokens in the Google Books HTML markup. Unfortunately (for our purposes), Google uses the title attribute and not the id attribute for expressing bounding boxes. Thus identifying and referencing tokens by their bounding boxes can’t be done with a standard URI + fragment identifier (beginning with a “#” in some URL/URIs).

We’ve asked the Google Books team for help on this issue, and we’re learning that Google may have some web services that could be used reference specific tokens using bounding box coordinates. We should learn more about these shortly. However, for the time being we need an alternative approach to reference specific tokens. One possibility is that a successor to the GAP project can create its own set of Web resources where books, pages, and individual tokens can all carry persistent URIs.

I’ve been aware for a while that there was a mismatch between the resources used by the Geoparser for geotagging (finding toponyms in text) and georesolution (determining their lat/long position) – and I’ve now got around to dealing with that. We’ve been trying to use the Geoparser without too much tweaking and reprogramming, but I clearly needed to make the lexicons it uses in the geotagging tie up with the Pleiades+ gazetteer it uses for georesolution.

For place names this is pretty straightforward, as the new lexicon is largely derived directly from Pleiades+. I also needed a lexicon of ancient personal names, as one of the main reasons for poor precision and recall scores on the geotagging seemed to be that there were too many confusions over personal names: there are several places (in the modern world) called Priam, for example.

Dropping the modern place name lexicons altogether improves performance, and adding lists of ancient personal names has helped still further. The overall result is that, although there’s much tinkering we could still do, the geotagging is now producing pretty good results that are fit for our purposes. Compared against the gold standard of our hand-annotated Hestia data, the performance scores (using standard NLP precision, recall and F1 measure) are:

There’s a simple display of Herodotus Book 1 text at http://synapse.inf.ed.ac.uk/~kate/gap/normname2.display.html. That display only highlights toponyms in the text, but one of the other things we’re playing around with is identifying personal names and temporal expressions. It may be that we can do interesting things with those in GAP, if we can identify them reliably.

The next thing I’m planning to do is to get back to processing actual Google Book texts. I’d interrupted myself on that in order to fix the problems with the geotagging performance.