In preparation for the upcoming API workshop, organized by Bill Turkel, I thought I’d try to assemble a few thoughts on APIs. This is the fruit of work on several text analysis projects, including TAPoR,HyperPo, Voyeur, BonPatron and MONK (I hesitate to associate ideas with specific people without their consent, but of course this is also the fruit of working with several talented people in digital humanities).

Use REST and keep it simple. The universal KISS principle is certainly valid for APIs: the simpler things are the more likely they’ll be properly understood and adopted. The TAPoR Portal supports both SOAP tools and REST tools, but REST tools have been far less of a headache (some of the problems related specificially to Ruby’s “SOAP”: library, but even beyond that, for our purposes REST tools provide everything we need with less hassle). Part of keeping the syntax of the API simple is to plan for a wide range of calls; this doesn’t mean that all the calls should be implemented and documented, but listing them at the beginning helps to define the purpose and scope of the tool and helps prevent overly complex syntax that’s usually the product of afterthought.

Document the APIs (preferably automatically). Documentation goes without saying (sometimes it even goes without doing). When tools get compared and evaluated, one of the main criteria is always the extent and quality of the documentation. Besides, good documentation usually avoids more support questions. Of course there may be cases where you want to keep some aspects of the API undocumented if they’re too much in flux: a documented API should be respected by both developer and user, even as the tool evolves. One of the best ways to ensure up-to-date documentation is to find a way of having tools document themselves (like JavaDocs). This is one reason why HyperPo used Cocoon and XForms in order to have self-documenting tools.

Provide XML and JSON output. Providing two forms of output is a bit contradictory to the KISS principle, but there are good reasons for providing both: 1) XML because it’s still a powerful interchange language and can be infinitely transformed with XSL; 2) JSON because results are usually easier and faster to work with for client-side Javascript libraries (not to mention less bandwidth because results are more compact). Part of a well-documented API is of course explaining the results format.

Provide paging functionality. It’s a pain when you really want 5 results but the tool gives you 5,000: it’s an unnecessary performance burden in terms of bandwidth, memory, and computation. There are rare exceptions, but most tools should provide paging funcionality to ensure they’re scalable (even if the paging doesn’t seem immediately useful). Things get trickier when you need to combine pageing and sorting or grouping, but that’s where clear API documentation helps.

Create a proxy to channel traffic. For many client-side web applications, having a proxy channel requests to other tools can help avoid some constraints imposed by cross-domain Javascript security. But even beyond that, proxies can serve a useful purpose as a centralized broker of communication with other tools – there are good chances that parts of proxy code can be reusable for different types of tool requests, even when direct requests to the tools are possible (for instance,caching results or handling connection errors). One of the main benefits that we’ve found from having a proxy layer goes beyond APIs: it decouples development schedules of the interface (client-side) group and the backend (server-side) group. For instance, it’s possible for the proxy to provide fake data to the interface until the backend is ready to provide real data – but the interface code is oblivious to the difference.

For rich client-side tools, create embeddable objects. We usually think of APIs as providing data-centric content that is transformed and presented to the user in a different format. However, there are some tools where the server-side and client-side components work together and it’s actually the bundled combination that’s desired. These are often called widgets or badges, and they provide stand-alone functionality (like an embedded YouTube video or a Twitter timeline). A text-analysis example of this is Voyeur panels, like on the Day of Digital Humanities. Again, because of cross-domain security constraints, it can be easiest to embed these panels in an IFRAME (though of course they won’t be allowed to interact with the rest of the page).

Coordinated redundancy of services would be nice. I’m talking here primarily about academic projects, not commercial services: our servers and services go down for a variety of reasons and there’s rarely staff available 24/7 to make sure things are restored immediately. Furthermore, we’re more likely in an academic context to deploy an experimental version of something that could inadvertantly break functionality required elsewhere. The problem is that if Project 1 depends on services from Project 2 but _Project 2 _ is unavailable for some time, Project 1 may be partly or completely compromised. Projects that want to do the right thing and integrate existing remote services instead of re-inventing every wheel or having local installations of every service (that individually need to be maintained) face a network challenge. One possibility (again that’s fairly specific to the academic context) is to have a mechanism for coordinating fail-over sites for certain services. This isn’t quite as easy as it sounds since you need to maintain and distribute (presumably again through an API) a list of current installations with versioning information included. One benefit, if really there’s collaboration between sites, is that you get a form of mirroring that can provide load-balancing as well as improve network latency by calling services that are closer to you. I don’t think we have any good examples of tools that are widely used by several digital humanities projects, but that’s not entirely the fault of the existing tools, it’s that we haven’t focused enough on APIs and distributed services….

Although HyperPo has many faults (not very scalable, not to mention the fact that its development has been superceded by Voyeur), it does provide a decent API. To see it in action, you can view the list of modular tools in the HyperPoets Gallery, click on one of the tools, scroll down to near the bottom of the page and click the API link, and submit some values (please don’t be a bully – use shorter texts:-). Some tools provide alternate output formats – you’ll find those in the options section if applicable. For instance:

Are all of the stakeholders on board? (Hat tip to @patrickgmj for this gem.)

What about sustainability?

In their right place, each of these are valid criticisms. But they shouldn’t be levied reflexively. Sometimes X, Y, and Z’s project stinks, or nobody uses it, or their code is lousy. Sometimes stakeholders can’t see through the fog of current practice and imagine the possible fruits of innovation. Sometimes experimental projects can’t be sustained. Sometimes they fail altogether.

If we are going to advance a field as young as digital humanities, if we are going to encourage innovation, if we are going to lift the bar, we sometimes have to be ready to accept “I don’t know, this is an experiment” as a valid answer to the sustainability question in our grant guidelines. We are sometimes going to have to accept duplication of effort (aren’t we glad someone kept experimenting with email and the 1997 version of Hotmail wasn’t the first and last word in webmail?) And true innovation won’t always garner broad support among stakeholders, especially at the outset.

Duplication of effort, stakeholder buy in, and sustainability are all important issues, but they’re not all important. Innovation requires flexibility, an acceptance of risk, and a measure of trust. As Dorthea Salo said on Twitter, when considering sustainability, for example, we should be asking “‘how do we make this sustainable?’ rather than ‘kill it ‘cos we don’t know that it is.’” As Rachel Frick said in the same thread, in the case of experimental work we must accept that sustainability can “mean many things,” for example “document[ing] the risky action and results in an enduring way so that others may learn.”

Innovation makes some scary demands. Dorthea and Rachel present some thoughts on how to manage those demands with the other, legitimate demands of grant funding. We’re going to need some more creative thinking if we’re going to push the field forward.

Late update (10/16/09): Hugh Cayless at Scriptio Continua makes the very good, very practical point that “if you’re writing a proposal, assume these objections will be thrown at it, and do some prior thinking so you can spike them before they kill your innovative idea.” An ounce of prevention is worth a pound of cure … or something like that.

I am very pleased to be attending the Workshop on Application Programming Interfaces for the Digital Humanities sponsored by SSHRC and hosted by the amazing Bill Turkel in his role as a member ofNiCHE.

Here are a few things I’m thinking about going into Day 2:

In talking about APIs, we’re necessarily talking about access and the political and cultural issues that surround access to cultural heritage materials. It’s one thing for a library (say) to make some data collection available and to allow you to browse, search, and display it in various ways. It’s another thing to allow other people to come along and create their own ways of browsing, searching, viewing (which is what API access is really about). I think we need to insist on the necessity of this form of access as essential to the future of digital work in the humanities and social sciences. At the same time, we need to be respectful of those who are understandably nervous about it. How do we articulate the benefits of this kind of access? How do we persuade content providers that this kind of access is good for the institutions that provide it, and not just for the people who take advantage of the new entry point?

There’s a movable wall when it comes to APIs. I heard a lot of people yesterday describing elaborate ideas about data mining with textual resources (or something similarly ambitious), but in every case, I noticed that the idea was predicated not on access to a series of data points, but on access to the entire dataset. This raises a fundamental question (for designers) on where you put the “wall” between the resource and the user. You could imagine an API that had a single function called “get_all()” Call that, and you can mirror the entire dataset and do what you like. You could also have an API with dozens of highly granular hooks that return nicely formatted data structures, and so forth. The former is undoubtedly the most flexible, but it’s also the hardest to work with (particularly if you’re a novice programmer). But again, it’s a kind of shifting wall. If it’s data mining you’re after, you could do all that mining back on the archive side and make the results available through the (highly granular) API. These aren’t mutually exclusive, of course; Flickr, for example, offers both kinds. Still, I think thinking about this helps to highlight some of the design challenges one encounters with APIs in general.

I think we need to think more carefully about “impedance mismatches” between data sources. There was a lot of talk yesterday about mashing this humanities resource to that humanities resource, but I think there were also some hand-waving assumptions (I was guilty as much as anyone) about the degree to which that data is tractable from an interoperability standpoint. Some of the most successful web service APIs are successful, I think, because the data is simple and easy to work with (lat/longs, METAR data, stats arranged as key-value pairs, etc.). Humanities resources are often quite a bit more complicated, and there’s far less agreement about how that data should be formatted. It’s true that the TEI (for example) provides a degree of metadata standardization, but it’s mostly silent about how the content itself should be formatted. That is, when you actually look at the content of the “tags” (whether it’s XML or something else entirely), you find that people are defining things at radically different levels of granularity and with different ordering schemes. I don’t want to declare that the sky is falling; I just want to point out that some of this might be quite a bit more difficult than it sounds. And it’s a tough problem, because defining complicated interoperability standards in this space really does, in my opinion, run against the spirit of the thing.

I’ve had a wonderful time at this gathering, which includes so many talented librarians, scholars, and hackers (many of whom manage to combine all three skill sets). I can’t help but think that great things will come of this.

How have online digital technologies changed environmental history research, communication, and teaching? This episode of the podcast explores this question in the context of the recent NiCHE Digital Infrastructure API Workshop held in Mississauga, Ontario. Online-based Application Programming Interfaces or APIs are just one digital technology that holds the potential to change the way environmental historians access resources, analyze historical data, and communicate research findings. Within the past decade alone, the development of online digital technologies has offered the potential to transform historical scholarship.

This episode includes a round-table conversation with some leading figures in the realm of digital history as well as an interview with Jan Oosthoek, the producer and host of the Exploring Environmental History podcast.

I recently attended Bill Turkel’s Workshop on APIs for the Digital Humanities held in Toronto and had the pleasure of coming face to face with many of the people who have created the historical data repositories I have used so enthusiastically.

What I came away with was an even stronger conviction that data in humanities repositories should be completely liberated. By that, I mean given away in their entirety.

The mass digitizations that have occurred recently have provided a great first step for researchers. I no longer need to fly to London and sit in a dark room sifting through boxes to look at much of the material I seek. But, I’m still – generally – unable to do with it what I like.

Many repositories contain scanned documents which have been OCR’d so that they are full text searchable, but that OCR is not shared with the end user, rendering the entire thing useless for those wanting to do a text analysis or mash up the data with another repository’s content.

Most databases require the user to trust the site’s search mechanisms to query the entire database an return all relevant results. If I’m doing a study, I’d prefer to do so with all the material available. Without access to the entire database on my hard drive, I have no way of verifying that the search has returned what I sought.

Many of those at the workshop who administered repositories were willing and eager to email their data at the drop of a hat, but that is not yet the norm. Most of my past requests for data have been completely ignored. When it comes to scholarly data, possessiveness leads to obscurity.

As humanists become increasingly confident programmers, many will define research projects based on the accessibility of the sources. Those who are giving their data away will end up cited in books and journal articles. Those desperate to maintain control will get passed by. If someone asks you for your data, think of it as a compliment of your work, then say yes.