Wednesday, April 28, 2010

There is no surer way to flush out software bugs and configuration errors than to do a sales demo. The process not only exposes the problem, but also sears into the psyche of the demonstrator an irrational desire to see the problem eradicated from the face of the earth, no matter the cost or consequences.
Here's a configuration problem I once found while demonstrating software to a potential customer:
Many library information services can be configured with the base URL for the institution's OpenURL server. The information service then constructs links by appending "?" and a query string onto the base URL. So for example, if the base URL is http://example.edu/links
and the query string is isbn=9780393072235&title=The+Big+Short ,
the constructed URL ishttp://example.edu/links?isbn=9780393072235&title=The+Big+Short.
For the demo, we had configured the base URL to be very short: http://example.edu, so the constructed URL would have been http://example.edu?isbn=9780393072235&title=The+Big+Short. Everything worked fine when we tested beforehand. For the customer demo, however, we used the customer's computer, which was running some Windows version of Internet Explorer that we hadn't tested, and none of the links worked. Internet Explorer had this wonderful error page that made it seem as if our software had broken the entire web. Luckily, breaking the entire web was not uncommon at the time, and I was able to navigate to a different demo site and make it appear is if I had fixed the entire web, so we managed to make the sale anyway.
It turns out that http URLs with null paths aren't allowed to have query strings. You wouldn't know it if you looked at the W3C documentation for URIs, which is WRONG, but you will see it if you look at the IETF specs, which have jurisdiction (see RFC 1738 and RFC 2616).
Internet Explorer was just implementing the spec, ignoring the possibility that someone might ignore or misinterpret it. The fact that Netscape worked where IE failed could be considered a bug or a feature, but most users probably considered Netscape's acceptance of illegal URLs to be a feature.
I still feel a remnant of pain every time I see a pathless URL with a query string. Most recently, I saw a whole bunch of them on the thing-described-by site and sent a nit-picky e-mail to the site's developer, and was extremely pleased when he fixed them. (Expeditious error fixing will be richly rewarded in the hereafter.) I've come to recognize, however, that a vast majority of these errors will never be fixed or even noticed, and maybe that's even a good thing.

Nit picking appears to have been a highlight of the Linked Data on the Web Meeting in Raleigh, NC yesterday, which I've followed via Twitter. If you enjoy tales of nerdy data disasters or wonky metadata mischief, you simply must peruse the slides from Andreas Harth's talk (1.8M, pdf) on "Weaving the Pedantic Web". If you're serious about understanding real-world challenges for the Semantic Web, once you've stopped laughing or crying at the slides you should also read the corresponding paper (pdf, 415K ). Harth's co-authors are Aidan Hogan, Alexandre Passant, Stefan Decker, and Axel Polleres from DERI.
The DERI team has studied the incidence of various errors made by publishers of Linked Data "in the wild". Not so surprisingly, they find a lot of problems. For example, they find that 14.3% of triples in the wild use an undeclared property and 8.1% of the triples use an undeclared class. Imagine if a quarter of all sentences published on the web used words that weren't in the dictionary, and you'd have a sense of what that means. 4.7% of typed literals were "ill-typed". If 5% of the numbers in the phone book had the wrong number of digits, you'd probably look for another phone book.
They've even found ways that seemingly innocuous statements can have serious repercussions. It turns out that it's possible to "hijack" a metadata schema, and induce a trillion bad triples with a single Web Ontology Language (OWL) assertion.To do battle with the enemy of badly published Linked Data, the DERI team urges community involvement in a support group that has been formed to help publishers fix their data. The "Pedantic Web" has 137 members already. This is a very positive and necessary effort. But they should realize that the correct data cause is a hopeless one. The vast majority of potential data publishers really don't care about correctness, especially when some of the mistakes can be so subtle. What they care about is accomplishing specific goals. The users of my linking software only cared that the links worked. HTML authors mostly care only that the web page looks right. Users of Facebook or Google RDFa will only care that the Like buttons or Rich Snippets work, and the fact that the schemas for these things either don't exist in machine readable form or are wildly inconsistent with the documentation is a Big Whoop.
Until of course, somebody does a sales demo, and the entire web crashes.(nit and head louse photos from Wikimedia Commons)

Monday, April 26, 2010

Imagine walking into a building. At the door, you pick up an electronic reading device- you've presented some ID and have agreed to pay for anything you take and don't return. Inside the building are displays of books and comfortable reading areas. There's a cafe off in the corner. Your reading device gives you access to hundreds of thousands of books, all free for you to read. You can curl up and read the latest romance novel, or you can study for tomorrow's physics exam. If you find an ebook you really like, you click a few buttons, and you can take the book home with you.

Are you in a library, or are you in a bookstore?

If you're doing that today, the answer is that you're in a Barnes and Noble bookstore, and you've just paid for a Nook ebook reader. The latest version of the Nook software includes a "Read In Store" feature. And the Nook seem to be selling pretty well. Over at the Thingology Blog, Tim Spalding wonders when the "Read In Store" functionality is going to migrate to libraries (he calls it the "Brigadoon Library", because the books vanish when you leave the building). A while back, I wondered whether something like it would work in a Starbucks.

Which raises a deeper question: what is a library, anyway?

A week or so ago, I was asked what I thought digital libraries would look like in five years. I answered with a question- "what will libraries look like in five years?" and started blabbering about "what is a digital library anyway?" What else could I do?

Here's what I wrote six years ago:

A digital library is "Any collection of digital resources managed with the primary goal of maximizing the collection's utility to a defined user community".

Even if you remove the word "digital" from that, I think that still works.

Barnes and Noble is not a library, because the collection of material it makes available is designed to maximize sales, not to benefit a community. When Barnes and Noble goes bankrupt, and sells the building and the Nooks and the contractual arrangements to the local library foundation, it would become, almost by magic, a library. Or it least it would after they get rid of the one hour per day limit for Read In Store, that's how I see it. It could still sell the $125 Kate Spade Nook covers, though.

That's not to say that having a Barnes and Noble Nook Cafe doesn't benefit a community, and it doesn't say that a non-profit won't try to maximize revenue in the interests of self preservation. But it does say that the essence of libraryness is rooted in the community that the library-like resource serves, and not in its collection of stuff.

I've talked to a lot of people in publishing and in libraries about the Brigadoonbucks library concept, and I haven't found much interest. The technology would be pretty easy, but the reconceptualization of libraries and bookselling is going to take a while. In New Jersey, at least, it seems more likely that our new Governor will sell off the libraries to Barnes and Noble.

Thursday, April 22, 2010

Facebook and Twitter each held developer conferences recently, and the conference names speak worlds about the competing worldviews. Twitter's conference was called "Chirp", while Facebook's conference was labeled "f8" (pronounced "FATE"). Interestingly, both companies used their developer conferences to announce new capability to integrate meaning into their networks.

Facebook's announcement surrounded something it's calling the "Open Graph protocol". Facebook showed its market power by rolling it out immediately with 30 large partner sites that are allowing users to "Like" them on Facebook. Facebook's vision is that web pages representing "real-world things" such as movies, sports teams, products, etc. should be integrated into Facebook's social graph. If you look at the right-hand column of this blog, you'll see an opportunity to "Like" the Blog on Facebook. That action has the effect of adding a connection between a node that represents you on Facebook with a node that represents the blog on Facebook. The Open Graph API extends that capability by allowing the inclusion of web-page nodes from outside Facebook in the Facebook "graph". A webpage just needs to add a bit of metadata into its HTML to tell Facebook what kind of thing it represents.

I've written previously about RDFa, the technology that Facebook chose to use for Open Graph. It's a well designed method for adding machine-readable metadata into HTML code. It's not the answer to all the world's problems, but it can't hurt. When Google announced it was starting to support RDFa last summer, it seemed to be hedging its bets a bit. Not Facebook.

The effect of using RDFa as an interface is to shift the arena of competition. Instead of forcing developers to choose which APIs to support in code, using RDFa asks developers to choose among metadata vocabularies to support their data model. Like Google, Facebook has created its own vocabularies rather than use someone else's. Also, like Google last summer, the documentation for the metadata schemas seems not to have been a priority. Although Facebook has put up a website for Open Graph protocol at http://opengraphprotocol.org/ and a google group at http://groups.google.com/group/open-graph-protocol, there are as yet no topics approved for discussion in the group. [Update- the group is suddenly active, though tightly moderated.]

Nonetheless, websites that support Facebook's metadata will also be making that metadata available to everyone, including Google, putting increased pressure on websites to make available machine readable metadata as the ticket price for being included in Facebook's (or anyone's) social graph. A look at Facebook's list of object types shows their business model very clearly. Here's their "Product and Entertainment" category:

album

book

drink

food

game

movie

product

song

tv_show

Whether you "Like" it or not, Facebook is creating a new playing field for advertising by accepting product pages into their social graph.

Facebook clearly believes in that fate follows its intelligent design. Twitter, by contrast, believes its destiny will emerge by evolution from a primordial ooze.

At Twitter's "Chirp" conference, Twitter announced that it will add "Annotations" to the core Twitter platform. The description of Twitter annotations is characteristically fuzzy and undetermined. There will be some sort of triple structure, the annotations will be fixed at a tweet's creation, and annotations will have either 512 bytes or maybe 1K. What will it be used for? Who knows?

Last week, I had a chance to talk to Twitter's Chairman and co-Founder Jack Dorsey at another great "Publishing Point" meeting. He boasted about how Twitter users invented hashtags, retweets and "@" references, and Twitter just followed along. Now, Twitter hopes to do the same thing with annotations. Presumably, the Twitter ecosystem will find a use for Tweet annotations and Twitter can then standardize them. Or not. You could conceivably load the Tweet with Open Graph metadata and produce a Facebook "Like" tweet.

Many possibilities for Tweet annotations, underspecified as they are, spring to mind. For example, the Code4Lib list was buzzing yesterday about the possibility that OpenURL references (the kind used in libraries to link to journal articles and books) could be loaded into an annotated tweet. It seems more likely to me that a standard mechanism to point to external metadata, probably expressed as Linked Data, will emerge. A Tweet could use an annotation to point to a web page loaded with RDFa metadata, or perhaps to a repository of item descriptions such as I mentioned in my post on Linked Descriptions. Clearly, it will be possible in some way or other to put real, actionable literature references into a tweet. Whether it will happen, it's hard to say, but I wouldn't hold my breath for Facebook to start adding scientific articles into its social graph.

Although there's a lot of common capability possible between Facebook's Open Graph and Twitter's Annotations, the worldviews are completely different. Twitter clearly sees itself as a communications media and the Annotations as adjuncts to that communication. In the twitterverse, people are entities that tweet about things. Facebook sees its social graph as its core asset and thinks of the graph as being a world-wide web in and of itself. People and things are nodes on a graph.

While Facebook seems offer a lot more to developers than Twitter, I'm not so sure that I like its worldview as much. I'm much more than a node on Facebook's graph.

Sunday, April 18, 2010

When I was in grad school, my housemates and I would sit around the dinner table and have endless debates about obscure facts like "there's no such thing as brown light". That doesn't happen so much in my current life. Instead, my family starts making fun of me for "whipping out my iPhone" to retreive some obscure fact from Wikipedia to end a discussion about a questionable fact. This phenomenon of having access to huge amounts of information has also changed the imperatives of education: students no longer need to learn "just in case", but they need to learn how to get information "just in time".

In thinking about how to bring semantic technologies to bear on OpenURL and reference linking, it occured to me that "just in time" and "just in case" are useful concepts for thinking about linking technologies. Semantic technogies in general, and Linked Data in particular, seem to have focused on just-in-case, identifier-oriented linking. Library linking systems based on OpenURL, in contrast, have focused on just-in-time description-oriented linking. Of course, this distinction is an oversimplification, but let me explain a bit what I mean.

Let's first step back and take a look at how links are made. Links are directional; they have a start and an end (a target). The start of a link always has an intention or purpose, the target is the completion of that purpose. For example, look at the link I have put on the word "grad school" above. My intention there was to let you, the reader, know something about my graduate school career, without needing to insert that digressional information in the narrative. (Actually my purpose was to illustrate the previous sentence, but let's call that a meta-purpose.) My choice of URL was "http://ee.stanford.edu/", but I might have chosen some very different URL. When I choose a specific URL, I "bind" that URL to my intention.

In the second paragraph, I have added a link for "OpenURL". In that case, I used the "Zemanta" plug-in to help me. Zemanta scans the text of my article for words and concepts that it has links for, and offers them to me as choices to apply to my article. Zemanta has done the work of finding links for a huge number of words and concepts, just in case a user come along with a linking intention to match. In this case, the link suggested by Zemanta matches my intention (to provide background for readers unfamiliar with OpenURL). The URL becomes bound to the word during the article posting process.

At the end of this article, there's a list of related articles, along with a link that says "more fresh articles". I don't know what URLs Zemanta will supply when you click on it, but it's an example of a just in time link. A computer scientist would call this "late binding". My intention is abstract- I want you to be able to find articles like this one.

Similar facilities are in operation in scholarly publishing, but the processes have a lot more moving parts.

Consider the citation list of a scientific publication. The links expressed by these lists are expressions of the author's intent- perhaps to support an assertion in the article, to acknowledge previous work, or to provide clarification or background. The cited item is described by metadata formatted so that humans can read and understand the description and go to a library to find the item. Here's an example:

With the movement of articles on-line, the citations are typically turned into links in the publication process by parsing the citation into a computer-readable description. If the publisher is a member of CrossRef, the description could then be matched against CrossRef's huge database of article descriptions. If a match is found, the cited item description is bound to an article identifier, the DOI. For my example article, the DOI is 10.1103/PhysRevLett.48.1559 The DOI provides a layer of indirection that's not found in Zemanta linking. While CrossRef binds the citation to an identifier, the identifier link, http://dx.doi.org/10.1103/PhysRevLett.48.1559, is not bound to the target URL, http://prl.aps.org/abstract/PRL/v48/i22/p1559_1 until the user clicks the link. This scheme holds out hope that should the article move to a different URL, the connection to the citation can be maintained and the link will still work.

If the user is associated with a library using an OpenURL link server, another type of match can be made. OpenURL linkservers use knowledgebases which describe the set of electronic resources made available by the library. When the user clicks on on OpenURL link, the description contained in the link is matched against the knowledgebase, and the user is sent to the best-matching library resource. It's only at the very last moment that the intent of the link is bound to a target.

While the combination of OpenURL and CrossRef has made it possible to link citations to their intended target articles in libraries with good success, there has been little leveraging of this success outside the domain of scholarly articles and books. The NISO standardization process for OpenURL spent a great deal of time in making the framework extensible, but the extension mechanisms have not seen the use that was hoped for.

The level of abstraction of NISO OpenURL is often cited as a reason it has not been adopted outside its original application domain. It should also be clear that many applications that might have used OpenURL have instead turned to Semantic Web and Linked Data technologies (Zemanta is an example of a linking application built with semantic technologies.) If OpenURL and CrossRef could be made friendly to these technologies, the investments made in these systems might also find application in more general circumstances.

I began looking at the possibilities for OpenURL Linked Data last summer, when, at the Semantic Technologies 2009 conference, Google engineers expressed great interest in consuming OpenURL data exposed via RDFa in HTML, which had just been finalized as a W3C Technical Recommendation. I excitedly began to work out what was needed (Tony Hammond, another member of the NISO standardization committee had taken a crack at the same thing.)

My interest flagged, however, as I began to understand the nagging difficulties of mapping OpenURL into an RDF model. OpenURL mapped into RDF was...ugly. I imagined trying to advocate use of OpenURL-RDF over BIBO, an ontology for bibliographic data developed by Bruce D'Arcus and Frédérick Giasson, and decided it would not be fun. There's nothing terribly wrong with BIBO.

One of the nagging difficulties was that OpenURL-RDF required the use of "blank nodes", because of its philosophy of transporting descriptions of items which might not have URIs to identify them. When I recently described this difficulty to the OpenURL Listserv, Herbert van de Sompel, the "irresistible force" behind OpenURL a decade ago, responded with very interestingnotes about "thing-described-by.org", how it resembled "by-reference" OpenURL, and how this could be used in a Linked Data friendly link resolver. Thing-Described-by is a little service that makes it easy to mint a URI, attach an RDF description to it, and make it available for harvest as Linked Data.

In the broadest picture, linking is a process of matching the intent of a link with a target. To accomplish that, we can't get around the fact that we're matching one description with another. A link resolver needs to accomplish this match in less than a second using a description squeezed into a URL, so it must rely on heuristics, pre-matched identifiers, and restricted content domains. If link descriptions were pre-published as Linked Data as in thing-described-by.org, linking providers would have time to increase accuracy by consulting more types of information and provide broader coverage. By avoiding the necessity of converting and squeezing the description into a URL, link publishers could conceivably reduce costs while providing for richer links. Let's call it "Linked Description Data".

Descriptions of targets could also be published as Linked Description Data. Target knowledgebase development and maintenance is a significant expense for link server vendors. However, target publishers have come to understand the importance (see KBART) of providing more timely, accurate and granular target descriptions. If they ever start to view the knowledgebase vendors as bottlenecks, the Linked Description Data approach may prove appealing.

Computers don't learn "just-in-time" or "just-in-case" the way humans do. But the matching at the core of making links can be an expensive process, taking time proportional to the square of the number of items (N2). Identifiers make the process vastly more efficient, (N*logN). This expense can be front-loaded (just-in-case) or saved till the last momemt (just-in-time), but opening the descriptions being matched for "when-there's-time" processing could result in dramatic advances in linking systems as a whole.

Monday, April 12, 2010

I had this great idea about how to fight comment spam. If you're not familiar with comment spam, you probably don't have your own blog and you think that "Kathryn" and "Patrick" who try to comment on this blog are just brain dead people. You might be right about the brain dead part, but I'm not sure they're really people.

Do you ever wonder why commenting on blogs can be such a hassle, or why so many blogs require moderation, or why many blogs don't accept comments on older posts, or forbid links in comments? It's because of comment spam. Spammers will submit comments such as "Your post is helpful and informative" or "We need to pay attention to the eco friend environment" that don't address the topic of the post in question. I'm not talking about targeted self-promotion here. It's not comment spam to link to an article you wrote on a similar topic, but it's definitely comment spam if you use a robot to do so. Or if you hire people in Asian boiler rooms to get around the CAPTCHA's that stop your robots.

It used to be that comment spam was done to improve the search engine ranking of websites. That motivation has largely gone away with the development of the "nofollow" tag. Blogs such as "Go To Hellman" attach add rel="nofollow" to any links in the comment threads. This tells spidering robots not to follow the specified links and tells search engines to ignore the links for purposes of site ranking.

I guess the people who have been leaving spam comments on my blog didn't get that memo. It's annoying to have to delete the comments, especially the ones in Chinese where links get hidden around the periods in "...". I went to the Blogger help pages to see if there's any way to report the abusive commenters (this blog restricts anonymous comments, so there's at least a user profile for every comment). There isn't. What's worse, Google tells you that if you don't remove those spam comments, your site's ranking will be hurt. Then I had my bright idea. I clicked on one of the links left in the spam comment. Then I picked some keywords from the page and plugged them into Google to find the site. There, at the bottom of the search result, was an option: Dissatisfied? Help us improve. Google is asking for feedback. I pasted in the URL for my comment spammer's site, and checked the radio button labeled "The results included spam." I clicked send, and my spammer's site was bound for Google oblivion!

Beware, comment spammers, I'm going to report you!

Though I felt good about it, I started to have doubts. A lot of these comment spammers seemed to be Asian; could it be that Asian search engines didn't get the nofollow memo either? Some quick googling confirmed my suspicion, China's leading search engine, Baidu, doesn't pay attention to the nofollow attribute! These comment spammers must be using my blog to juice their Baidu ranking!

Well maybe not. I did a few searches in Baidu. Baidu is probably the worst internet search engine I've ever tried! Baidu gives really stupid results for my vanity search. Baidu doesn't index my blog, my website, or anything I've ever posted. Perhaps China has blacked out the entire Google network, including Blogger, and Baidu doesn't see it any more. Or perhaps "Go To Hellman" has been banned for its post on Qin Shi Huangdi. Baidu has spidered a page from WorldCat that mentions some other Eric Hellman, and has picked up blog mentions of my by John Blyberg and in Dear Author but not much else. It's safe to assume that Baidu's strength is not English-language indexing.

So if Baidu doesn't index my blog, then spammers shouldn't be able to improve their Baidu rankings with comment spam in my blog. There must be some other motivation for the comments.

Another thing I noticed is that Baidu seems to be big on searching for MP3's and PDF's. It ranks sites like Rapidshare rather highly. Maybe Baidu and similar search engines spider websites like my blog to discover the mp3 files, the PDFs, and the video files that Baidu users are really looking for, and the intended audience of the spam comments is these content spiders. My blog has discussed ebooks, piracy and related topics, so maybe the spammers think its a good source for links to content. Who knows?

Another possibility is that the spammers are trying to get bloggers themselves to visit the their sites. "Patrick" from Madras is trying to sell "web templates". It turns out that his site has copied content from another site marketing web templates, which appear to me to be copies of other websites with much of the content stripped out. It's ironic: Patrick seems to be using a template for a web-template selling website to sell web templates.

After a few days, I checked back to see if the website I had complained about had been removed from Google or not. As it turns out, the site actually improved its Google ranking from #5 to #1 in my test search. So much for my career in comment spam scourgedom!

Wednesday, April 7, 2010

When librarians catalog a book, they do their best to describe a thing they have in their hands. The profession has been cataloging for a long time, and it tends to think that it's reduced the process to a science. When library catalogs became digital in the 1970's, the descriptions moved off of paper cards and into structured database records using a data format called MARC. That stands for MAchine Readable Cataloging, and as one Google engineer recently complained, "the MAchine Readable part of the name is a lie". The problem that Google's machines are having with these records is that the descriptions have always been meant for humans to read, not for computers to parse and understand.

Cataloging librarians are not stupid, and they've been working since the very beginning of digital cataloging to make their descriptions more useful to computers. They've introduced "name authority files" to bring uniformity to things like subject headings and author and publisher names. Unicode has brought uniformity to the encoding of non-roman characters and diacritics. XML has replaced some of the ancient delimiters and message length encoding. And perhaps most importantly, for a long time they've been embedding identifiers in the catalog records. Despite all this, library catalog records are still not as computer-friendly as they should be.

The move towards identifiers is worth special note. The use of identifiers in libraries dates to the first industrialization of libraries that took place in the 19th century. The classification systems of Melvil Dewey, Charles Ammi Cutter and the Library of Congress were all efforts to make library catalogs more friendly to machines. Except the machines weren't digital computers, the machines were the libraries themselves. From the shelves to the circulation slips, libraries were giant, human-powered information storage and retrieval machines. The classification codes are sophisticated identifier systems upon which the entire access system was based. So maybe MARC isn't a lie after all!

The rest of the world took a while to catch up on the use of identifiers. The US began issuing social security numbers in 1936, but it wasn't until the 60's with the adoption of ISBN in the 1966 and ISSN in 1971 that the entire publishing industry began to use identifiers to more efficiently manage their sales, delivery and tracking of products.

The same properties that made identifiers useful in physical libraries make them essential for digital databases. Identifiers serve as keys that allow records in on table to be precisely sorted and matched against records in other tables. Well designed identifier systems provide assurances of uniqueness: there may be many people with the same name as me, but I'm the only one with my social security number.

Nowadays, it sometimes seems that almost any problem in the information industries is being solved by the introduction of a new identifier. Building on the success of ISBN and ISSN, there are efforts to identify works (ISTC), authors (ORCID, ISNI), musical notations (ISMN), organizations (SAN), recordings (ISRC), audio-visual works (ISAN), trade items (UPC) and many other entities of interest. We live in an age of identifiers.

The apotheosis of indentifiers has been achieved in the Linked Data movement. The first rule of Linked Data is to give everything- subject, objects, and properties, their own URI (Uniform Resource Identifier). By putting EVERYTHING in one global space of identifiers, it is expected that myriad types of knowledge and information can be made available in uniform and efficient ways over the internet, to be reused, recombined, and reimagined.

What's often glossed over during the adoption of identifiers is their fundamental pragmatism. The association between any identifier and the real-world object it purports to identify is a thinly veneered but extremely useful social fiction which doesn't approach mathematical perfection. Even very good identifier systems can fail as much as 1% of the time, and automated systems that fail to recognize and accommodate the possibility of identifier failure exhibit brittleness and become subject to failure themselves. Still 99% of perfect works perfectly fine for a lot of things.

A decade ago, the world of libraries and the publishers that supply them embarked on an effort to link together the citations in journal articles and the bibliographic databases essential to libraries with the cited articles in e-journals and full text databases. Two complementary paths were pursued. One effort, OpenURL, sent bibliographic descriptions inside hyperlinks, and relied on intelligent agents in libraries to provide users with institutional specific and relevant links. The other, CrossRef, built identifiers for journal articles into a link redirection system. Together, OpenURL and CrossRef built on the strengths of the description and identification approaches and do a reasonably good job serving a wide range of users, including those in libraries.

Now, however, the slow but sure development of semantic web technologies and deployment of Linked Data has spurred both CrossRef's Geoff Bilder and the OCLC's Jeff Young (OCLC runs the OpenURL Maintenance Agency) to examine whether CrossRef and OpenURL need to make changes to take advantage of wider efforts. In another post, I'll look at this question more closely, but for now, I'd like to comment on what we've learned in the process of building article linking systems for libraries.

1. Successful linking requires both identification and description. The use of CrossRef by itself did not have the flexibility that libraries needed; CrossRef addressed this by making its bibliographic descriptions available to OpenURL systems. Similarly, the OpenURL's ability to embed CrossRef identifiers (DOIs) inside hyperlinks has made OpenURL linking much more accurate and effective.

2. Successful linking is as much about knowing which links to hide as about link discovery. Link discovery and link computation turn out not to be so hard. Keeping track of what is and isn't available to a user is much harder.

3. Bad data is everywhere. If a publisher asks authors for citations, 10% of the submitted citations will be wrong. If a librarian is given a book to catalog, 10% of the records produced will start out with some sort of transcription error. If a publisher or library is asked to submit metadata to a repository, 10% of the submitted data will have errors. It's only by imposing the discipline of checking, validating and correcting data at every stage that the system manages to perform acceptably.

Linking real world objects together doesn't happen by magic. It's a lot of work, and no amount of RDF, SPARQL, or URI fairy dust can change that. The magic of people and institutions working together, especially when facilitated by appropriate semantic technologies, can make things easier.

Sunday, April 4, 2010

I didn't get an iPad this weekend. I'll probably get one sooner or latter (I have a birthday coming up, hint, hint), but for those of you who spent your Easter weekend playing with your iPad, the first picture should give you a taste of what you missed.

I entered some data into a spreadsheet to see if ebook reader pricing shows any trend. The data comes mostly from the MobileRead wiki. I've added two arrows to indicate two possible evolutionary paths for ebook readers.

One path, the tablet, is exemplified by the iPad. The tablet roadmap is characterized by a relatively constant price and ever increasing computer power and display functionality (color, speed, resolution).

The other path, exemplified by the Kindle, is the dedicated ebook reader. The reader roadmap is exemplified by relentless price reduction and gradually increasing accommodation for reading ebooks.

If you believe my extrapolations, the dedicated ebook reader will cost as little as $25 in 2014, and the iPad will still cost $400. You heard it here first.

Thursday, April 1, 2010

PLEASE READ THIS LICENSE AGREEMENT ("LICENSE") CAREFULLY BEFORE USING THE GO TO HELLMAN BLOG ("THE BLOG"). BY USING THE BLOG, YOU ARE AGREEING TO BE BOUND BY THE TERMS OF THIS LICENSE. IF YOU DO NOT AGREE TO THE TERMS OF THIS LICENSE, CEASE READING AT ONCE. IF YOU DO NOT AGREE TO THE TERMS OF THE LICENSE, YOU MAY APPLY FOR A REFUND. IF THE BLOG WAS ACCESSED ELECTRONICALLY, CLICK "DISAGREE/DECLINE" BELOW. IF NOT, YOU MUST RETURN THE ENTIRE BLOG PACKAGE IN ORDER TO OBTAIN A REFUND.

IMPORTANT NOTE: This Blog may be used to induce synapse connection patterns in neural networks. It is licensed to you only for induction of non-copyrighted connection patterns, connection patterns in which you own the copyright, or connection patterns you are authorized or legally permitted to induce or have induced. This Blog may also be used for linking to music and video files for listening or viewing via your reading/listening/viewing device. Linked access of copyrighted music or video is only provided for lawful personal use or as otherwise legally permitted. If you are uncertain about your right to access to any linked material you should contact your legal advisor at once.

1. General. The Blog, commentary and any graphics accompanying this License whether transmitted electronically, resident on disk or in memory, on any other media or in any other form (collectively the "Go To Hellman Blog") are licensed, not sold, to you by Gluejar Inc. ("Gluejar") for use only under the terms of this License, and Gluejar reserves all rights not expressly granted to you. The rights granted herein are limited to Gluejar's and its licensors' intellectual property rights in the Go To Hellman Blog and do not include any other patents or intellectual property rights. You may own the chair you are sitting on, subject to other ownership claims, and nothing else remotely connected to this Blog. The terms of this License will govern any correction or additional posts provided by Gluejar that replace and/or supplement the original Gluejar product, unless such changes are accompanied by a separate license in which case the terms of that license will govern.

2. Permitted License. Uses and Restrictions. This License allows you to read and mentally process the Go To Hellman Blog. The Blog may be used to inspire new thoughts so long as such use is limited to thoughts which are lawful in your country of domicile. You may not make the Blog available over a network except as permitted under this License. You may make one copy of the Blog in machine-readable form for backup purposes only; provided that the backup copy must include all copyright or other proprietary notices contained on the original. You may also extract quotes, limited to forty words or less, and include them in your lawfully constructed derivative works, always providing that this License accompanies such derivative works. Except as and only to the extent expressly permitted in this License or by applicable law, you may not copy, decompile, reverse engineer, disassemble, modify, or create derivative works of the Blog or any part thereof. THE GO TO HELLMAN BLOG IS NOT INTENDED FOR USE IN THE OPERATION OF NUCLEAR FACILITIES, AIRCRAFT NAVIGATION OR COMMUNICATION SYSTEMS, AIR TRAFFIC CONTROL SYSTEMS, LIFE SUPPORT MACHINES OR OTHER EQUIPMENT IN WHICH THE SILLY THINGS THE BLOG SAYS COULD LEAD TO DEATH, PERSONAL INJURY, OR SEVERE PHYSICAL OR ENVIRONMENTAL DAMAGE.

3. Hyperlinking. You may not rent, lease, lend, redistribute or sublicense the Go To Hellman Blog. You may, however, make a one-time permanent hyperlink of all of your license rights to the Go To Hellman Blog to another party, provided that: (a) the hyperlink must include all of the Blog, including all its component parts, original media, printed materials and this License; (b) you do not retain any copies of the Blog, full or partial, including copies stored on a computer or other storage device; and (c) the party activating the hyperlinked Blog reads and agrees to accept the terms and conditions of this License.

4. Consent to Use of Data. You agree that Gluejar and its millions of subsidiaries, present and future, real and imagined, may collect and use technical and related information, including but not limited to technical information about your computer, system and application software, and peripherals, that is gathered periodically to facilitate the embarrassment of bashful users. Gluejar may use this information, as long as it is in a form that does not personally identify you or your grotesque bodily deformities, to improve our stories or to provide services or technologies to you.

5. Other Services. You understand that by using this Blog, you may encounter Content that may be deemed offensive, indecent, or objectionable, which content may or may not be identified as having explicit language. Nevertheless, you agree to use the Blog at your sole risk and that Gluejar shall have no liability to you for content that may be found to be offensive, indecent, objectionable, obscene, pornographic or just plain stupid. Content types (including tags, links, videos, comments and the like) and descriptions are provided for convenience, edification and amusement, and you acknowledge and agree that Gluejar does not guarantee their accuracy or grammar.

Certain Content may include materials from third parties or links to certain third party web sites. You acknowledge and agree that Gluejar is not responsible for examining or evaluating the content or accuracy of any such third-party material or web sites, notwithstanding that it's pretty much the whole point. Gluejar does not warrant or endorse and does not assume and will not have any liability or responsibility for any third-party materials or web sites, or for any other materials, products, or services of third parties. Links to other web sites are provided solely as a convenience to you and your chattel. You agree that you will not use any third-party, fourth party, or even nth-party materials in a manner that would infringe or violate the rights or put a damper on any other party and that Gluejar is not in any way responsible for any such use by you, especially insofar as we have not been invited to such parties.

6. Termination. This License is effective until terminated. Your rights under this License will terminate automatically without notice from Gluejar if you fail to comply with any term(s) of this License. Upon the termination of this License, you shall cease all use of the Blog and delete all copies of the Blog in your possession as well as all derivative works you have made using quotes from the Blog.

7. Disclaimer of Warranties. YOU EXPRESSLY ACKNOWLEDGE AND AGREE THAT USE OF THE BLOG (AS DEFINED ABOVE) IS AT YOUR SOLE RISK AND THAT THE ENTIRE RISK AS TO SATISFACTORY QUALITY, PERFORMANCE, ACCURACY AND EFFORT IS WITH YOU. TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLE LAW, THE GO TO HELLMAN BLOG IS PROVIDED "AS IS", WITH ALL FAULTS AND WITHOUT WARRANTY OF ANY KIND, AND GLUEJAR AND GLUEJAR'S LICENSORS (COLLECTIVELY REFERRED TO AS "GLUEJAR" FOR THE PURPOSES OF SECTIONS 7 AND 8) HEREBY DISCLAIM ALL WARRANTIES AND CONDITIONS WITH RESPECT TO THE GO TO HELLMAN BLOG, EITHER EXPRESS, IMPLIED OR STATUTORY, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES AND/OR CONDITIONS OF READABILITY OR INTELLIGIBILITY, OF SATISFACTORY QUALITY, OF FITNESS FOR A PARTICULAR PURPOSE, OF ACCURACY, OF QUIET ENJOYMENT, AND NON-INFRINGEMENT OF THIRD PARTY RIGHTS. GLUEJAR DOES NOT WARRANT AGAINST INTERFERENCE WITH YOUR ENJOYMENT OF THE GO TO HELLMAN BLOG, THAT THE FUNCTIONS CONTAINED IN THE GO TO HELLMAN BLOG WILL MEET YOUR REQUIREMENTS, THAT THE OPERATION OF THE GO TO HELLMAN BLOG WILL BE UNINTERRUPTED OR ERROR-FREE, OR THAT DEFECTS IN THE GO TO HELLMAN BLOG, NUMEROUS AND EGREGIOUS AS THEy MAY BE, WILL BE CORRECTED. NO ORAL OR WRITTEN INFORMATION OR ADVICE GIVEN BY GLUEJAR OR AN GLUEJAR AUTHORIZED REPRESENTATIVE SHALL CREATE A WARRANTY. SHOULD THE GO TO HELLMAN BLOG PROVE DEFECTIVE, YOU ASSUME THE ENTIRE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. SOME JURISDICTIONS DO NOT ALLOW THE EXCLUSION OF IMPLIED WARRANTIES OR LIMITATIONS ON APPLICABLE STATUTORY RIGHTS OF A CONSUMER, SO THE ABOVE EXCLUSION AND LIMITATIONS MAY NOT APPLY TO YOU BUT WE REALLY DONT GIVE A DAMN.

8. Limitation of Liability. TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT SHALL GLUEJAR BE LIABLE FOR PERSONAL INJURY, OR ANY INCIDENTAL, SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES WHATSOEVER, INCLUDING, WITHOUT LIMITATION, DAMAGES FOR LOSS OF PROFITS, LOSS OF DATA, DISTRACTION FROM MORE IMPORTANT THINGS, BUSINESS INTERRUPTION OR ANY OTHER COMMERCIAL DAMAGES OR LOSSES, ARISING OUT OF OR RELATED TO YOUR USE OR INABILITY TO USE THE GO TO HELLMAN BLOG, HOWEVER CAUSED, REGARDLESS OF THE THEORY OF LIABILITY (CONTRACT, TORT OR OTHERWISE) AND EVEN IF GLUEJAR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. SOME JURISDICTIONS DO NOT ALLOW THE LIMITATION OF LIABILITY FOR PERSONAL INJURY, OR OF INCIDENTAL OR CONSEQUENTIAL DAMAGES, SO THIS LIMITATION MAY NOT APPLY TO YOU. In no event shall Gluejar's total liability to you for all damages (other than as may be required by applicable law in cases involving personal injury) exceed the amount of fifty cents ($0.50). The foregoing limitations will apply even if the above stated remedy fails of its essential purpose.

9. Trademarks. “Gluejar”, “Go To Hellman”, “Hellman” and "ebook" are trademarks/service marks of Gluejar, Inc. Gluejar and/or its suppliers own all rights, title and interest, including, without limitation, all intellectual property rights, in and to the Blog. Except as expressly provided for in these Terms, You acquire no rights in or to Gluejar Trademarks. If your name happens to be "Hellman", doesn't that really suck? Lillian, you're dead, so stop twittering. Warren, wouldn't "We Shall Be Friedman" be a much better name for your blog? And did Frances ever tell you I inherited her desk at Ginzton Lab? Speaking of Twitter, don't you hate mayonnaise misspellers? The Best has two Ns, idiots! Marty, you can have my second key any time, keep up the good work. Jakob, do a new album already. Lauren, may you rot in Hel for not doing Googling your blog name first.

10. Export Control. You may not use or otherwise export or reexport the Go To Hellman Blog except as authorized by United States law and the laws of the jurisdiction in which the Go To Hellman Blog was obtained. In particular, but without limitation, the Go To Hellman Blog may not be exported or re-exported (a) into any U.S. embargoed countries or (b) to anyone on the U.S. Treasury Department's list of Specially Designated Nationals or the U.S. Department of Commerce Denied Person’s List or Entity List. By using the Go To Hellman Blog, you represent and warrant that you are not located in any such country or on any such list. You also agree that you will not use this Blog for any purposes prohibited by United States law, including, without limitation, the development, design, manufacture or production of missiles, or nuclear, chemical or biological weapons.

11. Government End Users. The Go To Hellman Blog and related commentary are "Commercial Items", as that term is defined at 48 C.F.R. §2.101, as such terms are used in 48 C.F.R. §12.212 or 48 C.F.R. §227.7202, as applicable. Consistent with 48 C.F.R. §12.212 or 48 C.F.R. §227.7202-1 through 227.7202-4, as applicable, the Blog is being licensed to U.S. Government end users (a) only as Commercial Items (b) with only those rights as are granted to all other end users pursuant to the terms and conditions herein and (c) can you believe this inpenetrable language is contained in the iTunes License agreement that you probably accepted this week.

12. Controlling Law and Severability. This License will be governed by and construed in accordance with the laws of the State of New Jersey. If you gotta problem with that, I know a guy. This License is Born in the USA and shall not be governed by the United Nations Convention on Contracts for the International Sale of Goods, the application of which is expressly excluded by Bruce Himself. If for any reason a court of competent jurisdiction finds any provision, or portion thereof, to be unenforceable, the remainder of this License shall continue in full force and effect, so go stuff it.

13. Complete Agreement; Governing Language. This License constitutes the entire agreement between the parties with respect to the use of the Go To Hellman Blog licensed hereunder and supersedes all prior or contemporaneous understandings regarding such subject matter. No amendment to or modification of this License will be binding unless in writing and signed by Gluejar. Any translation of this License is done for local requirements and in the event of a dispute between the English and any non-English versions, the English version of this License shall govern, because we're Americans and what other language do you expect us to understand. In particular, any declarations made in robots.txt or meta tag assertions files are null and void. Googlebot, this means you.