Luis invented captchas, the random characters you have to type in to convince a web page that you are a human and not a hostile software program. (He shows randomly generated sequences that happened to spell out “wait” and “restart.”) Captchas are useful, he says, when you’re trying to prevent people from gaming a system by writing a program to enter data robotically. They’re also useful to prevent spammers from signing up for free email accounts. To get around this, spammers have started up sweat shops where humans type captchas all day long; it costs the spammers about $0.33/account. And some porn companies ask users to type in a captcha to see photos; the captchas are drawn from email account applications. Damn clever!

He shows some variants. A Russian asks you to solve a mathematical limit. In India one asks you to solve a circuit. Luis says these aren’t all that effective because compputers can solve both problems, but they’re still better than the “what is 1 + 1?” captchas he’s found on US sites.

He says that about 200M captchas are typed every day. He was proud of that until he realized it takes about 10 seconds to type them, so his invention is wasting 500,000 hours per day. So, he wondered if there was a way to use captchas to solve some humungous problem ten seconds at a time. result: ReCAPTCHA. For books written before 1900, the type is weak and about 30% of the text cannot be recognized by OCR. So, now many captchas ask you to type in a word unrecognized when OCR’ing a book. (The system knows which words are unrecognized by running multiple OCR programs; ReCAPTCHA uses those words.) To make sure that it’s not a software program typing in random words, ReCAPTCHA shows the user two words, one of which is known to be right. The user has to type in both, but doesn’t know which is which. If the user types in the known word correctly, the system knows it’s not dealing with a robot, and that the user probably got the unknown word right.

ReCAPTCHA is a free service. Sites that use it have to feed back the entries for the unknown word. About 125,000 sites use it. They’re doing about 70M words per day, the equivalent of 2-4M books per year. If the growth continues, they’ll run out of books in 7 years, but Luis doesn’t think the growth will continue, so it might take twenty years. (There are 100M books.)

The ReCAPTCHA system filters out nationalities, known insult terms, and the like, to avoid unfortunate juxtapositions. It’s soon going to be released in 40 languages. Google acquired ReCAPTCHA.

Q: When will OCR be good enough to break captchas?
A: I don’t know. We’ll probably run out of books first.

Q: Business model?,br>
A: Google Books gets help digitizing.

ReCAPTCHA “reuses wasted human processing power.” The average American spends 1.9 seconds per day typing captchas. We also spend 1.1 hours a day playing electronic games. We humans spent 9B hours spending in 2003. It took less than a day of that to build the Panama Canal. So, Luis switches topics a bit to talk about how to solve human problems by playing games.

First is tagging images with words. Image search works by looking at file names and html text, because computers can’t yet recognize objects in images very well.

Does typing two words take twice as long as typing random letters? No, it takes about the same time, he says. Luis says about 10% of the world’s population have typed in a captcha. The ESP game asks two people unknown to each other to label an image until they agree. The game taboos words that other players have already agreed on. The system passes images through until they get no new labels. They’ve gotten over 50M agreements. 5,000 players playing simultaneous could label all Google images in a month. Google has itsown version; Google has an exclusive license to the patent.

Q: Demographics?
A: For my version, average age is 29 (with huge variance), evenly split between women and men.

Q: Compared to Flickr tags?
A: Only a small fraction of Flickr images have useful tags. The tags from flickr tend to be significantly more exact, but also significantly noisier (e.g., a person tagging an image in a way that means something idiosyncratic).

Q: Bots?
A: Yes, we don’t want you to wait for a partner, so sometimes we’ll give you a bot that replays the moves a human had made with the same image.

Q: Google Images benefits from its version of your game. Who benefits from your version of the game?
A: No one.

For some images, guesses change over time. E.g., a Britney Spears photo five years ago got labels like britney and hot. About two years ago, the labels changed to crazy, rehab, and shaved head. Now they’re back to britney and hot. By watching a player for 15 mins, you can guess whether the player is male or female with 95-98% accuracy.

Why do people like the ESP game? Sometimes they feel an intimacy with their partners. They have to step outside of themselves to make the match. They can have a sense of achievement.

He ends by saying that the about the same number of people — 100,000 — have worked on humanity’s big projects, e.g., pyramids, Panama Canal, putting a person on the moon. That’s in part (he says) because it is so hard to coordinate large numbers of people. Now we can get 100M people to work on something. What can we do?

There’s a terrific article by Carol Kaesuk Yoon in the NY Times about research that shows that humans around the world tend to cluster the natural world in highly similar ways, even using similar-ish names.

The analysis of the data shows that protestors most likely disseminated the use of strategic tagging among their contacts, rather than within a particular specific-interest group. A list of contacts is much closer to a hand-picked ensemble of friends than one of such groups, and therefore represents a bigger influence for the list’s owner.

The University of Huddersfield is making publicly available the metadata about the circulation of its books â€” 3 million transactions â€” over the past thirteen years. This includes a book’s ISBN, number of times it’s been checked out, by which academic department. (It does not include information about individual borrowers.)

BTW, the library used LibraryThing‘s ISBN lookup service to derive some of the ISBNs, and it includes “FRBR-ish” data, i.e., other books that may be closely related.

Tweetag automatically creates tags for tweets and shows you the tag cloud for any term you’re looking for. At the moment, it only looks at the past 24 hours’ tags, a limitation the Belgian folks behind this hope to remove if they get a little money coming in.

[Note from the next day: This is a little embarrassing. I just noticed that this was first published in 2006. It came through my inbox on Saturday, and I carelessly thought it had just come out.]

Elaine Peterson, associate professor at Montana State University, has an article in D-Lib Magazine called “Beneath the Metadata: Some Philosophical Problems with Folksonomy.” It’s good to see the issues taken seriously, and many of her premises strike me as true. But, I disagree with her pragmatic conclusion that “A traditional classification scheme will consistently provide better results to information seekers.” And I think I disagree with her philosophical critique, although I am not confident that I’m understanding it as she intends.

I read the article two different ways. At first I thought it was a critique of folksonomies on the grounds that they contradict traditional philosophical premises. The next time I read it, I thought it was simply pointing out the differences. Now I’m tending toward my first reading, in part because her section on the traditional defends it against some objections while about half of the section on folksonomies is critical of them.

Her philosophical criticism seems to be rooted in what she presents as the Aristotelian approach to classification: Things are lumped with other things like them, and simultaneously distinguished from them. Most important, she says, is the idea that “A is not B,” which means that A cannot be truthfully classified also as a B. But what about digital items that “can reside in more than one place”? That is “irrelevant,” she says, “since one is talking about a classification scheme, not about the items themselves.” I have to admit I don’t understand this. What is the philosophical basis for restricting things to one category if not that that restriction reflects the metaphysical truth that A cannot also be B? So, I think she’s saying we are to reject multiple classifications because such classifications are untrue metaphysically.

This reading is supported by the section on folksonomy, where she identifies philosophical relativism as “the underlying philosophy behind folksonomies,” and pretty clearly intends this as a criticism. (I personally am no fan of philosophical relativism, although there’s a longer story there.) The problem with relativism, she writes, is that it means classification escapes from the demand that A be A and not be B. I take this as indicating that, in her section on traditional classification, she is agreeing with the 1930 textbook she cites that recommends that classifiers give “emphasis to what the author intended to describe.” If you’re arguing that, on metaphysical grounds, things should only be classified in a single category, I guess looking for the author’s intention gives you a way forward…even though categorizing only by the author’s intent is to me like insisting that readers only underline passages that the author considers significant.

And this highlights what I think is my root disagreement with Elaine’s piece (if I’m understanding it correctly). It’s fine to raise pragmatic problems with folksonomies, as she does. But Elaine is pointing at philosophical problems. And those problems require assuming that folksonomists are trying to do what Aristotelian categorizers are trying to do. But they’re not. Aristotelians (I’m using this sloppily as shorthand, so pardon my “tagging”) are trying to find the one true and right category for each thing, creating a well-ordered system free of contradictions. Folksonomies are trying to help us find stuff.

Inconsistencies in tags actually make a folksonomy useful; a folksonomy that consists of 1,000 instances of a single tag isn’t worth the folksonomizing. But these inconsistencies are a problem for Elaine because she is thinking of a folksonomic classification as a philosophical statement rather than as a mere tool. She says that “perhaps … the strongest criticism one could make of folksonomies” is that because tags can be true for one group and false for another,

a folksonomy universe allows both true and false statements to coexist. Because tags are relativized, personal, idiosyncratic views can coexist and thrive in the form of tags, in spite of their inconsistencies. Readers of texts on the Internet become individual interpreters, despite the document author’s intent.

To this many of us will say “Hallelujah!” because we disagree with Elaine’s opening claim that all classification is about answering the philosophical question, “What is it?” Indeed, she’s a hard-liner: An inconsistency to Elaine is any multiple classification, not simply one that contradicts others. Classifying a dissertation about “Moby-Dick” under “ecology” as well as under “novels: 19th Century” would introduce an insupportable inconsistency (in Elaine’s terms). She seems to assume that tags are Aristotelian judgments in which we say that A is a B. But, when I tag a photo of my wife as “ann,” “birthday,” “2008,” and “family events,” I am not saying the essence of Ann (or her photo) is any of those things. Even if I believed in essentialism (I pretty much don’t), we could make use of Aristotle’s idea of “accidental properties” (non-essential but true) to explain what I’m doing. And if I tag Oliver Stone’s “Alexander” as “Angelina Jolie” or “tripe” knowing full well that I am not staying true to the author’s intent, well, tough on Oliver. Tags are not always truth claims, and a folksonomy is not intended to mirror nature. Indeed, a folksonomy can reveal the most appalling areas of ignorance and prejudice in a populace — and, pragmatically, we may well want to address those popular errors, especially since a folksonomy can indeed reinforce them

But, Elaine is right to point to the philosophical implications of folksonomies. An individual folksonomy may make no claim to providing the real truth about how the world is ordered, but the use of folksonomies generally carries some philosophical implications. Elaine sees relativism underneath them while I see a form of pragmatism. But folksonomies didn’t arise out of philosophy. They are a “found” ordering: Hey, we have all these tags, so why don’t we make use of them in a more systematic way? So, I think Elaine is mislocating the philosophical moment in folksonomies. Philosophy isn’t underneath them or behind them. It’s after them, in their effect. Folksonomies reinforce our move away from the essentialist view that every thing has a single category that reflects its single and real essence. We’ve been moving away from that view for a long time as a culture. The success of folksonomies as a tool reveals that we accepted the traditional Aristotelian scheme in part because it was useful. If its utility has been undercut, then we have to ask for the other reasons we should believe in an Aristotelian metaphysics.

The ball is in Aristotle’s court.

* * *

Most of Elaine’s outright criticisms of folksonomies are actually practical, not philosophic. She makes them without empirical evidence. She has not convinced me that she’s right. For example, her final paragraph says:

A traditional classification scheme based on Aristotelian categories yields search results that are more exact. Traditional cataloging can be more time consuming, and is by definition more limiting, but it does result in consistency within its scheme. Folksonomy allows for disparate opinions and the display of multicultural views; however, in the networked world of information retrieval, a display of all views can also lead to a breakdown of the system… Most information seekers want the most relevant hits when keying in a search query.

By “exact” she apparently means the results include fewer false results (where a result is false if the search term doesn’t really apply to the result, as when you search for “fish” and get back posts about dolphins). And that seems correct: A professionally constructed index should have fewer of those sorts of mistakes. But the second criterion in her concluding paragraph is relevancy, and there folksonomies well may beat a professionally constructed index. Not only might a folksonomy retrieve results more relevant to me personally or to my cultural sub-group, but it constructs a semantic system that can retrieve results the narrow and carefully categorizing by experts might miss. So, I disagree with her last sentence: “A traditional classification scheme will consistently provide better results to information seekers.” Traditional classification is best for certain types of searches — ones where you want precision over recall and relevancy, and especially where there is a confined domain of contents that you have to be sure you’ve searched thoroughly — but is not as good as a folksonomy for other types of searches.

In short, neither traditional nor folksonomic classifications are best. Each is best for something.

Vincent Sterken has posted his master’s thesis, which examines LibraryThing.com to understand the dynamics and utility of social tagging. It begins with an exceptionally clear backgrounder on tagging and taxonomies, and then moves to a fascinating exploration of LibraryThing’s folksonomy, including a comparison of how LibraryThing’s community and the Library of Congress classify books.

I’m going to be on the radio news show Here and Now tomorrow to talk about Google.org’s ability to track outbreaks of flu by charting search terms (“flu symptoms”), time, and presumed IP location. I plan on talking about it as an example of the power of having enormous amounts of data, and of putting to use information generated for some other purpose.

Academia.edu lets you add yourself to its gigantic Tree of University Departments. It’s a slick, slidey, Ajaxy UI, and there seem to be only benefits to adding your name to it, even though it will forever be incomplete.

The question is whether it’s easier and more beneficial to count on participants to centralize their contact info at Academia.edu or to hope that universities somehow might agree on a metadata standard â€” a microformat â€” for how they list faculty members on their own sites. Since the latter isn’t happening, the former becomes appealing. (Thanks to John Palfrey for the link.)