Wednesday, November 10, 2010

Image via CrunchBaseSure, a picture is worth a thousand words, but what is a thousand words worth? How about a million? If I had a dataset of the most recent trillion words spoken by humanity, (anonymized and randomized of course!) would that be worth any more than the set of words in this blog post?

These are real questions. A Texas company called Infochimps has datasets quite similar to these, ready for you to use. Some of the datasets are free, others you have to pay for. More interesting is that if you have a dataset you think other people might be interested in, or even pay for, InfoChimps will host it for you and help you find customers. (Infochimps just announced they had raised $1.2 million in its first round of institutional funding.)

One of the datasets you can get from Infochimps for free is the set of smileys used on twitter in tweets sent between March 2006 and November 2009. It's free. It tells you that the smiley ":)" was used 13,458,831 times, while ";-}" was only used 1,822 times.

If you're willing to fork over $300, you can get a 160MB file conatining a month-by-month summary of all the hashtags, URLs and smiley's used on twitter during the same period. That dataset wil tell you that during September of 2009, the hashtag #kanyeisagayfish was used 11 times while #takekanyeinstead was used 141 times.

I had a great talk with Infochimps President and Co-Founder Flip Kromer a few weeks ago before his presentation to the New York Data Visualization Meetup. I fell in love with one of the visualizations he showed in his presentation, and he's given me permission to reproduce it here. (Creative Commons Attribution License) It's derived from the same Twitter data set you can get from Infochimps, and shows networks of characters that are found in the same tweet. So if ♠ and ♣ appear in the same tweet over and over again, the two characters will have a strong connection in the network of characters.

As you might expect, the main character networks that show up are associated with languages, but there are some anomalies. For example, the katakana character ツ (tu) sticks out. Katakana is a set of phonetic characters used in Japanese for non-Japanese words. The reason "tu" is set apart from all the other katakana is that people use it on Twitter as a smiley.

The other anomalous character subnet is labeled "???" in the graph. A closer look reveals this to be the set of characters that look like upside down roman text.

Kromer has noticed that the price (or perhaps cost) of a partial data set follows a non-monotonic curve (see graphic). Small amounts of data are essentially free, but a peak value is reached when portions of the data set are extracted from the full data set. If we were discussing book metadata, for example, peak value might accrue for a set of the 100,000 top selling books.

There's much less value, according to Kromer, in having a large incomplete chunk of a data set. Data for 10,000,000 books, for example, would have less value than the 100,000 book data set, because it's not complete. Complete data sets become extremely expensive because of the logistics involved, and because of the value of having the complete set.

This pattern seems plausible to me, but I'd like to see some clearer examples. I've previously written about having too much data, but that article looked at the effect of error rates on data collection; Kromer's curve is about utility.

For me, the most interesting thing about Infochimps is the idea that the best way to make data flow in large volumes and create new types of knowledge is to provide the right incentives for data producers through the establishment of a market. This makes a lot of sense to me; however I'm not sure that the Infochimps market has also established incentives needed for data set maintenance; the world's most valuable and expensive data sets are one that change rapidly.

Kromer contrasted the Infochimps approach to that of Wolfram, whose Alpha service is produced by "putting 100 PhDs and data in a lab". He also feels that much of the work being put into the semantic web is a "crock" because its technology stack solves problems that we don't have. Humans are pretty good at extracting meaning from data, given a good visualization.

6 comments:

"He also feels that much of the work being put into the semantic web is a "crock" because the its technology stack solves problems that we don't have. Humans are pretty good at extracting meaning from data, given a good visualization."

How do you read a million books?http://www.dlib.org/dlib/march06/crane/03crane.htmlGood visualizations of them, I would very much like to see. :)

I think the other issue is that not everything is visualized, even when you wish it were. Here are some SPARQL queries that give an example of ending up with visualizations from the SemWeb technology--rather than starting with them:http://blog.ouseful.info/2009/12/14/hackable-queries-parameter-spotting/Of course you could make such visualizations another way--if the data were expressed the way you wanted. But the point is that the data isn't always expressed in accessible ways (it may be locked up in PDFs, or need expert interpretation).

Jodi, By juxtaposing thos sentences, I didn't mean to imply that Kromer was saying visualizations were the answer. The context was more like: improving metadata ontologies would not help much if we wanted to read a million books.

When I chatted w/ Flip recently about Semantic Web, he took a much less silly position than "crock"; my guess is that they think they can successfully market against Semantic Web like the Freebase guys always tried to do.

Kendall- Infochimps is not averse to distributing RDF data and things like that. But Flip's point of view, IIRC, was that people with good data sets on their hands aren't sitting around saying "if only we had better globally interoperable metadata schemas and knowledge models, we'd post our data in a jiffy." They're more concerned with business models, logistics, and partnering. That's a ways from silly.

Regarding the price vs. %complete graph: At the far right is the peak for comprehensiveness; it's all in there, whatever you need to know. The second peak is the peak of insight: you're not paying for data, you're paying for the answer to a question. The insight doesn't have to be deep: the clearing price for 200 GB of blog posts is far less than the price for {frequency of mentions of "I'm buying a ___"} mined from those blog posts, even though the latter is a strict subset

@jodischneider Absolutely agree that we don't remotely have the ability to visualize arbitrary (or even more than a tiny segment of) datasets -- and also that "the point is that the data isn't always expressed in accessible ways (it may be locked up in PDFs, or need expert interpretation)." Indeed, this is my main problem with the semantic web: too many people seem to be concentrating on turning data into RDF, and in knitting it together at the atomic level. What we need to do right *now* is work on making it discoverable, and verifiable, and downloadable, and descriptive of its contents and methodology. "Connected" is too hard right now -- let's solve "Adjacent" first.