Saturday, January 8, 2011

I don't really know how all the food gets to my table. Sure, I've gathered berries, baled hay, picked peas, baked bread and smoked fish, but I've never slaughtered a pig, (successfully) milked a cow or roasted coffee beans. In my grandparents generation, I would have seemed rather ignorant and useless. Agriculture has become an industry as specialized as any other modern industry; increasingly inaccessible to the layperson or small business.

I do know a bit about how data gets to my browser. It gets harvested by data farmers and data miners, it gets spun into databases, and then gets woven into your everyday information diet. Although you've probably heard of the "web of data", you're probably not even aware of being surrounded by data cloth.

The dataculture industry is very diverse, reflecting the diversity of human curiosity and knowledge. Common to all corners of the industry is the structural alchemy that transmutes formless bits into precious nuggets of information.

In many cases, this structuring of information is layered on top of conventional publishing. My favorite example of this is that the publishers of "Entertainment Week" extract facts out of their stories and structure them with an extensive ontology. Their ontologists (yes, EW has ontologists!) have defined an attribute "wasInRehabWith" so that they can generate a starlet's biography and report to you that she attended a drug rehabilitation clinic at the same time as the co-star of her current movie. Inquiring minds want to know!

If you look at location based services such as Facebook's "places", Foursquare, Yelp, Google Maps, etc, they will often present you with information pulled from other services. Often, a description comes from Wikipedia and reviews come from Yelp or Tripadvisor and photos come from Panoramio or Flickr. These services connect users to data using a common metadata backbone of Geotags. Data sets are pulled from source sites in various ways.

Some datasets are produced in data factories. I had a chance to see one of these "factories" on my trip to India last month. Rooms full of data technicians (women do the morning shift, men the evening) sit at internet connected computers and supervise the structuring of data from the internet. Most of the work is semi-automated, software does most of the data extraction. The technicians act as supervisors who step in when the software is too stupid to know when it's mangling things and when human input is really needed.

There's been a lot of discussion lately about how spammers are using data scraped from other websites and ruining the usefulness of Google's search results. There are plenty of companies that offer data scraping services to fuel this trend. Data scraping is the use of software that mimics human web browsing to visit thousands of web pages and capture the data that's on them. This works because large websites are generated dynamically out of databases; when machines assemble web pages, machines can disassemble them.

A look at the variety of data scraping companies reveals a broad spectrum. Scraping is an essential technology for dataculture; as with any technology, it can be used to many ends. One company boasts of their "massive network of stealth scrapers capable of downloading massive amounts of data without ever getting blocked. Some companies, such as Mozenda, offer software to license. Others, such as Xtractly and Addtoit are strictly service offerings.

I spoke to Addtoit's President, Bill Brown, about his industry. Addtoit got its start doing projects for Reuters and other firms in the financial industry; their client base has since become more "balanced". Companies such as Bloomberg, Reuters and D&B get paid premiums to provide environments rich in structured data by customers wanting a leg up on competitors. Brown's view is that the industry will move away from labor intensive operations to being completely automated, and Addtoit has developed accordingly.

A small number of companies, notably Best Buy, have realized that making their data easily available can benefit them by promoting commerce and competition. They have begun to use technologies such as RDFa to make it easy for machines to read data on their web sites; scraping becomes superfluous. RDFa is a method of embedding RDF metadata in HTML web pages; RDF is the general data model standardized by the W3C for use on the semantic web, which has been discussed much on this blog.

This doesn't work for many types of data. Brown sees very slow adoption of RDFa and similar technologies but thinks website data will gradually become easier to get at. Most websites are very simple, and their owners see little need or benefit in investing in newer website technologies. If people who really want the data can hire firms like Addtoit to obtain the data, most of the potential benefits to website owners of making their data available accrue without needing technology shifts.

The library industry is slowly freeing itself from the strictures of "library data" and is broadening its data horizons. For example, many libraries have found that genealogical databases are very popular with patrons. But there is a huge world of data out there waiting to be structured and made useful. One of the most interesting dataculture companies to emerge over the last year is ShipIndex. As you'd expect from the name, ShipIndex is a vast directory of information relating to ships. Just as place information is tied together with geoposition data, ShipIndex ties together the world of information by identifying ships and their occurrence in the world's literature. The URIs in ShipIndex are very suitable for linking from other resources.

ShipIndex is proof that a "family farm" can still deliver value in the dataculture industry. The process used to build ShipIndex. Nonetheless, in coming years you should expect that technologies developed for the financial industry will see broader application and will lead to the creation of data products that you can scarcely imagine.

The business model for ShipIndex includes free access plus a fee-for-premium-access model. One question I have is how effectively libraries will be able leverage the premium data provided with this model. Imagine for example the value you might get from a connection between ShipIndex and a geneological database bound by passenger manifests. I would be able to discover the famous people who rode the same ship that my parents took to and from the US and Sweden (my mom rode the Stockholm on the crossing before it collided with the Andrea Doria). For now though, libraries struggle to leverage the data they have; better data licensing models are way down on the list of priorities for most libraries.

Peter McCracken

ShipIndex was started by Peter and Mike McCracken, who I've known since 2000. Their previous company (SerialsSolutions) and my previous company (Openly Informatics) both had exhibit tables in the "Small Press" section of the American Library Association exhibit hall, where you'll often find the next generation of innovative companies serving the library industry. They'll be back in the Small Press Section at this weekend's ALA Midwinter meeting. Peter has promised to sing a "shanty" (or was that a scupper?) for anyone who signs up for a free trial. You could probably get Mike to do a break dance if you prefer.

I'll be floating around the meeting too. If you find me and say hello, I promise not to sing anything.