In a post last week I argued that the key to making structured data pervasive on the web was tools that make it easy for people to create interesting data visualizations that share their data by default, without adding effort. This prompted a pair of responses that I’d like to address here. One, from Glen McDonald in the comments of that post, argued that simply making the data available isn’t enough if people don’t have tool to process the shared data, massaging it into different forms through useful queries. Another response from Stefano Mazzocchi in his blog, essentially argued that people won’t want to share their data, and will act to block sharing, unless we give them some quid pro quo.

Mobility versus Queryability

Let’s look first at data processing. Glen McDonald has built a nice tool called Needle that offers a query language over a graph-shaped data model. As a query engine, Needle’s focus is not on providing a variety of visualizations (though it has some) but on letting you ask questions of the data and get specific answers. Glen argues that this is more important than data export—that when a publisher is in charge of the visualization, they control the way you look at the data, and thus control the conclusions you draw. Only if you make your own queries, says Glen, can a reader really test the author’s argument the way it ought to be tested.

I consider the difference between a query language and a visualization is one of degree, not kind. People only experience data through visualizations—they can’t see the platonic data object. And any kind of non-static presentation gives a user the ability to construct “queries”—different ways of looking at the data. Conversely, a query-based interface is just a particularly powerful way to construct visualizations, but that power tends to come with complexity that can make them too difficult for most users.

Most importantly, I agree with Glen about the importance of letting the reader drive, but I think that supports my argument about the greater importance of data mobility over rich query languages. No one site or tool can ever offer all imaginable data-query interactions—there’s always a new visualization or a new query function lurking around the corner. On the other hand, given the richness of the web, the reader’s favorite query/visualization is sure to be available on some. So the key is for the user to be able to move the data from where it is to where it needs to be. In particular, if the data the user wants to look at is spread over two different sites, then clearly it has to export from at least one of them to be useful.

(De)motivations for Sharing

Stefano agrees that sharing data is key, but questions my assumption that it will “just happen” if we build default data export into the tools people use to author. Stefano argues that “normal people get a weird feeling when they think that others can just take their entire work and run with it”; he feels that people need a driver, some direct benefit that will incentivize them to share their data. In support of this argument, Stefano talks like a programmer about the horror of having your entire web site codebase copied somewhere else, and about the development of GPL and CC-share-alike licenses that try to force people who copy and adapt your code to share those adaptations back to the community.

I think this is an argument that will only resonate with programmers. And since most of my hoped-for content creators and sharers aren’t programmers, it’s not really relevant. If we look at the broader population of content creators on the web, we see blogs and photo and video-sharing sites like Flickr and YouTube. The driver for these authors is clear, and Stefano points it out himself: “the ability for people to publish something to the entire world with dramatically reduced startup costs and virtually zero marginal costs”. Do these authors care about how their content gets used? It seems not. Oshani Seneviratne, a student at MIT, recently did a study of licensing on content-sharing sites. She found that amidst the tremendous amount of reuse going on at those sites, only a tiny fraction of content producers and consumers paid any attention to questions of licensing. You might suspect that this is because most people have never had their content appropriated. But the reason isn’t important: the fact is that a huge population published without feeling a need to defend their content, and that this seems likely to carry over to structured content. If it does become an issue, we have the same answer as we have for text: attribution. The web is full of blogs quoting other blogs, and that seems to satisfy just about everyone. There’s no reason we can’t do the same with structured data. Indeed, our Datapress project (a wordpress extension that lets you put structured data visualizations in your blog) automatically creates “footnotes” in an article linking back to the source of any data it is importing.

In fact, if we look back to the earliest days of the web, I think Stefano’s got his history a bit wrong. Stefano argues that the difficulty of copying and pasting the entire web-site codebase protected web authors from having all their hard work appropriated, and that is the reason they were not deterred from putting content on the web. But I’d say this is a relatively recent phenomenon. In the earliest days of the web there was no codebase—all a web site offered was a set of static pages, and it was indeed straightforward to copy the entire content of the web site to another location. Clearly this did not pose much of a limit to the growth of content on the web.

I think Stefano’s concern stems partly from the particular model of structured data sharing that his company, Metaweb, is pursuing. At Metaweb, the goal is to get everyone contributing to one giant shared data repository. By its nature, that means that you lose the ownership connection to the data you deposit. So it’s very important for you to feel like you’re getting something back for your “donation” of this data—that just like in the open source community, or on wikipedia, the payback for your donation is that lots of other people will be donating as well. But I’m focused on a more personal and distributed model of data sharing, where individuals are publishing their data through their outlets, where their ownership of it can survive others’ copying (hopefully with attribution) and reuse.

Programmers versus everyone else

I think these two discussions both reflect a difference in mindset. Stefano and Glenn are both high-powered programmers as you can see by the wonderful tools they’ve built. I, on the other hand, am not—just check my thesis if you want proof. If you’re a programmer, you naturally think about designing complex interactions with data. That means you need a powerful query language, which you as a programmer have the skill to learn and use. And once you’ve invested the programming work in creating your powerful visualization, you’re going to be pretty upset if someone takes all that work without giving anything back to you or the community. But if you’re a blogger, or an enthusiast about a particular kind of data, then what really drives you is the opportunity to communicate your enthusiasm to everyone else. As a non-programmer, you’ll be excited by even elementary data interactions like maps and faceted browsing. And, having invested less effort in tossing some structured data into your page, you’ll be less sensitive to others reuse of that data, especially if they cite your work as is typical of blogs today.

So why isn’t structured data being published by non-programmers? Because we are currently missing a collection of tools that let regular people do the easy work with structured data. In particular, we need tools to let regular users create really elementary structured data visualizations on their own web sites, and fill those visualizations using basic copy and paste of structured data they find elsewhere on the web. If we can figure out how to make it easy enough, then I think we’ll see structured data explode.

The rest of the Needle team will be offended by the idea that I “built” it, but compensatingly amused by the idea that I’m a “high-powered programmer”.

Regardless, though, I agree that data-mobility is essential, and structured export is integral and pervasive in Needle.

From my perspective, your final point about the missing tools is the same thing I was trying to get at. If there aren’t any tools for the “non-programmers” to use to work with structured data, then the fact that they can download the data and look at the file-name in their downloads folder isn’t worth much. So it appears that we agree on everything except whether we agree!

To make the YouTube analogy more accurate, we should have to ask YouTube users to also upload raw footage, raw audio recordings, scripts, etc. rather than just final videos. That would allowmore reuse, but would you expect people to do that without any additional benefit to themselves?

I don’t think that’s an accurate analogy for Youtube. Raw footage doesn’t reflect the message the author wants to send—it undermines it with distractions, mistakes, repetitions. It’s unprofessional. The data underlying a presentation isn’t like that—it’s part of the message.

But let’s set aside issue of contradicting your own message. Obviously, any time you post content, there’s a risk it will be appropriated. But do people really care? If they did, there are many steps they could take to prevent it. Explicitly forbidding reuse of their copyrighted blogs, or perhaps putting up scanned images of their handwriting instead of text you can copy and paste. Or, not putting up their content at all. The mass of posted (and easily copyable) content provides evidence that many authors aren’t thinking this way. I don’t see why they would start if data were part of their publishing.

Data underlying a presentation doesn’t tell the story. The presentation does, together with the surrounding natural language text. The data isn’t part of the message.

I think people publish often to express themselves, to tell stories. They do so with carefully scripted videos and well-composed essays. Both raw data and raw footage don’t add anything to their narratives, but require extra effort. There’s just no incentive to go that extra distance, however short it is.

Unlike Stefano, I don’t think it’s about wanting to share the data or not. By the time the blog post or news article gets published, the race is over. Publishing data is just not a sub-goal of publishing.

BTW, I just checked the RDF file from TimBL’s site, and it’s still dated “2007/12/13″. There’s no incentive to keep that up-to-date, either.

You get people to publish data together with stories by
1. Requiring it by law, or passing down directives from higher up (like data.gov) – this lasts as long as the law is upheld
2. Social scolding – this can quickly degenerates into a “precarious value”
3. Engineering it into the tools (like Exhibit) – this lasts only as long as the tools remain popular
4. Business values – million dollar questions

I’d bet on #1 and #4. For #1, we could imagine a GPL-like license for data: “if you use this data set to write a blog post or news story, you must publish the final raw data set feeding into your visualizations.”

I’ll stand behind #3 which, as I’ve argued, will work fine because users are not actively opposed to sharing the data. Regarding your objection that “this last only as long as the tools remain popular”, all that means is that tools that expose data have to stay popular forever. I think that’s achievable. Consider (text) copy paste. It’s pervasive—nearly every tool that shows you text lets you copy it. Even when the data being presented is not natively in ASCII (such as in a PDF document) the tool writer went the extra mile to provide the feature. There’s no legal requirement for it, but people now take it for granted—such that in those rare cases you can’t do it, like certain error messages in old versions of windows, it is surprising and frustrating. People now expect copy/paste on text, so tool-authors provide it. The same can happen with data copy/paste.

I’m with David (K.) on David (H.)’s #3. Writers don’t usually object to their text being copyable, I think, because it’s so patently useful to them that they can copy and paste other text (including stuff they themselves wrote before or elsewhere).

This effect is what we’re going for in Needle, and presumably what Stefano and David are going for in Freebase: we’re trying to make systems that are so rewarding for the data creators to use that they don’t even bother thinking about resenting what this makes possible for others.

I’m glad for the support, but I will question one bit—the idea that Freebase is going after the same goal. Freebase is a clear analogue of Wikipedia, and shares the same major characteristic: people don’t think of the information on it as “theirs” at all—rather, they think of it as a community resource to which they are contributing. This requires different drives—people need to feel rewarded by their own altruism in contributing to the community. This has obviously been a successful driver but it isn’t the whole story. I’m very interested (as you can see from our datapress tool) in the more selfish/personal blog/website analogue where publication is driven by the author’s desire to be heard. This motivator has been in play for quite a bit longer than the communitarian wiki one, and I think it can motivate the publication of structured information just as it motivated the publication of unstructured information.

As CS researchers, we are all drawn to #3. But in this situation, I think #1 will get the result the quickest and the easiest. Say, if all data sets on data.gov carry a GPL license, that might get the snowball rolling. It’s not the most romantic solution for us CS folks, but maybe the most effective.

OK, how to seed the structured-data ecology is a different question. For that, I agree with you that #1 will be important. People generally have trouble getting started with a blank slate, and do better with some preexisting examples like data.gov or freebase that they can start from (via copy/paste) or even learn from (via view source). But like the rolling snowball, the ecology can become self-sustaining once people realize how it works. Returning to my original theme that we should be addressing non-programmers, I think a BSD-type license on the seed material will be more effective here than a GPL.

Sure, Freebase is partly a communal space, but it’s also partly a hosting and application environment for personal data works, and in another part an advertisement for Metaweb, and its ecosystem now includes GridWorks, which facilitates, but doesn’t require, uploading to Freebase. So I think there’s a much stronger component of personal motivation there than in Wikipedia.

[...] rebuttals’ like the ones that David Karger and myself have been having lately about requirements, motives and incentives for people to share structured data on the web. Both of us care a great deal about this problem and we still cross paths and cross-pollinate [...]

From my perspective, your final point about the missing tools is the same thing I was trying to get at. If there aren’t any tools for the “non-programmers” to use to work with structured data, then the fact that they can download the data and look at the file-name in their downloads folder isn’t worth much. So it appears that we agree on everything except whether we agree!