thoughts on geospatial, augmenting capitalism, architectures of participation, and more

Menu

The Metadata problem. Or, the problem with metadata

In the geospatial domain, a big problem that many worry about is 'metadata'. Metadata is the information about the data: who collected it, when it was last updated, how accurate it is, how it was made, who to contact to get it, ect. For many years the FGDC, the coordinating organization for sharing geospatial information in the US, has primarily focused on getting people to write metadata for their datasets, and to put the metadata in catalogs of information so that others know what information is out there.

Unfortunately, though millions of dollars have been spent educating people on metadata standards and how to fill them out, there is still shockingly little metadata, let alone actual data, available. Many believe that metadata has to be the basis of the coming 'geospatial web', that being able to at least search who has what information is the first step towards getting even more data available. If I know who has what information, then at least I can seek them out and offer to pay them for it, at least it exists in some form, or so the argument goes.

The big counter to this argument, however, is the World Wide Web. When you write a web page, how much metadata are you required to fill out? Absolutely none. Yes, there are some meta-tags in html, but none are required, and your web page will still be found if there are none. Why isn't this metadata needed? Because a whole industry has been built around helping you search web pages, indeed, to judge by what sites get the most traffic, it's definitely the most important. Why did these search engines come about? Because there was data. Lots of it. And people needed help finding it. In the early days it was Yahoo!, which was able to hire a bunch of people to search the web and categorize it. As the web started growing faster than a team of monkeys clicking all over the place could handle, automated techniques began to be used, with Google emerging as the clear winner.

And the web continues to innovate, with blogs that one person can follow for some other individual's recommendations of information that may be relevant to them, with community rated sites like slashdot and digg, and community tagging on sites like flickr and del.icio.us. Many people are looking to apply such things to geospatial, but what needs to happen first is to put data online.

Unfortunately many of the largest organizations that have data don't put it online. One argument is technical, that it costs too much and is too hard to set up a server to get the data out there. I hope that GeoServer, my main focus in the last few years, is able to offer a cost free easy to use alternative to make that argument less effective. But I believe there's a deeper issue, mostly related to psychology, with individuals being scared to put their data out there. Why? Because the individuals who produce it fear that what they've made isn't good enough, that it has to be perfect, or people will think less of them. And it gets even worse, since there's this whole metadata pressure, that says they better have good metadata if they want to put things out there.

I understand the fear well, when my boss first asked me to release my code to the public repository, where anyone could look at it, it freaked me out. I asked him for an extra week, and spent it adding more comments, redoing the quicker hacks I did for cleaner code, ect. At the end of the week he asked me again, and I still didn't feel ready. What if someone read it and realized I was a bad coder? It might hurt my chances of a future job. It was putting a piece of myself out there for others to judge, and it was very scary. But I eventually got over it, because I realized that even very code that I wrote is generally better than their alternative, which is nothing.

In the geospatial domain, for the most part, we get nothing. People are afraid others might find errors, or they don't have the time to fill out the appropriate metadata. And past that they lack the skills to set up a server, or a good place to just post their data. Though there is a freedom of information act in the US that basically requires most any information by the government to be available to all taxpayers, there is still just a tiny percentage of geospatial information available, let alone accessible to an average user.

I think one of the biggest things needed is a shift in thinking. Metadata needs an architecture of participation, and there needs to be a culture of encouragement. Indeed we need an architecture of participation around geospatial data, so that releasing it isn't opening yourself up to criticism, but instead it puts the onus on others to make what you've put out better, or to move on. This is how it works in the Open Source movement, code released is always seen as a good thing, even if it's not what I need. Once the data starts to get out there, I believe it will begin to make economic sense for companies to build search engines and participation based organization schemes that will organize it. The problem is not a lack of metadata, instead it's the focus on metadata that's slowing down getting real data out there for real innovation. I'll write more about what I think can help in a future post.