Tuesday, July 28, 2009

Thoughts on GeoWeb Standards

Andrew Turner recently asked for our thoughts on GeoWeb standards, and I thought I'd put it as a post here instead of cluttering too much of his comment stream.

I've been thinking about the different standards and their place in the world a lot recently. I'm not someone who takes strong stances on anything, and you're not, I hope, going to read this post and think that I'm a KML partisan, and that it's only because I work at Google that I think positive thoughts about it. I prefer instead to explore the problem space.

The problem isn't adoption, clearly. It's findability.

Adoption rate:

There's no question that KML has a phenomenal adoption rate. Michael Jones went over the numbers during his GeoWeb talk, but in case you missed it:

More than 500,000,000 KML/KMZ documents on the Internet

More than 250,000 Internet websites hosting KML/KMZ content

2 billion placemarks accessible on the public Internet

Those are staggering numbers, especially compared to just last year when Google announced it had indexed tens of millions of KML files on a hundred thousand unique domains. Growth by a power of ten in a year is a lot.

GeoRSS has also expanded rapidly. I don't have numbers on it, but I'm sure it's also a very large number.

There are other formats, too, like GeoJSON, that are great, and I really look forward to seeing what happens with them.

Findability

Frankly, I think we can do a much better job. Fundamentally, one of the problems is that geographic data doesn't lend itself well to linkablity. Sure, you can link within the data, but few people do. A limited number of KML files link to other KML files. GeoRSS can contain a variety of links, but often not to other geographic data files, but rather to HTML or binary media.

KML has been described as the HTML of geographic data. Whether that's true or not is a matter of some discussion, though I happen to think it is (more on that in another post I guess, after lots of people tell me I'm full of it). But one of the principle characteristics of HMTL is linking, which is weakly implemented in KML. Linking happens in two places, the Atom link element, and in the description balloon. Atom links usually refer to HTML media, as in "this is the site credited with authorship." In the description balloon, you're operating in essentially an HTML environment, leading people to be less likely to author KML with links to other KML files, but rather with links to HTML. When authors do put links to KML, it's mostly within their site, not to other KML files elsewhere.

My point isn't to encourage people to link to KMLs created by others, but rather that for findability purposes, on the HTML web we rely primarily on a link structure. The early web was made up of pages that delivered content, and linked to other sites. Whole pages developed early as directories of other sites, and they linked to other directories. Google web search was built on using the number and authority of links of others to rank pages. The "GeoWeb" isn't really a web in the same way. It uses the technologies that built the web, that live on the web, but it itself doesn't constitute a web in a meaningful sense. The vast majority of links to geographic data that I've found are HTML links within full HTML pages, with the next set being programmatic.

Is that the nature of geographic data? Or have we just not found the true linkability of it? I tend to think it's the former. Geographic data is heirarchical, it is ontological, it is content rich, it is combinable. It is linkable through common ontologies. But geographic data doesn't lend itself to easy linking in the same way. It's the nature of structured data, it must relate to a structure. Ontologies are almost the antithesis of linkablity outside the domain.

So that suggests that we need to find another mechanism for findability. Deep searches are possible, but generally when you want geographic data, you either want points on a map, "This is where that thing is" which is fairly easy to do, and I think we've done it well. Or, you want a metadata search of some kind, "Give me all the polygons that fall within this bounding box, and where property X is between Y and Z." That no one does well on a global scale, only within limited sets of data. Searching on text is great for web pages, because they are composed primarily of text. But searching for data is a whole other problem not easily solved by our current mechanisms.

Some people have written about using Semantic Web technologies to provide the linking, and particularly Andrew notes LinkedGeoData in his comments on his blog post. I've always been of the opinion that the Semantic Web is too complex. One of the joys of HTML is the ease by which you can link pages. The authoring tools aren't really there yet either. I'd be happy to be proved wrong. I used to think that standards, like RDF, that have languished for so long will never take off. However, the explosion of Ajax in the last few years has made me less skeptical. I don't know if the Semantic Web is the technology of the future and it always will be, or if it will actually take off. I remain fairly skeptical however, and as yet there's no widely adopted viewer for it either.

Combinability

Perhaps the true value of XML based formats comes from their combinability. Whether it's Atom (or RSS) and GML to make GeoRSS, or Atom and KML to produce a Google Data API, or Atom and KML to produce, well, KML containing Atom. This greatly increases their usability, and I think I sense another post coming on since this one is getting long. But my point being the XML standards provide the only really good way of doing this while retaining proper namespaces. The downside is, of course, the verbocity of XML and the pain of XML schema.

Wrap Up

Don't get me wrong, I think that KML and GeoRSS are great, as are a lot of other formats I haven't mentioned, like GeoJSON and others. Andrew asked also about other interesting topics, like expressiveness and durability, which I haven't gotten to. Ultimately, though, if we can't solve the findability problem, other technologies will come in that do.

9 comments:

Anonymous
said...

Spatial data doesn't really need to be linked, its position in the data universe is fixed by the sites that publish and link to it and by its explicit spatial location. You don't need hypertext to know that a feature is next to another feature.

That said, if certain search engine giants started doing a better job of spatial search, I think that the interlinking of spatial features may increase, if only by SEO-savvy folks.

What do I mean by a better job? Well, it doesn't seem like spatial search has had enough attention given to returning results that users might actually want.Here are a few points I've been thinking about:

- Semantic analysis of the KML content (folder name, description, placemark name, description, balloonstyle, extended data) is important but can not be taken in isolation. The meaning of these resources needs to both contribute to and take meaning from the parent site and other linkers. Site structure, page text, anchor text, etc, etc all play a role in understanding the data. All kinds of room for magic here which, as far as I can tell now, isn't happening in either direction.

- Spatial patterns are important. If a site has been spatially placed (by indicators like address in HTML, whois records, etc), and primarily has KML files in that area, then perhaps it should be more authoritative for the terms of its data than a site of similar rank in a different location. Along similar lines, consider the case for a spatial "authority" site based on sheer numbers of unique features.

- if there are located pages (HTML or KML) that are returned for a term, perhaps their position should be more strongly weighted based on the location of the searcher. There is some of this already, but I think it could be utilised more heavily.

- take into account the characteristics of the parent site's data. Perhaps data which is sparse but clustered around certain locations should be returned more frequently for those locations, but not for the areas where the data is less dense. Similar logic applies when using a bounds-constrained search engine like GMaps. If there are local clustered resources , perhaps these results should return higher.

I remain convinced that the only way for the GeoWeb to take off is to allow webmasters to be lazy. The only reason the web is so effective today is because the only metadata that matters is the metadata derived by the search engines. Just make the search engines smarter and the GeoWeb will flourish.

"Spatial data doesn't really need to be linked, its position in the data universe is fixed by the sites that publish and link to it and by its explicit spatial location. You don't need hypertext to know that a feature is next to another feature."

True. You don't, Jason. Not within your own GIS.

But I'd be surprised if, scaling Nanaimo's system up and out, your individual parcel representations (HTML or KML) didn't link "outward" to representations of the municipalities and provinces and "inward" to other resources. These links tell user agents where they can go and what they can do to the resources without requiring them to do any spatial anlaysis of their own. Engine of application state and all that. Better yet, I can create new geographic resources that link to yours for context, asserting relationships that benefit users without requiring them to reconcile our different feature sets within their own GIS. I don't see links replacing spatial analysis at all. Instead, they'll represent some of the most routine analyses.

I'm more interested in the fact that the lack of linking that we have in these technologies does limit findability through mechanisms we've come to accept on the web. And the fact that it isn't actually a "web" in the sense of linked files.

This is an old discussion in GML. We used to talk (circa 2000) about "three clicks to my house" in GML. You will note that in GML feature instances have id's which are URI valued. You will also note that in GML properties can have xlink:href attributes which identify the resource which is the value of the property. These features are not there by accident. The original idea in GML was to deploy all feature instances as elements in files on the web. This does however have drawbacks. It might be scalable in this fashion - but maybe not - spatial indexes returning thousands or millions of object references might not be so nice. I believe there is a role for linking and an important one - but I believe there is also a role for session based query and update.

"Ontologies are almost the antithesis of linkablity outside the domain."

no, they're not. they are if you require people to understand and use and process your ontology. but on the other hand, they are an excellent source for providing richly linked representations to your clients, so depending on what your service or application is, you will derive different sets of links to maybe serve in different representations. ontologies are a good starting point to manage your rich and highly structured data, but you should not require others to use them. by following RESTful design principles, you can design and provide representations that are as rich or as simple as you want.

Jason, Well put, we do need to get better, but I think what we need to figure out is, what are good signals to us. So far, we don't have great signals to us as to what other people say are authoritative sites for geographic data.

I remember somebody some years ago saying "you are what you can find." Text based search engines have made made it possible to find exponentially more information much faster than anyone ever thought possible.

Someone else has said that much of modern computing, and especially the web, is basically fast text processing. Mano's point, as I understand it, is that linking HTML pages is easy, because it is involves linking words and phrases based on HTML. Finding things is the same, you look for words and phrases. But linking geospatial data, for example, overlapping polygons, points inside a polygon, or lines that intersect a line or polygon and finding information based on these links is more complicated. As Jason points out Geospatial/GIS folks do these kinds of thing routinely, but I think Mano's point is that they are seldom done on an internet scale, the scale at which a Google operates.

I think the same argument would apply to imagery where links were based on shared patterns. An internet example of image (specifically digital photos) linking might be Photosynth.

Of all the things that the semantic web could address, imagery (including digital photos, for example, Flickr) and geospatial data should be at the top of the list, not least because of the incredible volume of spatial and image data that is available on the web. And it may be more useful to concentrate on these two areas than try to create a general semantic web.

You can tell my OpenID provider sucks since my name is "id" :) For the record, this is Jason.

Mano, I obviously don't have any insight into the inner workings of Google (bummer) but my guess is that the geo (kml/georss) index is being treated too much in isolation from the web index.

Until you start seeing people interlinking geo data, the authority / quality of a site's geo resources may need to be proxied from a level up, so from the pages on the hosting site that link to the resource.

I understand that even though spatial isn't all that special, spatial search is a tougher problem than standard web search. I don't think it's a huge leap, though the resources expended may be considerably higher. Google is already enhancing the main web index with proximity-based relevance so the spatial index must already be there to some degree. It's just a matter of moving beyond the "what's close to me" test to a series of signals based on the hosting site's spatial characterisitcs (nearby density, nearby specificity, total features, HTML results with high rank for similar terms) and of course the other quality signals inherited from the parent site. As more explicitly spatial data is published, you may also need to loosen up what search terms kick in the proximity test.

Google could also promote growth from the HTML side of the geoweb by doing things like giving some location karma to things like meta tags or microformats that indicate an HTML page's location. That way the web could participate more strongly in the geoweb and spatial search could return HTML pages with pushpins too :)

My recent blog post talks about publishing data as individual resources. What needs to be taken away from that is that this would not have happened to the degree that it has if the search engines hadn't become smart/deep enough to return individual pages. Without this driver for open data, I think we'd still be burdened with lots of framed sites and, now, more stateful AJAX applications.

Google obviously has the option to wait and see what happens, but in doing so is slowing down the rise of the geoweb.