Erik Wilde on Services and APIs

Tuesday, August 04, 2009

Data, Models, Metamodels, Cosmologies

when i recently tweeted about the fact that the semantic web or linked data (choose your preferred brand name) had this troubling attitude of one metamodel to rule them all, i got some feedback saying that this was wrong. it's not, and i guess the reason for the confusion in that space is that data, models, metamodels, and cosmologies are somehow hard to grasp, so here is my brief attempt at displaying all of them in one handy overview (and the selection of items in that overview is definitely not complete and simply provided for illustrative reasons).

what does this mean, in particular the weird cosmology part? it just starts with the observation that people usually start with an application domain, and certain recurrent features/patterns in that domain they want to see supported in their models and the tools they work with. which is the reason why the hierarchical databases of the early days were replaced by the more appropriate tabular structures provided by ER and SQL. for the same reason, document processing always had its own world, because working with documents cannot be done effectively in unordered tabular structures, so this is where the trees come in. they exist in various flavors, but the main idea is that you have a tree and that the tree is ordered, which maps well to document structures.

as long as you stay within the same cosmology, mappings usually work reasonably well, and working with the data can be done in at least comparable ways. there may be differences in features and tools and languages, but the big picture of how the data is structured is the same. sometimes there are special features of metamodels, for example whether a model is even required: SQL data always needs a database to live in, whereas XML and RDF can happily live without a specific model (i.e., schema) and can still be used.

things become much more complicated if you want to take data from one cosmology, and map it to another cosmology. this is always possible, but for non-trivial data it often results in a severe mismatch between the application model and the underlying assumptions of the metamodel. it also often means that the tools and languages in the new environment are not appropriate anymore. imagine representing a complex XML document in RDF (i am sure somewhere there is some RDFS for representing XML). it can be done, sure, but compared to the tools provided with XSD and XQuery and XSLT, working with that data will become much more complicated and very ineffective (and also probably very inefficient in terms of processing times when considering large document collections), because the inherent ordered tree structure of the data is not supported by the underlying metamodel anymore.

some people may think that because it is always possible to define mappings between any metamodels, that implies that there is one overarching metamodel that spans all of the above cosmologies (as a layer on top of all cosmologies), and since RDF is the metamodel with the fewest built-in constraints, the claim is often made by the semantic web community that RDF can be that one metamodel. however, this ignores the fact that there is a reason why there are different cosmologies and metamodels and model languages and tools: these have evolved over time to deal with certain classes of data, and they usually do a good job in their domains. RDF has had success because it, too, deals with a certain class of data, it is typically being used for metadata. metadata, even though it's only loosely defined, typically is not data with complex structures, and thus using a simple model such as RDF works well.

the motivation for this post is simply to show that RDF is not the metamodel to end all metamodels. it is one metamodel, and simply a new addition to the existing multitude of cosmologies, metamodels, and models that have been around for a long time, and it has found an application area for which it is well-suited. claiming that its simplicity (i.e., the ability to map other cosmologies to the RDF cosmology) means that all data out there can be appropriately represented and handled as RDF ignores the fact that in the end, it's not important that it's possible to do something, but only how effective you can be when you're doing it: if your job is processing large collections of documents, what models, languages and tools make you most productive at getting your job done.

Comments

You can follow this conversation by subscribing to the comment feed for this post.

Well, not quite. Just as there are piles and piles of XML DTDs and Schemas out there to choose from, each of which tackles a particular domain from a particular perspective, there is also a growing pile of semantic web vocabularies out there for tackling a particular domain from a particular perspective. The point of expressing them all in RDF is to provide easy ways for the data expressed via those different vocabularies to play well together. XHTML, I think, aimed at similar goals, but didn't go over so well. RDF has a similar problem, perhaps, but the goal is same: provide a way to express data in a format that makes it easy to mix data from different domains and vocabularies together. There are different XML schemas for different things, and there are different semantic web vocabs for different things. RDF just provides a way for them to play well together that has, I think, more potential than mixing things up with XML namespaces.

That said, I see your point about effectiveness. Totally. But, I think that in terms of data-mes(s)hing-together, RDF provides an extraordinarily effective mechanism. The real problem is how to deal with all that mes(s)hiness, especially when it comes to user interfaces. From that angle, it really is still a messh out there in the linked data world.

I'll try out an X(HT)ML:CSS :: RDF:???? analogy. CSS has done really, really well at demonstrating how nice, clean separation of content and display can make life easier for everyone, especially web developers. Semantic web folks, I think, don't yet have the last part of that analogy. Maybe we're waiting for a messiah -- more likely we've been concentrating too much on the effectiveness of the data-messhing, and haven't been able to convince the user experience gurus to explore what they could to with RDF. Tom Heath is leading the way there.

When it all comes (finally!) together, I really do think that it'll make everyone more effective at things we hadn't thought about before.

@patrick: thanks for your comment. and i absolutely agree that RDF is better at simply mixing things together than probably anything else out there. which is great and works like a charm as long as your domain model can be appropriately expressed in some RDF schema. however, if you have data that just does not map too well to the highly generalized model of RDF, let's say you have deeply nested data and a lot of sequential data, then you might still be able to easily mash it up, but you're just not able anymore to work effectively with your data. that was my main point, and just to be sure i am not misunderstood: RDF is great for many applications and has very nice properties when it comes to mixing things.

another area where the mixability of RDF has some unintended side-effects is when it comes to provenance: since it is so easy to mix things, it also is easy to lose track of where they came from. this is fine if you just want to amass RDF data, but when it comes to scenarios where you need to later interact with the data, let's say update something and then write it back to where it came from, then this gets actually quite hairy. in this case, RDF's "let's mix things easily", and REST's "let's interact based on self-containing representations of resources" collide, and figuring out what to do here is still a research topic.