The blog of @ldodds

Smushing Algorithms

I was pleased to see Leo Sauermann recently publish a draft smushing algorithm as he’s saved me a job! There’s some subsequent discussion on the ESW wiki.
I agree with Sauermann that this is an underspecified but significant area. I also suspect there’s room for a range of algorithms optimised for different purposes.
For example, in a simple application working on relatively small data sets it may be simpler and sufficient to smush together all resources irrespective of whether they’re blank nodes or URIs. Just do a global merge to reorganize the properies to ensure that all the data is collated in a single resource. This could simplify things somewhat at the application level and would remove the need for a triple store that was aware of the semantics of owl:sameAs. This is what my own code does for example.
However if you’re regularly trawling the web for data, maintaining provenance and original URIs will be important. So simply collapsing bNodes into a suitable “canonical resource” with owl:sameAs linking related resources is more flexible.
For large data sets, especially where they’re incrementally updated, incremental smushing will be important. This suggests keeping indexes of IFP properties and values to make the merging more efficient. Depending on the store implementation it may also be more efficient to simple add properties to existing resources rather than merge the graphs and subsequently smush the data.
There’s a lot of scope for experimental research here to explore different approaches and the trade-offs. Here’s plenty of data out there to play with, and some performance metrics would be a useful supplement to Sauermann’s specification.