Limits of Metadata

When somebody has a tool that analyses metadata, such as a search-engine, there will always be people wanting to fool your tool by faking metadata. This creates a problem of trust:

If you assume the metadata is correct, you can use it for your tool, but others can fool your tool by faking the metadata.

If you assume the meta-data is being faked, you must compare it with the real data, and as soon as you start analysing the real data, there’s no need for the untrusted metadata anymore.

Case Stories

Google doesn’t use self-reported HTML metadata because people will try to fool Google by adding fake keywords, for example.

PeriPeri? is a WikiEngine? using RdfForWikis?; essentially making the metadata part of the ordinary page text and implementing a syntax rule allowing the engine to parse the IntrinsicMetaData. (See DublinCore.) This works because changes to the metadata are peer-reviewed just like any other change on the wiki.

SemanticWeb

Discussion

It’s a common myth that “Google doesn’t use metadata”. Google, in fact, depends very heavily on metadata – just not the self-reported metadata in HTML <link> and <meta> tags.

The way Google works – well, the main way, at least; there’s some complexities and trade secrets and the like – is by following links from one page to another. When the GoogleBot sees a link like “Go see the <a href=’http://www.communitywiki.org/’>CommunityWiki</a>”, it registers the fact that someone has associated the text “CommunityWiki” with the URL “http://www.communitywiki.org/”.

This is, of course, metadata. When I link to you, I’m saying, “There is data here that is about this subject.” Just because it’s not self-reported doesn’t mean it’s not metadata.

The more links there are that associate this text with this URL – that is, the more human beings there are who say “http://www.communitywiki.org/ has-to-do-with CommunityWiki”, the more “Googlejuice” that relationship has. So when someone searches for “CommunityWiki”, Google.com will return the URL with the highest relationship.

This is why Google has an apparent performance so much better than first-generation Web search engines. It’s using collaborative, human-created, distributed metadata, in the form of links on the World Wide Web, to guide its search results. --EvanProdromou