Oracle Blog

Don't panic !

Search, Tagging and Wikis

Problem Statement

Large companies such as Sun Microsystems have a number of knowledge management needs. Plain text search engines can help find documents by searching through keywords in document content. The problem is that

This can return a huge number of documents

The keywords the user is thinking of may not be the right ones

The things searched for may not be text documents – they could be pictures, people or things

Search and its limitations

The first problem can be solved using PageRank algorithms such as those used by Google on the Internet, which looks at the topology of the graph made from the links between pages to categorize pages into those that are considered important by the web community as a whole.
The second problem can be solved by automatic category extraction tools such as those developed by Exalead which analyze the frequency of occurrence of words in documents to create a graph of the similarity space of concepts. This can be presented to the user who can then find similar concepts to those he is looking for, and so narrow down his search. (I developed a Topic Graph Java Applet at AltaVista to do exactly this.)

On the intranet the PageRank algorithms are not going to be as successful as the user community is much smaller, and the need for people to link to each other's documents less big: those who work on the same topic will tend to know each other well. The value of the link will therefore be less evident.
Concept extraction tools should be much more successful on the intranet, as there will be a lot of high quality content available. But even though they will help narrow down a query to more relevant search terms, they will not help find those documents that are the most authoritative. And so the results returned could just as well be out of date documents, only vaguely related documents, or other irrelelvant ones. The ones everyone in the know is reading will not pop up to the top automatically.

Finally concept extraction tools or keyword searches are not helpful for finding information about documents or things that are not textual in nature.

To find what a company deems the most important one needs the behavioral input of its members. And so the question is really: how does one generate this information.

From Bookmarking to Tagging to Wikis

One thing one can assume is that useful documents or things will be those people keep wanting to return to. Bookmarking is the way to find one's way back to documents in a stable information space - a space where things have reliable links (aka. permalinks) - especially so when search itself is not reliable. Bookmarking is the only solution available to a successful information gatherer apart from search.
As bookmarks lists grow they become unwieldy and so they need to be categorized or tagged, so that the user can find his way around his own private information space. Developing one's own private categorization/tagging scheme is itself complex, time consuming and unreliable.
Working with a larger community helps this process dramatically, and so is the first incentive towards helping a lone information gatherer participate in a larger communal structure. Bookmarking tools such as del.icio.us or slynkr just help the information gatherer do what he needs to do anyway.

Once bookmarking services are available and people are busy tagging the resources bookmarked it is possible to find resources that other people have found to be similar (with respect to a tag) to the one one is looking at. This is the beginning of the development of a conceptual scheme. This only requires a space to make these concepts more explicit and to reinforce their use. As tagging commes to be a communal activity the space for defining the concepts develops communally too. It may start as just a helpful suggestion for other tags to use, but it can easily develop into something more informational.
An enhanced wiki would be such an informational space. It would help define the publicly agreed meaning of each tag as its users work on filling out the wiki pages associated with it. This meaning would itself be agreed to communally, that is in a distributed fashion. Writing down the meaning more carefully would help disambiguate the tags and also serve as a repository of knowledge about the concept in question. An empty tag-wiki page would just link to all the documents that had been tagged that way. A more worked upon tag-wiki page would contain information about the meaning of the concept in question, the history of it, the people leading the changes, the place to find technical documentations, other related concepts, and much more. A fully semantic tag/wiki page would express some important elements of that content in terms of machine readable semantic relations.

An Ontology For Tagging

During a week in Zürich we came up with the following elements needed to describe a Tag.

the tag

the event of tagging

the thing tagged

the person or agent doing the tagging

This gives the really simple UML diagram:

It is important to distinguish the tag from the tagging event itself, as otherwise one cannot count the number of times a tag was applied to a resource, which is one element in calculating the value of a resource. The other element in keeping track of the value of the resource is to find out who did the tagging, as the value one gives to the Tagger flows to the tagging event.

Finally it is important in a Tag to keep track of the schema of the tag. Tags evolve within a social context - be it one provided by flickr, del.icio.us, stumbleupon or slynkr - and this social evolution gives them particular meaning. The label “bank” will end up being associated with a very different tag if part of a tag cloud at a large banking institution or if part of a tag cloud in an environmental agency. Using URL scheme (as opposed to URN schemes as Tim Bray proposed recently) is clearly be a big advantage as it can help locate the relevant context.

As it turns out the ontology proposed above is pretty much isomorphic with what Richard Newman came up with a few years ago in Tag ontology, and we should certainly try to work what is happening at the Tag Commons[1].

Richard Newman points out that a Tag has many of the properties of a skos concept. Things that are tags could therefore also be skos:Concepts, giving us a handy and simple vocabulary for relating tags.

With the above frameworks it should be possible to import tags from any of the tag engines such as del.icio.us and keep the meanings of the labels used in those tag engines separated enough to be able to go back to the context of the tagging, yet close enough together to be able to do searches across tag engines on the labels, as I showed in Folksonomies Ontologies and Atom[2]. This is really important for intranet tagging engines, as people do not want to duplicate the tagging work they do at home when at work.

Conclusion

Bookmarking is a necessary activity for information gatherers to keep track of content of interest to them. Tagging one's bookmarks is important tool to help find them again. The energy required in keeping this information can be gathered to help build enterprise encyclopedias in a distributed fashion: these are also known as wikis. These wikis can provide deep content an pointers of everything of interest to members of a group. Using Semantic Web technologies, it should be possible to build this in an open way.

Further thoughts

How should the tag be related to the wiki page? Should the tag be given the url of a wiki page? Ie. should we have something like this:

Or should there be a rdfs:seeAlso relation from the tag to the wiki page?

If a tagging is an association of a tag with an object, would it be useful to give the user some more granularity as to what the type of the relation is? Or are we going beyond tagging here? It would also mean that one could have to limit the number of tags in a tagging event to one.

Any other problems?

[1]Thanks Danny for rectifying my initial error. I thought Tom Gruber had been responsible for the rdf ontology. That was in fact Richard Newman. Tom Gruber has done a lot of work in helping people see the relation between ontologies and folksonomies and owns the Tag Commons site.[2]Richard Newman also notes a very interesting parallel between a Tag and an RSS1.0 item. This is not surprising. The Evolution described earlier from bookmarking has been taken once before. Blogging stands for Bookmark Logging, and the various formats of RSS and Atom are just formats for describing bookmarks. These evolved over time to a system for describing not just information about a resource, but also be itself the new information resource. So we should not be surprised to find some very close commonalities between the syndication data models and what is needed for describing a Tag. I found a similar parallel for Atom. An Atom Entry is very reminiscent of a tagging. An Entry is an event of changing something to a resource at the updated time, which is initiated by an author, to which one can associate a category which is isomorphic with a Tag. The structure of an Atom Entry is a little heavier though as it forces one to give the event a URI (the id) and some content.