What if every subject that we think about can have explicit representations in our computers?

Tag Archives: XTM

InfoQ published an interview with Tom Preston-Werner on Powerset, GitHub, Ruby and Erlang. I really like projects that try to analyze text/resources on the web and try to implement “smart search”. Powerset is one of these projects. But what I like even more is the approach when we explicitly represent facts/information items using open knowledge representation standards such as Topic Maps or RDF.

Topic Maps can play the role of “knowledge middleware” that helps to integrate various components of “smart search puzzle”. A topic map-based index allows to represent and connect subjects and resources. Explicit representation of relatively small number of relationships (“facts”,”assertions”) between resources and subjects can dramatically change the world of smart search.

Topic Maps based-knowledge middleware is a disruptive technology because it replaces proprietary knowledge organization schemas and modules and it allows multiple players to build various solutions that help to create or use smart index.

Topic Maps-based Ontopedia PSI server, for example, can represent assertions that are manually created by users or generated by some algorithms. We do not have our own text analysis infrastructure, but I hope that in the future we can leverage some services on the web (such as OpenCalais) which can perform text analysis on “as needed” basis. The core ability of Ontopedia PSI server is maintaining explicit representations of subjects that are important for people and ability to maintain assertions about these subjects.

The new version of Ontopedia PSI server can play a role of an aggregator that can extract assertions from existing topic maps/fragements hosted on other websites. Assertions from multiple sources are aggregated into one assertion set/information map/semantic index. Ontopedia PSI server keeps track of information provenance and supports multiple truth values. The server, for example, can handle a situation when one source on the web asserts that Person X did a Presentation P and someone else makes the opposite assertion.

I think that natural language processing can play a huge role in improving search. Ideal text analysis tool should allow to provide ‘clues’ about subjects in a text. I am looking for equivalent of some kind of ‘binding’ that is used in programming quite often these days. I would love to have the ability to provide list of main subjects in a form of PSIs to text analysis tool (using embedded markup or attached external assertions). If I do so, I expect much more precise results. If I do not have an initial list of subjects I expect some kind of suggestions from text analysis tools that I can check against existing information map.

Ontopedia (as many other Topic Maps-based projects) promotes usage of Public Subject Identifiers (PSIs) for “all thinkable” subjects. For example, there is an identifier for TMRA 2008 conference – http://psi.ontopedia.net/TMRA_2008 .
There are identifiers for each presenter and presentation. Basic relationships between various subjects are also “mapped”/explicitly represented. Each basic resource, such as a blog post can have a small assertion set that describes metadata (using Dublin Core metadata vocabulary, for example) and maybe some main assertions. Traditional websites can provide combined assertion sets in XTM or RDF which can be consumed by semantic aggregators such as Ontopedia PSI server. Text analysis is great (when it is good enough). But even simple (semi-)manual “mapping” of subjects, resources and relationships can change the search game.

When we manually try to “map” an existing resource such as a conference website for the first time, it can look as a complicated and time consuming task. Mapping a website for another conference will take much less time. And, of course, in many cases it is possible to reverse traditional website building/assertion extraction paradigm.

It is possible to build nice looking and functional web sites based on “assertion sets”. Topicmaps.com is a great example of this approach. It is driven by a topic map. Humans can enjoy HTML-based representation of this site and aggregators like Ontopedia PSI Server can consume raw XTM-based representation and aggregate it with other assertion sets such as TMRA 2008 conference assertion set.

XTM export has been available on Subject-centric blog from the first day. But, I think, it was not obvious what readers can do with it. I added a link to Subject-centric topic map in Omnigator (Topic Maps browser).

I just finished reading RESTful Web Services. It is an amazing book and I think it will play a very important role in defining main principles of the next generation of the Web. The authors of the book introduce the Resource-Oriented Architecture (ROA) as an architecture for building the resource-centric programmable Web. “Resource” is a fundamental concept in this architecture.

“A resource is anything that’s important enough to be referenced as thing in itself… What makes a resource a resource? It has to have at least one URI. The URI is the name and address of the resource…”

“… A resource can be anything a client might want to link to: a work of art, a piece of the information, a physical object, a concept, or a grouping of references to other resources… The client cannot access resources directly. A [ROA-based] web service serves representations of a resource: documents in a specific data formats that contain information about the resource…”

ROA defines principles of organizing data sets as resources, approaches to designing representations of these resources and main operations on these representations.

The key concept of the Subject-centric computing (SCC) is a “Subject” which is defined as “anything whatsoever, regardless of whether it exists or has any other specific characteristics, about which anything whatsoever may be asserted by any means whatsoever”. This definition is very close to the definition of a “Resource” in ROA.

But there are important differences between ROA and SCC main goals. The Subject-centric computing is less concerned with managing resource/subject representations and using universal HTTP operations such as GET, POST, PUT and DELETE to manipulate resources. SCC assumes that there are a lot of different data sets/documents (at least potentially) which describe or reference the same subject. With SCC, our main concern is in identifying subjects reliably and in bringing together different pieces of information related to the same subject.

As with ROA, we use (resolvable) URIs to identify Resources/Subjects. But in the case of SCC, we promote usage of Published Subject Identifiers (PSIs). If we have a subject that is not a “digital information item”, its PSI should be resolvable to a special kind of a “document” – Published Subject Descriptor (PSD). Each PSD provides a human readable description of a subject which is enough for distinguishing this subject from other subjects. Using ROA terminology, PSD is a special kind of a representation that is introduced to convey “identification” information about a subject.

Many other “documents” and data sets which contain various assertions about the same subject can exist on the Web. SCC is concerned with providing ability to collect these various assertions into the 360Â° view of the subject. PSIs are one of the main mechanisms to achieve this goal.

With SCC, we do not have a luxury of doing point-to-point data integration each time when we have a new data set. That’s why we rely on universal representation formalism which is an important part of ISO Topic Maps standard. Topic Maps provide also a universal merging mechanism that takes care of integration of various data sets published using an interchange syntax such as XTM.

One of the main goals of SCC is to support “associative nature” of human thinking. ROA is satisfied quite often with “shallow” representations of associations (with the “a” HTML tag, for example). SCC is more targeted to semantically rich representations of relationships between subjects. Topic Maps help to represent and manage such relationships as “instance-type”, “supertype-subtype” and thousands of domain-specific association types. Representations of these relationships are available for processing at the semantic level. It makes possible to implement integration scenarios which are “unthinkable” with HTML-like representations.

But in general, ROA and SCC are complementary architectures and can be successfully used together to build exciting applications and environments

More and more applications can produce XML representation of internal information and save it to shared storage. It helps users to synchronize information on several computers. XML representation also helps to create user communities based on sharing of information. Think about shared calendars, music and picture mixes, blogs, recipes. It’s nice, but it can be much better… with topic maps.

Topic Maps provide “out of the box” support for information sharing and merging. This support is based on ability to explicitly represent subjects and ability to connect any piece of information with subjects.

If we have a blog entry, for example, we have a standard mechanism to express that this entry is related to specific subjects. And we have a standard way to merge information from several blogs. As a result we can easily find all blog entries related to the same subject.

“Pure” XML solutions can encode relationships between information pieces and subjects. But these solutions are based on custom schemas. Each time we need to define custom merging rules which also can include transformations between various XML schemas.

It is time… it is time to promote XTM format as “save as” option for various applications. Applications can use optimized internal data models to implement specific set of functions. But applications can also publish Topic Map – based representations of internal information to shared storage. Other applications can “subscribe” to external topic maps and merge external and internal information. Of course, applications remember source of information so users can keep track of “who said what”.

With “save as XTM” support it will be possible to use “universal topic map browsers” to explore information from different applications. Users also will be able to rely on specific applications with optimized views.