Persisted Knowledge - Information in actionable context that is also available in transient or persistent forms expressed using a Graph Data Model. A modern knowledgebase would more than likely have RDF as its Data Language, RDFS as its Schema Language, and OWL as its Domain Definition (Ontology) Language. Actual Domain, Schema, and Instance Data would be serialized using formats such as RDF-XML, N3, Turtle etc).

How do Data Spaces and Databases differ? Data Spaces are fundamentally problem-domain-specific database applications. They offer functionality that you would instinctively expect of a database (e.g. AICD data management) with the additonal benefit of being data model and query language agnostic. Data Spaces are for the most part DBMS Engine and Data Access Middleware hybrids in the sense that ownership and control of data is inherently loosely-coupled.

How do Data Spaces and Content Management Systems differ?Data Spaces are inherently more flexible, they support multiple data models and data representation formats. Content management systems do not possess the same degree of data model and data representation dexterity.

How do Data Spaces and Knowledgebases differ?A Data Space cannot dictate the perception of its content. For instance, what I may consider as knowledge relative to my Data Space may not be the case to a remote client that interacts with it from a distance, Thus, defining my Data Space as Knowledgebase, purely, introduces constraints that reduce its broader effectiveness to third party clients (applications, services, users etc..). A Knowledgebase is based on a Graph Data Model resulting in significant impedance for clients that are built around alternative models. To reiterate, Data Spaces support multiple data models.

Where do Data Spaces fit into the Web's rapid evolution?They are an essential part of the burgeoning Data Web / Semantic Web. In short, they will take us from data “Mash-ups” (combining web accessible data that exists without integration and repurposing in mind) to “Mesh-ups” (combining web accessible data that exists with integration and repurposing in mind).

I've just re-read an article penned by Dan Brickley in 1999 titled: The WWW Proposal and RDF: Then and Now, that retains its prescience to this very day. Ironically I stumbled across this timeless piece while revisiting the RSS name imbroglio that gave us a simple syndication format (RSS 2.0) that will ultimately implode (IMHO) since "Simple" is ultimately short lived when dealing with attention challenged end-users that are always assumed to be dumb when in fact they are simply ambivalent.

Although I don't believe in complex entry points into complex technology realms, I do subscribe to the approach where developers deal with the complexity associated with a problem domain while hiding said complexity from ambivalent end-users via coherent interfaces -- which does not always imply User Interface.

XBRL is a great piece of work that addresses the complex problem domain of Financial Reporting. The only thing it's missing right now is an Ontology that facilitates RDF Data Model based XBRL Schema and Instance Data which ultimately makes XBRL data available to RDF query languages such as SPARQL. This line of thought implies, for instance, an XML Schema to OWL Ontology Mapping for Schema Data (as explained in a white paper by the VSIS Group at the university of Hamburg) leaving the Instance Data to be generated in a myriad of ways that includes XML to RDF and/or XML->SQL->RDF.

As I stated in an earlier post: we should not mistake ambivalence to lack of intelligence. Assuming "Simple" is always right at all times is another way of subscribing to this profound misconception. You know, assuming the world was flat (as opposed to geoid) was quite palatable at some point in the history of mankind, I wonder what would have happened if we held on to this point of view to this day because of its "Simplicity"?

Following from the post about a new Multithreaded RDF Loader, here are some intermediate results and action plans based on my findings.

The experiments were made on a dual 1.6GHz Sun SPARC with 4G RAM and 2 SCSI disks. The data sets were the 48M triple Wikipedia data set and the 1.9M triple Wordnet data set. 100% CPU means one CPU constantly active. 100% disk means one thread blocked on the read system call at all times.

Starting with an empty database, loading the Wikipedia set took 315 minutes, amounting to about 2500 triples per second. After this, loading the Wordnet data set with cold cache and 48M triples already in the table took 4 minutes 12 seconds, amounting to 6838 triples per second. Loading the Wikipedia data had CPU usage up to 180% but over the whole run CPU usage was around 50% with disk I/O around 170%. Loading the larger data set was significantly I/O bound while loading the smaller set was more CPU bound, yet was not at full 200% CPU.

The RDF quad table was indexed on GSPO and PGOS. As one would expect, the bulk of I/O was on the PGOS index. We note that the pages of this index were on the average only 60% full. Thus the most relevant optimization seems to be to fill the pages closer to 90%. This will directly cut about a third of all I/O plus will have an additional windfall benefit in the form of better disk cache hit rates resulting from a smaller database.

The most practical way of having full index pages in the case of unpredictable random insert order will be to take sets of adjacent index leaf pages and compact the rows so that the last page of the set goes empty. Since this is basically an I/O optimization, this should be done when preparing to write the pages to disk, hence concerning mostly old dirty pages. Insert and update times will not be affected since these operations will not concern themselves with compaction. Thus the CPU cost of background compaction will be negligible in comparison with writing the pages to disk. Naturally this will benefit any relational application as well as free text indexing. RDF and free text will be the largest beneficiaries due to the large numbers of short rows inserted in random order.

Looking at the CPU usage of the tests, locating the place in the index where to insert, which by rights should be the bulk of the time cost, was not very significant, only about 15%. Thus there are many unused possibilities for optimization,for example writing some parts of the loader current done as stored procedures in C. Also the thread usage of the loader, with one thread parsing and mapping IRI strings to IRI IDs and 6 threads sharing the inserting could be refined for better balance, as we have noted that the parser thread sometimes forms a bottleneck. Doing the updating of the IRI name to IRI id mapping on the insert thread pool would produce some benefit.

Anyway, since the most important test was I/O bound, we will first implement some background index compaction and then revisit the experiment. We expect to be able to double the throughput of the Wikipedia data set loading.

Continuing on from the previous post... If Microsoft opens the right interfaces for independent developers, we see many exciting possibilities for using ADO .NET 3 with Virtuoso.

Microsoft quite explicitly states that their thrust is to decouple the client side representation of data as .NET objects from the relational schema on the database. This is a worthy goal.

But we can also see other possible applications of the technology when we move away from strictly relational back ends. This can go in two directions: Towards object oriented database and towards making applications for the semantic web.

In the OODBMS direction, we could equate Virtuoso table hierarchies with .NET classes and create a tighter coupling between client and database, going as it were in the other direction from Microsofts intended decoupling. For example, we could do typical OODBMS tricks such as prefetch of objects based on storage clustering. The simplest case of this is like virtual memory, where the request for one byte brings in the whole page or group of pages. The basic idea is that what is created together probably gets used together and if all objects are modeled as subclasses of (subtables) of a common superclass, then, regardless of instance type, what is created together (has consecutive ids) will indeed tend to cluster on the same page. These tricks can deliver good results in very navigational applications like GIS or CAD. But these are rather specialized things and we do not see OODBMS making any great comeback.

But what is more interesting and more topical in the present times is making clients for the RDF world. There, the OWL Ontology Language could be used to make the .NET classes and the DBMS could, when returning URIs serving as subjects of triple include specified predicates on these subjects, enough to allow instantiating .NET instances as 'proxies' of these RDF objects. Of course, only predicates for which the client has a representation are relevant, thus some client-server handshake is needed at the start. What data could be prefetched is like the intersection of a concise bounded description and what the client has classes for. The rest of the mapping would be very simple, with IRIs becoming pointers, multi-valued predicates lists and so on. IRIs for which the RDF type were not known or inferable could be left out or represented as a special class with name-value pairs for its attributes, same with blank nodes.

In this way,.NETs considerable UI capabilities could directly be exploited for visualizing RDF data, only given that the data complied reasonably well with a known ontology.

If an SPARQL query returned a resultset, IRI type columns would be returned as .NET instances and the server would prefetch enough data for filling them in. For a SPARQL CONSTRUCT, a collection object could be returned with the objects materialized inside. If the interfaces allow passing an Entity SQL string, these could possibly be specialized to allow for a SPARQL string instead. LINQ might have to be extended to allow for SPARQL type queries, though.

Many of these questions will be better answerable as we get more details on Microsofts forthcoming ADO .NET release. We hope that sufficient latitude exists for exploring all these interesting avenues of development.

Microsoft's recent unveiling of the next generation of ADO.NET has pretty much crystalized a long running hunch that the era of standardized client/user level interfaces for "Object-Relational" technology is neigh. Finally, this application / problem domain is attracting the attention of industry behemoths such as Microsoft.

My hope is that Microsoft's efforts trigger community wide activity that result in a collection of interfaces that make scenarios such as generating .NET based Semantic Web Objects (where the S in an S-P->O RDF-Triple becomes a bona fide .NET class instance generated from OWL).

To be continued since the interface specifics re. ADO.NET 3.0 remain in flux...

Ontology is a key foundation of the Semantic Web. Without ontology, it will be difficult for applications to share knowledge and reason over information that is published on the Web. However, it is a serious mistake to think that the Semantic Web is simply a collection of ontologies.

Last week I was invited to be on a panel discussion at the Humans and the Semantic Web Workshop. I talked a bit about the Geospatial Semantic Web and its associated research issues. Overall the workshop went very well. You can read about the notes from the workshop here.

New Thinkings

Some of my new thinkings after the workshop are as the follows.

People, especially those who are new to the Semantic Web, have put too much emphasis on developing ontologies and not enough emphasis on developing application functions.

While ontology languages such RDF and OWL are important part of the current Semantic Web development, it’s a mistake to build Semantic Web applications that assume that average users are fluent in those languages.

Many people seem to have forgotten that building Semantic Web applications don’t have start with ontology development. It’s a good idea to start with ontology reuse — i.e. reuse ontologies that have already been developed even if they don’t meet every single requirements of the application.

There is no excuse to build ‘crappy’ UI just because developing Semantic Web applications are challenging.

Hide Low-Level Details from the Semantic Web Users

I was asked the question, ‘What’re user-related issues that Semantic Web developers must pay attention to?’ I think building Semantic Web applications are similar to building database applications. Few things we can learn from our past experience in building database applications.

When building database-driven applications, we store information in SQL databases, and we use SQL to access, manipulate, and manage this information. When building Semantic Web applications, we express ontologies and information in RDF, and use RDF query languages (e.g. SPARQL) to access and manipulate this information.

When building database-driven applications, we hide complexity from the end-users. For example, we almost never expose raw SQL statements to the end users, or ask users to process the raw result sets returned from an SQL engine. We always provide intuitive interfaces for accessing and representing information.

When building Semantic Web applications, we should also hide complexity from the end-users. Users shouldn’t need to see or edit RDF statements. Users shouldn’t need to be fluent in SPARQL queries or able parse graphs that are returned by a SPARQL engine.

Concluding Remarks

Semantic Web developers should spend more time on building functional capabilities that solve real world problems and improve people’s productivity. It’s important to remember that ‘the Semantic Web != ontologies‘.

The return of WinFS back into SQL Server has re-ignited interest in the somewhat forgotten “DBMS Engine hosted Unified Storage System” vision. The WinFS project struggles have more to do with the futility of “Windows Platform Monoculture” than the actual vision itself. In today's reality you simply cannot seek to deliver a “Unified Storage” solution that's inherently operating system specific, and even worse, ignores existing complimentary industry standards and the loosely coupled nature of the emerging Web Operating System.

A quick FYI:
Virtuoso has offered a DBMS hosted Filesystem via WebDAV for a number of years, but the implications of this functionality have remained unclear for just as long. Thus, we developed (a few years ago) and released (recently) an application layer above Virtuoso's WebDAV storage realm called: “The OpenLink Briefcase” (nee. oDrive). This application allows you to view items uploaded by content type and/or kind (People, Business Cards, Calendars, Business Reports, Office Documents, Photos, Blog Posts, Feed Channels/Subscriptions, Bookmarks etc..). it also includes automatic metadata extraction (where feasible) and indexing. Naturally, as an integral part of our “OpenLink Data Spaces” (ODS) product offering, it supports GData, URIQA, SPARQL (note: WebDAV metadata is sync'ed with Virtuoso's RDF Triplestore), SQL, and WebDAV itself.

There is an interesting article at regdeveloper.com titled: Structured data is boring and useless.. This article provides insight into a serious point of confusion about what exactly is structured vs. unstructured data. Here is a key excerpt:

"We all know that structured data is boring and useless; while unstructured data is sexy and chock full of value. Well, only up to a point, Lord Copper. Genuinely unstructured data can be a real nuisance - imagine extracting the return address from an unstructured letter, without letterhead and any of the formatting usually applied to letters. A letter may be thought of as unstructured data, but most business letters are, in fact, highly-structured." ....

"The labels "structured data" and "unstructured data" are often used ambiguously by different interest groups; and often used lazily to cover multiple distinct aspects of the issue. In reality, there are at least three orthogonal aspects to structure: * The structure of the data itself.* The structure of the container that hosts the data.* The structure of the access method used to access the data. These three dimensions are largely independent and one does not need to imply another. For example, it is absolutely feasible and reasonable to store unstructured data in a structured database container and access it by unstructured search mechanisms."

Data understanding and appreciation is dwindling at a time when the reverse should be happening. We are supposed to be in the throws of the "Information Age", but for some reason this appears to have no correlation with data and "data access" in the minds of many -- as reflected in the broad contradictory positions taken re. unstructured data vs structured data, structured is boring and useless while unstructured is useful and sexy....

The difference between "Structured Containers" and "Structured Data" are clearly misunderstood by most (an unfortunate fact).

For instance all DBMS products are "Structured Containers" aligned to one or more data models (typically one). These products have been limited by proprietary data access APIs and underlying data model specificity when used in the "Open-world" model that is at the core of the World Wide Web. This confusion also carries over to the misconception that Web 2.0 and the Semantic/Data Web are mutually exclusive.

But things are changing fast, and the concept of multi-model DBMS products is beginning to crystalize. On our part, we have finally released the long promised "OpenLink Data Spaces" application layer that has been developed using our Virtuoso Universal Server. We have structured unified storage containment exposed to the data web cloud via endpoints for querying or accessing data using a variety of mechanisms that include; GData, OpenSearch, SPARQL, XQuery/XPath, SQL etc..