2008/06/03

Semantic Web Meetup June 1, 2008

What would you do on a day of perfect weather in New York City? Attend an all-day code camp on Semantic Web programming in Brooklyn of course! OK, I guess I would have preferred it to be a rainy day if I had to be inside, but it still was worth it. I learned a lot about Semantic Web programming and more importantly, realized that the technology is closer to being reality than before. This is a report on what I learned.The event was organized by Marco Neumann and hosted by Breck Baldwin of Alias-i. After bagles, that Marco had brought along, and introductions, a brief run-down of some of the concepts and technologies was given. This was followed by quick descriptions of the projects we were to tackle at the meetup. After a rather late lunch we chose our projects and had a few hours to complete them. In theory, we were supposed to use the Extreme Programming paradigm, but that devolved a bit into group programming interspersed with discussion.

I don't really want to go into the projects in detail. I was interested in two of them, the Natural Language Processing project headed by Breck, our host, and a spatial reasoning projected headed by Marco. The actual projects were not that important really, though, instead the programming aspects were. I was at a disadvantage as it turned out that Java is king when it comes to semantic web programming and I've been doing my programming in Ruby and Erlang for over a year. Semantic Web support for Ruby is not great and not really existent on Erlang.

In Java, the way to go was to use the Jena library. Jena started at HP, but in the intermediate time had become a sourceforge project. It now offers support for RDF, RDFS, OWL and SPARQL. It also supports reading and writing the RDF in RDF-XML, N3, N-Triple and I believe also Turtle. There was some discussion of the strengths and weaknesses of these formats. The rough consensus was that N3 and N-Triple are more human readable, but RDF-XML is more expressive, at least from a syntactical standpoint. It wasn't clear to me if there was any semantic difference. In the NLP project, Jena was used to emit RDF, initially in N3 format, though that was quickly changed to RDF-XML. Once that was done for a subset of the data, a SPARQL query was hacked together (again using Jena) that used that file. All in all, it required not that much real code, though given that it was Java there was all sorts of fluff surrounding it.

On a side note, one of the participants showed us some of his Groovy code, and I must say that Groovy might get me back in Java again. It's like a less wordy version of Java, or perhaps a Java that has been put on a diet by the Ruby camp. When Groovy is mentioned, I guess you have to mention Scala as well. Both seem to be taking Java beyond the confines of the actual language, Java, by leveraging Java, the virtual machine and all the libraries that are available as Jars.

Apart from the programming, there were a few other things I picked up. In the past I had been using Protegé. However, apparently this is no longer the way to go. A company called TopBraid Composer, which is based on the Eclipse platform and Jena has usurped Protegé from its throne. Apparently it is free for non-commercial use, though that is unclear from the website as it does say that you need to purchase a license after 30 days.

One of the other projects was looking at transforming a relational database into an RDF database using D2RQ. There was a paper at W3 that describes this idea. From what I gather, this is nearly equal to trying to derive semantics from database schemata - not something that can really be mechanized. There are also all sorts of performance issues that will have to be addressed if a production database were to be stored as an RDF database, but perhaps this is too early to discuss those issues as we first need to understand why we need them this at all. If it means that we can elevate the data in a database to the level of information, this might be worth it, though. Since there seem to be all sorts of expressivity issues when comparing traditional databases to RDF stores, perhaps the right thing would be to develop new application based on RDF first and only then try to transform existing databases.

Another subject that came up was the difference between ABox and TBox reasoning. ABox reasoning is based on assertions on individuals (ie, the rows of data, to use a database table analogy) whereas TBox reasoning is based on concepts (ie, the schema of a database table).

So, what does all this have to do with security? There are two aspects of this.

The security of the Semantic Web metadata

Using Semantic Web technology to secure our data

The first aspect is certainly not a trivial one. Metadata has already caused embarrassment to many people including Tony Blair when people don't realize that there is more data in a typical document than (literally) meets the eye. In computer forensics, this is what we live for. However, as more webpages get semantic data attached to them, more data may be transmitted than gets shown to the user and now it can be read automatically. Privacy advocates will be all over this problem, but corporations will have to pay attention, too.

However, what I am more interested in is the use of this metadata and the technologies of the semantic web to define and enforce security. At IBM, this is called Data-Centric Security and as far as I can tell, they are working on database security using taxonomies for classification. What the NLP projected showed me is that to some degree, we could also create a content based security system at some point in time. Alias-i and OpenCalais might be the key.

What the code camp showed me is that the technology has reached the point that it is usable. While security is nearly never a business case in itself, there will be other, more motivating, reasons to use semantic metadata in corporations and that will enable such ideas as DCS.