Archive

I am doing some work on a Top Secret Project to demonstrate on the SkyTwenty[1] platform the use of email data (in place of location data).

I am making use of Aperture[2] to crawl an IMAP store, then allow sharing of contact and message information, so that queries can be run to discover

who-knows-who in what domain

how many degrees of freedom there are between contacts

do selected contacts have any connection

how “well” do they know each other and so on.

Aperture makes use of the Nepomuk [3] message and desktop ontologies[4], and they’re fairly extensive, so a graphic helps to understand some of the ontological relationships.

The brilliant Protege4 [5] ontology design tool has plugins for GraphViz[6] and OntoGraf[7] produce some fairly neat images to visualise ontologies, so here they are. I would like if there was a way to include object and data propertys (by annotation perhaps, will try later) but for now have compiled a table of the class properties from a crawl and sparql query I did against the repository I loaded the data into.

Contact class relationships

Note that OntoGraf needs the Sun JDK to work, so on Ubuntu, which has the OpenJDK by default, you need to install and agree to the license terms, then make sure that Protege is using the Sun java at /usr/lib/jvm/java-6-sun-1.6.0.22 (or whatever version).

Nepomuk message and contact classes

These tables are incomplete, and represent the classes and properties from the crawl of my nearly empty inbox. The full set of classes and properties for the Nepomuk ontologies are available on another page on this blog.

Well – what before how – this is firstly about requirements, and then about treatment

Linked Open Data app

Create a semantic repository for a read only dataset with a sparql endpoint for the linked open data web. Create a web application with Ajax and html (no server side code) that makes use of this data and demonstrates linkage to other datasets. Integrate free text search and query capability. Generate a data driven UI from ontology if possible.

So – a fairly tall order : in summary

define ontology

extract entites from digital text and transform to rdf defined by ontology

create an RDF dataset and host in a repository.

provide a sparql endpoint

create a URI namespace and resolution capability. ensure persistence and decoupling of possible

provide content negotiation for human and machine addressing

create a UI with client side code only

create a text index for keyword search and possibly faceted search, and integrate into the UI alongside query driven interfaces

the webserver (tomcat in fact) returns html and javascript. This is the “application”.

interactions on the webpage invoke javascript that either makes direct calls to Joseki (6) or makes use or permanent URIs (at purl.org) for subject instances from the ontology

purl.org redirects to dynamic dns which resolves to hosted application – on EC2, or during development to some other server. This means we have permanent URIs with flexible hosting locations, at the expense of some network round trips – YMMV.

dyndns calls EC2 where a 303 filter intersects to resolve to either a sparql (6) call for html, json or rdf. pluggable logic for different URIs and/or accept headers means this can be a select, describe, or construct.

TDB provides single semantic repository instance (java, persistent, memory mapped) addressable by joseki. For failover or horizontal scaling with multiple sparql endpoints SDB should probably be used. For vertical scaling at TDB – get a bigger machine ! Consider other repository options where physical partitioning, failover/resilience or concurrent webapp instance access required (ie if youre building a webapp connected to a repository by code rather than a web page that makes use of a sparql endpoint).

Next article will provide similar description or architecture used for the Java web application with code that is directly connected to a repository rather than one that talks to a sparql endpoint.

Just got my bill from Amazon for the 2 instances Im running and find Ive been charged for 728 hours on one of them – I thought this was supposed to be free for a year ! Reading again the small print (ugh) it seems you are entitled to 750 hours free, but it doesn’t explicitely say per instance. So – it seems its per account and you can run as many instances as you like and use a total of 750 hours across them in total before you get charged. Then again, I suppose thats reasonable enough – Amazon wouldn’t want to have every SME in the world running in the cloud for free, for a year when you could be getting cash from them, would you ? I must have been in a daze🙂

I’ve filled out the tools matrix with the 60 or so tools, libraries and frameworks I looked at for the two projects I created. Not all are used of course, and only a few are used in both. Includes comments and opinion, which I used and why, and all referenced. Phew.

Community

This is a crucially important aspect in a new and evolving technology domain like the Semantic/Linked-Open-Data web – whether its a commercial or FOSS component you are thinking about using.

For commercual tools, many offer free end-user or community licensing, limited by size or frequency of use, but if you plan to take your application to market you may well need to upgrade to a commercial license, and these are often very expensive – a Semantic Web or Knowledge based application based on what might be an essential technology component, will surely be seen as large value-add area for commercial companies. While this is true I believe, and commercial licenses can be justified, some technology offerings have small print that takes you straight to commercial licensing once you go to production. Others have smaller but knobbled versions, while some do have true SME quality licensing. So, watch out, it can be a barrier to entry, and we do need to see Mid-level, SME and Cloud offerings for the success of the pervasive or ubiquitous Semantic Linked Open Data web.

Unfortunately, it seems that many tools and libraries born from academic research or OpenSource endeavours, while available for use, are often not maintained. The author or team moves on, or the tool or library is published but languishes. This ends up with a situation where you may find a tool that does what you need but that has no or poor documentation; no active maintenance; no visible community support forums or user-base; or compatibility problems with other tools, libraries or runtime environments. While that removes many from “production” usage or deployment, they can still be an important learning resource, and a means of comparing more current tools and libraries. I will itemise what I’ve come across below, but make sure you cast your professional eye over any offering – once you know what you are looking for, and what help in tools, libraries and environments you need : hopefully this article and the previous two have helped you in that.

What does it say it does and does-not do ?

How old is it ? What are its dependencies ?

How often is code being updated ?

Is it written in java/php/perl/.NET/ProLog/Lisp/ ? Does it suit you – does it matter if its written in Perl but youre going to write your app in Java – is what you are going to use it for an independent stage in the production of your application, or are all stages inter-twined ? How much will you have to learn ?

Who is the author ? What else has he/she/they done ? Are they involved in standards process, coding, design, implementation, community ? Blogs, conferences, presentations ?

Is there documentation ? A tutorial ? A reference ? Sample Code ? Production applications ?

Is there a means of contacting the authors, and other users ?

Are there bugs ? Are there many ? Are they being fixed ?

What are answers to questions like – simple, helpful, understanding, presumptuous, brick-wall !? One sentence answers or contextualised for audience ?

What are the user group like – beginner, intermediate, advanced, helpful, broad or narrow base, international, academic, commercial,… ?

How quickly are questions answered ?

Does it seem like the tool/library is successfully used by the community, or is it too early to say, or unfit for purpose😦 ?

Under what licensiing is the tool/library made available ?

Results

At the application level, this is how things pan out then.

Linked Open Data webapp

Semantic backed J2EE webapp

Metadata, RDF, OWL

Need to have entries for each location in gazeteer. Need list of those locations. Then need to relate one to another from what text describes about road links, directions and bearing. Need metadata fields for each of those. Will also pull out administrative region type, population information, natural resources, and “House” information – seats of power/peerage/members of parliament. Will need RDF, RDFS, OWL for this, along with metadata from other ontologies. A further dataset later added for townland names – this allows parish descriptions from Lewis to encompass townland divisions, and potential for crossover to more detailed reporting at the time (eg Parliamentary reports)

This application associates a member or person with a list of locations and datetimes. Locations are posted by a device on a platform by a useragent at a datetime, and also associated with an application or group. An application is an anonymous association of people with a webapp page or pages that makes use of locations posted by its members. A group is an association of people who know each other by name/ID/email address and who want to share locations. Application owners cannot see locations or members of other locations unless they own each of the applications. Application owners cannot see with full accuracy the location or datetime information. Group owners can see the location and datetime with more accuracy, but not full accuracy, of their members. A further user type (“Partner”) can see all locations for all groups and applications but cannot see names of groups, applications or people, and has less accuracy on location and datetime. Concept subject tags can be associated with profiles and locations. A query capability is exposed to allow data mining with inference to application owners and partners. Queries can be scheduled and actions performed on “success” or “fail”. Metadata for people, devices, platforms, datetime, location, tags, applications and groups is required. ACL control based on that metadata is performed, but done so at an application logic level, not at a data level.

SPARQL, Description Logic (DL), Ontologies

A SPARQL endpoint is provided on top of the extracted and loaded data, and is the primary “API” used by the application logic which is expressed in javascript. Inferrence allows regions for instance to be queried at a very high level rather than by listing specific types. An ontology is created around the Location, location type, direction, bearing, distance, admin type, population, natural resource and peerage. A separate ontology created for peerage relationships and vocabulary, and imported into toplevel Lewis ontology. Some fields used from others notably wgs84 and muo. UI allows navigation by ontology (jOwl plugin).

No SPARQL interface directly exposed, but sparql queries for the basis of a data console, but restrictions on queries are applied based on ID and application/group membership, as well as role. A custom ontology is created based around FOAF, and SIOC, extending for RegisteredUser, Administrator, Partner, Application, Group, Device, Location and so on. Object model in Java mirrors this at interface level to simulate multiple inheritance. Some cardinality restrictions, but mostly makes use of domain/range specification from RDFS. Umbel ontology used for querying across tag relations. Inference has huge impact on performance, and data partitioning would be required for query performance, but this also has implications for library code used (named graph and query support, inference configuration) and application architecture and scale-out planning.

Artificial intelligence, machine learning, linguistics

Machine learning and linguistic analysus avoid in favour of syntactic a-priori extraction via gazeteer and word list after sentences have been delimited within each delimited location entry or report. Aliases and synonyms added later manually as fixup for OCR errors. Quality restricted by text from PDF and structural artifacts (page headings, numbers) newlines, linefeeds and lack of section headings within locations, location delimiters, and linguistic vagaries of author. Much much more information is available within each entry, but for now the original text is also stored sentence by sentence, with each entry.

None required here as no extraction is performed. Tag words and terms are restricted to those available in Umbel (OpenCyc) and condensed to Umberl Subject Concept URIs, which sparql queries can then make use of for broader, narrower and associative queries. “Find everyone who likes sports who posted a location within 1 mile of here”.

Linked Open Data

Location name lookups at extraction time link with to WGS84 grid location and ID in geonames, then to dbPedia entry. Former done using traditional web service API, latter by Sparql query. Coverage of about 85% achieved. dbPedia lookup based on name attempted but higher error rate (no or ambiguous hits) and lower coverage found (there are many infobox field variations for same type of information) QA manual/”eyeball” deemed sufficient for expected usage and audience.Link to Dictionaries of Biography for houses,possible using some form of owl:equivalence of peerage ontology. UI level links to Sindice and Uberblic attempted but cross-domain scripting prohibited. Locations mapped to Google maps – could be migrated to OpenStreenMap (geonames basis). Visualisation possible with Google visualisation or other web tool. Server side proxy created for this, and for further dbPedia integration – this provides example link to “people born before 1842 at this location”.

Links to Umbel are performed at query time based on Umbel Subject concepts applied by members to their profile and location. Umberl vocabulary is currently directly queried to Structured Dynamics endpoint, but could be loaded into same data repository or a separate but more local repository. Large memory footprint. Federated query capability depends on pluggable persistence technology used in application. Applications built on or off domain are free to make use of owl:sameAs for instance to further link proprietary data with data stored in this system, but need to make that association within their own repository. Links can be made to profile identity (local or OpenID) if known or if user expressly associates (after OAuth verification), to wgs84 location (assuming some proximity calculation), to application or group name (if known).

NLP and ML too advanced, too manual, too time consuming for a beginner, or a one-person prototyping “team”.

UI from RDF a problematic area – would be good to be able to geneate a UI now theres an ontoloy, but no more advanced than any UI or Form generation from XML or structured data.

Link generation code largely manual, could do with abstraction and ease of use (but this is complex area !). Lots and lots to learn, active support and experience required . Cross domain scripting a problem for Linked Open Data.

Where open linked data isnt a primary requirement then most other requirements are met by traditional RDBMS based technology and architecture. Open source can meet all component requirements for now (tech demo)

No JDBC type access wrappers to semantic repositories. SPARQL young and evolving.

Concurrency and multi-instance access considerations need to be made up front, early in development.

Some library or repository specific ORM type tools, one (I found) JPA based library being developed. Lots and lots to learn, active support and experience required.

Tools

This is as comprehensive a list as I can come up with based on what I looked at and ended up using (or not). There are many many more for sure, some in Java, others in various other languages. As some of the work types in the text->knowledge progression are often independent, being available in Java many not be important or even a consideration for you. So – look here, there and everywhere. See also Dave Becketts [81] list for a great source of information about available tools and technologies.
.

Category

Tool

Comment

Linked Open Data webapp

Semantic backed J2EE webapp

Extraction

GATE [56]

IDE for configuration of NLP toolsets and training ML engine. Active user group, but tool UI seemed buggy (q1 2010) and documentation was obtuse – not geared towards those not “in the know” IMO. Still, good, but would need a lot of effort and patience.

This is part of the transformation of source content to “knowledge”. Once entities are extracted they need to be used in RDF triples – how you go about this depends on your vocabulary and ontology and its up to you to use the RDF-Java-Object frameworks (below) that allow you to create a Subject and add a Property with an Object value. I havent found a tool that would allow code to generate RDF from tagged entities say, and its likely not reasonble to think in this way – however convenient. How would such a tool know which relationships in an ontology were asserted in the entity set you gave it ? The only way to about this is to code those things yourself from the knowledge you already have about the information, or what you want to assert, or perhaps, if you are dealing with a database to use its schema as the basis for a set of asserted statements in RDF – using D2R[87] or Triplify[88] say (do you need inference or not ?). This approach was not used in either of these projects however. Perhaos owl2java [99] might have helped ?

X

X

NLP, ML

GATE [56]

NLP engine from Sheffield University with support for ML – see also Extraction category above. Tried but not used.

X

X

OpenNLP[89, 90]

NLP library for tokenization, chunking, parsing and coreference. Simple than GATE, less documentation, dormant ? Tried but not used.

X

X

MinorThird [91]

Probably more ML than NLP, but with tokenization and extraction capability. Getting long in the tooth, and had some compatability issues when tested.

X

X

UIMA[92, 94,95]

“Unstructured Information Management Architecture”. A full blown framework for NLP and ML – “text mining”, a la GATE. Now in Apache (contributed by IBM). Good documentation, active support and development. Came close to using for Linked Data app but came too late, and seemed large and time consuming to learn (in my timescale). However, for a version2 of the project I would use it, over GATE and the custom code I built – documentation for end user and developer is less assuming than GATE, and there are various plugins available, and as it is modular (so is GATE btw) you can create and add your own discrete code into the UIMA processing pipeline. Still need something to generate RDF based around your ontology and the extracted entities tho…

SenseRelate[93]

NLP-Wordnet disambiguation toolkit. Couldnt see how I would integrate this – what purpose for my application as I was using a-priori knowledge of the text for the Linked Open Data webapp, and the application business logic for the Semantic backed J2EE webapp. Also getting old…

X

X

LingPipe [106]

Very interesting toolkit for NLP, text and document processing, but ultimately with a commercial license

X

X

Mallet [107]

Like LingPipe but opensource, with sequence tagging and topic modelling.

X

X

Weka [108]

Another text mining tool, opensource, good docs, current and maintained, also works with GATE[56]

X

X

RDF-Java

OpenJena [59, 65]

Maturing framework for RDF with java. Sparql implementation [61] follows standards closely and previews upcoming versions, as Andy Seaborne on SPARQL w3c group. Has repository capability as well. Used in both projects, but in J2EE app was just on of possibilties for repository integration and RDF capability. Support forum high traffic – popular choice. Expected to provide working code examples when describing problems – discussion not entertained ! HP [64] and now Apache [65] backing. Combined with JenaBean [71] and Empire-JPA[72] in J2EE app. TTL/N3 config may seem alien to java webapp developers.

Y

Y

KAON [97]

Another library – didnt seem as popular as Jena or Sesame. Documentation ? Old, not actively maintained ?

X

X

Sesame [62]

Modular RDF to java library and repository framework. v3 expected soon (Q1 2011 ?). Good documentation and comment available on and off site, but you still need to experiment. Support forum can be slow and low traffic, but still a popular choice. Also home for Elmo [66] (an object-RDF extension), and Alibaba [67] – “the next generation of the Elmo codebase”. Combined with Empire [72] in J2EE app. TTL/N3 config may seem alien to java webapp developers.

X

Y

Object-RDF

JenaBean [71]

Appears now dormant, but Jena Object library with custom annotations to model and map Java Classes to RDF classes. Support very slow. Low activity.

X

Y

Empire-JPA[72]

Aka Empire-RDF. From makers of Pellet [75]. JPA implementation for access to semantic repositories, with adapters for Sesame, Jena, Fourstore [74]. Newish, v0.7 about to be released. Support good, interested, helpful.

X

Y

RDF2GO [79]

Abstraction over repository and triplestores, with Jena, Sesame and OWLIM adapters. Decided in favour of Empire.

X

X

Repository and/or database

TDB [60]

Single instance in memory repository, with cmdline and Jena integration. No clustering, replication capability – must be local to webapp. Configuration can be awkward, imo, but easy enough to get started with. Inferencing and custom ontology support, both at configuration and code levels. Single writer multiple reader. Used in both projects but in J2EE app was just one of possible repository technologies. Memory mapped files in 64bit JVM.

Provides proxy http capability in front of in memory, file based or database backed repositories. Inferrence by configuration, performed on write – inferred statements are asserted and persisted. Allows for multiple web app instances to make use of any of the repositories. Web based “workbench”. Limited reasoning support compared to Jena. Support forum could be described as “slow”. OntoText [73] backing.

X

Y

BigData [68]

Sesame[62] + Zookeeper [77] + MapReduce [78] based clustered semantic repository for very large datasets. Too big for either apps at this stage, but Empire/Sesame usage provides growth path.

X

X

AllegroGraph [69]

Lisp based Semantic Repository with community and commercial licensing options for larger datasets. Http interface – could be used as alternative to Jena/Sesame/Empire. Biggish application and framework to read and learn – too big for now !

X

X

OWLIM [70]

Large scale repository based around Sesame. Reasoning support better then Sesame, and takes alternative approach to implementation compared with Jena say. Community and commercial license. Too big for now !

X

X

Fourstore [80]

Python semantic repository. Could be used behind Empire.[72]

X

X

Content negotiation

Pubby [76]

WAR file with configuration (N3) for URI mapping, 303 redirect and many other aspects of Linked Data access – for sparql endpoints that support DESCRIBE. Wrote filter that could sit on remote front end as alternative, but may get used later.

X

X

SPARQL access & Endpoint

Joseki [58]

Sparql endpoint for use with Jena. Needs URL rewriting for PURLs and content negotiation code in front.(custom code)

Y

X

Link generation

N/A

Use custom code from eg Jena or Sesame to create statements in model – once you’ve designed your URI scheme – and get the code to serialise/materialise the URI for you.

Y

Y

Ontologies

Protégé [82]

IDE to create RDFS and OWL ontologies, with reasoning and visualisation.

Y

Y

NeOn Toolkit [100]

Eclipse based tool suite for semantic apps. Broad scope, protege seemed a better fit – easier and quicker to get to grips with at the time. May be used again tho.

X

X

KAON – OI-Modeler [98]

old. still available ? being maintained ?

X

X

m2t4 [101]

looked promising, simple eclipse plugin, had compatability and maintenance issues. switched to Protege in the end however.

X

X

Inference & Reasoning

Jena [59]

Jena has built in inference capability, but is considered slower than others.[86]. In the J2EE app, with an RDBMS backed repository it was poor, IMO. With a TDB repo its better, but still something you really need to have before you would deploy in production. This is probably true of all current repostories, but Jena seems to be at the slow end of the scale.However, it does deliver high standards compliance rather than a “degraded” compliance you may get with others.

Y

Y

Sesame[62]

Sesame has “reduced” reasoning support – it can do RDFS based reasoning, and if custom ontologies are added to a repository type with inferrence support it will make use of them. If a “view” of a dataset is required that doesnt contain inferred statements, then a query parameter needs to be used so they are filtered out.

X

Y

Pellet [75]

“Independent” inference and reasoning. Not used except as plugin in Protege. Supposedly faster than some others.

X

X

OWLIM [73]

OWLIM comes with its own flavour of inference and reasoning “support for the semantics of RDFS, OWL Horst and OWL 2 RL”

X

X

UI Generation & Rendering

Talis Sparql js lib [57]

Javascript interface for using datasources hosted on Talis platform. Decided not to host data “offsite” at this stage

Display vocabulary for RDF. Integrate at java level, may have been possible to create a Spring [105] view module (I use Spring a lot) but was another thing to learn and I wanted to try and use plain old javascript and html as much as possible. Has promise, but documentation, support and maintenance may be an issue.

I was just about to start writing a CORS[1] servlet filter so that I can move one of my apps onto an independent EC2 host and give it more memory when I came across the CometD project [2] (DOJO event bus in Ajax, interesting in itself), which makes use of Jetty7’s CrossOriginFilter [3].

This seems to do all you need to allow your servlet interact with cross domain requests and built Javascript RIAs that mash up and link data, semantic or not. The filter allows a list of allowable domains to be set, among other things, so that you can add it to any of your servers, map it to any of your servlets, and allow different clients you trust and want access to your data to get to it.

Saves me having to write it, and it looked like it was going to be painful to do fully and correctly, so its a real relief to see in Jetty7. All credit to the developers there.

Not sure about the licensing aspects (EPL1 + Apache2) , but you can lift the source, remove the Eclipse logging dependency, and alter as you see fit for your version of servlet engine. I’m trying this now with Tomcat6 and another Jetty6 instance, just as soon as I can get my apps separated and onto different domains (without the filter, a request from localhost to a remote domain using jQuery seems to get thru just fine for some reason)

Separating services onto different hosts on EC2

I wanted to move one of my http services (Joseki) onto a new host so as to be able to give the JVM more memory and avoid EC2 unceremoniously killing it when it asked for too much. The tomcat service with the webapps would remain put. So I

created an AMI from my running instance

create a new instance from it

resinstalled ddclient because it didnt seem to work

created a new DynDNS account thinking it was tied to my account rather than the host, but that didnt make any difference

checked the ddclient cache file – it seemed to have the right ip addresses – ie one for the tomcat services, and another for the joseki host. However dyndns showed that all hostnames were backed by the same ip address. I suspected cache so i did a ‘sudo ddclient -force ‘ and this seems to have updated DynDNS correctly

changed my js files so that all sparql would be directed to the new Joseki host, and started testing

Getting JSONP where there is only JSON

Now I expected that things wouldn’t work – Joseki is on a different host than where the js files have been loaded from – making an Ajax call there shouldnt work should it – unless I was using jsonp – but Joseki doesnt do jsonp !

So, I checked my code, and it is making JSONP calls – I’m doing a jQuery $ajax call like this

So, whats going on ? Have I not understood this whole cross-domain thing [4], or is jQuery doing something strange ?

Well turns out that Ive forgotten a trick I used to get this to work before : whats actually going in is this

jQuery, rather than using xmlhttp request, is making a DOM call to insert a <script> tag (which can make cross domain calls), because Ive specified dataType:”jsonp”.

The url for the script (specified in url:remoteurl) uses my new hostname – but – happens to include (specified in url:remoteurl) “&output=json” and the necessary SPARQL of course. [5]

The script tag gets processed making the GET call to the remote URL, the sparql runs on the remote/cross-domain server, and the JSON response is processed by the callback specified in the success:callback option

if you change it to a json call dataType:"json"

the request is made (and visible in remote access logs) but the response is aborted – Im not sure if its the browser doing this, or jQuery.

So, hey presto, JSONP where the server does not explicitely support it. However, it won’t work with POST of course (script tags), so for SPARQL update or insert it will be an issue. CORS really should be used here for that…..

Back to the real topic though, CORS. I installed the code [3], modified with some more debug logging so I could see what was going on. Having changed the client javascript to make a jQuery.$ajax call with dataType:json rather than jsonp I expected this to work straight out – after all, Gecko on Firefox 3.6 does the hard work with the headers [6] for “simple” requests (no credentials, not a POST, no custom [non http1.1] headers), so jQuery using XmlHttpRequest should be fine – but it was not. This turned out to be a false negative tho, as my broadband provider is being rubbish today, and when I switched to my rubbish 3G dongle it time out so often that it looked like failure.

Now the strange thing is that when I remove the CORSFilter servlet mapping from web.xml, jQuery still sends the dataType:json request, Joseki receives it, but the response is never processed by the callback. A Mozilla Hacks post [7] says this :

In Firefox 3.5 and Safari 4, a cross-site XMLHttpRequest will not successfully obtain the resource if the server doesn’t provide the appropriate CORS headers (notably the Access-Control-Allow-Origin header) back with the resource, although the request will go through. And in older browsers, an attempt to make a cross-site XMLHttpRequest will simply fail (a request won’t be sent at all).

Reverting to having the filter in place seems to fix things, but Im still not 100% convinced that things are correct. I suppose the “will not successfully obtain the resource” is vague enough to be an acceptable explanantion for when I dont have CORSFilter in place, but I would have thought that sending the request and putting load on the network and target server wasn’t something that Mozilla really want to happen.

But when in place, the CORSFilter is getting the origin header, and setting the AC-AO response header, so its behaving. I expect it will be different in a range of other browsers (ie Internet Exploder). So for now, its not broken, I’m not going to fix it any more. YMMV🙂

And by the way, a t1.micro on EC2 isnt really up to it for even a smallish dataset of 340k triples. It does, just, but you get what you pay for here.

Available tools and technologies

(this section is unfinished, but taking a while to put together, so – more to come)

When you first start trying to find out what the Semantic Web is in technical terms, and then what the Linked Open Data web is, you soon find that you have a lot of reading to do – because you have lots of questions. That is not surprising since this is a new field (even though Jena for instance has been going 10 years) for the average Java web developer who is used to RDBMS, SOA, MVC, HTML, XML and so on. On the face of it, RDF is just XML right ? A semantic repository is some kind of storage do-dah and there’s bound to be an API for it, right ? Should be an easy thing to pick up, right ? But you need answers to these kind of questions before you can start describing what you want to do as technical requirements, understanding what the various tools and technologies can do, which ones are suitable and appropriate, and then select some for your application.

One pathway is to dive in, get dirty and see what comes out the other side. But that to me is just a little unstructured and open-ended, so I wanted to tackle what seemed to be fairly real scenarios (see Part 2 of this series) – a 2-tier web app built around a SPARQL endpoint with links to other datasets and a more corporate style web application that used a semantic repository instead of an RDBMS, delivering a high level API and a semantic “console”.

In general then it seems you need to cover in your reading the following areas

Metadata – this is at the heart of the Semantic Web and Linked Open Data web. What is it !! Is it just stuff about Things ? Can I just have a table of metadata associated with my “subjects” ? Do I need a special kind of database ? Do I need structures of metadata – are there different types of things or “buckets” I need to describe things as ? How does this all relate to how I model my things in my application – is it different than Object Relational Modelling ? Is there a specific way that I should write my metadata ?

RDF, RDFS and OWL – what is it, why is it used, how is it different than just XML or RSS; what is a namespace, what can you model with it, what tools there are and so on

SPARQL – what is it, how to write it, what makes it different from SQL; what can it NOT do for you, are there different flavours, where does it fit in a web architecture compared to where a SQL engine might sit ?

Description Logic – you’ll come across this and wonder, or worse give up – it can seem very arcane very quickly – but do you need to know it all or any of it ? Whats a graph, a node, a blank node dammit, a triple, a statement ?

Ontologies – isn’t this just a taxonomy ? Or a thesaurus ? Why do I need one, how does metadata fit into it ? Should I use RDFS or OWL or something else ? Is it XML ?

Artificial Intelligence, Machine Learning, Linguistics – what !? you mean this is robotics and grammer ? where does it fit in – whats going on, do I need to have a degree in cybernetics to make use of the semantic web ? Am I really creating a knowledge base here and not a data repository ? Or is it an information repository ?

Linked Open Data – what does it actually mean – it seems simple enough ? Do I have to have a SPARQL endpoint or can I just embed some metadata in my documents and wait for them to be crawled. What do I want my application to be able to do in the context of Linked Open Data ? Do I need my own URIs ? How do I make or “coin” them ? How does that fit in with my ontology ? How do I host my data set so someone else can use it ? Surely there is best practice and examples for all this ?

Support and community – this seems very academic, and very big – does my problem fit into this ? Why can I not just use traditional technolgies that I know and love ? Where are all the “users” and applications if this is so cool and useful and groundbreaking ? Who can help me get comfortable, is anyone doing any work in this field ? Am I doing the right thing ? Help !

I’m going to describe these things before listing the tools I came across and ended up selecting for my applications. So – this is going to be a long post, but you can scan and skip the things you know already. Hopefully, you can get started more quickly than I did.

First the End

So you read and your read and come across tools and libraries and academic reports and W3C documents and you see it has been going on some time, that some things are available and current, others are available and dormant. Most are OpenSource thankfully and you can get your hands on them easily, but where to start ? What to try first – what is the core issue or risk to take on first ? Is the enough to decide that you should continue ?

What is my manager going to say to me when I start yapping on about all these unfamiliar things –

why do I need it ?

what problem is it solving ?

how will it make or save us money ?

our information is our information – why would I want to make it public ?

Those are tough questions when you’re starting from scratch, and no one else seems to be using the technologies you think are cool and useful – who is going to believe you if you talk about sea-change, or “Web3.0” or paradigm shift, or an internet for “machines”. I believe you need to demonstrate-by-doing, and to get to the bottom of these questions so you know the answers before someone asks them of you. And you better end up believing what your saying so you that you are convincing and confident. Its risky….*

So – off I go – here is what I found, in simple, probably technically incorrect terms – but you’ll get the idea and work out the details later (if you even need to)

*see my answers to these questions at the end of this section

Metadata, RDF/S, OWL, Ontologies

Coarsely, RDF allows you to write linked lists. URIs allow you to create unique identifiers for anything. If you use the same URI twice, your saying that exact Thing is the same in both places. You create the URIs yourself, or when you want to identify a thing (“john smith”) or a property of a thing (eg “loginId”) that already exists, you reuse the URI that you or someone else created. You may well have a URI for a concept or idea, and another for one of its physical form – eg a URI for a person in your organisation, and another for the webpage that shows his photo and telephone number, another for his HR system details.

Imagine 3 columns in a spreadsheet called Subject, Object and Predicate. Imagine a statement like “John Smith is a user with loginId ‘john123’ and he is in the sales Department“. This ends up like

Subject

Predicate

Object

S-ID1

type

User

S-ID1

name

“John”

S-ID1

familyName

“Smith”

S-ID1

loginId

“john123”

S-ID2

type

department

S-ID2

name

“sales”

S-ID2

member

ID1

That is it, simply – RDF allows you to say that a Thing with an ID we call S-ID1 has properties, and that those properties are either other Things (S-ID2/member/ID1) or literal things like strings “john123”.

So you can build a “graph” or a connected list of Things (nodes) where each Thing can be connected to another Thing. And once you look at one of those Things, you might find that it has other properties that link to different Things that you don’t know about or that aren’t related to what you are looking at – S-ID2 may have another “triple” or “statement” that links it with ID-99 say (another user) or ID-10039 (a car lot space, say). So you can wire up these graphs to represent whatever you want in terms of properties and values (Objects). A Subject, Property or Object can be a reference to another Thing.

Metadata are those properties you use to describe Things. And in the case of RDF each metadatum can be a Thing with its own properties (follow the property to its own definition), or a concrete fact – eg a string, a number. Why is metadata important – because it helps you contextualise and find things and to differentiate one thing from another even if they are called the same name. Some say “Content is King” but I say Metadata is !.

RDF has some predefined properties like “property” and “type”. Its pretty simple and you’ll pick it up easily [1]. Now RDFS extends RDF to add some more predefined properties that allow you to create a “schema” that describes your data or information – “class”, “domain”, “range”, “label”, “comment”. So if you start to formalise the relationships described above – a user has a name, familyName, loginID and so on – before you know it, you’ve got an ontology on your hands. That was easy, right ? No cyborgs, logic bombs, T-Box or A-Box in sight.(see the next section) And you can see the difference between an ontology and a taxonomy – the latter is a way of classifying or categorising things, but an ontology does that and also describes and relates them. So keep going, this isn’t hard ! (Hindsight is great too)

Next you might look at OWL because you need more expressiveness and control in your information model and you find out that it has different flavours – DL, LITE, FULL[2] What do you do now ? Well, happily, you don’t have to think about it too much, because it turns out that you can mix and match things in your ontology – use RDFS and OWL, and you can even use things from other ontologies. Mash it up – you don’t have to define these properties from scratch yourself. So go ahead and do it, and if you find that you end up in OWL-FULL instead of DL then you can investigate and see why. The point is, start, dig in and do what you need to do. You can revise and evolve at this stage.

A metadata specification called “Dublin Core”[3] comes up a lot – this is a useful vocabulary for describing things like “title”, “creator”, “relation”, “publisher”. Another, the XSD schema is useful for defining things like number types -integer, long and float – and is used as part of SPARQL for describing literals. You’ll also find that there are properties of things that you thought are so common that someone would have an ontology or a property defined for them already. I had a time looking for a definition of old English miles, but it turns out luckily that there was one[4,5]. On the other hand, there wasn’t one for a compass bearing of “North” – or at least one that I could find, so I invented one, because it seemed important to me. Not all things in your dataset will need metadata – and in fact you might find that you, and someone working on another project have completely different views on whats important in a dataset – you might be interested in describing financial matters, and someone else might be more interested in the location information. If you think about it long enought a question might come to mind – should we still maintain our data somewhere in canonical, raw or system-of-record form, and have multiple views of what it is stored elsewhere ? (I dont have an answer for that one yet).

Once your start you soon see that the point of reusing properties from other ontologies is that you are creating connections between datasets and information just by using them – you may have a finance department that uses “creator” that you can now link records in the HR system with the same person – and because the value used for the “creator” is in fact a unique URI (simply, an ID that looks like an URL) eg http://myCompany.com/people/john123. If you have another John in the company, he’ll have a different ID eg http://myCompany.com/people/john911, so you can be sure that the link is correct and precise – no ambiguity – John123 will not get the payslip meant for John911. There are also other ways of connecting information – you could use owl:sameAs for instance – this makes a connection between two Things when a common vocabulary or ID is not available, or when you want to make a connection where one didn’t exist before. But think about these connections before you commit them to statements – the correctness, provenance and trust around that new connection has to be justifiable – you want your information and assertions about it to have integrity, right ?

I needed RDF and RDFS at least – this would be the means that I would express the definition and parameters of my concepts, and then also the statements to represent actual embodiments of those concepts – instances. It started that way, but I knew I might need OWL if I wanted to more controlled over the structure and integrity of my information – eg to say that John123 could only be a member of one department and one department only, that he had role of “salesman” but couldn’t also have a role of “paymaster”. So, if you need this kind of thing, read more about it [6,7]. If you don’t yet, just keep going, and you can still come back to it later.(turns out I did in fact)

The table above now looks like this when you use URIs – its the same information, just written down in a way that ensures things are unique, and connectable.

The Namespaces at the top of the table mean that you can use shorthand in the three columns and don’t have to repeat the longer part of the URI each time. Makes things easier to read and take in too, especially if you’re a simple human. For predicates, I’ve changed name to rdfs:label and familyName to foaf:family_name[8]. In the Object column only the myCo namespace is used – in the first case it points to a Subject with a type defined elsewhere (in the ontology in fact). I say the ontology is defined elsewhere, but that doesnt haev to be physically elsewhere, its not uncommon to have a file on disk that contains the RDF to define the ontology but also contains the instances that make up the vocabulary or the information base.

So – why is this better than a database schema ? The simple broad answers are the best ones I think :

You have only 3 columns*

You can put anything you like in each column (almost – literals cant be predicates (?) ), and its possible to describe a binary property User->name->John as well as n-ary relationships[9] User->hasVehicle->car->withTransmission->automatic

You can define what you know about the things in those columns and use it to create a world view of things (a set of “schema rules”, an ontology).

You can (and should) use common properties – define a property called “address1” and use it in Finance and HR so you know you’re talking about the same property. But if you don’t, you can fix it later with some form of equivalence statement..

If there are properties on instances that aren’t in your ontology, they don’t break anything, but they might give you a surprise – this is called an “open world assumption” – that is to say just because it is not defined does not mean it cannot exist – this is a key difference from database schema modelling.

You use the same language to define different ontologies, rather than say MySQL DDL for one dataset and Oracle DDL for another

There is one language spec for querying any repository – SPARQL **. You use the same for yours and any others you can find – and over Http – no firewall dodging, no operations team objections, predictable, quick and easy to access

You don not have to keep creating new table designs for new information types

You can easily add information types that were not there before while preserving older data or facts

You can augment existing data with new information that allows you to refine it or expand it – eg provide aliases that allow you to get around OCR errors in extracted text, alternative language expressions

Any others ?

*Implementations may add one or two more, or break things up into partitioned tables for contextual or performance reasons
**there are different extensions in different implementations

SPARQL, Description Logic (DL), Ontologies

SPARQL [10] aims to allow those familiar with querying relational data to query graph data without too much introduction. Its not too distant but needs a little getting used to. “Select * from users” looks like “select * from {?s rdf:type myCo:User}”, and then you get back 2 types of information rather than every column from a table. Of course this is because you have effectively 3 “columns” in the graph data and theyre populated with a bunch of different things. So you need to dig deeper[11] into tutorials and what others have written.[12,13]

One of the key things about SPARQL is that you can use it to find out what is in the graph data without having any idea before hand.[14] You can ask to find the types of data available, then ask for the properties of the types, then DESCRIBE or select a range of types for identified subjects. So, its possible to discover whats available to suit your needs, or for anyone else to do the same with your data.

Another useful thing is the ability (for some SPARQL engines – Jena’s ARQ [15] comes to mind) to federate queries either by using a “graph” (effectively just a named set of triples) that is an URI to a remote dataset, or by using (in Jena’s) case, the SERVICE keyword. So you can have separate and independent datasets and query across them easily. Sesame[16] allows a similar kind of thing with Federated Sail but you predefine the federation you want, rather than specify it in-situ. Beware of runtime network calls in the Jena case, and consider hosting your independent data in a single store but under different graphs to avoid them. You’ll need more memory in one instance, but you should get better performance. And watch out for JVM memory limits and type size increases if you (probably) move to a 64bit JVM.[17,18]

While learning the syntax of SPARQL isn’t a huge matter, understanding that youre dealing with a graph of data and having to navigate or understand that graph before hand can be a challenge, especially if its not your data you want to federate or link with. Having ontologies and sample data (from your initial SPARQL queries) helps a lot, but it can be like trying to understand several foreign database schemas at once, visualising a chain rather than a hierarchy, taking on multiple-inheritance and perhaps cardinality rules, domain and range restrictions and maybe other advanced ontology capabilities.

SPARQL engines or libraries used by SPARQL engines that allow inferencing provide a unique selling point for the Semantic and Linked web of data. Operations you cannot easily do in SQL are possible. Derived statements with information that is not actually “asserted” in the physical data you may have loaded into your repository start to appear. You might for instance ask for all Subjects or things of a certain type. If the ontology of the information set says that one type is a subclass of another – say you ask for “cars” – then you’ll get back statements that say your results are cars, but you’ll also get statements saying they are also “vehicles”. If you did this with an information set that you were not familiar with, say a natural history data set, then when you ask for “kangaroos” you are also told that its an animal, a kangaroo, and a marsupial. The animal statement might be easy to understand, but perhaps you expected that it was a mammal. And you might not have expressly said that a Kangaroo was one or the other.

Once you get back results from a SPARQL query you can start to explore – you start looking for kangaroos, then you follow the marsupial link, and you end up with Opossum, then you see its in the USA and not Australia, and you compare the climates of the two continents. Alternatively of course, you may have started at the top end – asked for marsupials, and you get back all the kangaroos and koalas etc, then you drill down into living environment and so on. Another scenario deals with disambiguation – you ask for statements about eagles and the system might return you things named eagles, but you’ll be able to see that one is a band, one is a US football team, and the other a bird of prey. Then you might follow links up or down the classifications in the ontology.

Some engines have features or utilities that allow you to “forward-chain”[19] statements before loading – this can mean that using an ontology or a reasoning engine based on a language specification that derived statements about things are asserted and materialised for you before you load them into your repository. This is not only things to do with class hierarchy but also where a hierarchy isnt explicit, inference might create a statement – “if a Thing has a title, pages, book, and has a hardback coverthen it is ….a book”. This saves the effort at runtime and should mean that you get a faster response to your query. Forward chaining (and backward-chaining[20]) are common reasoning methods used with inferrence rules in Artificial Intelligence and Logic systems.

It turns out, Description Logic or “DL” [21] is what we are concerned with here – a formal way of expressing or representing knowledge – things have properties that are a certain value. OWL is a DL representation for instance. And like Object oriented prorgammic languages – Java say – there are classes (ontology, T-Box statements) and instances (A-Box, instances, vocabularies). There are also notable differences from Java (eg multiple inheritance or typing), and a higher level of formalism, and these can make mapping between your programming language and your ontology or modelling difficult or problematic. For some languages, ProLog or Lisp this mapping may not be such a problem, and indeed you’ll fnd many semantic tools and technologies built using them.

Despite the fact that DL and AI can get quite heady once you start delving into these things, it is easy to start with the understanding that they allow you to describe or model your information expressively and formally without being bound to an implementation detail like the programning language you’ll use, and that once you do implement and make use of your formal knowledge representation – your ontology – that hidden information and relationships may well become clear where they may not have been before. Doing this with a network of information sets means that the scope of discovery and fact is broadened – for your business, this may well be the difference between a sale or not, or provide a competitive edge in a crowded market.

Artificial intelligence, machine learning, linguistics

When you come across Description Logic and the Semantic Web in the context of identifying “things” or entities in documents – for example the name of a company or person, a pronoun or a verb – you’ll soon be taken back to memories of school – grammer, clauses, definitive articles and so on. And you’ll grow to love it Im sure, just like you used to🙂
It’s a necessary evil, and its at the heart of a one side of the semantic web – information extraction(“IE”) as a part of information retrieval (“IR”)[22,23]). Here, we’re interested in the content of documents, tables, databases, pages, excel spreadsheets, pdfs, audio and video files, maps, etc etc. And because these “documents” are written largely for human consumption, in order to get at the content using “a stupid machine”, we have to be able to tell the stupid machine what to do and what to look for – it does not “know” about language characteristices – what the difference is between a noun and a verb – let alone how to recognise one in a stream of characters, with variations in position, capitalisation, context and so on. And what if you then want to say that a particular noun, used a particular way is a word about “politics” or “sport”; that its Englih rather than German; that it refers to another word two words previous, and that its qualified by an adjective immediately after it ? This is where Natural Language Processing (NLP) comes in.

You may be familiar with tokenising a string in a high level programming language, then writing a loop to look at each word and then do something with it. NLP will do this kind of thing but apply more sophisticated abstractions, actions and processing to the tokens it finds, even having a rule base or dictionary of tokens to look for, or allowing a user to dynamically define what that dictionary or gazeteer is. Automating this is where Machine Learning (ML) comes in. Combined, and making use of mathematical modelling and statistical analysis they look at sequences of words and then make a “best guess” at what each word is, and tell you how good that guess is.

You may need (probably) to “train” the machine learning algorithm or system with sample documents – manually identify and position the tokens you are interested in, tag them with categories (perhaps these categories themselves are from a structured vocabulary you have created or found, or bought) and then run the “trained” extractor over your corpus of documents. With luck, or actually, with a lot of training (maybe 20%-30% of the corpus size), you’ll get some output that says “rugby” is a “sports” term and “All Blacks” is a “rugby team”. Now you have a your robot, your artificial intelligence.

But the game is not up yet – for the Semantic and Linked web, you now you have to do something with that output – organise and transform into RDF – a related set of extracted entities – relate one entity to another into a statement “all blacks”-“type”-“rugby team”, and then collect your statements into a set of facts that mean something to you, or the user for whom you are creating your application. This may be defined or contextualised by some structure in your source document, but it may not be – you may have to provide and organising structure. At some point you need to define a start – a Subject you are going to describe, and one of the Subjects you come up will be the very beginning or root Thing of your new information base. You may also consider using an online service like OpenCalais[24], but you’re then limited to the range of entities and concepts that those services know about – in OpenCalais’ case its largely business and news topics – wide ranging for sure, but if you want to extract information about rugby teams and matches it may not be too successful. (There are others available and more becoming available). In my experience, most often and for now, you’ll have to start from scratch, or as near as damn-it. If you’re lucky there may be a set or list of terms for the concept you are interested in, but its a bit like writing software applications for business – no two are the same, even if they have the same pattern. Unlike software applications though, this will change over time – assuming that people will publish their ontologies, taxonomies, term sets, gazeteers and thesauri. Lets hope they do, but get ready to pay for them as well – they’re valuable stuff.

So

Design and Define your concepts

Define what you are interested in

Define what things represent what you are interested in

Define how those things are expressed – the terms, relations, ranges and so on – you may need to build up a gazeteer or thesaurus

Understand how and where those things are used – the context, frequency, position

Extract the concepts and metadata

Now tell the “machine” about it, in fact, teach it what you know and what you are interested in – show it by example, or create a set or rules and relations that it understands

Teach it some more – the more you tell it, the more variety, the more examples and repitition you can throw it, the better the quality of results you’ll get

Get your output – do you need to organise the output, do you have multiple files and locations where things are stored, do you need to feed the results from the first pass into your next one ?

Fashion some RDF

Create URIs for your output – perhaps the entities extracted are tagged with categories (that you provided to the trained system) or with your vocabulary, or perhaps not – but now you need to get from this output to URIs, Subjects, Properties, Objects – to match your ontology or your concept domain. Relate and collect them into “graphs” of information, into RDF.

Find a repository technology you like – if you dont know, if its your first time, pick one – suck it and see – if you have RDF on disk you might be able touse that directly (maybe slower than an online optimised repository). Initialise it, get familiar with it, consider size and performance implications. Do you need backup ?

Load your RDF into the repository. (Or perhaps you want to modify some existing html docs you have with the metadata you’ve extracted – RDFa probably)

Test what you’ve loaded matches what you had on disk – you need to be able to query it – how do you do that ? Is there a commandline tool – does it do SPARQL ? What about it you want to use it on the web, this is whole point isnt it ?Is there a sparql endpoint – do you need to set up Tomcat or a Jetty say to talk to your repository ?

Link it

And what about those URIs – you have URIs for your concept instances (“All Blacks”), and URIs for their properties (“rdf:type”), and URIs for the Object of those properties (“myOnt:Team”), What happens now – what do you do with them ? If there for the web, if theyre URIs shouldnt I be able to click on them ? (Now were talking Linked Data – see next section).

Link your RDF with other datasets (See next section) if you want to be found, to participate, and to add value by association, affiliation,connection – the network effect – the knowledge and the value (make some money, save some money)

Build your application

Now create your application around your information set. You used to have data, now you have information – your application turns that into knowledge and intelligence, and perhaps profit.

There are a few tools to help you in all this (see below) but you’ll find that they dont do everything you need, and they wont generate RDF for you without some help – so roll your sleeves up. Or – don’t – I decided against it, having looked at the amount of work involved in learning all about NLP & ML, in the arcane science (its new to me), in the amount of time needed to set up training and the quality of the output. I decided on the KISS principle – “Keep It Simple, Stupid”, so instead I opted to write something myself, based on grep !

I still had to do 1-5 above, but now I had to write my own code to do the extraction and “RDFication”. It also meant I got my hands dirty and learned hard lessons by doing rather than reading or trusting someone else’s code that I didnt understand. And the quality of the output and the meaning of it was all in my control still. It is not real Machine Learning, it’s still in the tokenisation world I suppose, but I got what I wanted and in the process made something I can use again. It also gave me practical and valuable experience so that I can revisit the experts tools with a better perspective – not so daunting, more comfortable and confident, something to compare to, patterns to witness and create, less to learn and take on, and, importantly, a much better chance of actual, deliverable success.

It was quite a decision to take – it felt dirty somehow – all that knowledge and science bound up in those tools, it was a shame not to use it – but I wanted to learn and to fail in some ways, I didn’t want to spend weeks training a “machine”, and it seemed better to fail with something I understood (grep) rather than take on a body of science that was alien. In the end – I succeeded – I extracted my terms with my custom-automated-grep-based-extractor and I created RDF and loaded it into a repository. Its not pretty, but it worked – I have gained lots of experience, and I know where to go next. I recommend it.

Finally, it’s worth noting here the value-add components

ontologies – domain expertise written down

vocabularies – these embody statements of knowledge

knowledge gathering – collecting a disparate set of facts, or describing and assembling a novel perspective

Linked Open Data

Having googled and read the w3c docs [25-30] on Linked Open Data it should become clear that the advantages of Linked Open Data are many.

Addressable Things – every Thing has an address, namespaces avoid conflicts, there are different addresses for concepts and embodiments

Content for purpose – if your a human you get html or text (say), if your a machine or program you get structured data

Public vocabulary – if there is a definition for something then you can reuse it, or perhaps even specialise it. if not, you can make one up and publish it. if you use a public definition, type, or relationship, and some one else does as well, then you can join or link across your datasets

Open links – a Thing can point to or be pointed at from multiple places, sometimes with different intent.

Once these are part of your toolkit you can go on to create data level mashups which add value to your own information. As use cases, consider how and why you might link that to other datasets for

a public dataset that contains the schedules of buses in a city, and their current positions while on that schedule – the essence of this is people, place and time – volumes, locations at point in time, over periods of time, at start and end times.

You could create an app based around this data and linkages to Places of Interest based on location : a tourist guide based on public transport.

How about an “eco” application comparing the bus company statistics over time with that of cars and taxis, and cross reference with a carbon footprint map to show how Public Transport compares to Private Transport in terms of energy consumption per capita ?

Or add the movements of buses around the city to data that provides statistics on traffic light sequences for a journey planner ?

Or a mashup of route numbers, planning applications and house values ?

Or a mash up that correlates the sale of newspapers, sightings of wild foxes and bus routes – is there a link ???🙂

The point is, it might look like a list of timestamps and locations, but its worth multiple times that when added to another dataset that may be available – and this may be of interest to a very small number of people with a very specific interest or to a very large number of people in a shared context – the environment, value for money etc. And of course, even where the data may seem to be sensitive, or that making it all available in one place, it is a way of reaching out to customers, being open and communicative – build a valuable relationship and a new level of trust, both within in the organisation and without.

a commercial dataset such as a parts inventory used within a large manufacturing company : the company decides to publish it using URIs and a SPARQL endpoint. It tells its internal and external suppliers that its whole catalog is available – the name, ID, description and function of each part is available. Now a supplier can see what parts it can sell to the company, as well as the ones it doesn’t, but that it could. Additionally, if it can also publish its own catalog, and correlate with owl:sameAs or by going further and agreeing to use a common set of IDs (URIs), then the supply chain between the two can be made more efficient. A “6mm steel bolt”, isn’t the same as a “4mm steel rod with a 6mm bolt-on protective cap” (the keywords match) – but

And if the manufacturing company can now design new products using the shared URI scheme and reference the inventory publication and supply chain system built around it, then it can probably be a lot more accuate about costs, delivery times, and ultimately profit. What imagine if the URI scheme used by the manufacture and the supplier was also an industry wide scheme – if its competitors, and other suppliers also used it ? It hasn’t given the crown-jewels away, but it has been able to leverage a common information base to save costs, improve productivity, resource efficiency and drive profit.

So Linked Open Data has value in the public and private sector, for small, medium and large scale interests. Now you just need to build your application or service to use it. How you do that is based around the needs and wishes you have of course. The 5 stars[26] of Linked Open Data mean that you can engage at a small level and move up level by level if you need to or want to.

At the heart of things is a decision to make your information available in a digital form, so that it can be used by your employees, your customers, or the public. (You could even do this at different publishing levels, or add authorization requirements to different data sets). So you decide what information you want to make available, and why. As already stated, perhaps this is a new means to communicate with your audience, or a new way to engage with them, or it may be a way to drive your business. You may be contributing to the public information space. Either way, if you want it to be linkable, then you need URIs and a vocabulary. So you reuse what you can from RDF/RDFS/FOAF/DC/SIOC etc etc and then you see that you actually have a lot of information that is valuable that you want to publish but that has not been codifed into an information scheme, an ontology. What do you do – make one up ! You own the information and what it means, so you can describe it in basic terms (“its a number”) your perspective (“its a degree of tolerance in a bolt thread”). And if you can you should also now try and link this to existing datasets or information thats already outthere – a dbPedia or Freebase information item say, or an IETF engineering terminology (is there one ?), or a location expressed in Linked Open Data format (wgs84 [31]) or book reference on Gutenberg [32,33], or a set of statistics from a government department [34, 35], or a medical digest entry from Pubmed or the like [36] – and there are many more [37]. Some of these places are virtual hubs of information – dbPedia for instance has many many links to other data sets – if you create a link from yours to it, then you are also adding all dbpedia’s links to your dataset – all the URIs are addressable afterall, and as they are all linked, every connection can be discovered and explored.

This part of the process will likely be the most difficult, but most rewarding and valuable part, because the rest of it is mostly “clerical” – once you have the info, and you have a structured way of describing it, of providing addresses for it (URIs) – so now you “just” publish it. Publishing is a spectrum of things : it might mean creating CSV files of your data that you make available; that you “simply”* embed RDFa into your next web page refresh cycle (your pages are your data, and google is how people find it, your “API”); or it might mean that you go all the way and create SPARQL endpoint with a content negotiating gateway into your information base, and perhaps also a VoID [38] page that describes your dataset in a structured, open vocabulary, curatable way [45]

* This blog doesn’t contain RDFa because its just too hard to do – wordpress.com doesn’t have available pluginst, and the wordpress.org plugins may be limited for what you want to do. Drupal7 [50] does a better job, and Joomla [51] may get there in the end.

Service Oriented Architecture, Semantic Web Services

There are some striking similarities and echoes of Service Oriented Architecture (SOA) in Linked Open Data. . It would seem to me that using Linked Open Data technologies and approaches with SOA is something that may well happen over time, organically, if not formally. Some call this Semantic Web Services[49] or Linked Services [46]. It is not possible to discuss this here and now (its complex, large in scope, emerging) but I can attempt to say how a LoD approach might work with SOA

Technology

APIs, Services, Data

Comment

REST, SOAP

Some love it, some hate it. I fall into the latter camp – SOAP has too much complexity and overhead compared to REST that its hard to live with, even before you start coding for it. LOD has lots too, but it starts with a more familiar base. And this kind of sums up SOA for me and for a lot of people. But then again, because I’ve avoided it, I may be missing something.
REST, in the Linked Open Data world allows all that can be accomplished with SOAP to be done so at a lower cost of entry, and with quicker results. If REST were available with SOA services then it might be more attractive I believe.

VoID[38,39], WSDL[40]

WSDL and VoID fill the same kind of need – for SOA WSDL describes a service. In the Linked Open Data world VoID describes a data set and its linkages. That dataset may be dynamically generated tho (the 303 Redirect architecture means that anything can happen, including dynamic generation and content negotiation), so is comparable to a business process behind a web based API.

This is where things start to really differ. UDDI is a fairly rigid means of registering web services, while the Linked Open Data approach (eg CPoA) is more do-it-yourself and federated. Whether this latter approach is good enough for businesses that have SLA agreements with customers is yet to be seen, but theres certainly nothing to stop orchestration, process engineering and eventing to be built on top of it. Combine these basics with semantic metadata about services, datasets, registries and availability its possible to imagine a robust, commercially viable, internet of services and datasets, being both open and on easily adopted standards.

The same issues that dog SOA remain in the Linked Open Data world, and arguably because of the ease of entry and the openness they are more of an issue there. But the Linked Open Data world is new in comparison to SOA and can even leverage learnings from it and correct problems. Its also not a prescriptive base to start from, so there is no need for a simple public data set to have to implement heavy or commercially oriented APIs where they are not needed.

Whatever happens it should be better than what was there before, IMO. SOA is too rigid and formal to be an internet scale architecture, and one in which ad-hoc organic participation is possible, or perhaps the norm, but also one in which highly structured services and processes can be designed, implemented and grown. With Linked Open Data, some more research and work (from both academia and people like you and me, and big industry if it likes !) the ultimate goals of process automation, personalisation (or individualisation), and composition and customisation gets closer and closer, and may even work where SOA seems to have stalled.[46,47,48]

Having googled and read the w3c docs [25-30] on Linked Open Data it should become clear that the advantages of Linked Open Data are many.

Addressable Things – every Thing has an address, namespaces avoid conflicts, there are different addresses for concepts and embodiments

Content for purpose – if your a human you get html or text (say), if your a machine or program you get structured data

Public vocabulary – if there is a definition for something then you can reuse it, or perhaps even specialise it. if not, you can make one up and publish it. if you use a public definition, type, or relationship, and some one else does as well, then you can join or link across your datasets

Open links – a Thing can point to or be pointed at from multiple places, sometimes with different intent.

Once these are part of your toolkit you can go on to create data level mashups which add value to your own information. As use cases, consider how and why you might link that to other datasets for

a public dataset that contains the schedules of buses in a city, and their current positions while on that schedule – the essence of this is people, place and time – volumes, locations at point in time, over periods of time, at start and end times.

You could create an app based around this data and linkages to Places of Interest based on location : a tourist guide based on public transport.

How about an “eco” application comparing the bus company statistics over time with that of cars and taxis, and cross reference with a carbon footprint map to show how Public Transport compares to Private Transport in terms of energy consumption per capita ?

Or add the movements of buses around the city to data that provides statistics on traffic light sequences for a journey planner ?

Or a mashup of route numbers, planning applications and house values ?

Or a mash up that correlates the sale of newspapers, sightings of wild foxes and bus routes – is there a link ???🙂

The point is, it might look like a list of timestamps and locations, but its worth multiple times that when added to another dataset that may be available – and this may be of interest to a very small number of people with a very specific interest or to a very large number of people in a shared context – the environment, value for money etc. And of course, even where the data may seem to be sensitive, or that making it all available in one place, it is a way of reaching out to customers, being open and communicative – build a valuable relationship and a new level of trust, both within in the organisation and without.

a commercial dataset such as a parts inventory used within a large manufacturing company : the company decides to publish it using URIs and a SPARQL endpoint. It tells its internal and external suppliers that its whole catalog is available – the name, ID, description and function of each part is available. Now a supplier can see what parts it can sell to the company, as well as the ones it doesn’t, but that it could. Additionally, if it can also publish its own catalog, and correlate with owl:sameAs or by going further and agreeing to use a common set of IDs (URIs), then the supply chain between the two can be made more efficient. A “6mm steel bolt”, isn’t the same as a “4mm steel rod with a 6mm bolt-on protective cap” (the keywords match) – but

And if the manufacturing company can now design new products using the shared URI scheme and reference the inventory publication and supply chain system built around it, then it can probably be a lot more accuate about costs, delivery times, and ultimately profit. What imagine if the URI scheme used by the manufacture and the supplier was also an industry wide scheme – if its competitors, and other suppliers also used it ? It hasn’t given the crown-jewels away, but it has been able to leverage a common information base to save costs, improve productivity, resource efficiency and drive profit.

So Linked Open Data has value in the public and private sector, for small, medium and large scale interests. Now you just need to build your application or service to use it. How you do that is based around the needs and wishes you have of course. The 5 stars[26] of Linked Open Data mean that you can engage at a small level and move up level by level if you need to or want to.

At the heart of things is a decision to make your information available in a digital form, so that it can be used by your employees, your customers, or the public. (You could even do this at different publishing levels, or add authorization requirements to different data sets). So you decide what information you want to make available, and why. As already stated, perhaps this is a new means to communicate with your audience, or a new way to engage with them, or it may be a way to drive your business. You may be contributing to the public information space. Either way, if you want it to be linkable, then you need URIs and a vocabulary. So you reuse what you can from RDF/RDFS/FOAF/DC/SIOC etc etc and then you see that you actually have a lot of information that is valuable that you want to publish but that has not been codifed into an information scheme, an ontology. What do you do – make one up ! You own the information and what it means, so you can describe it in basic terms (“its a number”) your perspective (“its a degree of tolerance in a bolt thread”). And if you can you should also now try and link this to existing datasets or information thats already outthere – a dbPedia or Freebase information item say, or an IETF engineering terminology (is there one ?), or a location expressed in Linked Open Data format (wgs84 [31]) or book reference on Gutenberg [32,33], or a set of statistics from a government department [34, 35], or a medical digest entry from Pubmed or the like [36] – and there are many more [37]. Some of these places are virtual hubs of information – dbPedia for instance has many many links to other data sets – if you create a link from yours to it, then you are also adding all dbpedia’s links to your dataset – all the URIs are addressable afterall, and as they are all linked, every connection can be discovered and explored.

This part of the process will likely be the most difficult, but most rewarding and valuable part, because the rest of it is mostly “clerical” – once you have the info, and you have a structured way of describing it, of providing addresses for it (URIs) – so now you “just” publish it. Publishing is a spectrum of things : it might mean creating CSV files of your data that you make available; that you “simply” embed RDFa into your next web page refresh cycle (your pages are your data, and google is how people find it, your “API”); or it might mean that you go all the way and create SPARQL endpoint with a content negotiating gateway into your information base, and perhaps also a VoID [38] page that describes your dataset in a structured, open vocabulary, curatable way [45]

Service Oriented Architecture, Semantic Web Services

There are some striking similarities and echoes of Service Oriented Architecture (SOA) in Linked Open Data. . It would seem to me that using Linked Open Data technologies and approaches with SOA is something that may well happen over time, organically, if not formally. Some call this Semantic Web Services[49] or Linked Services [46]. It is not possible to discuss this here and now (its complex, large in scope, emerging) but I can attempt to say how a LoD approach might work with SOA

Technology

APIs, Services, Data

REST, SOAP

Some love it, some hate it. I fall into the latter camp – SOAP has too much complexity and overhead compared to REST that its hard to live with, even before you start coding for it. LOD has lots too, but it starts with a more familiar base. And this kind of sums up SOA for me and for a lot of people. But then again, because I’ve avoided it, I may be missing something.
REST, in the Linked Open Data world allows all that can be accomplished with SOAP to be done so at a lower cost of entry, and with quicker results. If REST were available with SOA services then it might be more attractive I believe.

VoID[38,39], WSDL[40]

WSDL and VoID fill the same kind of need – for SOA WSDL describes a service. In the Linked Open Data world VoID describes a data set and its linkages. That dataset may be dynamically generated tho (the 303 Redirect architecture means that anything can happen, including dynamic generation and content negotiation), so is comparable to a business process behind a web based API.

This is where things start to really differ. UDDI is a fairly rigid means of registering web services, while the Linked Open Data approach (eg CPoA) is more do-it-yourself and federated. Whether this latter approach is good enough for businesses that have SLA agreements with customers is yet to be seen, but theres certainly nothing to stop orchestration, process engineering and eventing to be built on top of it. Combine these basics with semantic metadata about services, datasets, registries and availability its possible to imagine a robust, commercially viable, internet of services and datasets, being both open and on easily adopted standards.

The same issues that dog SOA remain in the Linked Open Data world, and arguably because of the ease of entry and the openness they are more of an issue there. But the Linked Open Data world is new in comparison to SOA and can even leverage learnings from it and correct problems. Its also not a prescriptive base to start from, so there is no need for a simple public data set to have to implement heavy or commercially oriented APIs where they are not needed.

Whatever happens it should be better than what was there before, IMO. SOA is too rigid and formal to be an internet scale architecture, and one in which ad-hoc organic participation is possible, or perhaps the norm, but also one in which highly structured services and processes can be designed, implemented and grown. With Linked Open Data, some more research and work (from both academia and people like you and me, and big industry if it likes !) the ultimate goals of process automation, personalisation (or individualisation), and composition and customisation gets closer and closer, and may even work where SOA seems to have stalled.[46,47,48]