Archive

The overall architecture for the Semantic backed J2EE app is different from the Linked data app already discussed because we need a business logic layer and a decoupling from the persistence layer. We also want to create a Java app rather than a semantic application so that the programming paradigms and patterns are familiar to the Enterprise java developer.

Semantically backed J2EE webApp - System diagram

Here we see a fairly standard 3 tier MVC application. Browser requests URIs from the appserver, or makes an Ajax call and gets html from server side JSPs or JSON formatted data in response, respectively. The application server contains java code that maps URIs and API calls to controllers, which make calls to service classes and DAO code. The DAO code makes call via a persistence proxy to get data from the server that is unmarshalled from RDF to java objects (or makes writes in the other direction). The persistence layer is configured to use an implementation that takes care of the Object to RDF mapping – two implementations are available (JenaBean and EmpireJPA). These in turn use their own protocols to talk to native or location repositories, or typically JDBC talk with standard DBMS. Spring and Spring security provide infrastructure level services for dependency injection, component wiring, MVC abstractions, and role, method and data level security for beans and dynamically created object instances. These technologies are shown below in the AppServer layer cake.

Technology libraries and tools used in Semantically backed J2EE WebApp

Obviously, there are many things going on here, and they’ll need some discussion

Community

This is a crucially important aspect in a new and evolving technology domain like the Semantic/Linked-Open-Data web – whether its a commercial or FOSS component you are thinking about using.

For commercual tools, many offer free end-user or community licensing, limited by size or frequency of use, but if you plan to take your application to market you may well need to upgrade to a commercial license, and these are often very expensive – a Semantic Web or Knowledge based application based on what might be an essential technology component, will surely be seen as large value-add area for commercial companies. While this is true I believe, and commercial licenses can be justified, some technology offerings have small print that takes you straight to commercial licensing once you go to production. Others have smaller but knobbled versions, while some do have true SME quality licensing. So, watch out, it can be a barrier to entry, and we do need to see Mid-level, SME and Cloud offerings for the success of the pervasive or ubiquitous Semantic Linked Open Data web.

Unfortunately, it seems that many tools and libraries born from academic research or OpenSource endeavours, while available for use, are often not maintained. The author or team moves on, or the tool or library is published but languishes. This ends up with a situation where you may find a tool that does what you need but that has no or poor documentation; no active maintenance; no visible community support forums or user-base; or compatibility problems with other tools, libraries or runtime environments. While that removes many from “production” usage or deployment, they can still be an important learning resource, and a means of comparing more current tools and libraries. I will itemise what I’ve come across below, but make sure you cast your professional eye over any offering – once you know what you are looking for, and what help in tools, libraries and environments you need : hopefully this article and the previous two have helped you in that.

What does it say it does and does-not do ?

How old is it ? What are its dependencies ?

How often is code being updated ?

Is it written in java/php/perl/.NET/ProLog/Lisp/ ? Does it suit you – does it matter if its written in Perl but youre going to write your app in Java – is what you are going to use it for an independent stage in the production of your application, or are all stages inter-twined ? How much will you have to learn ?

Who is the author ? What else has he/she/they done ? Are they involved in standards process, coding, design, implementation, community ? Blogs, conferences, presentations ?

Is there documentation ? A tutorial ? A reference ? Sample Code ? Production applications ?

Is there a means of contacting the authors, and other users ?

Are there bugs ? Are there many ? Are they being fixed ?

What are answers to questions like – simple, helpful, understanding, presumptuous, brick-wall !? One sentence answers or contextualised for audience ?

What are the user group like – beginner, intermediate, advanced, helpful, broad or narrow base, international, academic, commercial,… ?

How quickly are questions answered ?

Does it seem like the tool/library is successfully used by the community, or is it too early to say, or unfit for purpose 😦 ?

Under what licensiing is the tool/library made available ?

Results

At the application level, this is how things pan out then.

Linked Open Data webapp

Semantic backed J2EE webapp

Metadata, RDF, OWL

Need to have entries for each location in gazeteer. Need list of those locations. Then need to relate one to another from what text describes about road links, directions and bearing. Need metadata fields for each of those. Will also pull out administrative region type, population information, natural resources, and “House” information – seats of power/peerage/members of parliament. Will need RDF, RDFS, OWL for this, along with metadata from other ontologies. A further dataset later added for townland names – this allows parish descriptions from Lewis to encompass townland divisions, and potential for crossover to more detailed reporting at the time (eg Parliamentary reports)

This application associates a member or person with a list of locations and datetimes. Locations are posted by a device on a platform by a useragent at a datetime, and also associated with an application or group. An application is an anonymous association of people with a webapp page or pages that makes use of locations posted by its members. A group is an association of people who know each other by name/ID/email address and who want to share locations. Application owners cannot see locations or members of other locations unless they own each of the applications. Application owners cannot see with full accuracy the location or datetime information. Group owners can see the location and datetime with more accuracy, but not full accuracy, of their members. A further user type (“Partner”) can see all locations for all groups and applications but cannot see names of groups, applications or people, and has less accuracy on location and datetime. Concept subject tags can be associated with profiles and locations. A query capability is exposed to allow data mining with inference to application owners and partners. Queries can be scheduled and actions performed on “success” or “fail”. Metadata for people, devices, platforms, datetime, location, tags, applications and groups is required. ACL control based on that metadata is performed, but done so at an application logic level, not at a data level.

SPARQL, Description Logic (DL), Ontologies

A SPARQL endpoint is provided on top of the extracted and loaded data, and is the primary “API” used by the application logic which is expressed in javascript. Inferrence allows regions for instance to be queried at a very high level rather than by listing specific types. An ontology is created around the Location, location type, direction, bearing, distance, admin type, population, natural resource and peerage. A separate ontology created for peerage relationships and vocabulary, and imported into toplevel Lewis ontology. Some fields used from others notably wgs84 and muo. UI allows navigation by ontology (jOwl plugin).

No SPARQL interface directly exposed, but sparql queries for the basis of a data console, but restrictions on queries are applied based on ID and application/group membership, as well as role. A custom ontology is created based around FOAF, and SIOC, extending for RegisteredUser, Administrator, Partner, Application, Group, Device, Location and so on. Object model in Java mirrors this at interface level to simulate multiple inheritance. Some cardinality restrictions, but mostly makes use of domain/range specification from RDFS. Umbel ontology used for querying across tag relations. Inference has huge impact on performance, and data partitioning would be required for query performance, but this also has implications for library code used (named graph and query support, inference configuration) and application architecture and scale-out planning.

Artificial intelligence, machine learning, linguistics

Machine learning and linguistic analysus avoid in favour of syntactic a-priori extraction via gazeteer and word list after sentences have been delimited within each delimited location entry or report. Aliases and synonyms added later manually as fixup for OCR errors. Quality restricted by text from PDF and structural artifacts (page headings, numbers) newlines, linefeeds and lack of section headings within locations, location delimiters, and linguistic vagaries of author. Much much more information is available within each entry, but for now the original text is also stored sentence by sentence, with each entry.

None required here as no extraction is performed. Tag words and terms are restricted to those available in Umbel (OpenCyc) and condensed to Umberl Subject Concept URIs, which sparql queries can then make use of for broader, narrower and associative queries. “Find everyone who likes sports who posted a location within 1 mile of here”.

Linked Open Data

Location name lookups at extraction time link with to WGS84 grid location and ID in geonames, then to dbPedia entry. Former done using traditional web service API, latter by Sparql query. Coverage of about 85% achieved. dbPedia lookup based on name attempted but higher error rate (no or ambiguous hits) and lower coverage found (there are many infobox field variations for same type of information) QA manual/”eyeball” deemed sufficient for expected usage and audience.Link to Dictionaries of Biography for houses,possible using some form of owl:equivalence of peerage ontology. UI level links to Sindice and Uberblic attempted but cross-domain scripting prohibited. Locations mapped to Google maps – could be migrated to OpenStreenMap (geonames basis). Visualisation possible with Google visualisation or other web tool. Server side proxy created for this, and for further dbPedia integration – this provides example link to “people born before 1842 at this location”.

Links to Umbel are performed at query time based on Umbel Subject concepts applied by members to their profile and location. Umberl vocabulary is currently directly queried to Structured Dynamics endpoint, but could be loaded into same data repository or a separate but more local repository. Large memory footprint. Federated query capability depends on pluggable persistence technology used in application. Applications built on or off domain are free to make use of owl:sameAs for instance to further link proprietary data with data stored in this system, but need to make that association within their own repository. Links can be made to profile identity (local or OpenID) if known or if user expressly associates (after OAuth verification), to wgs84 location (assuming some proximity calculation), to application or group name (if known).

NLP and ML too advanced, too manual, too time consuming for a beginner, or a one-person prototyping “team”.

UI from RDF a problematic area – would be good to be able to geneate a UI now theres an ontoloy, but no more advanced than any UI or Form generation from XML or structured data.

Link generation code largely manual, could do with abstraction and ease of use (but this is complex area !). Lots and lots to learn, active support and experience required . Cross domain scripting a problem for Linked Open Data.

Where open linked data isnt a primary requirement then most other requirements are met by traditional RDBMS based technology and architecture. Open source can meet all component requirements for now (tech demo)

No JDBC type access wrappers to semantic repositories. SPARQL young and evolving.

Concurrency and multi-instance access considerations need to be made up front, early in development.

Some library or repository specific ORM type tools, one (I found) JPA based library being developed. Lots and lots to learn, active support and experience required.

Tools

This is as comprehensive a list as I can come up with based on what I looked at and ended up using (or not). There are many many more for sure, some in Java, others in various other languages. As some of the work types in the text->knowledge progression are often independent, being available in Java many not be important or even a consideration for you. So – look here, there and everywhere. See also Dave Becketts [81] list for a great source of information about available tools and technologies.
.

Category

Tool

Comment

Linked Open Data webapp

Semantic backed J2EE webapp

Extraction

GATE [56]

IDE for configuration of NLP toolsets and training ML engine. Active user group, but tool UI seemed buggy (q1 2010) and documentation was obtuse – not geared towards those not “in the know” IMO. Still, good, but would need a lot of effort and patience.

This is part of the transformation of source content to “knowledge”. Once entities are extracted they need to be used in RDF triples – how you go about this depends on your vocabulary and ontology and its up to you to use the RDF-Java-Object frameworks (below) that allow you to create a Subject and add a Property with an Object value. I havent found a tool that would allow code to generate RDF from tagged entities say, and its likely not reasonble to think in this way – however convenient. How would such a tool know which relationships in an ontology were asserted in the entity set you gave it ? The only way to about this is to code those things yourself from the knowledge you already have about the information, or what you want to assert, or perhaps, if you are dealing with a database to use its schema as the basis for a set of asserted statements in RDF – using D2R[87] or Triplify[88] say (do you need inference or not ?). This approach was not used in either of these projects however. Perhaos owl2java [99] might have helped ?

X

X

NLP, ML

GATE [56]

NLP engine from Sheffield University with support for ML – see also Extraction category above. Tried but not used.

X

X

OpenNLP[89, 90]

NLP library for tokenization, chunking, parsing and coreference. Simple than GATE, less documentation, dormant ? Tried but not used.

X

X

MinorThird [91]

Probably more ML than NLP, but with tokenization and extraction capability. Getting long in the tooth, and had some compatability issues when tested.

X

X

UIMA[92, 94,95]

“Unstructured Information Management Architecture”. A full blown framework for NLP and ML – “text mining”, a la GATE. Now in Apache (contributed by IBM). Good documentation, active support and development. Came close to using for Linked Data app but came too late, and seemed large and time consuming to learn (in my timescale). However, for a version2 of the project I would use it, over GATE and the custom code I built – documentation for end user and developer is less assuming than GATE, and there are various plugins available, and as it is modular (so is GATE btw) you can create and add your own discrete code into the UIMA processing pipeline. Still need something to generate RDF based around your ontology and the extracted entities tho…

SenseRelate[93]

NLP-Wordnet disambiguation toolkit. Couldnt see how I would integrate this – what purpose for my application as I was using a-priori knowledge of the text for the Linked Open Data webapp, and the application business logic for the Semantic backed J2EE webapp. Also getting old…

X

X

LingPipe [106]

Very interesting toolkit for NLP, text and document processing, but ultimately with a commercial license

X

X

Mallet [107]

Like LingPipe but opensource, with sequence tagging and topic modelling.

X

X

Weka [108]

Another text mining tool, opensource, good docs, current and maintained, also works with GATE[56]

X

X

RDF-Java

OpenJena [59, 65]

Maturing framework for RDF with java. Sparql implementation [61] follows standards closely and previews upcoming versions, as Andy Seaborne on SPARQL w3c group. Has repository capability as well. Used in both projects, but in J2EE app was just on of possibilties for repository integration and RDF capability. Support forum high traffic – popular choice. Expected to provide working code examples when describing problems – discussion not entertained ! HP [64] and now Apache [65] backing. Combined with JenaBean [71] and Empire-JPA[72] in J2EE app. TTL/N3 config may seem alien to java webapp developers.

Y

Y

KAON [97]

Another library – didnt seem as popular as Jena or Sesame. Documentation ? Old, not actively maintained ?

X

X

Sesame [62]

Modular RDF to java library and repository framework. v3 expected soon (Q1 2011 ?). Good documentation and comment available on and off site, but you still need to experiment. Support forum can be slow and low traffic, but still a popular choice. Also home for Elmo [66] (an object-RDF extension), and Alibaba [67] – “the next generation of the Elmo codebase”. Combined with Empire [72] in J2EE app. TTL/N3 config may seem alien to java webapp developers.

X

Y

Object-RDF

JenaBean [71]

Appears now dormant, but Jena Object library with custom annotations to model and map Java Classes to RDF classes. Support very slow. Low activity.

X

Y

Empire-JPA[72]

Aka Empire-RDF. From makers of Pellet [75]. JPA implementation for access to semantic repositories, with adapters for Sesame, Jena, Fourstore [74]. Newish, v0.7 about to be released. Support good, interested, helpful.

X

Y

RDF2GO [79]

Abstraction over repository and triplestores, with Jena, Sesame and OWLIM adapters. Decided in favour of Empire.

X

X

Repository and/or database

TDB [60]

Single instance in memory repository, with cmdline and Jena integration. No clustering, replication capability – must be local to webapp. Configuration can be awkward, imo, but easy enough to get started with. Inferencing and custom ontology support, both at configuration and code levels. Single writer multiple reader. Used in both projects but in J2EE app was just one of possible repository technologies. Memory mapped files in 64bit JVM.

Provides proxy http capability in front of in memory, file based or database backed repositories. Inferrence by configuration, performed on write – inferred statements are asserted and persisted. Allows for multiple web app instances to make use of any of the repositories. Web based “workbench”. Limited reasoning support compared to Jena. Support forum could be described as “slow”. OntoText [73] backing.

X

Y

BigData [68]

Sesame[62] + Zookeeper [77] + MapReduce [78] based clustered semantic repository for very large datasets. Too big for either apps at this stage, but Empire/Sesame usage provides growth path.

X

X

AllegroGraph [69]

Lisp based Semantic Repository with community and commercial licensing options for larger datasets. Http interface – could be used as alternative to Jena/Sesame/Empire. Biggish application and framework to read and learn – too big for now !

X

X

OWLIM [70]

Large scale repository based around Sesame. Reasoning support better then Sesame, and takes alternative approach to implementation compared with Jena say. Community and commercial license. Too big for now !

X

X

Fourstore [80]

Python semantic repository. Could be used behind Empire.[72]

X

X

Content negotiation

Pubby [76]

WAR file with configuration (N3) for URI mapping, 303 redirect and many other aspects of Linked Data access – for sparql endpoints that support DESCRIBE. Wrote filter that could sit on remote front end as alternative, but may get used later.

X

X

SPARQL access & Endpoint

Joseki [58]

Sparql endpoint for use with Jena. Needs URL rewriting for PURLs and content negotiation code in front.(custom code)

Y

X

Link generation

N/A

Use custom code from eg Jena or Sesame to create statements in model – once you’ve designed your URI scheme – and get the code to serialise/materialise the URI for you.

Y

Y

Ontologies

Protégé [82]

IDE to create RDFS and OWL ontologies, with reasoning and visualisation.

Y

Y

NeOn Toolkit [100]

Eclipse based tool suite for semantic apps. Broad scope, protege seemed a better fit – easier and quicker to get to grips with at the time. May be used again tho.

X

X

KAON – OI-Modeler [98]

old. still available ? being maintained ?

X

X

m2t4 [101]

looked promising, simple eclipse plugin, had compatability and maintenance issues. switched to Protege in the end however.

X

X

Inference & Reasoning

Jena [59]

Jena has built in inference capability, but is considered slower than others.[86]. In the J2EE app, with an RDBMS backed repository it was poor, IMO. With a TDB repo its better, but still something you really need to have before you would deploy in production. This is probably true of all current repostories, but Jena seems to be at the slow end of the scale.However, it does deliver high standards compliance rather than a “degraded” compliance you may get with others.

Y

Y

Sesame[62]

Sesame has “reduced” reasoning support – it can do RDFS based reasoning, and if custom ontologies are added to a repository type with inferrence support it will make use of them. If a “view” of a dataset is required that doesnt contain inferred statements, then a query parameter needs to be used so they are filtered out.

X

Y

Pellet [75]

“Independent” inference and reasoning. Not used except as plugin in Protege. Supposedly faster than some others.

X

X

OWLIM [73]

OWLIM comes with its own flavour of inference and reasoning “support for the semantics of RDFS, OWL Horst and OWL 2 RL”

X

X

UI Generation & Rendering

Talis Sparql js lib [57]

Javascript interface for using datasources hosted on Talis platform. Decided not to host data “offsite” at this stage

Display vocabulary for RDF. Integrate at java level, may have been possible to create a Spring [105] view module (I use Spring a lot) but was another thing to learn and I wanted to try and use plain old javascript and html as much as possible. Has promise, but documentation, support and maintenance may be an issue.

I had used omondo a couple of years back but trying to install again proved to be too painful. Had to download a complete new eclipse+plugin combo for evaluation purposes and then it wouldnt start – eclipse gave a helpful “error=-1”.

Tried MaintainJ – had looked at a while ago – but couldnt get the plugins to work because my AspectJ code used v7 and it wanted v6. Gave up.

MyEclipseIDE – looks like it has it all including an update url, but it depends on Mozilla XUL 1.9.07 which doesnt seem to be available. Installing 1.9.17 didnt help. Eventually downloaded the the v9 beta (full install of several hundred mB) and imported the projects I needed, right clicked to the MyEclipse menu in the UML2 perspective, and followed the prompts. Right click on diagram to export to JPG. Have not investigated round tripping or sequence diagrams yet – to follow, when I need. But then I restarted it, and it never stops starting. Still I got one diagram from it !