Category Archives: cetis-standards

From UCAS applications to HESA returns, and from league tables to the academic technology approval scheme, degree programmes and modules are classified by subject. JACS3 does that job now, but HECoS will do it in the future. Here are the main differences.

After many years of use, the Joint Academic Coding System (JACS) that’s pervasive in UK Higher Education data sets ran into some limits: it was running out of codes in some subject areas, and it was being used for many more purposes than it was originally designed to support.

That’s why the Higher Education Data and Information Improvement Programme (HEDIIP) commissioned CETIS, in collaboration with APS and Aspire, to consult with the sector on a replacement of the vocabulary. The result of that work is the Higher Education Coding of Subjects (HECoS) vocabulary. HECoS has now reached the penultimate stage in that a release candidate is out for consultation, as are proposals for the governance and adoption of the scheme.

Here are the main differences between JACS3 and HECoS in a nutshell, though;

One flat list, no hierarchies, and no memorable codes

This is easily the biggest and most noticeable change. HECoS itself is just a list of terms without any implied or given groupings. That doesn’t mean groupings and hierarchies aren’t important, quite the contrary: different organisations have different uses for subject information, and that means they can group subjects differently.

In a way, that follows on from what’s already happening with JACS3 in practice. The definition of what subjects constitutes biological sciences, for example, already differs between JACS3, HEFCE and what a typical university is likely to be able to offer. Different drivers and different contexts lead these organisations to group subjects differently, and HECoS is designed to enable different groupings to exist side by side, whilst still sharing the same subject terms.

A consequence of the approach is that the familiar JACS3 codes (“L3xx” is anything sociological etc.) are no longer valid. From the perspective of HECoS “sociolinguistics” will therefore have no defined link with “sociology”, which is why the code for the former is “101016” –or a URI that encodes that number such as http://hecos.hediip.ac.uk/terms/101016– and the code for the latter is “100505”.

For ease of navigation, however, HECoS will come with some common groupings. There is a “sociology group” that has both “sociolinguistics” and “sociology” in it. This is just to help people find terms, and nodes like “sociology group” cannot be used to classify a degree programme or module.

Terms are based on demonstrated use, need and distinguishability

While JACS was reviewed periodically, it hasn’t always had formal acceptance criteria either for the terms that were already in there, or for newly proposed ones. HECoS does have a proposal for it, which has already been applied in the development of the current draft.

The criteria for the first cut were, in short:

is the term in JACS3?

is there evidence of use of the term in HESA data returns?

is the term’s definition and scope sufficiently clear and comprehensive to allow classification?

is the term reliably distinguishable from other terms?

The first criterion comes out of a recognition that JACS has imposed a structure and created its own reality over the years. That’s a good thing, and worth preserving for time series analysis reasons alone. The second criterion addresses an issue that has bedevilled JACS for a while: many terms were sound in theory, but barely or never used in practice. This creates confusion and often makes coding unreliable: what good is a term if it groups one degree programme in one institution? For that reason, we looked at whether a term has at least two degree programmes in at least two institutions in HESA student data returns.

The third criterion has to do with the way some JACS terms were defined: some were incomplete –e.g. “history by topic” without specifying what that topic was– or where not sufficiently complete to determine what was in or out. The final criterion of distinguishability is related to that: we examined the HESA returns for consistency of coding. If the spread of similar degree programmes over several terms indicated that people were struggling to distinguish between terms, we’ve rearranged terms so that they follow the groupings that were obvious in the data as closely as possible. We’ve also started to test any such changes with sorting exercises to ensure that people can indeed distinguish between four related terms.

A commonly administered change process

Just like JACS evolved over the years, so will HECoS. The difference is that we are proposing to regularise the change and allow it to follow a predictable path. The main mechanism for that would be a registry for new terms. The diagram outlines how a new subject term can be discovered, or entered for consideration for inclusion, or discovery by others.

The proposed criteria for accepting a new term into HECoS proper are similar the ones used for the first draft: a term has to be demonstrably in use, or fill a need, and be distinguishable by non-specialists. In each case, though, the HECoS governance body, which is designed to represent the whole sector, will have the ultimate say on which terms will be accepted or retired, and how often these changes will happen.

Learning about an interoperability specification such as a QTI 2.1 becomes easier when you can see it working in a set of tools. In this post, we’ll create a very simple test using three freely available tools.

Item creation

We’ll get started with making an item, and we’ll use Kingston University’s Uniqurate. As an item and test editor it’s limited in what types of question items it does, which is also its virtue; it’s nice and simple.

Since we’re making just an item, we click “+Question”, and then hit the pencil and paper icon to edit the content of the new item we’ve just started.

The editing window already has boxes for the title of the item and the prompt- the item’s instructions. We’ll make a question about cities in Arizona, and ask people to pick those cities that are in that state from a list.

Since we’re doing a multiple choice question, we’ll pick the multiple choice widget from the list of components on the left, and drag it into the free slot underneath the prompt. Then we’ll start filling it with cities. Click “+” to add a slot, give each correct choice a score of 1, and each incorrect choice with 0. The maxchoices should be set at least to the number of all the correct choices.

To see what the item’s code looks like, hit the “Expert Mode” button. Be aware that changing the code in expert mode may not be retained when going back to easy mode.

Hit the save button, and save either the file or content package to a convenient place on your machine.

While Uniqurate is perfectly able to make tests itself, seeing whether the QTI item can be integrated into a test by another tool is a nice demonstration of interoperability, so we’ll use BPS Onyx for that.

Test creation

We start by hitting the “Create test” link, and giving the new test a name. The default test gives a good idea of the basic structure of a QTI 2.1 test.

To add the Arizona item to the test, it first needs to be added to the question bank, so we navigate to “Test resources”. In there, hit “Import content”, select the item, and hit upload.

Then we click on the edit button of the test we made to add the item. Highlight the section we want to put the item in, and hit the “Question bank” folder icon. Select the item in it, and click “Add an element”. The tabs on the item as well as the section and test give a good idea of all the configuration possibilities.

Once the test is done, hit “Save test”, then “Test resources”. Select the test in the list, and hit “Export”, and save the test package in a convenient place.

Running a test

To run our test, we’ll use QTIWorks. Click “Demos”, then “Quick Upload and Run”. While you click through the test, be sure to hit “Open Author’s feedback” to view all the state information QTIWorks collects about the test. All of this information is available in standard QTI results XML for each test of each candidate.

QTIWorks also has a number of QTI example items and tests readily built in. The full collection gives a very good idea of all the capabilities of QTI 2.1, and hitting the “Open Author’s feedback” link allows you to inspect the XML, as well as see what it does.

Uniqurate, Onyx and QTIWorks are not the only readily available QTI tools. Be sure to have a look at TAO too for a complete open source solution.

During last week’s CETIS conference I ran a session that assessed how ebooks can function as an educational medium beyond the paper textbook.

After reminding ourselves that etextbooks are not yet as widespread as ebook novels, and that paper books generally are still more widely read, we examined what ebook features make a good educational experience.

Though many features could have been mentioned, the majority were still about the experience itself. Top of the bill: formative assessment at the end of a chapter. Either online or offline, it needs to be interactive, and there need to be a lot of items readily available. Other notable features in the area include a desire for contextualised discussion about a text. Global is good, but chats limited to other learners in a course is better. A way of asking for clarification of a teacher by highlighting text was another notable request.

These features were then compared to what is currently the state of the art. Colin Smythe presented the latest EDUPUB work from IMS, IDPF and the W3C which integrates both VLEs and books, as well as analytics and assessment platforms. The solution is slick, and entirely web based. This contrasts with the solution I demoed before the formal EDUPUB work started. Unlike IMS’ example, my experiment does work in most any ebook software, but it doesn’t include IMS’ Caliper analytics capability.

But then Mick Chesterman of Flossmanuals and Manchester Metropolitan University reminded us that reading features aren’t the only ones worth considering. The open source Booktype platform allows communities to quickly and easily write books collaboratively, and then clone, share or merge them in a process called ‘federated publishing’.

The editability of standard, EPUB format ebooks also introduced the core question: what is the difference between an ebook and a web site? The interactivity and media support that is now possible in ebooks is blurring the distinction, but features such as the possibility of editing could prove a key distinction.

Another distinction, but one that may not persist, is a book’s persistence itself. With more functionality living outside of the book, on servers on the wider internet, how will a book endure? While intermittent connectivity means that offline access is still desirable now, will the ever increasing ubiquity of bandwidth spell the end for self-contained media?

During last week’s EDUPUB workshop, I presented a demo of how an IMS QTI 2.1 question item could be embedded in an EPUB3 e-book in a way that is engaging, but also works across many e-book readers. Here’s the why and how.

One of the most immediately obvious differences between a regular book and an e-textbook is the inclusion of little quizzes at the end of a chapter that allow the learner to check their understanding of what they’ve just learned. Formative assessment matters in textbooks.

When moving to electronic textbooks, there is a great opportunity to make that assessment more interactive, and provide richer feedback, and connect the learning to a wider view of how a student is doing (i.e. learning analytics). The question is how to do that in a way that works across many e-reading devices and applications, on a scale that works for publishers.

QTI item in Adobe Editions

Scalability is where interoperability standards like EPUB3, IMS Learning Tool Interoperability (LTI) and IMS Question and Test Interoperability (QTI) 2.1 come in. People use a large number of different software systems in the authoring, management, and playback of e-books. Connecting each of those to all the others with one-off custom integrations just gets too complex, too expensive and too brittle; that’s why an increasing number of publishers and software vendors agreed on the EPUB specification. As long as you implement that spec, solutions can scale across many e-book applications. The same goes for question and test material, where IMS QTI does the same job. LTI does that job for connecting VLEs to any online learning tool.

Which leaves the question of how to square the circle of making the assessment experience as engaging and effective as possible, but also work on devices with very different capabilities.

Fortunately, EPUB3 files can include a number of techniques that allow an author to adapt the content to the capability of the device it is being read on. I used those techniques to present the same QTI item in three different ways; as a static quiz – much like a printed book –, as a simple interactive widget and as a feedback rich test run by an online assessment system inside the book. The latter option makes detailed analytics data available and it should also make it possible to send a grade to a VLE automatically.

The how

QTI item in Apple iBooks

For the static representation and the interactive widget, I relied on Steve Lay’s rather brilliant transform from QTI XML to HTML5 (and back again), and to make the HTLM5 interactive with some javascript. By including this QTI HTML5 in the EPUB, you get all the advantages of standard QTI, in a way that still works in a simple, offline reader such as Adobe Editions as well as more capable software such as Apple’s iBooks.

For the most capable, online ebook readers such as Readium, the demo e-textbook connects to QTIWorks, an online QTI compliant assessment engine. It does that via IMS LTI 1.1, but in a somewhat unusual way: in LTI terms, the e-book behaves as a tool consumer. That is; like a VLE. Using a hash of an Oauth secret and key, it establishes a connection to QTIWorks, identifies the user, and retrieves the right quiz to show inside the ebook. A place to send the results of the quiz to is also provided, but I’ve not tested that yet. QTIWorks makes detailed report available of what the learner did exactly with each item, which can be retrieved in a variety of machine readable formats.

QTI item in Readium

Because the secret and the key have to be included in the book, the LTI connection the book establishes is not as secure as an LTI connection from a proper VLE. For access to some formative assessment, that may be a price worth paying, though.

The demo EPUB3 uses both scripting and some metadata to determine which version of the QTI item to show. The QTI item, the LTI launch and the EPUB textbook are all valid according to their specifications, and rely on stock readers to work.

As the QTI 2.1 specification gets ready for final release, and new communities start picking it up, conforming tools demonstrated their interoperability at the JISC – CETIS 2012 conference.

The latest version of the world’s only open computer aided assessment interoperability specification, IMS’ QTI 2.1, has been in public beta for some time. That was time well spent, because it allowed groups from across at least eight nations across four continents to apply it to their assessment tools and practices, surface shortcomings with the spec, and fix them.

Nine of these groups came together at the JISC – CETIS conference in Nottingham this year to test a range of QTI packages with their tools, ranging from the very simple to the increasingly specialised. In the event, only three interoperability bugs were uncovered in the tools, and those are being vigorously stamped on right now.

Where it gets more complex is who supports what part of the specification. The simplest profile, provisionally called CC QTI, was supported by all players and some editors in the Nottingham bash. Beyond that, it’s a matter of particular communities matching their needs to particular features of the specification.

In the US, the Accessible Portable Item Profile (APIP) group brings together a group of major test and tool vendors, that are building a profile for summative testing in schools. Their major requirement is the ability to finely adjust the presentation of questions to learners with diverse needs, which is why they have accomplished by building an extension to QTI 2.1. The material also works in QTI tools that haven’t been built explicitly for APIP yet.

A similar group has sprung up in the Netherlands, where the goal is to define all computer aided high stakes school testing in the country in QTI 2.1 That means that a fairly large infrastructure of authoring tools and players is being built at the moment. Since the testing material covers so many subjects and levels, there will be a series of profiles to cover them all.

An informal effort has also sprung up to define a numerate profile for higher education, that may yet be formalised. In practice, it already works in the tools made by the French MOCAH project, and the JISC Assessment and Feedback sponsored QTI-DI and Uniqurate projects.

For the rest of us, it’s likely that IMS will publish something very like the already proven CC QTI as the common core profile that comes with the specification.

System A needs to talk to System B. Standards are the ideal to achieve that, but pragmatics often dictate otherwise. Let’s have a look at what approaches there are, and their pros and cons.

When I looked at the general area of interoperability a while ago, I observed that useful technology becomes ubiquitous and predictable enough over time for the interoperability problem to go away. The route to get to such commodification is largely down to which party – vendors, customers, domain representatives – is most powerful and what their interests are. Which describes the process very nicely, but doesn’t help solve the problem of connecting stuff now.

So I thought I’d try to list what the choices are, and what their main pros and cons are:

A priori, global
Also known as de jure standardisation. Experts, user representatives and possibly vendor representatives get together to codify whole or part of a service interface between systems that are emerging or don’t exist yet; it can concern either the syntax, semantics or transport of data. Intended to facilitate the building of innovative systems.
Pros:

Has the potential to save a lot of money and time in systems development

Facilitates easy, cheap integration

Facilitates structured management of network over time

Cons:

Viability depends on the business model of all relevant vendors

Fairly unlikely to fit either actually available data or integration needs very well

A priori, local
i.e. some type of Service Oriented Architecture (SOA). Local experts design an architecture that codifies syntax, semantics and operations into services. Usually built into agents that connect to each other via an ESB.
Pros:

Can be tuned for locally available data and to meet local needs

Facilitates structured management of network over time

Speeds up changes in the network (relative to ad hoc, local)

Cons:

Requires major and continuous governance effort

Requires upfront investment

Integration of a new system still takes time and effort

Ad hoc, local
Custom integration of whatever is on an institution’s network by the institution’s experts in order to solve a pressing problem. Usually built on top of existing systems using whichever technology is to hand.
Pros:

Solves the problem of the problem owner fastest in the here and now.

Results accurately reflect the data that is actually there, and the solutions that are really needed

Cons:

Non-transferable beyond local network

Needs to be redone every time something changes on the local network (considerable friction and cost for new integrations)

Can create hard to manage complexity

Ad hoc, global
Custom integration between two separate systems, done by one or both vendors. Usually built as a separate feature or piece of software on top of an existing system.
Pros:

Fast point-to-point integration

Reasonable to expect upgrades for future changes

Cons:

Depends on business relations between vendors

Increases vendor lock-in

Can create hard to manage complexity locally

May not meet all needs, particularly cross-system BI

Post hoc, global
Also known as standardisation, consortium style. Service provider and consumer vendors get together to codify a whole service interface between existing systems; syntax, semantics, transport. The resulting specs usually get built into systems.
Pros:

Facilitates easy, cheap integration

Facilitates structured management of network over time

Cons:

Takes a long time to start, and is slow to adapt

Depends on business model of all relevant vendors

Liable to fit either available data or integration needs poorly

Clearly, no approach offers instant nirvana, but it does make me wonder whether there are ways of combining approaches such that we can connect short term gain with long term goals. I suspect if we could close-couple what we learn from ad hoc, local integration solutions to the design of post-hoc, global solutions, we could improve both approaches.

What’s more effective than taking two days out and focus on a new practice with peers and experts?

Following the JISC’s FSD programme, an increasing number of UK Universities started to use the ArchiMate Enterprise Architecture modelling language. Some people have had some introductions to the language and its uses, others even formal training in it, others still visited colleagues who were slightly further down the road. But there was a desire to take the practice further for everyone.

For that reason, Nathalie Czechowski of Coventry University took the initiative to invite anyone with an interest in ArchiMate modelling (not just UK HE), to come to Coventry for a concentrated two days together. The aims were:

1) Some agreed modelling principles

2) Some idea whether we’ll continue with an ArchiMate modeller group and have future events, and in what form

3) The models themselves

With regard to 1), work is now underway to codify some principles in a document, a metamodel and an example architecture. These principles are based on the existing Coventry University standards and the Twente University metamodel, and the primary aim of them is to facilitate good practice by enabling sharing of, and comparability between, models from different institutions.

With regard to 2), the feeling of the ‘bash participants was that it was well worth sustaining the initiative and organise another bash in about six months’ time. The means of staying in touch in the mean time have yet to be established, but one will be found.

As to 3), a total of 15 models were made or tweaked and shared over the two days. Varying from some state of the art, generally applicable samples to rapidly developed models of real life processes in universities, they demonstrate the diversity of the participants and their concerns.

All models and the emerging community guidelines are available on the FSD PBS wiki.

While most of Europe was on the beach, a dedicated group of QTI vendors gathered in Koblenz, Germany to demo what a standard should do: enable interoperability between a variety of software tools.

A total of twelve tools were demonstrated for the attendees of the IMS quarterly meeting that was being held at the University of Koblenz-Landau. The vendors and projects ranged from a variety of different communities in Poland, Korea, France, Germany and the UK, and their tools included:

All other things being equal, the combination of such a diversity of purposes with the comprehensive expressiveness of QTI, means that there is every chance that a set of twelve tools will implement different, non-overlapping subsets of the specification. This is why the QTI working group is currently working on the definition of two profiles: CC (Common Cartridge) QTI and what is provisionally called the Main profile.

The CC QTI profile is very simple and follows the functionality of the QTI 1.2 profile that is currently used in the IMS Common Cartridge educational content exchange format. Nine out of the twelve tools had implemented that profile, and they all happily played, edited or validated the CC QTI reference test.

With that milestone, the group is well on the way to the final, public release of the QTI 2.1 specification. Most of the remaining work is around the definition of the Main profile.

Initial discussion in Koblenz suggested an approach that encompasses most of the specification, with the possible exclusion of some parts that are of interest to some, but not all subjects or communities. To make sure the profile is adequate and implementable, more input is sought from publishers, qualification authorities and others with large collections of question and test items. Fortunately, a number of these have already come forward.

Fortunately, almost all the information required is already available as RDF: the ASN makes its machine readable curricula available in that format, and Zotero (my bibliography tool of choice) happily puts out its data in RDF too. What still needed to be done was the ability to turn LEAP2a eportfolios into RDF.

That took some doing, but since LEAP2a is built around the IETF Atom newsfeed format, there were at least some existing XSL transformations to build on. I settled on the one included in the open source OpenLink Virtuoso data management server, since that’s what I used for the subsequent Linked Data meshing too. Also, the OpenLink Virtuoso Atom-to-RDF XSLT came out of their ‘sponger’ middleware layer, which allows you to treat all kinds of structured data as if they were RDF datasources. That means that it ought to be possible to built a wee LEAP2a sponger cartridge around my leap2rdf.xslt, which then allows OpenLink Virtuoso to treat any LEAP2a portfolio as RDF.

The result still has limitations: the leap2rdf.xslt only works on LEAP2a records with the new, proper namespace, and it only works well on those records that use URIs, but not those that use Compact URIs (CURIEs). Fixing these things is perfectly possible, but would take two or three more days that I didn’t have.

So, having spotted my ponds of RDF triples and filled one up, it’s time to go fishing. But for what and why?

Nigel Ward and Nick Nicholas of Link Affiliates have done an excellent job in explaining the why of machine readable curriculum data, so I’ve taken the immediate advantages that they identified, and illustrated them with noddy proof-of-concept hows:

1. Learning resources can be easily and unambiguously tagged with relevant learning outcomes.
For this one, I made a query that looks up a work (Robinson Crusoe) in my Zotero bibliographic database and gets a download link for it, then checks whether the work supports any known learning outcomes (in my own 6-lines-of-RDF repository), and then gets a description of that learning outcome from the ASN. You can see the results in CSV.

It ought to have been possible to use a bookmarking service for the learning resource to learning outcome mapping, but hand writing the equivalent of

‘this book’ ‘aligns to’ ‘that learning outcome’

seemed easier

2. A student’s progress can be easily and unambiguously mapped to the curriculum.
To illustrate this one, I’ve taken Theophilus Thistledown’s LEAP2a example portfolio, and added some semi-appropriate Californian K-12 learning outcomes from the ASN against the activities Theophilus recorded in his portfolio. (Anyone can add such ASN statements very easily and legally within the scope of the LEAP2a specification, by the way) I then RDFised the lot with my leap2rdf XSLT.

I queried the resulting RDF portfolio to see what learning outcomes were supported by one particular learning activity, and I then got descriptions of each of these learning outcomes from the ASN, and also got a list of other learning outcomes that belong to the same curriculum standard. That is, related learning outcomes that Theophilus could still work on. This is what the SPARQL looks like, and the results can be seen here. Beware that a table is not the most helpful way of presenting this information- a line and a list would be better.

3. Lesson plans and learning paths to be easily and unambiguously mapped to the curriculum.
This is what I think of as the classic case: I’ve taken an RDFised, ASN enhanced LEAP2a eportfolio, and looked for the portfolio owner’s name, any relevant activities that had a learning outcome mapped against them, then fished out the identifier of that learning outcome and a description of same from the ASN. Here’s the SPARQL, and there’s the result in CSV.

Together, these give a fairly good of what Robinson Crusoe was up to, according to the Californian K-12 curriculum, and gives a springboard for further exploration of things like comparison of the learning outcomes he aimed for then with later statements of the same outcome or the links between the Californian outcomes with those of other jurisdictions.

4. The curriculum can drive content discovery: teachers and learners want to find online resources matching particular curriculum outcomes they are teaching.
While sitting behind his laptop, Robinson might be wondering whether he can get hold of some good learning resources for the learning activities he’s busy with. This query will look at his portfolio for activities with ASN learning outcomes and check those outcomes against the outcome-to-resource mapping repository I mentioned earlier. It will then look up some more information about the resources from the Zotero bibliographic database, including a download link- like so

The nice thing is that this approach should scale up nicely all the way from my six lines of RDF to a proper repository.

5. Other e-learning applications can be configured to use the curriculum structure to share information.
A nice and simple example could be a tool that lets you discover other learners with the same learning outcome as a goal in their portfolio. This sample query looks through both Theophilus and Robinson’s eportfolios, identifies any ASN learning outcomes they have in common, and then gets some descriptions of that outcome from the ASN, with this result.

Lessons learned

Of all the steps in this and other meshups, deriving decent RDF from XML is easily the hardest and most time consuming. Deriving RDF from spreadsheets or databases seems much easier, and once you have all your source data in RDF, the rest is easy.

Even using the distributed graph pattern I described in a previous post, querying across several datasets can still be a bit slow and cumbersome. As you may have noticed if you follow the sample query links, uriburner.com (the hosted version of OpenLink Virtuoso) will take it’s time in responding to a query if it hasn’t got a copy of all relevant datasets downloaded, parsed and stored. Using a SPARQL endpoint on your own machine clearly makes a lot of sense.

Perhaps more importantly, all the advantages of machine readable curricula that Nigel and Nick outlined are pretty easily achievable. The queries and the basic tables they produce took me one evening. The more long term advantages Nigel and Nick point out – persistence of curricula, mapping different curricula to each other, and dealing with differences in learning outcome scope – are all equally do-able using the linked data stack.

Most importantly, though, are the meshups that no-one has dreamed of yet.

What’s next

For other people to start coming up with those meshups, though, some further development needs to happen. For one, the leap2rdf.xslt needs to deal with a greater variety of LEAP2a eportfolios. A bookmark service that lets you assert simple triples with tags, and expose those triples as RDF with URIs (rather than just strings) would be great. The query results could look a bit nicer too.

The bigger deal is the data: we need more eportfolios to be available in either LEAP2a or LEAP2r formats as a matter of course, and more curricula need be described using the ASN.

Beyond that, the trickier question is who will do the SPARQL querying and how. My sense is that the likeliest solution is for people to interact with the results of pre-fabbed SPARQL queries, which they can manipulate a bit using one or two parameters via some nice menus. Perhaps all that the learners, teachers, employers and others will really notice is more relevant, comprehensive and precisely tailored information in convenient websites or eportfolio systems.

I wanted to demo my meshup of a triplised version of CETIS’ PROD database with the impressive Linked Data Research Funding Explorer on the Linked Data meetup yesterday. I couldn’t find a good slot, and make my train home as well, so here’s a broad outline:

The data

The Department for Business Innovation and Skills (BIS) asked Talis if they could use the Linked Data Principles and practice demonstrated in their work with data.gov.uk to produce an application that would visualise some grant data. What popped out was a nice app with visuals by Iconomical, based on a couple of newly available data sets that sit on Talis’ own store for now.

The data concerns research investment in three disciplines, which are illustrated per project, by grant level and number of patents, as they changed over time and plotted on a map.

CETIS have PROD; a database of JISC projects, with a varying amount of information about the technologies they use, the programmes they were part of, and any cross links between them.

The goal

Simple: it just ought to be possible to plot the JISC projects alongside the advanced tech of the Research Funding Explorer. If not, than at least the data in PROD should be augmentable with the data that drives the Research Funding Explorer.

For one, though PROD pushes out Description Of A Project (DOAP, an RDF vocabulary) files per project, it doesn’t quite make all of its contents available as linked data right now. The D2R toolkit was used to map (part of) the contents to known vocabs, and then make the contents of a copy of PROD available through a SPARQL interface. Bang, we’re on the linked data web. That was easy.

Since I don’t have access to the slick visualisation of the Research Funding Explorer, I’d have to settle for augmenting PROD’s data. This is useful for two reasons: 1) PROD has rather, erm, variable institutional names. Synching these with canonical names from a set that will go into data.gov.uk is very handy. 2) PROD doesn’t know much about geography, but Talis’ data set does.

To make this work, I made a SPARQL query that grabs basic project data from PROD, and institutional names and locations from the Talis data set, and visualises the results.

Zooming in on a project, this time to show the attributes of a single project. Still in OpenLink RDF browser.

A project in D2R’s web interface; not shiny, but very useful.

From blagging a copy of the SQL tables from the live PROD database to the screen shots above took about two days. Opening up the live server straight to the web would have cut that time by more than half. If I’d have waited for the Research Grant Explorer data to be published at data.gov.uk, it’d have been a matter of about 45 minutes.

Lessons learned

Opening up any old database as linked data is incredibly easy.

Cross-searching multiple independent linked data stores can be surprisingly difficult. This is why a single SPARQL endpoint across them all, such as the one presented by uberblic‘s Georgi Kobilarov yesterday, is interesting. There are many other good ways to tackle the problem too, but whichever approach you use, making your linked data available as simple big graphs per major class of thing (entity) in your dataset helps a lot. I was stymied somewhat by the fact that I wanted to make use of data that either wasn’t published properly yet (Talis’ research grant set), or wasn’t published at all (our own PROD triples).

A bit of judicious SPARQLing can alleviate a lot of inconsistent data problems. This is salient to a recent discussion on twitter around Brian Kelly’s Linked Data challenge. One conclusion was that it was difficult, because the data was ‘bad’. IMHO, this is the web, so data isn’t really bad, just permanently inconsistent and incomplete. If you’re willing to put in some effort when querying, a lot can be rectified. We, however, clearly need to clean up PROD’s data to make it easier on everyone.

SPARQL-panning for gold in multiple datastores (or even feeds or webpages) is way too much fun to seem like work. To me, anyway.

What’s next

What needs to happen is to make all the contents of PROD and related JISC project information available as proper linked data. I can see three stages for this:

We clean up the PROD data a little more at source, and load it into the Data Incubator to polish and debate the database to triple mapping. Other meshups would also be much easier at that point.

We properly publish PROD as linked data either on a cloud platform such as Talis’, or else directly from our own server via D2R or OpenLink Virtuoso. Simal would be another great possibility for an outright replacement of PROD, if it’s far enough along at that point.

JISC publishes the public part of its project information as Linked Data, and PROD just augments (rather than replicates) it.