Snapshots of data were taken from BRAD, HR, and HESA and dumped into csv & spread sheet format. I have read access to SITS data at table level (this made it extremely easy to get all the data required for the CERIF schema (SITS was used to extract data on Research students to enable reporting on supervision)

We used a sample report to determine which tables to use in the CERIF schema (time constraints determined that we wouldn’t be able to do a full mapping) see below the diagram of the tables used in our project.

I then modelled this data by creating a spread sheet model of the CERIF schema. This was the quickest way to get the data into the CERIF Schema format due to time restrictions on the project. If this were to be put into production then automatic data extraction from the data sources would be explored.

Lots of issues with Publications when translating into CERIF Schema

–Different field sizes–Date formats needed tweaking–Null values from the sample data needed to have dummy defaults set

The data was then imported into a MySql database (installed on the BRUCE server)

CERIF is predominantly relational (but not pure – the semantics took a while to understand) the link tables & class tables was a bit Object-Orientated – and for me it didn’t quite make sense. But it incorporated flexibility into the schema and there is the scope to use in many ways. Because our time was restricted we didn’t spend too much time analysing how to use it. Richard & I came up with the BRUCE model in a couple of hours and went with it. Once the entities were defined in Solr – we found that we had to populate some of the link/class tables with unnecessary duplication of data just to get the interface to work – although this goes against my understanding of how relational data works the indexer was really fast – especially when we used the ‘test data’ – the Brunel data snapshot is quite small so wouldn’t determine speed efficiency.

If this prototype were to be developed into production, we would need to analyse the data structures & mappings in more depth. We have all the data on-line and automation from the various systems could be developed so the data amalgamation would be seamless. Time determined that reporting was just touched upon – I used an open-source tool ‘DataVision’ to connect directly to the database and a very simple report was produced.

Reporting would be a separate component using data extracted from the queries in the interface – pushed into a temp table – would need more development and thought to achieve this.

Part of the challenge of the BRUCE project is to take a highly relational model like CERIF and convert it into something which can be adequately indexed for searching and faceting.

Apache Solr, like many traditional search engines, works on the principle of key-value pairs. A key-value pair is simply an assertion that some value (on the right) is associated with some key (on the left). Examples of key-value pairs are:

name : Richard
project : bruce
organisation : Brunel University

Typically, the keys on the left come from a set of known terms, while the values on the right can vary arbitrarily. Therefore, when you search for documents belonging to “Richard”, you are asking which documents have the value “Richard” associated with the key “name”.

In addition, keys are often repeatable (although depending on the search index schema this might not be always the case), so you could have multiple “name” keys, with different values.

Approach

The objective, then, is for us to convert the graph-like structure of CERIF (that is, it has entities and relationships which do not follow a hierarchy) into the flat key-value structure of a search index. It should be clear from the outset, therefore, that data-loss will necessarily result from this conversion; it is not possible to fully and adequately represent a graph as a set of key-value pairs.

The project aimed, instead, to extract the key information from the CERIF schema from the point of view of one of the Base Entities.

There are 3 Base Entities in CERIF: Publications, People and Organisational Units. Since BRUCE is concerned with reporting principally on staff, we selected People as the Base Entity from which we would view the CERIF graph. By doing this we reduce the complexity of the challenge, since a graph viewed from the point of view of one of its nodes behaves like a hierarchy at least in the immediate vicinity (see the real analysis of this, below, for a clear example).

Our challenge is then simplified to representing a tree structure as a set of key-value pairs.

The second trick we need to use is to decide what kind of information we want to actually report on, and narrow our indexing to fields in the CERIF schema which are relevant to those requirements. This allows us to index values which are actually closely related to eachother as totally separate key-value pairs: as long as the index provides enough information for searching and faceting, it won’t matter that information about their relationship to eachother is lost.

For example: suppose we want to index the publications associated with a person, and we want to be able to list those publications as well as providing an integer count of how many publications were published by that person in some time frame. Initially this might look quite difficult, as a “publication” is a collection of related pieces of information, such as the title, the other authors, the date of publication, and other administrative terms such as page counts and so on. To place this in a set of key-value pairs would require us to do something like:

title: My Publication
publication_date: 01-09-2008
pages: 10

This is fine if there is only one publication by the person, but if they have multiple publications it would not be possible to tell which publication_date was associated with which title.

Instead, we have to remember that this is an index and not a data store. If we wish to list publication titles and count publications within date ranges, then it is just necessary for us to index the titles and the dates separately and ensure that they are used separately within the index. So we may have:

This configuration loses data by not maintaining the links between publication_date and title, but is completely adequate for the indexing and faceting requirements.

To meet our original requirement stated above we can just count the number of publication_date keys which contain a date which lies within our desired time frame and return this integer count, while simultaneously listing the titles of the publication. The fact that these two pieces of information are not related in the index makes no difference in producing the desired outcome.

CERIF schema

The CERIF schema that we are working with is a limited sub-set of the project, and has been presented in a previous post. The set of tables which describe the graph contain the following fields that we are interested in are:

With the exception of the Org Unit data (marked with **), the result is a straightforward enough hierarchy. We can avoid considering the graph that emerges under the organisation unit data by ensuring that the cfPers_OrgUnit table contains all the relevant relationships that we want to consider during indexing, so that we don’t have to attempt to index the org unit graph when preparing an index from the perspective of the person.

Solr index

The Solr index allows us to specify a field name (the key, in the key-value pair), and whether that field is repeatable or not. Each set of key-value pairs is grouped together into a “document”, and that document will represent a single person in the CERIF dataset, along with all the relevant data associated with them. When we have fully built our index, there will be one document per person.

The Solr index which then meets our requirements is constructed from the above CERIF data as follows:

Field

Single/Multi

Value

Notes

entity

single

“cfPers”

Indicates that this is a person oriented document. This allows us to extend the index to view other kinds of entities as well, all represented within one schema.

id

unique

cfPersId

A unique id representing the entity. When other entities are included in the index, this could also be their ids (e.g. cfResPublId)

gender

single

cfGender

name

single

a combination of cfFirstNames, cfOtherNames and cfFamilyNames

This is the first person name that is encountered in the database, and is used for sorting and presented as the authors actual name. There is another field for name variants

name_variants

multi

a combination of cfFirstNames, cfOtherNames and cfFamilyNames

This allows us to have multiple names for the author for the purposes of searching, although they will not be used for sorting or presented to the end user

contract_end

single

cfOrgUnit/cfEndDate

Taken from the cfEndDate field in the cfPers_OrgUnit table which is tagged by cfClassId as Employee

funding_code

multi

cfFundId

org_unit_name

multi

cfOrgUnit/cfName

org_unit_id

multi

cfOrgUnit/cfOrgUnitId

primary_department

single

cfOrgUnit/cfName

This differs from org_unit_name in that it is the department that the person should be considered most closely affiliated with. This would be, for example, their department or research group. It is used specifically for display and sorting, which is why it may only be single valued.

primary_department_id

single

cfOrgUnit/cfOrgUnitId

The id for the department contained in primary_department

primary_position

single

cfOrgUnit/cfClassId

The position that the person holds in their primary department (e.g. “Lecturer”)

fte

single

cfOrgUnit/cfFraction

The fraction of the time that the person works for their organisational unit which is tagged with cfClassId of Employee.

supervising

multi

cfPers_Pers/cfPersId2

This lists the ids of the people that the person is supervising. These can be identified as the cfPers_Pers relationship has a cfClassId of Supervising

publication_date

multi

cfResPubl/cfResPublDate

This lists the dates upon which the person published any result publications. This is a catch-all for all types of publication. Individual publication types are broken down in the following index fields

publication_id

multi

cfResPubl/cfResPublId

This lists the ids of all the publications of any kind which the person published.

journal_date

multi

cfResPubl/cfResPublDate

This is the list of dates of publication of all publications which have a cfClassId of “Journal Article”.

journal_id

multi

cfResPubl/cfResPublDate

This is the list of ids publications which have a cfClassId of “Journal Article”.

book_date

multi

cfResPubl/cfResPublDate

This is the list of dates of publication of all publications which have a cfClassId of “Book”.

book_id

multi

cfResPubl/cfResPublDate

This is the list of ids publications which have a cfClassId of “Book”.

chapter_date

multi

cfResPubl/cfResPublDate

This is the list of dates of publication of all publications which have a cfClassId of “Inbook”.

chapter_id

multi

cfResPubl/cfResPublDate

This is the list of ids publications which have a cfClassId of “Inbook”.

conference_date

multi

cfResPubl/cfResPublDate

This is the list of dates of publication of all publications which have a cfClassId of “Conference Proceedings Article”.

conference_id

multi

cfResPubl/cfResPublDate

This is the list of ids publications which have a cfClassId of “Conference Proceedings Article”.

These terms are encoded in a formal schema for Solr which can be found here.

Data Import

Apache Solr provides what it calls “Data Import Handlers” which allow you to import data from different kinds of sources into the index. Once we have configured the index as per the previous section we can construct a Data Import Handler which will import from the CERIF MySQL database.

This is effectively a set of SQL queries which are used to populate the index fields in the ways described in the previous section. A representitive example of the kinds of query include:

This query is at the root of the Data Import Handler, and selects our cfPersId which will be the central identifier that we will use to retrieve all other information, as well as any information which we can quickly and easily obtain by performing a JOIN operation across the cfPers* tables.

This query selects the ids and dates of publications by the selected person which have a class of ‘Journal Article’.

Here we will not go into this at any further length; instead the code which provides the Data Import functionality can be obtained here.

It is probably worth noting, though, that these queries are quite long and involve JOINing across multiple database tables, which makes reporting on the data hard work if done directly from source. The BRUCE approach means that this is all compressed into one single Data Import Handler, and leaves all the exciting stuff to the much simpler search engine query.

Use of the index

Once we have produced the index, we feed it into SolrEyes (discussed in more detail here) which is configured to produce the following functionality based on the indexed values:

Field

Usage

entity

facet

id

unused, required for index only

gender

facet, result display

name

sort, result display

name_variants

currently unused

contract_end

facet, sort, result display

funding_code

result display

org_unit_name

currently unused

org_unit_id

currently unused

primary_department

sort, result display

primary_department_id

currently unused

primary_position

facet, result display

fte

facet, sort, result display

supervising

result display (a function presents the number of people being supervised by the person)

publication_date

facet, result display (a function counts the number of publications in between the date ranges specified by the facet)

publication_id

currently unused

journal_date

result display (a function counts the number of journal articles in between the date ranges specified by the publication_date facet)

journal_id

currently unused

book_date

result display (a function counts the number of journal articles in between the date ranges specified by the publication_date facet)

book_id

currently unused

chapter_date

result display (a function counts the number of journal articles in between the date ranges specified by the publication_date facet)

chapter_id

currently unused

conference_date

result display (a function counts the number of journal articles in between the date ranges specified by the publication_date facet)

conference_id

currently unused

Key:

facet

used to create the faceted browse navigation

result display

used when presenting a “document” to the user. Sometimes the value is a function of the actual indexed content.

sort

used for sorting the result set

Note that a more thorough treatment of the Solr index would split the fields up into multiple indexed fields which are customised for their purposes, but that we have not done this in the prototype. For example, fields used for sorting will go through normalising functions to ensure consistent sorting across all values, while displayable values will be stored unprocessed.

We can now produce a user interface like that shown in the screen shot below.

The approach used here could be used to extend to more features of the person Base Entity, but also other Base Entities (and, indeed, any entity in the CERIF model) could be placed at the centre of the report, and its resulting hierarchy of properties mapped into a set of key-value pairs, and all could co-exist comfortably in the same search index.

Objective

At this stage in the project our main objective is to implement a “vertical” slice of the research reporting process, by taking some source data, mapping it into CERIF, storing it in a CERIF compliant database and then indexing that data with Apache Solr for display and interaction via Blacklight, which will ultimately be used to generate reports on the research information. There are a number of challenges involved in this process:

How to map the data sources such as HESA, SITS, HR and Publications data into CERIF. In some cases there will be clear mappings, and in other some creativity may be required, and in yet others it may not be possible.

How to turn the complex relational schema that is CERIF into a flat, indexable, set of key/value pairs which can be used by Solr and make sense to the user of the reporting software

How to configure Solr

How to configure Blacklight

Status Update

At the moment we have the following technical outputs from the project:

A theoretical mapping from the datasources to CERIF (not yet implemented)

A set of Solr configuration files and data importers which relate the MySQL CERIF database to a set of flat key/value pairs which meet the requirements of the project’s exemplar report. No general configuration has been produced for CERIF yet, as we are focussed on this specific vertical.

Some installation and configuration experience with Blacklight. We have done a number of demonstrations of Blacklight to investigate what the final interface will look like, but as yet no realistic data has been presented through it.

A high-spec dedicated project server with the capacity for storing and processing the large quantites of data that will be generated throughout has been installed and is ready to start working with the data.

Experiences with CERIF

Overall, mapping data to and from CERIF has not been too troublesome. It is a relational standard, which means that flattening it for Solr has been a bit tricky (more on that later). In addition, it does not always have clear ways of representing the data we want to represent, and it appears that the Semantic Layer is where most of the complexity will ultimately reside.

Experiences with Solr

Solr has been reliable (if complex to configure) throughout the process, and the project team is now comfortable and confident that it meets most if not all of the requirements that will be placed on it.

Experiences with Blacklight

Blacklight has so far been the weak link in the project. It is extremely difficult to install and configure, and no two installations go the same way so a large amount of time has been sunk in trying to make it work at all. It is partly for this reason that the project is not yet displaying the data from Solr in Blacklight.

Flattening CERIF for Solr

As CERIF is a relational format, flattening it for indexing by Solr has been a careful task for the project. We cannot represent all of the data in the CERIF database exactly as it appears in MySQL, since Solr does not strictly have the relational qualities of a database.

Instead we have begun to construct solr documents (effectively these are Object Classes) which are designed to meet the reporting requirements. That is, for our exemplar report (see linked presentation), which is focussed on the individuals, we create Solr documents which have the person as the key entity, and we add to the document extensive information about the organisational units that the person is part of, their publications, and so on.

Later we will construct documents which are designed to meet other reporting requirements, and may therefore be organisation or publication oriented. With a well designed Solr schema, all these different documents will co-exist comfortably side-by-side in the index, and we’ll be able to generate a variety of different kinds of report based on that data.

Next Steps

Finalise the datasource mappings to CERIF

Harden the CERIF to Solr indexing process based on the final datasource mappings

Get Blacklight to behave

Generate reports from search results. The the project is looking at Prawn, a rails application which can generate PDFs of the results.

The BRUCE project aims to develop a prototype tool, based on CERIF, that will facilitate the analysis and reporting of research information from data sources that are already in use at the majority of HEIs.

What does that actually mean? HEIs already collect lots of information about the research that they do but that information tends to be stored in separate silos, e.g. the HR database, the institutional repository, student records, etc. The idea of BRUCE is to pull that data out of those silos, index it using CERIF and then create a new tool that can analyse and report on that data.

The project is funded by JISC under the Research Information Management strand of the Infrastructure for Education and Research Programme (JISC Grant Funding 15/10) and you can read the full project proposal on the JISC website.

We will use this blog to keep you up to date on our progress with the project which will run from February to July 2011.