Monday, 9 August 2010

I recently submited an article to Sconul Focus which, I hope, will be published in a few months time. The topic of the article was data harvesting and aggregation. A simple and (hopefuly) easy to read explanation of a rather complex topic. I used our entity registry as an example and described how it harvests data from University and external sources, how it converts sources into RDF format so data can be aggregated and finally how it uncovers hidden connections. Regarding this last step, uncovering hidden connections, I thought this was a very interesting and value adding process, so I will blog about it here.

When we collect data from different sources, there is a chance that some of these data are interconnected. Different sources may contain replicated data, e.g. a researcher's profile can be found on his college and department's websites. Different sources may contain data that complement each other. For example a researcher's profile on his college website and a list of his projects and publications on his departmental website. Originally, these webpages do not include links to each other therefore unless we know about the other source we will not get a complete profile for that researcher. Add more sources, external ones too (e.g., funders' websites containing information about grants), and we will get information about researchers which is spread all over the place but disconnected.

Uncovering hidden connections means making connections between data in different sources more evident: as in adding links between sources which point to every each of them. But how do we create those links? Here a non-technical introduction:

For example if we get Prof Francis Matthew Kellner's profile in source 1 and Prof Francis Matthew Kellner's research interests and publications in source 2 we can establish with some level of certainty that these two sources refer to the same person. Therefore we can connect the data in these two sources and build a completer profile with Prof Kellner's profile, research interests and publications.

There are, however, other cases which are not so straightforward, where names are similar but we cannot be sure they belong to the same person. For example if we get Prof Francis Kellner’s biography in source 1, Prof F. M. Kellner listed as Principal Investigator on a project in source 2 and Francis M. Kellner as author in publications in source 3. How do we know if these data belong to the same person?

For cases like these we have developed a ‘same-as’ process.

‘Same-as’ has a set of rules which use information such as people’s first name and surname, researchers’ affiliation and email. Depending on the availability of information and whether the sets of data match, ‘same-as’ determines if two or more records belong to the same person or not. If the records belong to the same person ‘same-as’ merges the records. If the information available is not enough to do the matching, or if the data do not match, ‘same-as’ will keep the records separately.

The following is the logic used. This is a technical-ishh explanation written by Anusha.

Search for people with the same last name, who are part of a group of sources, e.g. sources belonging to the social sciences (we group sources together, based on likelihood of information overlapping)

Case 1Match people only if each person has at least the fields below and they all match.

first name (not just initials)

last name

affiliation

source(s)

If the source is a trusted source:

staff_id

Note: Subset of the firstname will be matched : Example John P M, John P, JohnCreate a person superset and add all of their info

Case 2For people with the information in atleast these follwing fields, with all of them matching:

initials

last name

affiliation

source

Create a new person superset and add them to that. Treat them as a separate person and do not add them to the person aboveIf the person matches the information above, suggest a connection.

Case 3For people with the information in atleast these follwing fields, with all of them matching:

initials

last name

source

Create a new person superset and add them to that. Treat them as a separate person and do not add them to the person above.If the person matches the information above, suggest a connection.If the source is a trusted source (data harvested from within oxford), display them in the browse / search results pages, else do not display them.

The ‘same-as’ process focuses on ‘people’ entities. Projects, publications, funders and academic units, usually have fixed (or standard) names which are used consistently across sources. However, names of people are frequently written in different ways, depending on the contexts.

So now, you can imagine, everytime we add more data to the registry, we pass these data through the same-as process to see if there are any hidden connections to the data we already have. Or put it in a different way, everytime we add a new source, we do not only add their data but the connections we find with same-as, which were probably unknown or at least not evident in the original sources.

If you want to know more you can wait for the Sconul Focus paper or if you are impatient e-mail us.

About this Blog

Cecilia Loureiro-Koechlin

I am the BRII Project Analyst and responsible for this blog. I work at the Systems and e-Research Service at the Bodleian Libraries - Oxford University. E Cecilia.Loureiro-Koechlin@bodleian.ox.ac.uk, T +44 (0) 1865 280028, Contact address: Osney One, Osney Mead, Oxford, OX2 0EW

Project Website

Our Goal

Building the Research Information Infrastructure (BRII) aims to support the efficient sharing of Research Activity Data (RAD) captured from a wide range of sources. BRII develops an infrastructure that harvests and archives RAD, and Web services which disseminate and reuse this kind of data by using a lightweight solution based on semantic web technologies. Phases of the project include: a stakeholder analysis to collect views from interested parties (e.g., academics and administrators); an iterative development process which uses information collected in the analysis phase; and an embedding and sustainability phase where user acceptance is assessed and strategies to support the expansion of the information research infrastructure are designed. Additional outputs of the BRII include: an application programming interface (API) for harvesting and querying data; a collection of ontologies and taxonomies used to organise and classify data; a themed Web site; and the Oxford Blue Pages displaying RAD in creative ways. By facilitating access to RAD, BRII expects to improve the research visibility of the institution and its research impact, as well as boost collaboration.

Rumsey, S. (2010) BRII registry & other outputs A description of the pilot Research Activity Data Registry functionality, services and other outputs that will be developed by the project end (March 2010) and suggestions for further work.

Adding a researcher profile. Video clip demonstrating how to search for a researcher profile in the ORA registry and then embed this in a content managed website.

Loureiro-Koechlin C. (2009) Selling an abstract concept to a practical audience (presented at the Modular e-Administration of Teaching (MEAoT) Assembly, Centre for Applied Research in Educational Technologies (CARET), University of Cambridge, 10 December 2009.)