Author: Jane Stevenson

This is a summary of a break-out group discussion at the JISC Expo Programme meeting, July 2011, looking at ‘Skills required for Linked Data’.

We started off by thinking about the first steps when deciding to create Linked Data. We took a step back from the skills required and thought more about the understanding and the basic need and the importance of putting the case for Linked Data (or otherwise).

Do you have suitable data

Firstly, do you have data that is suitable to output as Linked Data. This comes down to the question: what is suitable data? It would be useful to provide more advise in this area.

Is Linked Data worth doing?

Why do you want Linked Data? Maybe you are producing data that others will find interesting and link into? If you give your data identifiers, others can link into it. But is Linked Data the right approach? Is what you really want open data more than Linked Data? Or just APIs into the data? Sometimes a simpler solution may give you the benefits that you are after.

Maybe for some organisations approaches other than Linked Data are appropriate, or are a way to start off – maybe just something as simple as outputting CSV. You need to think about what is appropriate for you. By putting your data out in a more low-barrier way, you maybe able to find out more about who might use your data. However, there is an argument that it is very early days for Linked Data, and low levels of use right now may not reflect the potential for the data and how it is used in the future.

Are you the authority on the data? Is someone else the authority? Do you want to link into their stuff? These are the sorts of questions you need to be thinking about.

The group agreed that use cases would be useful here. They could act as a hook to bring people in. Maybe there should be somewhere to go to look up use cases – people can get a better idea of how they (their users) would benefit from creating Linked Data, they can see what others have done and compare their situation.

We talked around the issues involved in making a case for Linked Data. It would be useful if there was more information for people on how it can bring things to the fore. For example, we talked about set of photographs – a single photograph might be seen in a new context, with new connections made that can help to explain it and what it signifies.

What next?

There does appear to be a move of Linked Data a ‘clique’ into the mainstream – this should make it easier to understand and engage with. There are more tutorials, more support, more understanding. New tools will be developed that will make the process easier.

You need to think about different skills – data modelling and data transformation are very different things. We agreed that development is not always top down. Developers can be very self-motivated, in an environment where a continual learning process is often required. It may be that organisations will start to ask for skills around Linked Data when hiring people.

We felt that there is still a need for more support and more tutorials. We should move towards a critical mass, where questions raised are being answered and developers have more of a sense that there is help out there and they will get those answers. It can really help talking to other developers, so providing opportunities for this is important. The JISC Expo projects were tasked with providing documentation – explaining what they have done clearly to help others. We felt that these projects have helped to progress the Linked Data agenda and that it is an important encouraging people to acquire these skills to require processes and results to be written up.

Realistically, for many people, expertise needs to be brought in. Most organisations do not have resources to call upon. Often this is going to be cheaper than up-skilling – a steep learning curve can take weeks or months to negotiate whereas someone expert in this domain could do the work in just a few days. We talked about a role for (JISC) data centres in contributing to this kind of thing. However, we did acknowledge the important contribution that conferences, workshops and other events play in getting people familiar with Linked Data from a range of perspectives (as users of the data as well as providers). It can be useful to have tutorials that address your particular domain – data that you are familiar with. Maybe we need a combination of approaches – it depends where you are starting from and what you want to know. But for many people, the need to understand why Linked Data is useful and worth doing is an essential starting point.

We saw the value in having someone involved who is outward facing – otherwise there is a danger of a gap between the requirements of people using your data and what you are doing. There is a danger of going off in the wrong direction.

We concluded that for many, Linked Data is still a big hill to climb. People do still need hand-ups. We also agreed that Linked Data will get good press if there are products that people can understand – they need to see the benefits.

As maybe there is still an element of self-doubt about Linked Data, it is essential not just to output the data but to raise its profile, to advocate what you have done and why. Enthusiasm can start small but it can quickly spread out.

Finally, we agreed that people don’t always know where products are built around Linked Data. So, you may not realise how it is benefitting you. We need to explain what we have done as well as providing the attractive interface/product and we need to relate it to what people are familiar with.

For me, the journey from an understanding of modelling data, and creating our own models for the Hub and Copac, to being able to understand the processes and decisions involved in creating XML RDF has been challenging. It has raised one question that often applies when dealing with something quite technical: how much should a manager (in my case an archivist managing an online archive service) be expected to understand the ‘technical’ aspects of something? This is a question I have spoken about and written about before; mainly in terms of what archivists (or other information professionals) in the digital age need to know in order to understand the implications of choices around things like data structure and software systems. In the case of Linked Data, I am still not sure how much I need to know about the detail of Linked Data, the RDF model, the use of RDF XML, the benefits of other output fomats, the application of stylesheets, etc. I have been thinking about how hard it is to create Linked Data – I have had a few enquiries from colleagues who are interested in doing the same sort of thing already and I want to be able to offer useful advice.

One thing that occurs to me is that it is reasonable to acknowledge that Linked Data does involves programming skills, and therefore it is not so dissimilar from structuring and outputting your data through a traditional relational database, for example, where you would expect that specialist skills are needed. But in either case there is the the same need for a manager to understand what the system offers and be able to offer the best service to researchers. I think what is important from the point of view of the manager is to be involved with the decision-making process and understand the implications of Linked Data; you need to know what you are saying about your own data. I am not sure that this requires a thorough knowledge of RDF and certainly I only have a rudimentary knowledge of stylesheets (and no knowledge of programming).

Service managers or administrators are not expected to understand systems from an in-depth technical point of view. But in fact, I think that one of the advantages of RDF model is that it is easier to get a sense of what is going on in terms of data processing than typically occurs with a database management system. After about six months of learning about Linked Data and RDF (I estimate that this translates into about one month of fairly intense learning), I can look at the stylesheet that we have for the Archives Hub, to transform the EAD data into RDF XML, and I can look at an RDF document representing the entities that we are describing, and I have a reasonably good overall sense of what it all means, which helps me with my main role: to understand the outputs and the potential benefits of Linked Data. For Locah, we’ve used XSLT, but there is no requirement for this, and maybe one of the challenges of outputting Linked Data is that there are a number of options in terms of translating your RDF model into an output.

There is no doubt that choices made now about how we model the data will have implications for what users can do with it, and some choices may limit future potential more than others. For example, which ‘things’ do we choose to represent? Should we have a conceptualisation of a person as an entity represented in the description, and to link this to a conceptualisation of the person? Which information should we provide as URIs and which as literals? I’m only gradually coming to understand the implications of these decisions, as we start to explore the potential of the data. Of course, this is always true, whatever data, structures and systems we are working with. This brings me to another point that I think is probably particularly relevant for Locah: we are doing this work at a time when we are very much early adopters. Whilst the classic Linked Data diagram may give the impression that the world has embraced Linked Data, the reality is that it is still very much at a hand-crafted level: we have not had tools available to us to aid us in this work, and in the case of EAD, there has been very little activity up till now. It is therefore difficult to judge how feasible it might be to output RDF in the future, as it is likely that more tools will be developed, and there will be greater awareness and skills built up around the whole Semantic Web. However, I wonder if we are currently still at that difficult point where we need to build the momentum of the Linked Data movement, but it is still very unfamiliar and poorly understood by many data providers?

Many Linked Data evangelists claim that Linked Data is ‘easy’. I’m not sure that it is necessarily easy, and I don’t think that it’s very helpful to say that it is easy. Easy compared to what? Easy for whom? It’s easy if you know how, if you have the requisite skills and experience, but we need to persuade people who don’t yet know how that it is worth doing, and provide a realistic assessment of the skills that are required. I suppose the question of how easy it is does rest in large part on the data you are working with as well. Archival finding aids are quite challenging. As Mark Matienzo, archivist at Yale University, states in his presentation on Linked Data and Archival Description: “Archival description is inherently multi-level and relational” and “EAD is both too flexible and too unforgiving” to be Linked Data friendly…and database-friendly for that matter. Also, ISAD(G) recommends the non-repetition of information and archival description generally contains implicit information. I suppose Linked Data might help provide the opportunity and impetus to move towards a more Web-friendly way of describing archives, if it does become more widely used.

At present, I can’t help thinking that if archive repositories and libraries would like to output their data as Linked Data, many of them will struggle, and I would have thought it might be similar for other types of data providers. I do think that expertise is required, and time needs to be invested in understanding some key aspects of Linked Data. On the other hand, this is the case whenever you are looking at creating effective means to output structured (but often inconsistent) data. However, I think that it makes good sense for the Archives Hub and Copac to do this work, as it is on behalf of our contributors, so it effectively will allow these repositories and libraries to output Linked Data. In other words, it may be that for Linked Data to really take hold, it will benefit from this kind of aggregated set-up, where skills and resources can be pooled. At present, I’m inclined to think that it is worth the investment of time and resources by our Locah team because it is benefitting a large number of data providers. I think it will be important for us to convey to our contributors, and indeed to other archivists and librarians, what we are doing and why, what the implications are and what the benefits may be. I have already had contact with two people, one representing another aggregation of content, interested in benefitting from our work. This is really important, because it potentially makes the investment more worthwhile.

We are in a fortunate position with the Locah project because we are part of a JISC-funded innovations project, with a team of people with a variety of skills, and we have support from Talis, who have significant experience of Linked Data. If we can work on behalf of our community, then I feel that the time invested may be worthwhile. For the second half of our year-long project we will want to explore the benefits more thoroughly – we will be looking at the crucial issues of creating links to other data, which is really Linked Data’s key selling point, and we will be developing a prototype to show some potential benefits for researchers.

(With thanks to Pete and Ade for their contributions to this blog post).

This post is to highlight some of the barriers and challenges to the creation of Linked Data. This is a personal reflection, trying to be honest about the challenges as I have found them and the learning experience, which is inevitably a personal thing depending upon your own background, experience and ways of thinking and working. However, I think it also reflects some of the general challenges as we have come across them.

Vocabulary

It comes as no surprise that I have found the terminology somewhat confusing, and it has sometimes led me astray. Only this week Bethan and I were getting tangled up in a conversation about ‘things’ within the data model. We spent a while talking about how having a ‘Hub conceptualisation’ and a ‘thing-in-its-own-right conceptualisation’ of an entity would allow for more clarity. With ‘thing’, ‘concept’, ‘label’, ‘property’, ‘value’, ‘predicate’, ‘information resources’, ‘non-information resources’ etc. – there is quite a bit of room for misinterpretation in communication. I have looked at definitions, but these can actually sometimes hinder rather than help. I think that an attempt at a definitive glossary for Linked Data would help enormously.

Landscape

For me, it has taken a while to really get into the Linked Data way of thinking. I have actually kept a kind of diary of my thoughts over the last 2-3 months, and when I look back now at my earliest attempts at understanding how to model the data, they certainly show a pretty steep learning curve. I started, for example, by being unsure about whether we were wanting to provide information on the ‘creator’ of the archive or the archive itself and what sort of relationships between ‘things’ to include. I don’t think this is surprising, as the power of RDF is that it can be used to model anything – it doesn’t help you by giving you a limited scope or particular rules to start with (which is, of course, generally a good thing).

Archival descriptions

I listened to a number of audio tutorials, read a number of reports, blogs, etc., and learnt a great deal from these, but I still found the lack of examples within my own particular domain to be a barrier. Talis provide a very excellent tutorial that you can sit and listen to, but the real-world example is for a whiskey distillery. It somehow seems a long way away from an archival description! So, I would definitely say this lack of information for my domain was a barrier. But, of course, for others who want to output their finding aids as Linked Data in the future, we should start to see models developing that they can use, with examples and information to help (Locah, we hope, being one source of help).

Expertise and experience

The Locah team has a variety of expertise and experience, but it is undoubtedly true to say that I would be struggling a great deal more than I have done if we had not had the input of Pete Johnston from Eduserv, who has been very much involved in the EAD modelling. Whilst it is important (and pleasant) to give credit where it’s due, the real point here is actually that I think a certain level of expertise is important, to model data and output RDF. I have experience as an archivist and understand EAD and metadata, Pete also has experience of working with archival descriptions, and also substantial experience of metadata standards and issues around the Semantic Web and technical interoperability. We also have Bethan Ruddock working with us, who now has 18 months experience of working with EAD descriptions, and is a trained librarian. That is just the core team looking at the archival data modelling. In addition, the expertise of UKOLN will come into play with other aspects of the project.

I find it hard to see how this sort of work could currently be done by a team with substantially less experience in these sorts of areas. However, it is important to state that we will also be working with Talis, who have a great deal of expertise in Linked Data. They are providing access to their own Triple Store and other benefits that we can take advantage of. Others thinking of outputting Linked Data could look to involve companies like Talis more heavily, thus taking advantage of their expertise and requiring less in-house expertise.

The benefits of data modelling

One of the areas that I spent most time trying to find good tuturials about was data modelling. I may have missed some things that would have been very useful, but as it is I found that there simply wasn’t enough helpful information about how to create a data model. This would have saved me quite a bit of time because I think the data model is so central to what we are doing and provides such an effective way to visualise the entities and relationships between them. I think this was partly a case of examples being too simplistic, and partly a lack of data models that used catalogue data – not necessarily archival finding aids, but at least something similar.

The data

I think that we are going to find challenges around the actual content. There are numerous examples of inconsistencies, such as where the ‘creator’ is ‘Joe Bloggs and others’ rather than just a name, or where the access points do not have rules or a source associated with them. I’ve just found some descriptions where the content for the ‘extent’ should acutally really be in the ‘scope’. Some descriptions have rather unsatisfactory references, some do not include the language field, a few do not even include the creator field. For some fields we will just be outputting literal values, but for others consistency would help a great deal with the creation of RDF, particularly when thinking about the vocabulary (or predicate) that we use to define the relationship between a subject and object. This is the challenge of creating Linked Data for descriptions that have been created by 200 different institutions over several decades and by 100s of different people. We’ll have to see how it goes!

The issue of access points

Within EAD there are access points, or index terms, associated with the description. These are most commonly subject, name and place. We’ve found that establishing the nature of the relationship between the unit of description and the access point is not easy. It looks like the relationship is going to be something very unspecific, such as ‘associatedWith’. I’m not sure yet whether this has any implications…

Conclusions

For me, after a few weeks away from thinking about Locah and Linked Data, getting back into the whole mindset actually takes about an hour and a nice cup of tea. In other words, the mindset I require to think about Linked Data currently feels separate from my normal working mindset. I think this is because LD requires something different. This in itself makes it quite challenging. It doesn’t fall naturally into what we do in the Hub and how we think about metadata.

However, the very big plus with this different kind of thinking is that really by definition it puts what the user is interested in at the forefront of your thinking. Well, maybe I should qualify that: I believe it puts what the user is interested in at the forefront. This is because we understand that users of archives are usually primarily interested in individuals, families, organisations, subjects and places. What they want is information on Sir Ernest Shackleton, Barbara Castle, Victorian theatre, town planning, a local business, a scientific organisation, the history of Manchester the industry of Sheffield, or anything else. They don’t tend to know that they want to access a particular archive. Or if they do, it is often due to an assumption that there is ‘an archive’ on the person or organisation that they are researching. Even if there is an archive, there may may be a misplaced assumption that this archive is pretty much all the stuff about that entity. Furthermore, there are going to be many many researchers out there who will not be aware of archives and how to access them. Linked Data provides a way to link archives into…well, into just about anything else.

Last week Pete Johnston, Bethan Ruddock and I got together and shut ourselves in a room for 5 hours with a whiteboard, flipchart and with our thinking caps on. Pete has already posted some thoughts about architecture and workflows following this meeting. I thought I would share some more informal thoughts of my own – from the perspective of an archivist and someone gradually getting to grips with Linked Data and RDF modelling.

Now that I understand a bit more about RDF, I can see where some of my misunderstandings were leading me astray. Firstly, it took me quite a while to get away from the idea of modelling the EAD record, rather than the actual data. This might seem obvious to those conversant in Linked Data, but I’ve been dealing with records as the unit of information for the last 20 years. With Linked Data you have to get away from this and think about the actual concepts within the data. The record (the EAD description in this case) exists as an entity along with everything else, but it can be misleading to take it as the starting point for data modelling.

I found actually getting a ‘starting point’ a bit difficult. I think this is because everything can be a starting point, and also because I kept going back to thinking of something like <http://archiveshub.ac.uk/search/record.html?id=gb15sirernesthenryshackleton> as the starting point (the record itself). I then moved away from this and started thinking about the archival creator as a central concept. I knew that in RDF this person or organisation would be a subject. I also knew that this subject would need a URI and that we might want to tell people about stuff related to this subject, but I struggled with how we would provide URIs for subjects like this, and also how we would link the creator as subject to things like the index term subjects.

After a quick chat with Pete Johnston I started to understand the real role of URIs within Linked Data. We are probably going to create URIs ourselves for things (concepts) within the Hub. So, we might create a URI for every archival creator, and a URI for every repository, etc. We agreed that we needed to model the data within our world before looking too much at linking to data outside of it. Whilst I had listened to, and read a good deal of literature on Linked Data, I somehow hadn’t quite got the idea that you might create URIs yourself for your own concepts and that these would be documents in their own right, so then you can link to these URIs within your statements and you can include whatever information you think will be useful within these documents.

For example, we were using Sir Ernest Henry Shackleton as a sample record (the famous Antarctic Explorer). He would have a URI – something like archiveshub.ac.uk/id/person/sirernesthenryshackleton. By providing him with a URI we can then create triples (statements) that include this URI. For example:

We can then decide what information we will put in this document that identifies Sir Ernest, so that when researchers look up the URI, they get useful information. We can include links to external locations and we can look at using the ‘sameAs’ relationship to link to other representations of the same person.

Some URIs are fairly straightforward. We will create URIs for archival levels, and then these can in theory be used by others who want to identify levels within the data. For something like language, we will probably use URIs that are already available.

It is useful within data modelling to distinguish the real from the conceptual. So, going back to Sir Ernest, he is a flesh and blood person, and he can also be represented as a concept. If we are thinking about subjects used as index terms within the data, you might have ‘Exploration’ as a subject. We want Sir Ernest, the man described within our description, to be associated with this subject, so we can do this by making him into a concept, and giving that concept a URI. We can then link that to a literal value – his name. In our meeting we discussed one of the advantages of conceptual agents as being that we can distinguish between the person or organisation in its entirety and the person or organisation within this particular context. Archives often only represent a small part of someone’s life or an organisation’s activities, so it is helpful to talk about ‘Sir Ernest Shackleton’ as the explorer and leader of the British National Antarctic Expedition of 1907-1909.

So, we are now starting to move towards a model where we have URIs for a number of key concepts within the Hub. Our intention is to limit the number of concepts that we create URIs for, at least at this stage. We will also simplify some areas with the EAD modelling that we can then open up for investigation later on. For example, it would be good to look at version control and how we might filter changes to Hub descriptions through to the RDF XML, but we think that initially it is a good idea to create Linked Data from our basic model so that we can get feedback and also benefit from the learning process.

The main text heavy field that we are planning to create URIs for at this stage is the Biographical and Administrative History. We haven’t yet explored this thoroughly, but with URIs for archival creators and URIs for administrative and biographical histories, one’s thoughts start to turn to name authorities and EAC-CPF (Encoded Archival Context – Corporate Bodies, Persons and Families – a means to markup information about archival creators in XML). We are not looking at creating EAC descriptions, but it would be good to keep in line with this in whatever ways we can in order to facilitate the subsequent creation of EAC records, or incorporation of our data into EAC records.

We will soon be able to share our current data model, so keep an eye on our blog. We welcome any feedback that the community might have.

I recently posted on the Archives Hub blog about the tricky issue of the ‘creator’ in an archival context. This is something we need to think about when modelling EAD data to create RDF triples. It has generated some discussion about the definition of a creator of archives, and some of the issues surrounding this. The post can be found at http://archiveshub.ac.uk/blog/?p=2401.