Here is a summary of the main ideas and themes from the presentations and discussions at the Discovery Conference at the Wellcome Institute, London, on 26 May 2011. It’s based on notes taken at the time, and is therefore by necessity to some extent selective, but I’ve tried to be comprehensive and true to the spirit of the day. I’ve included references to some of the key twitter themes as these help to highlight issues of interest to the community.

Jane Plenderleith of Glenaffric Ltd (and member of the Discovery Communications Team

Opening Address from David Baker, JISC

Our starting point was the RDTF Vision Statement of 2009. Since then there’s been some discussion about scope, suggesting that the vision should not be limited to UK HE. Following some heated discussion at the 2010 JISC conference, the vision is about opening access for all. But we have to start somewhere, hence the focus on UK HE. In our definition of the future state of the art, it’s important not to try to project too far forward, so the focus is on what we aim to achieve by 2012.

We are aiming for integrated seamless access, focusing on UK HE in the first instance, with a thorough and open aggregated layer, designed to work with all search engines, through a diverse range of personalised and innovative discovery services. Increasing efficiencies is clearly important for sector leaders and managers – the potential of open data to address this priority needs to be emphasised.

At the moment we are in Phase 1 of this process, focusing on open data. More detail about the call from JISC for moving into Phase 2 will follow. Phase 1 achievements include:

There was a successful event in April, with good engagement, proposals for further work, and suggestions for a ‘Call to Arms’. This has resulted in the Statement with eight Open Metadata Principles in the pack for today’s event.

Eight projects have already been funded, focusing on a broad and appropriate range of issues, providing a test-bed for the Phase 1 work, and giving us a good platform on which to build Phases 2 and 3.

We are working on making metadata open and easier to use, distilling advice and guidance from the eight projects. Key stakeholder engagement is vital to this process. This new phase of work is under the brand ‘Discovery’. RDTF was a clunky old name, but when boiled down to the essence, it’s about developing a metadata discovery ecology for UK education and research.

Engaging stakeholders and developing critical mass is key. With the community we want to explore what open data makes possible. Since the first event in Manchester on 18 April many people have signed up to the Statement of Principles. Today’s event is about using this momentum to move the open data agenda forward.

Part 1: The Demand Side — User Expectations in Teachin, Learning and Research

Keynote 1: Stuart Lee

Stuart was addressing the conference (by filmed videolink) wearing two hats, one as a researcher in the humanities and one as an IT service manager at the University of Oxford.

He started with a historical overview of his data usage techniques: ‘When I was writing my PhD thesis, I had to produce a glossary. The normal method at the time was using a card system, which took a long time (a friend took one and a half years). But I was trained to use a text analysis tool, so it took me three hours. Later I was asked to produce a monograph of my thesis, but instead I made a pdf and put my thesis on the web. I didn’t know at the time that this was in fact open publishing, but this had far more impact than if I had published in book form. It’s been downloaded thousands of times, and made my reputation in this field.’

Stuart went on to make reflections on how researchers in the humanities work, and what open data might mean. Researchers in the humanities never really finish. Projects have a long life span and are often revisited. We work in an iterative cycle, our research is unbounded and incomplete. We don’t just publish and move on to the next thing. We tend to work alone, in our own way, not in teams in labs. Print is a very important medium for us. We use primary and secondary resources, we find stuff through browsing catalogues.

In a nutshell, we just want to find ‘useful stuff’. Modern researchers are less worried about provenance, they are more concerned with usefulness. Many collections that we use are built by other academics working in our field.

We use tools to edit, analyse and compare data. We need to organise material so we can quickly find it. We have to present our work in a particular way – present an argument, combine primary and secondary resources. Citation is important, particularly of recognised names in the field. The material we produce has to be safeguarded, archived, so we can come back to it and others can use it. We want it to be available for a long long time. It does not go out of date like science stuff does.

So what opportunities does open data present for researchers in the humanities? We are very interested in open data and the Discovery agenda. We can now achieve the previously impossible – find relevant resources quickly, deal with mass quantities of data (example: corpus linguistics), achieve low cost distribution (example: iTunesU). Storage is no longer a problem, we can search across data silos from our desk and take advantage of cross-searching possibilities.

Perhaps we undervalue serendipity when we are looking for resources. If you are scanning books in the library, you find useful stuff on either side of the one you are looking for. If you are browsing data on a keyword search this throws up lots of possibilities.

There are a lot of chances for collaboration using online tools. We work very much in our own sub domain, with international connections in our field. We need better bibliographic tools like Mendeley.

Inevitably, open data poses some problems and challenges. Who is a researcher? Increasingly libraries have to incorporate meeting the needs of people beyond the usual HE sphere including the public and corporate bodies. There’s a lack of awareness about what is available. There’s a need for better standardisation. Text analysis tools haven’t advanced much in 20 years, and training undergraduates in their use is still necessary. There is still a problem with accessing data when we don’t know its provenance.

We need to break free of the stranglehold of academic publishers – we in the humanities are every bit as fed up about this as people working in the sciences. The system we have at present is unsustainable. We need to make metadata open to make it easier to find things. There are more challenges relating to the analysis of data, and preserving knowledge. We need support for adopting open content, both top down and bottom up.

Stuart ended with some comments on the changing nature of the library itself, the concept being no longer of a physical building, but a whole plethora of bodies holding information and making it available in what Stuart called the ‘cl**d’ (he doesn’t care for the word).

Keynote 2: Peter Murray Rust

PMR’s focus was researchers in the STEM field, and he was provocative from the outset. How many practising scientists are in the room? None. That’s the problem – scientists have no use for university libraries and repositories.

There are global and domain solutions to resource discovery. We have the technical solutions – what we need to make this happen is political will. For example, only those universities which have mandated publishing work in repositories (such as Ghent, Queensland, and to some extent Southampton) actually use them.

By comparison, look at the Open Street Map project (an open information resource for global maps). People have really contributed to this. They even held mapping parties. Example: after the earthquake they created a digital map of Haiti in two days for the rescue services. That’s the power of crowd sourcing. But there is no sense of the power of this in JISC – their strategy is to rely on publishers getting the stuff for us. But publishers, says PMR, produce garbage (this remark aroused amused assent from most of the people in the room).

PMR continues in this provocative vein. ‘It is quite simple for us to produce our own discovery data. Example: I have an interest in UK Theses, so I went to Ethos. I went with a simple and fundamental question – trying to access all Chemistry theses published in the UK in 2010. But they are scattered over different repositories, not searchable, and not available in any integrated way. In France they have SUDOC Catalogue– with 9 million bibliographic data references. If there is one clear message from today, it’s “do what the French do”.

It is technically trivial to turn documents into pdf, but this is an appalling way of managing data. PDF is like turning a cow into a hamburger. You can’t turn the hamburger back into a cow. (The twittersphere took up this comment and retweeted it many times).

Another example of where it doesn’t work: I put 2,000 objects into the Cambridge D-Space, but then I couldn’t get them out. I had to write some bash code to get my own objects back out again.

More provocation: We are paralysed by the absurdity of copyright. I know people who delight in not doing anything because of copyright. Any small interaction that is not automatic kills open data. Google just goes and does it.

PMR’s solution to these problems was to build his own repository – a graduate student did this in a year, which now costs about 0.25 FTE to maintain each year. Some funding was secured under JISC Expo to make open bibliographic data available. We have ‘liberated’ 25 million bibliographic references. It’s important to aim outwards not inwards. Example: PubMed is funded in the UK by Wellcome Trust. This organisation has done more than the whole of UK HE to push the open data agenda forward.

For PMR, what would really make this work is support from the major research funders. Wellcome, RCUK, Cancer Research UK. But they are not here today. If the funders were to mandate that all the work they fund is published openly, and state that if you don’t publish your data you won’t get another grant, this would have a serious impact. All that would then be required would be to manage the bibliography, and that’s easy. Open data just requires political will and management to make it happen.

Research Conversation

The opening keynotes gave rise to a lively debate about open data for research, with comments and questions from lots of people. The tweet wall was also animated, echoing key points and making further suggestions and generating ideas. Here’s a summary of the main questions and comments from this session:

What is the value of open data to researchers? What’s the value of a map to geographers? It’s a vital resource – we need to know who is doing what, with links to everything, with that we have the complete spine of scholarship. Bibliography is the map of scholarship. There are also management uses for data about published papers.

PMR said that data are complicated, diverse, and domain dependent. Every discipline is different and has its own views on what data is. It will take 25 years to sort what scientific data actually is on a technical level.

How important is provenance? Researchers care about provenance and how something came about. We need to exercise critical appraisal when assisting the construct of information sources. But while provenance is important, it is also incredibly difficult. In the first instance what’s important is that the data is available

What do we have to say to the funders to make them listen? Funders want the work they fund to be widely used, discovered, read, computed, built on.

The tweet wall at this point was alive with comments about IPR and copyright risks.

Is there the same ethos of collaboration and openness for museum data? Museums are protective of what is effectively their life’s work. There are copyright worries about the protection of intellectual capital. Providing an open record to a world where it might be challenged or used in a context for which it was not intended is quite challenging for museums. But it was also noted that there are people in museums who do want to share.

Should publicly funded research institutes make their data openly available? PMR praised organisations like BAS and NERC which are dedicated to maintaining data and making it publicly available. He noted that in academic communities this practice is variable. Some researchers would die rather than make their data available, while others are doing this quite freely. In some places there is an embargo on publication for five years, in case people might find out what they are doing. Issues relating to university ethics and data storage policies were mentioned.

It was suggested that what is needed in the sector is strong leadership promoting open data. There’s a particular problem with senior academic managers, working in a factional REF-dominated culture of competition. In industry, competitors manage to work together on issues of common interest, while still maintaining competition.

David Baker summarised the key issue: it is becoming apparent that the political and legal challenges to open data are more difficult than the technical.

Keynote 3: Drew Whitworth

Drew’s focus was the role of open data for teaching and learning in a variety of formal and informal contexts. A key theme was information as a resource in the environment – it does not diffuse itself evenly, it can be controlled, polluted, degraded. Drawing on Rose Luckin’s 2010 work ‘An ecology of resources’, Drew noted that an ecology evolves in a dynamic way. When you use resources you transform them into something else. This can be a problem – if we transfer resources into pollution, we are not using them in a sustainable way. Sustainable development means you meet current needs without damaging the process of meeting needs in the future. How are you using information now? Are you developing resources that will lead to enhanced resources in the future? We need to use resources now to build resources for the future.

In his book Information Obesity (2009) Drew presented the argument that while logically, information is a good thing and we need it, a lot of information can be a bad thing (why do we talk about information ‘overload’ not ‘abundance’?) It is the same with food – it is possible to have too much food or the wrong kind of food. Fitness means eating smaller amounts of right kind of food. We are under pressure to consume, and this works for information as well as food. Obesity is not just about over-consumption. It depends on individuals, and purpose. Athletes process lots of calories. Some of us can process lots of information. But we don’t want to turn learners into information processing machines.

Drew described the JISC-funded MOSI-ALONG project which was trying to connect museum artefacts of local relevance with real people and stories from the community.

In summary: we have to remember that learning and information processing happens all the time through communities. If we don’t look after our information environment it will become polluted. Environments are healthiest when they are diverse. We need to look after these environments, protect against storms and national disaster. It falls upon people in workplaces, and business leaders, to make sure the information environment on which we depend is sustainable. Our task is to look after these environments, and it’s everyone’s responsibility.

Teaching and Learning Conversation

Does the UK discovery ecotecture need to concern itself with usability or are we simply aiming to get the stuff out there? We definitely need a usability strategy. Otherwise people can just shove data in and it’s unusable.

We also need to be aware of our filtering strategies. We are programmed to filter sensory information all the time. We have known tendencies to filter out information that challenges our primary beliefs. You want to give help and guidance, you have a mental model of the data, you have some organising principles, but you need to build in some flexibility in case your mental models do not match those of your data users. This is key to effective use of these resources for learning and teaching. We need to guide, help, but not fix and control.

There is a danger in the paradigm of respected provenance, we need to be wary of gate-keeping, and think about filtering throughout the chain of use. But from a metadata standpoint – if we try to predict how users will use data, we tie ourselves in knots. For the Discovery initiative there is a sense we just have to get the content out there and communities can practice, can start to repurpose for their needs.

Usability and discovery are different but related. The challenge is – we are used to usability in terms of HCI making it easy for people to navigate and use.

But how do we channel that thinking about flexible usability while still making it possible for people to uncover the complexity of the data?

Whatever usability criteria there are need to be continuously reviewed in the light of how people are using the data. Any organising principle can become too restrictive. Scaffolding learning is a good principle – but when the job is done the scaffolding comes down. The challenge is finding a way to use scaffolding for information retrieval then take it down so people can find for themselves.

We need a discussion about the nature of infrastructure, so the scaffolding notion is useful. If we immediately apply this – we have processes that generate metadata, and much of it is context bound. We are moving towards just-in-time metadata, that is generated from processes. You might need the scaffold for 10 seconds or 10 months.

The elephant in the room is VLEs. People who are populating VLEs are not putting together temporary scaffolding, it’s a bit more permanent. There are competing approaches to describing resources and we need to take this into account. The problem with VLEs for learners is that the second they leave the institution they no longer have access.

Information is resource in a context. What else is necessary in order to turn information into learning? It’s not really possible to say what turns information into an educational resource. The quality of teacher, the motivation of student, the relevance in context. You cannot reduce education to a science, it is unpredictable, conversational, context specific.

There has to be redundancy to make our ecology healthy and diverse. Funders make decisions. JISC has a pretty good approach, do consultation, collaboration before they set priorities. For others funding is based more on political expediency, and this is worrying. There is a need to prioritise developments, but let’s do this in the right way and leave some room for flexibility.

At this point in the proceedings there was a welcome break for lunch.

Part 2: The supply side: Opportunities to expand access and visibility

Summing up the morning, David Baker said:

Seamless access, flexible delivery, personalisation – if we can put these three together, there is a very exciting future.

The afternoon session was chaired by Nick Poole. HE/FE says ‘we need this’. Just do it. Politicians say just do it. We say do it, but do it well. The afternoon session was to present examples of people who have just done it.

Veronica Adamson: The Art of the Possible, Special Collections

Key points from the discussion:

Special Collections may be the key that unlocks potential of open data for many people – it resonates, there is an understanding, examples of where LAM can really work together.

Having a business case is essential – LAM managers need to be able to make the case for open data on the basis of efficiency savings, improving the quality of learning and teaching, enhancing research output, widening participation, raising the profile of the institution.

Collections experts may not be the best people to make the business case for managers. It’s not just about listing benefits, it’s about costs and benefits.

Aggregation means combining different sources of data, seeing the machine as user.

There is a purpose beyond discovery.

Do we need to decouple the metadata layer from the presentation layer? This is a techie question but it’s important.

Supply and demand – maybe we are all middle folk. We are adding value by bringing different streams of data together, making them more amenable for access. Aggregation is an intervention with some purpose.