Crowdsourcing Our Cultural Heritage

About this blog

Posts from a cultural heritage technologist on digital humanities, heritage and history, and user experience research and design. A bit of wishful thinking about organisational change thrown in with a few questions and challenges to the cultural heritage sector on audience research, museum interpretation, interactives and collections online.

Category: Linked open cultural data

These notes were prepared for a panel discussion at the ‘Always Already Computational: Collections as Data‘ (#AACdata) workshop, held in Santa Barbara in March 2017. While my latest thinking on the gap between the scale of collections and the quality of data about them is informed by my role in the Digital Scholarship team at the British Library, I’ve also drawn on work with catalogues and open cultural data at Melbourne Museum, the Museum of London, the Science Museum and various fellowships. My thanks to the organisers and the Institute of Museum and Library Services for the opportunity to attend. My position paper was called ‘From libraries as patchwork to datasets as assemblages?‘ but in hindsight, piles and patchwork of material seemed a better analogy.

The invitation to this panel asked us to share our experience and perspective on various themes. I’m focusing on the challenges in making collections available as data, based on years of working towards open cultural data from within various museums and libraries. I’ve condensed my thoughts about the challenges down into the question on the slide: How do we embed the production of usable collections data into library work?

It has to be usable, because if it’s not then why are we doing it? It has to be embedded because data in one-off projects gets isolated and stale. ‘Production’ is there because infrastructure and workflow is unsexy but necessary for access to the material that makes digital scholarship possible.

One of the biggest issues the British Library (BL) faces is scale. The BL’s collections are vast – maybe 200 million items – and extremely varied. My experience shows that publishing datasets (or sharing them with aggregators) exposes the shortcomings of past cataloguing practices, making the size of the backlog all too apparent.

Good collections data (or metadata, depending on how you look at it) is necessary to avoid the overwhelmed, jumble sale feeling of using a huge aggregator like Europeana, Trove, or the DPLA, where you feel there’s treasure within reach, if only you could find it. Publishing collections online often increases the number of enquiries about them – how can institution deal with enquiries at scale when they already have a cataloguing backlog? Computational methods like entity identification and extraction could complement the ‘gold standard’ cataloguing already in progress. If they’re made widely available, these other methods might help bridge the resourcing gaps that mean it’s easier to find items from richer institutions and countries than from poorer ones.

You probably already all know this, but it’s worth remembering: our collections aren’t even (yet) a patchwork of materials. The collections we hold, and the subset we can digitise and make available for re-use are only a tiny proportion of what once existed. Each piece was once part of something bigger, and what we have now has been shaped by cumulative practical and intellectual decisions made over decades or centuries. Digitisation projects range from tiny specialist databases to huge commercial genealogy deals, while some areas of the collections don’t yet have digital catalogue records. Some items can’t be digitised because they’re too big, small or fragile for scanning or photography; others can’t be shared because of copyright, data protection or cultural sensitivities. We need to be careful in how we label datasets so that the absences are evident.

(Here, ‘data’ may include various types of metadata, automatically generated OCR or handwritten text recognition transcripts, digital images, audio or video files, crowdsourced enhancements or any combination or these and more)

In addition to the incompleteness or fuzziness of catalogue data, when collections appear as data, it’s often as great big lumps of things. It’s hard for normal scholars to process (or just unzip) 4gb of data.

Currently, datasets are often created outside normal processes, and over time they become ‘stale’ as they’re not updated when source collections records change. And when they manage to unzip them, the records rely on internal references – name authorities for people, places, etc – that can only be seen as strings rather than things until extra work is undertaken.

The BL’s metadata team have experimented with ‘researcher format’ CSV exports around specific themes (eg an exhibition), and CSV is undoubtedly the most accessible format – but what we really need is the ability for people to create their own queries across catalogues, and create their own datasets from the results. (And by queries I don’t mean SPARQL but rather faceted browsing or structured search forms).

Collections are huge (and resources relatively small) so we need to supplement manual cataloguing with other methods. Sometimes the work of crafting links from catalogues to external authorities and identifiers will be a machine job, with pieces sewn together at industrial speed via entity recognition tools that can pull categories out or text and images. Sometimes it’s operated by a technologist who runs records through OpenRefine to find links to name authorities or Wikidata records. Sometimes it’s a labour of scholarly love, with links painstakingly researched, hand-tacked together to make sure they fit before they’re finally recorded in a bespoke database.

This linking work often happens outside the institution, so how can we ingest and re-use it appropriately? And if we’re to take advantage of computational methods and external enhancements, then we need ways to signal which categories were applied by catalogues, which by software, by external groups, etc.

The workflow and interface adjustments required would be significant, but even more challenging would be the internal conversations and changes required before a consensus on the best way to combine the work of cataloguers and computers could emerge.

The trick is to move from a collection of pieces to pieces of a collection. Every collection item was created in and about places, and produced by and about people. They have creative, cultural, scientific and intellectual properties. There’s a web of connections from each item that should be represented when they appear in datasets. These connections help make datasets more usable, turning strings of text into references to things and concepts to aid discoverability and the application of computational methods by scholars. This enables structured search across datasets – potentially linking an oral history interview with a scientist in the BL sound archive, their scientific publications in journals, annotated transcriptions of their field notebooks from a crowdsourcing project, and published biography in the legal deposit library.

A lot of this work has been done as authority files like AAT, ULAN etc are applied in cataloguing, so our attention should turn to turning local references into URIs and making the most of that investment.

Applying identifiers is hard – it takes expert care to disambiguate personal names, places, concepts, even with all the hinting that context-aware systems might be able to provide as machine learning etc techniques get better. Catalogues can’t easily record possible attributions, and there’s understandable reluctance to publish an imperfect record, so progress on the backlog is slow. If we’re not to be held back by the need for records to be perfectly complete before they’re published, then we need to design systems capable of capturing the ambiguity, fuzziness and inherent messiness of historical collections and allowing qualified descriptors for possible links to people, places etc. Then we need to explain the difference to users, so that they don’t overly rely on our descriptions, making assumptions about the presence or absence of information when it’s not appropriate.

A lot of what we need relies on more responsive infrastructure for workflows and cataloguing systems. For example, the BL’s systems are designed around the ‘deliverable unit’ – the printed or bound volume, the archive box – because for centuries the reading room was where you accessed items. We now need infrastructure that makes items addressable at the manuscript, page and image level in order to make the most of the annotations and links created to shared identifiers.

(I’d love to see absorbent workflows, soaking up any related data or digital surrogates that pass through an organisation, no matter which system they reside in or originate from. We aren’t yet making the most of OCRd text, let alone enhanced data from other processes, to aid discoverability or produce datasets from collections.)

Image credit: https://www.flickr.com/photos/snorski/34543357My final thought – we can start small and iterate, which is just as well, because we need to work on understanding what users of collections data need and how they want to use them. We’re making a start and there’s a lot of thoughtful work behind the scenes, but maybe a bit more investment is needed from research libraries to become as comfortable with data users as they are with the readers who pass through their physical doors.

I was in London this week for the Linked Pasts event, where I presented on trends and practices for open data in cultural heritage. Linked Pasts was a colloquium on linked open data in cultural heritage organised by the Pelagios project (Leif Isaksen, Elton Barker and Rainer Simon with Pau de Soto). I really enjoyed the other papers, which included thoughtful, grounded approaches to structured data for historical periods, places and people, recognition of the importance of designing projects around audience needs (including user research), the relationship between digital tools and scholarly inquiry, visualisations as research tools, and the importance of good infrastructure for digital history.

Warning: generalisations ahead.

My discussion points are based on years of conversations with other cultural heritage technologists in museums, libraries, and archives, but inevitably I’ll have blind spots. For example, I’m focusing on the English-speaking world, which means I’m not discussing the great work that Dutch and Japanese organisations are doing. I’ve undoubtedly left out brilliant specific examples in the interests of focusing on broader trends. The point is to start conversations, to bring issues out into the open so we can collectively decide how to move forward.

The good

The good news is that more and more open cultural data is being published. Organisations have figured out that a) nothing bad is likely to happen and that b) they might get some kudos for releasing open data.

Generally, organisations are publishing the data that they have to hand – this means it’s mostly collections data. This data is often as messy, incomplete and fuzzy as you’d expect from records created by many different people using many different systems over a hundred or more years.

…the bad…

Copyright restrictions mean that images mightn’t be included. Furthermore, because it’s often collections data, it’s not necessarily rich in interpretative information. It’s metadata rather than data. It doesn’t capture the scholarly debates, the uncertain attributions, the biases in collecting… It certainly doesn’t capture the experience of viewing the original object.

Licensing issues are still a concern. Until cultural organisations are rewarded by their funders for releasing open data, and funders free organisations from expectations for monetising data, there will be damaging uncertainty about the opportunity cost of open data.

Non-commercial licenses are also an issue – organisations and scholars might feel exploited if others who have not contributed to the process of creating it can commercially publish their work. Finally, attribution is an important currency for organisations and scholars but most open licences aren’t designed with that in mind.

…and the unstructured

The data that’s released is often pretty unstructured. CSV files are very easy to use, so they help more people get access to information (assuming they can figure out GitHub), but a giant dump like this doesn’t provide stable URIs for each object. Records in data dumps rarely link to external identifiers like the Getty’s Thesaurus of Geographic Names, Art & Architecture Thesaurus (AAT) or Union List of Artist Names, or vernacular sources for place and people names such as Geonames or DBPedia. And that’s fair enough, because people using a CSV file probably don’t want all the hassle of dereferencing each URI to grab the place name so they can visualise data on a map (or whatever they’re doing with the data). But it also means that it’s hard for someone to reliably look for matching artists in their database, and link these records with data from other organisations.

So it’s open, but it’s often not very linked. If we’re after a ‘digital ecosystem of online open materials’, this open data is only a baby step. But it’s often where cultural organisations finish their work.

Classics > Cultural Heritage?

But many others, particularly in the classical and ancient world, have managed to overcome these issues to publish and use linked open data. So why do museums, libraries and archives seem to struggle? I’ll suggest some possible reasons as conversation starters…

Not enough time

Organisations are often busy enough keeping their internal systems up and running, dealing with the needs of visitors in their physical venues, working on ecommerce and picture library systems…

Not enough skills

Cultural heritage technologists are often generalists, and apart from being too time-stretched to learn new technologies for the fun of it, they might not have the computational or information science skills necessary to implement the full linked data stack.

Some cultural heritage technologists argue that they don’t know of any developers who can negotiate the complexities of SPARQL endpoints, so why publish it? The complexity is multiplied when complex data models are used with complex (or at least, unfamiliar) technologies. For some, SPARQL puts the ‘end’ in ‘endpoint’, and ‘RDF triples‘ can seem like an abstraction too far. In these circumstances, the instruction to provide linked open data as RDF is a barrier they won’t cross.

But sometimes it feels as if some heritage technologists are unnecessarily allergic to complexity. Avoiding unnecessary complexity is useful, but progress can stall if they demand that everything remains simple enough for them to feel comfortable. Some technologists might benefit from working with people more used to thinking about structured data, such as cataloguers, registrars etc. Unfortunately, linked open data falls in the gap between the technical and the informatics silos that often exist in cultural organisations.

And organisations are also not yet using triples or structured data provided by other organisations [with the exception of identifiers for e.g. people, places and specific vocabularies]. They’re publishing data in broadcast mode; it’s not yet a dialogue with other collections.

Not enough data

In a way, this is the collections documentation version of the technical barriers. If the data doesn’t already exist, it’s hard to publish. If it needs work to pull it out of different departments, or different individuals, who’s going to resource that work? Similarly, collections staff are unlikely to have time to map their data to CIDOC-CRM unless there’s a compelling reason to do so. (And some of the examples given might use cultural heritage collections but are a better fit with the work of researchers outside the institution than the institution’s own work).

It may be easier for some types of collections than others – art collections tend to be smaller and better described; natural history collections can link into international projects for structured data, and libraries can share cataloguing data. Classicists have also been able to get a critical mass of data together. Your local records office or small museum may have more heterogeneous collections, and there are fewer widely used ontologies or vocabularies for historical collections. The nature of historical collections means that ‘small ontologies, loosely joined’, may be more effective, but creating these, or mapping collections to them, is still a large piece of work. While there are tools for mapping to data structures like Europeana’s data model, it seems the reasons for doing so haven’t been convincing enough, so far. Which brings me to…

Not enough benefits

This is an important point, and an area the community hasn’t paid enough attention to in the past. Too many conversations have jumped straight to discussion about the specific standards to use, and not enough have been about the benefits for heritage audiences, scholars and organisations.

Many technologists – who are the ones making decisions about digital standards, alongside the collections people working on digitisation – are too far removed from the consumers of linked open data to see the benefits of it unless we show them real world needs.

There’s a cost in producing data for others, so it needs to be linked to the mission and goals of an organisation. Organisations are not generally able to prioritise the potential, future audiences who might benefit from tools someone else creates with linked open data when they have so many immediate problems to solve first.

While some cultural and historical organisations have done good work with linked open data, the purpose can sometimes seem rather academic. Linked data is not always explained so that the average, over-worked collections or digital team will that convinced by the benefits outweigh the financial and intellectual investment.

No-one’s drinking their own champagne

You don’t often hear of people beating on the door of a museum, library or archive asking for linked open data, and most organisations are yet to map their data to specific, widely-used vocabularies because they need to use them in their own work. If technologists in the cultural sector are isolated from people working with collections data and/or research questions, then it’s hard for them to appreciate the value of linked data for research projects.

The classical world has benefited from small communities of scholar-technologists – so they’re not only drinking their own champagne, they’re throwing parties. Smaller, more contained collections of sources and research questions helps create stronger connections and gives people a reason to link their sources. And as we’re learning throughout the day, community really helps motivate action.

Linked open data isn’t built into collections management systems

Getting linked open data into collections management systems should mean that publishing linked data is an automatic part of sharing data online.

Chicken or the egg?

So it’s all a bit ‘chicken or the egg’ – will it stay that way? Until there’s a critical mass, probably. These conversations about linked open data in cultural heritage have been going around for years, but it also shows how far we’ve come.

I’ve been playing with Tate’s collections data while preparing for a workshop on data visualisation. On the day I’ll probably use Google Fusion Tables as an example, but I always like to be prepared so I’ve prepared a short exercise for creating simple graphs in Excel as an alternative.

The advantage of Excel is that you don’t need to be online, your data isn’t shared, and for many people, gaining additional skills in Excel might be more useful than learning the latest shiny web tool. PivotTables are incredibly useful for summarising data, so it’s worth trying them even if you’re not interested in visualisations. Pivot tables let you run basic functions – summing, averaging, grouping, etc – on spreadsheet data. If you’ve ever wanted spreadsheets to be as powerful as databases, pivot tables can help. I could create a pivot table then create a chart from it, but Excel has an option to create a pivot chart directly that’ll also create a pivot table for you to see how it works.

Work out what data you’re interested in

In this example, I’m interested in when the artists in Tate’s collection were born, and the overall gender mix of the artists represented. To make it easier to see what’s going on, I’ve copied those two columns of data from the original ‘artists’ file and copied them over to a new spreadsheet. As a row by row list of births, these columns aren’t ideal for charting as they are, so I want a count of artists per year, broken down by gender.

Insert PivotChart

On the ‘Insert’ menu, click on PivotTable to open the menu and display the option for PivotCharts.

Excel will select our columns as being the most likely thing we want to chart. That all looks fine to me so click ‘OK’.

Configure the PivotChart

This screen asking you to ‘choose fields from the PivotTable Field List’ might look scary, but we’ve only got two columns of data so you can’t really go wrong.

The columns have already been added to the PivotTable Field List on the right, so go ahead and tick the box next to ‘gender’ and ‘yearofBirth’. Excel will probably put them straight into the ‘Axis Fields’ box.

Leave yearofBirth under Axis Fields and drag ‘gender’ over to the ‘Values’ box next to it. Excel automatically turns it into ‘count of gender’, assuming that we want to sum the number of births per year.

The final task is to drag ‘gender’ down from the PivotTable Field List to ‘Legend Fields’ to create a key for which colours represent which gender. You should now see the pivot table representing the calculated values on the left and a graph in the middle.

Close-up of the pivot fields

When you click off the graph, the PivotTable options disappear – just click on the graph or the data again to bring them up.

You’ve made your first pivot chart!

You might want to drag it out a bit so the values aren’t so squished. Tate’s data covers about 500 years so there’s a lot to fit in.

Now you’ve made a pivot chart, have a play – if you get into a mess you can always start again!

Colophon: the screenshots are from Excel 2010 for Windows because that’s what I have.

About the data: this data was originally supplied by Tate. The full version on Tate’s website includes name, date of birth, place of birth, year of death, place of death and URL on Tate’s website. The latest versions of their data can be downloaded from http://www.tate.org.uk/about/our-work/digital/collection-data The source data for this file can be downloaded from https://github.com/tategallery/collection/blob/master/artist_data.csv This version was simplified so it only contains a list of years of birth and the gender of the artist. Some blank values for gender were filled in based on the artist’s name or a quick web search; groups of artists or artists of unknown gender were removed as were rows without a birth year. This data was prepared in March 2015 for a British Library course on ‘Data Visualisation for Analysis in Scholarly Research’ by Mia Ridge.

I’d love to hear if you found this useful or have any suggestions for tweaks.

This is a lazy post, a straight copy and paste of my presentation notes (my excuse is that I’m eight days behind on everything at work and uni after being grounded in the US by volcanic ash). Anyway, I hope you enjoy it or that it’s useful in some way.

The Cosmic Collections project was based on a simple idea – what if we gave people the ability to make their own collection website? The Science Museum was planning an exhibition on astronomy and culture, to be called ‘Cosmos & Culture’. We had limited time and resources to produce a site to support the exhibition and we risked creating ‘just another exhibition microsite’. So what if we provided access to the machine-readable exhibition content that was already being gathered internally, and threw it open to the public to make websites with it? And what if we motivated them to enter by offering competition prizes? Competition participants could win a prize and kudos, and museum audiences might get a much more interesting, innovative site.

The idea was a good match for museum mission, exhibition content, technical context, hopefully audience – but was that enough?

Slide 2 (satellite dish):

Questions…

If we built an API, would anyone use it?

Can you really crowdsource the creation of collections interfaces?

The project gave me a chance to investigate some specific questions. At the time, there were lots of calls from some quarters for museums to produce APIs for each project, but would anyone actually use a museum API? The competition might help us understand whether or how we should invest in APIs and machine-readable data.

We can never build interfaces to meet the needs of every type of audience. One of the promises of machine-readable data is that anyone can make something with your data, allowing people with particular needs to create something that supports their own requirements or combines their data with ours – but would anyone actually do it?

Slide 3 (map mashup):

Mashups combine data from one or more sources and/or data and visualisation tools such as maps or timelines.

I’m going to get the geek stuff out of the way and quickly define mashups and APIs…

Mashups are computer applications that take existing information from known sources and present it to the viewer in a new way. Here’s a mashup of content edits from Wikipedia with a map showing the location of the edit.

Slide 4 (APIs)

APIs (Application Programming Interfaces) are a way for one machine to talk to another: ‘Hi Bob, I’d like a list of objects from you, and hey, Alice, could you draw me a timeline to put the objects on?’

APIs tell a computer, ‘if you go here, you will get that information, presented like this, and you can do that with it’.

A way of providing re-usable content to the public, other museums and other departments within our museum – we created a shared backend for web and gallery interactives.

I think of APIs as user interfaces for developers and wanted to design a good experience for developers with the same care you would for end users*. I hoped that feedback from the competition could be used to improve the beta API

* we didn’t succeed in the first go but it’s something to aim for post-beta

Slide 5: (what if nobody came?)

AKA ‘the fears and how to deal with them’

Acknowledge those fears

Plan for the worst case scenario

Take a deep breath and do it anyway

And on the next slides, the results. If I was replicating the real experience, you’d have several nerve-biting months while you waited for the museum to lumber into gear, planned the launch event, publicised the project in the participant communities… Then waited for results to come in. But let’s skip that bit…

Slide 6: (Ryan Ludwig’s http://www.serostar.com/cosmic/)

The results – our judges declared a winner and a runner-up, these are screenshots – this is the second prize winning entry.

People came to the party. Yay! I’d like to thank all the participants, whether they submitted a final entry or not. It wouldn’t have worked without them.

Slide 7: (Natalie and Simon’s http://cosmos.natimon.com/)

This is a screenshot from the winning site – it made the best use of the API and was designed to lure the visitor in and keep drawing them through the site.

(We didn’t get subject specialists scratching their own itch – maybe they don’t need to share their work, maybe we didn’t reach them. Would like to reach researchers, let them know we have resources to be used, also that they can help us/our audiences by sharing their work)

Slide 8: (astrolabe – what did we learn?)

People need (more) help to participate in a geektastic project like this

The dynamics of a competition are tricky

Mashups are shaped by the data provided – you get out what you put in

Can we help people bring their own content to a future mashup?

Slide 9: (evaluation)

I did a small survey to evaluate the project… Turns out the project was excellent outreach into the developer community. People were really excited about being invited to play with our data. My favourite quote: “The very idea of the competition was awesome”

Slide 10: (paper sheet)

Also positive coverage in technical press. So in conclusion?

Slide 11: (Tim Berners-Lee):

“The thing people are amazed about with the web is that, when you put something online, you don’t know who is going to use it—but it does get used.”

There are a lot of opportunities and excitement around putting machine-readable data online…

Slide 12: Tim Berners-Lee 2:

But: It doesn’t happen automatically; It’s not a magic bullet

But people won’t find and use your APIs without some encouragement. You need to support your API users. People outside the museum bring new ideas but there’s still a big role for people who really understand the data and audiences to help make it a quality experience…

Slide 13 (space):

What next?

Using the feedback to focus and improve collection-wide API

Adding other forms of machine-readable data

Connecting with data from your collections?

I’ve been thinking about how to improve APIs – offer subject authorities with links to collections, embed markup in the collections pages to help search engines understand our data…

I want more! The more of us with machine-readable data available for re-use, the better the cross-collections searches, the region or specialism-wide mashups… I’d love to be able to put together a mashup showing all the cultural heritage content about my suburb; all the Boucher self-portraits; all the inventions that helped make the Space Shuttle work…

Slide 14: (thank you)

If you’re interested in possibilities of machine-readable data and access to your collections, join in the conversation on the museum API wiki or follow along on twitter or on blogs. Join in at http://museum-api.pbworks.com/

Tom Morris gave a lightning talk on ‘How to use Semantic Web data in your hack‘ (aka SPARQL and semantic web stuff).

He’s since posted his links and queries – excellent links to endpoints you can test queries in.

Semantic web often thought of as long-promised magical elixir, he’s here to say it can be used now by showing examples of queries that can be run against semantic web services. He’ll demonstrate two different online datasets and one database that can be installed on your own machine.

First – dbpedia – scraped lots of wikipedia, put it into a database. dbpedia isn’t like your averge database, you can’t draw a UML diagram of wikipedia. It’s done in RDF and Linked Data. Can be queried in a language that looks like SQL but isn’t. SPARQL – is a w3c standard, they’re currently working on SPARQL 2.

Go to dbpedia.org/sparql – submit query as post. [Really nice – I have a thing about APIs and platforms needing a really easy way to get you to ‘hello world’ and this does it pretty well.]

[Line by line comments on the syntax of the queries might be useful, though they’re pretty readable as it is.]

‘select thingy, wotsit where [the slightly more complicated stuff]’

Can get back results in xml, also HTML, ‘spreadsheet’, JSON. Ugly but readable. Typed.

[Trying a query challenge set by others could be fun way to get started learning it.]

One problem – fictional places are in Wikipedia e.g. Liberty City in Grand Theft Auto.

Libris – how library websites should be
[I never used to appreciate how much most library websites suck until I started back at uni and had to use one for more than one query every few years]

Has a query interface through SPARQL

Comment from the audience BBC – now have SPARQL endpoint [as of the day before? Go BBC guy!].

Playing with mulgara, open source java triple store. [mulgara looks like a kinda faceted search/browse thing] Has own query language called TQL which can do more intresting things than SPARQL. Why use it? Schemaless data storage. Is to SQL what dynamic typing is to static typing. [did he mean ‘is to sparql’?]

Question from audence: how do you discover what you can query against?
Answer: dbpedia website should list the concepts they have in there. Also some documentation of categories you can look at. [Examples and documentation are so damn important for the update of your API/web service.]

Coming soon [?] SPARUL – update language, SPARQL2: new features

The end!

[These are more (very) rough notes from the weekend’s Open Hack London event – please let me know of clarifications, questions, links or comments. My other notes from the event are tagged openhacklondon.

Quick plug: if you’re a developer interested in using cultural heritage (museums, libraries, archives, galleries, archaeology, history, science, whatever) data – a bunch of cultural heritage geeks would like to know what’s useful for you (more background here). You can comment on the #chAPI wiki, or tweet @miaridge (or @mia_out). Or if you work for a company that works with cultural heritage organisations, you can help us work better with you for better results for our users.]

There were other lightning talks on Pachube (pronounced ‘patchbay’, about trying to build the internet of things, making an API for gadgets because e.g. connecting hardware to the web is hard for small makers) and Homera (an open source 3d game engine).

The Future of the Web with Sir Tim Berners-Lee at Nesta, London, July 8.

My notes from the Nesta event, The Future of the Web with Sir Tim Berners-Lee, held in London on July 8, 2008.

Panel at ‘The Future of the Web’ with Sir Tim Berners-Lee, Nesta

As usual, let me know of any errors or corrections, comments are welcome, and comments in [square brackets] are mine. I wanted to get these notes up quickly so they’re pretty much ‘as is’, and they’re pretty much about the random points that interested me and aren’t necessarily representative. I’ve written up more detailed notes from a previous talk by Tim Berners-Lee in March 2007, which go into more detail about web science.

The event was introduced by NESTA’s CEO, Jonathan Kestenbaum. Explained that online contributions from the pre-event survey, and from the (twitter) backchannel would be fed into the event. Other panel members were Andy Duncan from Channel 4 and the author Charlie Leadbeater though they weren’t introduced until later.

Web as blockage in sink – starts with a bone, stuff builds up around it, hair collect, slime – perfect for bugs, easy for them to get around – we are the bugs (that woke people up!). The web is a rich environment in which to exist.

Semantic web – what’s interesting isn’t the computers, or the documents on the computers, it’s the data in the documents on the computers. Go up layers of abstraction.

Paraphrase, about the web: ‘we built it, we have a duty to study it, to fix it; if it’s not going to lead to the kind of society we want, then tweak it, fix it’.

‘Someone out there will imagine things we can’t imagine; prepare for that innovation, let that innovation happen’. Prepare for a future we can’t imagine.

End of talk! Other panelists and questions followed.

Charles Leadbeater – talked about the English Civil War, recommends a book called ‘The World Turned Upside Down’. The bottom of society suddenly had the opportunity to be in charge. New ‘levellers‘ movement via the web. Participate, collaborate, (etc) without the trappings of hierarchy. ‘Is this just a moment’ before the corporate/government Restoration? Iterative, distributed, engaged with practice.

Need new kinds of language – dichotomies like producer/consumer are disabling. Is the web – a mix of academic, geek, rebel, hippie and peasant village cultures – a fundamentally different way of organising, will it last? Are open, collaborative working models that deliver the goals possible? Can we prevent creeping re-regulation that imposes old economics on the new web? e.g. ISPs and filesharing. Media literacy will become increasingly important. His question to TBL – what would you have done differently to prevent spam while keeping the openness of the web? [Though isn’t spam more of a problem for email at the moment?]

Andy Duncan, CEO of Channel 4 – web as ‘tool of humanity’, ability for humans to interact. Practical challenges to be solved. £50million 4IP fund. How do we get, grow ideas and bring them to the wider public, and realise the positive potential of ideas. Battle between positive public benefit vs economic or political aspects.

The internet brings more/different perspectives, but people are less open to new ideas – they get cosy, only talk to like-minded people in communities who agree with each other. How do you get people engaged in radical and positive thinking? [This is a really good observation/question. Does it have to do with the discoverability of other views around a topic? Have we lost the serendipity of stumbling across random content?]

Open to questions. ‘Terms and conditions’ – all comments must have a question mark at the end of them. [I wish all lectures had this rule!]

Questions from the floor: 1. why is the semantic web taking so long; 2. 3D web; 3. kids.
TBL on semantic web – lots of exponential growth. SW is more complicated to build than HTML system. Now has standard query language (SPARQL). Didn’t realise at first that needed a generic browser and linked open data. (Moving towards real world).

[This is where I started to think about the question I asked, below – cultural heritage institutions have loads of data that could be open and linked, but it’s not as if institutions will just let geeks like me release it without knowing where and why and how it will be used – and fair enough, but then we need good demonstrators. The idea that the semantic web needs lots of acronyms (OWL, GRDDL, RDF, SPARQL) in place to actually happen is a perception I encounter a lot, and I wanted an answer I could pass on. If it’s ‘straight from the horse’s mouth’, then even better…]

Questions from twitter (though the guy’s laptop crashed): 4. will Google own the world? What would Channel 4 do about it?; 5. is there a contradiction between [collaborative?] open platform and spam?; 6. re: education, in era of mass collaboration, what’s the role of expertise in a new world order? [Ooh, excellent question for museums! But more from the point of view of them wondering what happens to their authority, especially if their collections/knowledge start to appear outside their walls.]

AD: Google ‘ferociously ambitious in terms of profit’, fiercely competitive. They should give more back to the UK considering how much they take out. Qu to TBL re Google, TBL did not bite but said, ‘tremendous success; Google used science, clustering algorithms, looked at the web as a system’.
CL re qu 5 – the web works best through norms and social interactions, not rules. Have to be careful with assumption that can regulate behaviour -> ‘norm based behaviour’. [But how does that work with anti-social individuals?]
TBL re qu 6: e.g. MIT Courseware – experts put their teaching materials on the web. Different people have different levels of expertise [but how are those experts recognised in their expert context? Technology, norms/links, a mixture?]. More choice in how you connect – doesn’t have to be local. Being an expert [sounds exhausting!] – connect, learn, disseminate – huge task.

Questions from the floor: 7. ISPs as villains, what can they do about it?; 9. why can’t the web be designed to use existing social groups? [I think, I was still recovering from asking a question] TBL re qu 7 and ISPs ‘give me non-discriminatory access and don’t sell my clickstream’. [Hoorah!]

So the middle question (Question 8) was me. It should have been something like ‘if there’s a tension between the top-down projects that don’t work, and simple protocols like HTML that do, and if the requirements of the ‘Semantic Web’ are top-down (and hard), how do we get away from the idea that the semantic web is difficult to just have the semantic web?’* but it came out much more messily than that as ‘the semantic web as proposed is a top-down system, but the reason the web worked was that it was simple, easy to participate, so how does that work, how do we get the semantic web?’ and his response started “Who told you SW is top down?”. It was a leading question so it’s my fault, but the answer was worth asking a possibly stupid/leading question. His full answer [about two minutes at 20’20” minutes in on the Q&A video] was: ‘Who on earth told you the semantic web was a top-down designed system? It’s not. It is totally bottom-out. In fact the really magic thing about it is that it’s middle-out as well. If you imagine lots of different data systems which talk different languages, it’s a bit like imagine them as a quilt of those things sewn together at the edges. At the bottom level, you can design one afternoon a little data system which uses terms and particular concepts which only you use, and connect to nobody else. And then, in a very bottom-up way, start meeting more and more people who’ll start to use those terms, and start negotiating with people, going to, heaven forbid, standards bodies and committees to push, to try to get other people to use those terms. You can take an existing set of terms, like the concepts when you download a bank statement, you’ll find things like the financial institution and transaction and amount have pretty much been defined by the banks, you can take those and use those as semantic web terms on the net. And if you want to, you can do that at the very top level because you might decide that it’s worth everybody having exactly the same URI for the concept of latitude, for the number you get out of the GPS, and you can join the W3C interest group which has gotten together people who believe in that, and you’ve got the URI, [people] went to a lot of trouble to make something which is global. The world works like that plug of stuff in the sink, it’s a way of putting together lots and lots of different communities at different levels, only some of them, a few of them are global. The global communities are hard work to make. Lots and lots and lots of them are local, those are very easy to make. Lots of important benefits are in the middle. The semantic web is the first technology that’s designed with an understanding of that’s how the world is, the world is a scale-free, fractal if you like, system. And that’s why it’s all going to work.’

[So I was asking ‘how do we get to the semantic web’ in the museum sector – we can do this. Put a dataset out there, make connections to the organisation next to you (or get your users to by gathering enough anonymised data on how they link items through searching and browsing). Then make another connection, and another. We could work at the sector (national or international) level too (stable permanent global identifiers would be a good start) but start with the connections. “Small pieces loosely joined” -> “small ontologies, loosely joined”. Can we make a manifesto from this?

“He urged attendees to look over their data, take inventory of it, and decide on which of the things you’d most likely get some use out of re-using it on the Web. Decide priorities, and benefits of that data reuse, and look for existing ontologies on the Web on how to use it, he continued, referring to the term that describes a common lexicon for describing and tagging data.”

Anyway, on with the show.]

[*Comment from 2015: in hindsight, my question speaks to the difficulties of getting involved in what appeared to be distant and top-down processes of ontology development, though it might not seem that distant to someone already working with W3C. And because museums are tricky, it turns out the first place to start is getting internal museum systems to talk to each other – if you can match people, places, objects and concepts across your archive, library and museum collections management systems, digital asset management system and web content management system, you’re in a much better position to match terms with other systems. That said, the Linking Museums meetups I organised in London and various other museum technology forums were really helpful.]

Questions from the floor: 10. do we have enough “bosses who don’t say no”?; 11. web to solve problems, social engineering [?]; 12. something on Rio meeting [didn’t get it all].

TBL re 10 – he can’t emulate other bosses but he tries to have very diverse teams, not clones of him/each other, committed, excited people and ‘give them spare time to do things they’re interested in’. So – give people spare time, and nurture the champions. They might be the people who seem a bit wacky [?] but nurture the ones who get it.

Qu 11 – conflicting demands and expectations of web. TBL – ‘try not to think of it as a thing’. It’s an infrastructure, connections between people, between us. So, are we asking too much of us, of humanity? Web is reflection of humanity, “don’t expect too little”.

TBL re qu 12 – internet governance is the Achilles heel of the web. No permission required except for domain name. A ‘good way to make things happen slowly is to get a bureaucracy to govern it’. Slowness, stability. Domain names should last for centuries – persistence is a really important part of the web.

CL re qu 11 – possibilities of self-governance, we ask too little of the web. Vision of open, collaborative web capable of being used by people to solve shared problems.

JK – (NESTA) don’t prescribe the outcome at the beginning, commitment to process of innovation.

Then Nesta hosted drinks, then we went to the pub and my lovely mate said “I can’t believe you trolled Tim Berners-Lee”. [I hope I didn’t really!]

The HyperRecord system, used by the Capitoline Museums (Rome) and the Bibliotheca Hertziana (Max-Planck Institute, Rome) and developed as Culture2000 project, is a framework for the inter-connectivity of information resources from museums, archives and cultural institutes.
…
The repositories offer both the usual human interface for research (fulltext, title, etc.) and a smart REST API with a powerful behind-the-scenes direct machine-to-machine facility for querying and retrieving data.
…
The different information resources use digital object identifiers in the form of URNs (up to now, mostly for museum objects) for identification and direct-access. These allow easy aggregation of contents (data, records, documents) not only inside a repository but also across boundaries using the REST API for serving XML over a plain HTTP connection, in fact creating a loosely coupled network of repositories.

Thanks to Leif Isaksen for putting Dr Werner in contact with me after he saw his paper at CAA07.

This experiment is part of Google’s broader effort to increase its coverage of the web. In fact, HTML forms have long been thought to be the gateway to large volumes of data beyond the normal scope of search engines. The terms Deep Web, Hidden Web, or Invisible Web have been used collectively to refer to such content that has so far been invisible to search engine users. By crawling using HTML forms (and abiding by robots.txt), we are able to lead search engine users to documents that would otherwise not be easily found in search engines, and provide webmasters and users alike with a better and more comprehensive search experience.

You’re probably already well indexed if you have a browsable interface that leads to every single one of your collection records and images and whatever; but if you’ve got any content that was hidden behind a search form (and I know we have some in older sites), this could give it much greater visibility.

To take an example: at the individual tag level, the flaws of misspellings and inaccuracies are annoying and troublesome, but at a meta level these inaccuracies are ironed out; flattened by sheer mass: a kind of bell-curve peak of correctness. At the same time, inferences can be drawn from the connections and proximity of tags. If the word “cat” appears consistently – in millions and millions of data items – next to the word “kitten” then the system can start to make some assumptions about the related meaning of those words. Out of the apparent chaos of the folksonomy – the lack of formal vocabulary, the anti-taxonomy – comes a higher-level order. Seb put it the other way round by talking about the “shanty towns” of museum data: “examine order and you see chaos”.

The total “value” of the data, in other words, really is way, way greater than the sum of the parts.

So far, so ace. We’ve been excited about using the implicit links created between data as people consciously record information with tags, or unconsciously with their paths between data to create those ‘small ontologies, loosely joined’; the possibilities of multilingual tagging, etc, before. Tags are cool.

But the applications of this could go further:

I got thinking about how this can all be applied to the Semantic Web. It increasingly strikes me that the distributed nature of the machine processable, API-accessible web carries many similar hallmarks. Each of those distributed systems – the Yahoo! Content Analysis API, the Google postcode lookup, Open Calais – are essentially dumb systems. But hook them together; start to patch the entire thing into a distributed framework, and things take on an entirely different complexion.

…

Here’s what I’m starting to gnaw at: maybe it’s here. Maybe if it quacks like a duck, walks like a duck (as per the recent Becta report by Emma Tonkin at UKOLN) then it really is a duck. Maybe the machine-processable web that we see in mashups, API’s, RSS, microformats – the so-called “lightweight” stuff that I’m forever writing about – maybe that’s all we need. Like the widely accepted notion of scale and we-ness in the social and tagged web, perhaps these dumb synapses when put together are enough to give us the collective intelligence – the Semantic Web – that we have talked and written about for so long.

I’d say those capital letters in ‘Semantic Web’ might scare some of the hardcore SW crowd, but that’s ok, isn’t it? Semantics (sorry) aside, we’re all working towards the same goal – the machine-processable web.

And in the meantime, if we can put our data out there so others can tag it, and so that we’re exposing our internal ‘tags’ (even if they have fancier names in our collections management systems), we’re moving in the right direction.