Governance & Sustainability for a Web Observatory

I’m in a workshop at MIT today about plans to create a ‘Web Observatory’, collecting and curating vast quantities of data from across the web for research – in part to ensure that researchers can keep pace in their capacity to research the web with the companies and entrepreneurs who are already gathering terabytes of ‘traces’ of online behaviours in proprietary platforms. A lot of the discussion so far has looked at datasets for research gathered from platforms such as Twitter, curated data from platforms like Open Street Map, or collected in focussed research projects focussed on sensor networks and ‘humans as sensors’. However, the vision of the Web Observatory is not just about providing a catalogue of data for secondary research, but also about providing methods and tools that enable researchers to “locate, analyse, compare and interpret useful information in a consistent and reliable way … rather than drowning in a sea of data”.

As Wendy Hall noted in opening remarks, whilst the Web Observatory work begins with emphasis on academic researchers as the users of data, in the long run, Observatories could (or should) be accessible to individuals also. The growing imbalance of power created between citizens and companies through the privileged access that corporations have to information on our collective social lives is set to become an increasingly pressing social and political issue.

Now, there are clearly big technical challenges ahead in building the Web Observatory project and the many federated Web Observatories that will result, but in this post I want to briefly explore one of the organisational ones: getting the governance and sustainability of Web Observatories right.

Lessons from linked data: sustainability

If you’ve ever spent time exploring Linked Data projects you will have likely stumbled across a lot of abandoned datasets. One off conversions of open data; or data generated through now defunct research projects. The Web of Linked Data is far too often a web of broken links – as the funding for research projects runs out and links go dark.

The Linked Data Around the Clock programme (on a website that’s now offline ) had a slide that captures the coordination dilemma at the heart of creating and sustaining good Linked Data: the value of (linked and/or open) data accrues to a range of parties, and involves input from a range of parties. When projects are sustained through short-term grant funding, which covers all the work to create, curate and make accessible a dataset, then that data is sustainable only so long as the funding continues. It could be argued that when data is open, this is not so big a problem – as someone can simply take a copy of the data and if the original source goes dark, can bring up an alternative host for the data. But in practice, with Web Observatory datasets we’re talking big data where simply storing the datasets can require hundreds of terabytes storage; and datasets which cannot be entirely open due to privacy concerns or Terms of Service of the source data. The data also tends to be shaped primarily by the needs of the funding project that creates it, not by the needs of the projects that want to re-use the data. Although linked data promises distributed annotation and enhancement of data, in practice to query data it needs to be aggregated together in one place – and it’s more efficient to pool resources to enhance and maintain one data store, than to try and copy, convert and enhance multiple copies of big datasets.

So: if learning from Linked Data is anything to go by, the Web Observatory needs to be thinking critically from the start about how key datasets will be sustained, and how collaboration on enhancing data will be facilitated – recognising that there is a non-zero net cost (lots of near-zero marginal costs add up quickly in big data…) to enhancing and adding data to someone else’s data store.

Ethics issues: empowering access

Many of the datasets that might be contained in Web Observatories will raise significant privacy concerns. It might be tempting to manage these by simply deferring responsibility for judging what use can be made of the data to Institutional Review Boards and ethics committees at different participating academic institutions – if the Web Observatory programme is to be open to partners beyond academia, then ethics processes need to be placed into the heart of the Observatory governance structures, rather than managed around the edges.

A proposal: exploring co-opererative ownership and governance

There are, I think, three broad governance models open:

Observatories hosted and held-in trust by institutions: institutions, primarily academic, use fixed-term project funds to set up Web Observatories. They let other people use these so long as their funding allows, and prioritise those requests to enhance, extract or work with the data that fit with their own research goals. At the end of the project funding, Observatories either die, or end up maintained through residual or other funds.

Independent foundations: the model used by large web public goods like Archive.org and Wikipedia – establishing independent legal entities that maintain an Observatory. This has the value of helping Observatories out-live the projects that start them – but makes Observatories dependent upon finding their own funding, and creates an extra organisation entity over and above the partners with an interest in the data which either ends up with it’s own agenda and organisational imperatives, or which leaves a collective action problem with each of the partners waiting for the others to provide the funding to keep the lights on.

Data co-operatives: building on discussions convened last year in Manchester, there may be a new organisational structure the Web Observatories can build upon – that of the data cooperative. In a data cooperative, a light-weight separate entity is established, but which is constituted and jointly owned by the researchers and research institutions with a stake in the data. Cooperatives can establish rules about the resources that members should bring to the co-op, and what control they can expect over the design and maintenance of the Observatory, and can provide procedures for easy entry and exit from the co-op. In Manchester we discussed the potential for hybrid ‘workers/suppliers’ and ‘users/consumers’ co-operatives, that could give both the creators of data, and the researchers using the data, an appropriate stake in it. Co-operative membership to access data with privacy/ethical issues could also address ethics procedures.

Whilst the least developed, this third option I think holds most promise.

I don’t know yet if the Web Observatory programme will have an organisational research – but I hope so…