Building Speculative Future Capital into Digital Curation and Embracing the Changing ‘Signs’ in Metadata

The following essay aims to discuss the case study ‘Using The DCC Lifecyle Model to Curate a Gene Expression Database’. There is little doubting the benefit of such a project. Gene expression in early human foetal development can help scientists and medical professionals to better understand human growth in relation to the contracting of disease both early and later in life because it forms the foundation of how all human life develops. Recent research dictates that most genetic diseases begin in this early stage of human development so an archive that allows scientists to draw upon previous gene samples and subsequent experiment results is invaluable in understanding where disease comes from, and by extension, how to prevent and cure it. However, digital curation, no matter how much planning and policy outlining is involved, and no matter how valuable the collection is, is ultimately at the mercy of financial sponsors. Therefore, while it is important to plan through the full life cycle of a project, it is also just as important to build ideas for future ongoing funding into the project, and if possible, to either make the project profitable or to suggest ways in which the project could generate revenue based on possible discoveries. In this case, clear guidelines have to be made about future rights of the project even if funding is taken over by private enterprise that are interested in potential discoveries that could be made based on the use of those datasets. In this sense, the Gene Expression Database runs into some problems that could hinder its long term sustainability. The case study focuses quite heavily on the technical challenges of the project, as well as on the cycle from creator to end-user, or designated community, however, some of the technical plans are upset by a lack of clear financial planning, while the designated community needs to have clearer policies about future human infrastructure and how the representation information and metadata may evolve beyond the life of current creators and users.

Firstly, the case study provides a strong focus on the technical processes and applications that will be needed to provide long term security and accessibility to the gene expression data base. This is emphasised in the study because (O’Donaghue & van Hemert 2009, Pg.58): “One of the main concerns on the informatics side of the design study is how to curate this resource over the long term. DGEMap will not be a simple archive of images, but rather a constantly changing project with several types of research output and digital assets that will require both coordination and preservation.” The results of the experiments will be processed in local databases where raw digital images will be created, cleaned using photoshop, representation information and metadata added, before being transferred to searchable databases online. The datasets will then also need to be archived and stored in a long term digital repository. DGEMap, in this sense (O’Donaghue & van Hemert 2009, Pg.59) , “comprises two constantly changing databases and a large quantity of images that need to be transformed and mapped before being submitted to one of those databases”. The aim of this essay is not to go into detail describing the technical processes other than to say that the curator has decided to use all open source programmes (ie. MySQL, DRAMBORA, AONSII and SIARD) in order to facilitate later open access to the databases and to ensure consistency across all platforms. They will also apply the OAIS model for the same reason, which will be used in conjunction with MISHFISHIE, METS and PREMIS.

It is surprising to find, with such detailed technical planning, that there is one glaring shortcoming in the case study. That shortcoming arises in relation to the project’s funding and their budget which in the case study the curators of DGEMap do not appear to have full cognitive control over. The project has funding for 10 years from the European Union, however, they envision the project having to be sustained for an indefinite period of time. Also, the actual budget that has been allocated is never discussed within the case study which prohibits readers from fully understanding the policies of the project. For example DGEMap (O’Donaghue & van Hemert 2009, Pg.64) propose the use of “the Dark Archive In The Sunshine State because it is intended for back-end use to other systems, so while having no public interface, it can be used in conjunction with other access systems which adds to the protection of the data”. However, they (O’Donaghue & van Hemert 2009, Pg.64) go on to add that “one problem with this usage is the exorbitant costs which may not be sustainable over the long-term. DGEMap propose investigating other storage options further”. It is clear why they want to use the DAITSS, but it is worrying that they do not have the budget to use the most suitable storage facility, especially for even a ten year period, let alone thinking about their indefinite storage needs. The questions that remain over funding issues do call into questions the longevity of such a project once the EU come to the end of their funding obligations. This uncertainty naturally is exacerbated by the fact that the EU can undergo considerable economic fluctuations which can further disrupt long term funding commitments.

One might argue that a curator cannot plan for an indefinite period and that if one were to attempt to do so, the project would never begin in the first place. However, because data is going to be constantly added to the database, the case study could do more to emphasis the ongoing process of conceptualisation of the project. The first cycle of data has already been conceptualised and funding is in place for the initial stages. However, the DCC Lifecycle Model is not a one cycle model, something that the curators of DGEMap identify but never build upon. They [O’Donaghue & van Hemert 2009, Pg.68] accept that as future users access the data and use it to perform new analyses, the data is transformed and re-entered into the start of the cycle again. It is the argument of this essay that this ongoing reconceptualisation allows space for the curators to continually update potential future funding partners of the achievements of the project. One experiment builds upon previous experiments and with the new knowledge sets comes closer transitions to new discoveries that could lead to the development of new treatments for disease. Each time a reconeptualisation happens, these discoveries, or transitions towards discoveries could be capitalised on to continue adding scientific and monetary value to the project with a view to acquiring future funding. The DCC Lifecycle model allows for this reconceptualisation and this needs to be written into the policies of DGEMap.

The second area of interest in this case study lies in its explication of policies based around human infrastructure. Obviously, there is going to be an ever-evolving body of participants in this project as new creators are generated. These creators are initially responsible for adding representation information to the digital files. Also, this information is later checked by curators with the aim to maintaining consistent linguistic labelling of the information. In this sense, the policies of DGEMap aim to control human infrastructure to ensure consistency. This is why they use PREMIS, to ensure the information must remain readable by the community which means metadata will need to correspond to knowledge in the designated community. However, this writer believes that any attempt at enforcing a static nature onto language is destructive to the necessary evolution in such projects. Contemporary linguistic theory has been demonstrating since the 1960s that language is far from static. On the contrary, language is an ever changing malleable condition in any discipline. Within the realm of science, the labels that we use to signify meaning can play a role in promoting creativity in research and experimentation methods. Because the datasets are stored in two locations there is always going to be a static account of the results and the images. Allowing creators, who are entering the project with new evolving perspectives and by extension new, more relevant linguistic codes to develop the language used in the metadata in a natural way, can only lead to the speeding up of the discovery process. In this sense, words do not come into existence from a vacuum, but are an ever-evolving chain of signifiers that can add meaningfully to the growing body of knowledge that is being stored. Again there is plenty of scope within the policies of DGEMap to develop in this more open and organic way. For example, when referring to the monitoring of the designated community, the case study (O’Donaghue & van Hemert 2009, Pg.68) admits: “Effectively, DGEMap would be harnessing the knowledge of its designated community to increase the use and importance of the public database, allowing an even better resource to develop over time.” There is a sense that monitoring is too distant a term and that more control of the project needs to be given over to those creators and end-users that are developing it.

In conclusion, this essay has attempted to examine two important features of the case study on Gene Expression Data Storage (DGEMap) in ways that might allow the curators of that project to develop their policies more stringently. The first part of the essay examines ways in which budget constraints and long term funding uncertainty can undermine the most careful technical and conceptual planning. The essay has, however, suggested that through more fully utilising and emphasising the ongoing conceptualisation of new datasets that are entering the lifecycle of the project, the curators can build in a framework that allows that project to constantly speculate for new funding partners beyond the ten year EU funding base. The second part of the essay casts some doubt on the project’s attempt to control the human infrastructure in a way that may hinder the development of new discoveries. It suggests that digital curators need to more fully understand contemporary linguistic theory which can inform them to embrace the unavoidable nature of linguistic evolution which can add even greater momentum to the discovery process. This is especially true in that curators aim to store information for future generations. If there is a linguistic disconnect between end users and original creators, then the data may become misinterpreted or even linguistically obsolete.