Raising awareness of preservation issues within software development

… on old software? Check out this BBC article which must surely show that software preservation is becoming more known and talked about.

… on a blog? Well, the project is over, the outputs hosted, and so it’s time to say goodbye!

The final words should go to our brilliant programme manager, Neil Grindley, who had some kind words to say about us: “The Software Preservation work undertaken by Curtis+Cartwright in partnership with the Software Sustainability Institute was an exemplary project that delivered a generous intellectual and practical return on investment for JISC. The materials produced were of uniformly high quality and the events and partnership work were highly effective in meeting the project objectives.” Thanks Neil!

Interesting and readable article on preservation, from outside academia, from the March 2011 edition of the IET E&T magazine. Intriguing examples include Macromedia Director whose “later versions won’t run Antirom [an Arts Council-funded project]”, and Boo.com where “nobody has the code”. Well worth reading!

In preparing for the event, there were several demands from attendees for information on software licensing. My experience has been whilst there are some general principles to understand, many of the thorny issues only come out when considering a specific example of a software licence (especially commercial software). I fired off an email to JISC Legal to see what advice and help they are able to offer. They kindly replied and the rest of this post comprises their guidance (with my editing!).

Open source software

The JISC-funded OSS Watch should have specific information on open source licences. They have expertise both in licensing in and licensing out software and can advise on the appropriateness of open source. Open source licensing can be an effective way of making software that has been written by public bodies (with public money) available to the wider community, where it may be taken up and adapted for greater public benefit. It also should be possible to reverse engineer it so that it can be used to maintain access to resources even when changes and developments take place.

Your own software

The JISC Policy on open source software for JISC projects and services may also be useful for projects and services that generate software as a core output. This is available here and advises projects to maintain an IPR register listing all contributors to their software and who owns the copyright on contributions.

Commercial software

Commercial software presents more of a problem as it usually brings with it intellectual property of a company that will be very protective of it and who may on-licence it but probably on commercial terms. Some may be interested in maintaining their software so that data is preserved and accessible but there is no guarantee.

Certainly maintaining data in media neutral format is an aim of many mainstream commercial publishers who value the flexibility that this bring.

Need to know more?

JISC Legal are able to advise on particular licensing issues: just click here

I asked all attendees what they top learning point was, and the facilitators what their top lessons/observations/factoids were. The results are in, and make for a good record of the learning and consensus at the event:

Untended software is like an empty house in the jungle: it soon falls apart

Preserving software is hard even when you know what you’re doing

Software preservation is truly complex: there are so many things to take in to account in addition to just the technical aspects

Because there are so many things to consider, there are inevitably multiple people and perspectives involved, which makes collaboration essential

We must beware of preserving it for the sake of it – remind ourselves of why we need to do it (or not do it) so as to do it right

Preservation of software and preservation of data are two sides of the same coin

Rather than just formulating more theories, greater efforts are needed in terms of case studies and building test software preservation archives, tools, etc

Developing significant properties of software is valuable

Preserving software as part of the representation information of data objects is more justifiable in practice than preserving it as a digital object in its own right

Keep the code (and documentation): its where the semantics lie

There should be a practical example of how to apply the OAIS model to software preservation. You could think of it as a “software preservation profile” of the OAIS model

You can merge preservation approaches in order to work towards an ideal approach you may not have enough lead-in time for initially (eg emulation/migration initially, and in parallel working towards cultivation of community-based support)

Funding bodies should be involved in supporting preservation, eg in mandating this aspect in funding proposals

Modular design – where we can separate display from processing, for instance – is good engineering now, and good for future reuse, as not all components necessarily warrant preservation.

Good software engineering leads to good software preservation

People whose primary purpose is to develop software are more likely to follow good software engineering practices than people who develop as a means to an end. For example, researchers are motivated to do good research rather than produce good software; this may explain why a lot of research code is seen as poor quality.

The best way to curate and preserve software is to make sure that it has a good user community ensuring that it is maintained and kept “fit for purpose”; users keep software alive as much as companies do, their needs will determine what significant properties must be saved and building a community with a stake in what happens can help you not have to do everything yourself.

Preservation in a web service / cloud / distributed architecture world is hard! There is research to be done here…

After the formal elements of the day were over, we kicked back, ate brownies and had a final full group discussion. This one had no set agenda, but the team took questions, answered them as best possible and let the discussion flow.

The key questions and points arising were:

Is it better to preserve the source code or the byte code? Preserving source code is good, especially if the developers documentation and commenting is helpful. Such documentation and commenting must be independent of format/language of source code – one attendee had the fantastic example of being given well-written code to maintain but all the commenting was in Portuguese! Source code is one component of a bigger picture, and would require consideration of preserving compilers, IDEs, etc. The best practical answer to the question is to actually try and keep both source code and byte code!

It is possible that user power can help sustain software. For example, the Windows 98 End of Life was much later than Microsoft initially wanted, but because there was sufficient users and usage then the date was pushed back.

Are there any good examples of software reuse? Yes! NASA has done a lot of work on reuse, and have developed guidance including a set of Reuse Readiness Levels. The computer games industry often reuses libraries (in part or whole). The NAG libraries are a classic case of software reuse: reusing processes, reusing chunks of code, encouraging better documentation and offering best practise support of code.

How does software development differ between academia and the commercial world? There is a distinct difference between software development and research. For example if you are a researcher, rather than a developer, then often software development is about making programs for quick and easy use, rather than something which is built for maintainability, portability, reuse, etc. In addition within libraries and universities there are often project constraints, which can lead to software being functional but unstable.

Data preservation is easier than software preservation. Whilst sometimes the software is needed (eg to see data) it is possible make software preservation into more of a data preservation problem. Reducing the unknowns (eg documenting the significant properties) and adding structure (eg modularisation, use of common platforms, etc) can make software preservation more like data preservation.

In some circumstances software that needs to be preserved will be linked to specialist equipment (mass spectroscopy, NMR, robotics). In this instance, the software and its data only makes sense in the presence of this external hardware. Both specialist knowledge of the data and practical knowledge or experience of the hardware is necessary when carrying out software preservation.

An extreme approach to software preservation is to give the entire responsibility for preservation to the research group developing and using the code. The opposite extreme is to give the problem and responsibility to a specialist preserver. An in-between approach is for the specialist preserver and research group (or other developer if in the research domain) to work together.

Because of the complexity of software preservation, and in particular the difficulty of knowing in advance what key aspects are relevant to a particular case, it inadvisable to think of software preservation as a one-off decision with easy answers and a predictable outcome. Instead software preservation should be seen as a learning experience with an associated learning curve. Very often the realisation of what doesn’t work will come too late. As more is learnt within one team or organisation about what works and what doesn’t work, the better the placed they will be for making future decisions.

Are some particular platforms and/or languages better than others for software preservation? One answer is that platform and language choice is dependent on the community as use drives sustainability, therefore ask ‘what is everyone else using?’. Another answer is to more abstractly question whether the language have a future and whether it is easy to sustain? Many would argue that Perl is difficult to sustain and that there are very few COBOL developers left. This might mean that Perl and COBOL would be poor choices. However there is volatility in language use; for example many thought that C was on the decline, but with the rise of the iPhone and iPad a new life has been given to the C language (via Objective-C which is the primary language for Apple’s Mac OS X and iOS).

In the second group discussion we talked about specific examples of software preservation. Groups were asked to identify a real or hypothetical example, and then to answer the following questions:

Consider the why, what, who, where, when, how…

What are the preservation requirements?

What is your preferred approach?

Where are the skills, resources and funding?

What are your next steps?

We had four great examples, and a lot of interesting issues:

The research software involved with brain imaging can be from a variety of sources, including commercially bought packages and open-source based programs written by research students. This variability complicates decisions on how to preserve software in terms of overall strategy but the consensus is that globally there should be a push towards better software engineering. This includes better documentation, the sharing of expertise and a long term outlook. The group proposed an advocacy strategy where a small group of consumers could put pressure on the software designers in order for this to come about. It is also important to identify the benefits to the funders who would be putting investment into better software engineering.

This hypothetical example involved a piece of data entry and search software that uses a number of different formats, across three institutions and where the original developers have moved on. The group explained that continual availability of the software was dependent on:

a preserved interface (due to a lack of training funding if the interface were to be changed).

Security and confidentiality were noted as important aspects as was consistency in authentication. It was decided that a positive effect of software preservation (apart from the preservation itself) for current users would be to improve the speed of the software while preservation is taking place – ie the performance of the underlying systems would increase with each new generation of hardware, so that the searching may become faster over time.

The group also highlighted the importance of cultivating a community for future users and that a testing group across the involved institutions would be valuable. If attempts to engage the original developers failed, the group suggested engaging the support of top level professors and research councils to raise funds, awareness and possibly corporate sponsorship.

A library keeps archives of the works of famous scholars and writers, which sometimes includes entire PC’s with all their files. The aim is to have a complete environment to present to future scholars so they can explore the famous scholars file systems and have access to previous drafts and other documents. To do this the library has successfully used an emulation approach (using both Mac and Windows) which is greatly appreciated by the academics and students who use it. However, the emulator itself is now becoming obsolescent and they are now thinking about how to migrate the emulator to a new platform (eg away from XP to a new platform). A major migration issue is whether repeated emulation is possible, sustainable and leads to information degradation over time? Also, though current one machine jobs are relatively straight forward, the required software can sometimes be spread across more than one machine. Increasing the ‘standalone desktop’ model for personal computing is out-dated, to be replaced by a mix of laptops, mobile devices, and online services. This identified another huge problem – that of preserving data and software for research groups that use the “cloud”.

A Museum holds a physical video collection where both content and context are important. The collection is also publicly accessible and so has to be easily available and understandable to non-experts. Preservation is relatively simple for a single particular platform or console (eg the Nintendo SNES), but for PC games preservation can become complex as the hardware and software are often very varied. The group explained how the numbers of users who remember the preserved systems are starting to decrease and there is a now race to preserve people’s knowledge. Generally there is a throw-away attitude to “old” software in the commercial sector and the group suggested raising the profile of organisations that preserve old defunct hardware/software as an example of good practice. Due to the large workload, software preservation has a low priority at the museum, and the group assessed that their first approach should be to raise awareness in this area. Secondly they proposed a change the museums collections policy so that it covered software, which would then increase the importance of software preservation to the museum.

Our first group discussion was around who should be responsible for software preservation. Should it be the developer of the software, or should be there be a dedicated software curator? Or should it be someone else, or a team effort?!

Some of the points relayed back from these group discussions were:

You can’t do software preservation alone

People have different responsibilities for software preservation and this can depend on the institution.

Ultimately everyone has some responsibility for software preservation. The main problem is the lack of understanding, ie some people may know the technical information but not the reasons behind why the preservation’s important and vice versa.

Both IT technicians and researchers need to know the “whats and whys” for preservation to work.

Engaging current users is essential. They can tell you what’s important now and in the future. Users need to put a preservation strategy in place before things fall apart. Users must question the sustainability of their preservation techniques: will the technology and file formats become obsolete too quickly?

What data should be preserved? Raw or processed? Or both? Are there time and cost constraints on this?

What are the different reasons behind preserving your data? So you can interact with data? For prior art in patents – legal liability? It is recommended to have a strong business cases to justify preservation process.

It is recommended that an advocacy strategy is developed for software preservation.

It is recommended that certification approaches be developed for software (like with data) so that users can quickly understand whether a preservation technique is trustworthy.

Recommendations for software preservation are not as well known as those for data preservation. There is however some advice on using file types for which it is easy to build readers for (tiff/txt) and the need to ensure the representation of data is not lost, so we can make sense of even the simplest data standards (eg dictionaries and character sets for text files).

Curators must be careful when preserving software to minimise degradation of functionality or accuracy. This has been a problem even where a systematic approach to data migration was taken – problems with how graphics were rendered often arose migration approaches (though only with particular aspects of the graphics, rather than the entire thing).

Software preservation should help us understand the limitations of the user – rather than the mistakes they have made

Software is not easy to define let alone preserve. Workflows for example will have individual elements of software and processes which are brought together as a batch that is greater than the sum of its parts.

For example, within Geographic Information Systems (GIS) a mapping service is actually produced from underlying raw data and is not really an entity in its own right. Is it more important to sustain the software used to create the map, or the data that makes it available? The map is unique and only useful for one point, but it is difficult to separate data from processes.

For example with video and sound, what parts are signal and signal processing and what parts are data? Software has a whole technology stack and never stands alone – you can’t really preserve an operating system or create an emulator when software is a service and many components are disposable.

There are different approaches to software preservation listed on project website – technical preservation, emulation, migration, cultivation, hibernation, depreciation, and procrastination. Procrastination is never a really good option!

The issues in software preservation are not just technical (formidable as they may be) and include tricky activities such as managing digital rights, and justifying cost-benefit trade-offs.

Useful elements to help users approach this systematically are often significant properties of the software, key functionality for example.

The Software Sustainability Institute is creating a national facility for research software, encouraging the improvement of software design and architecture, and embedding maturity models into software training as a case for good practice in software engineering.

There are multiple ways to preserve data. The STFC uses a cultivation strategy to ensure that certain key software tools like ICAT are maintained. They have an active developer community that share the source code and keep it alive and documented.

The basic preservation steps for software are: preserve, retrieve, reconstruct and replay. These may sound self explanatory but are actually quite complex. For retrieval, in addition to knowledge of general software architecture and licensing data, there is a need for explicit information on the software’s functionality. With reconstruction there is a need for understanding the dependencies and components, details on program language and the libraries required to ensure the correct output. Replay will also need sufficient documentation and might be used as a benchmark to assess the success of the preservation method. In order to do this, there will need to be enough test cases to ensure accuracy.

We also had some curator-developer role play, with a couple of key points arising:

If you build a house in a jungle and then leave it, it will be overrun with the flora and fauna of the environment and eventually destroyed. However if you live in that house, you are committed to its continual up-keep and the house is maintained.

If software is required and there is a market for it, then it will always be maintained. If there is no market and no users that rely on it then it is more likely that the software will quickly become unusable. There are some exception to this, namely with open source software where there is a close community of users. Ironically, receiving many bug reports is a good thing, it means your code is being used and the more robust it will be in the long term.

Ooops! In wrapping up the project, I’ve found a half-written blog post dating from the start of the project. So, in the spirit of preservation, here goes… just accept my apologies for the slightly abrupt ending.

We’d been been looking for good examples where software preservation could be beneficial. One that cropped up is in climatic research. Paragraph 7 in Lord Oxburgh’s report on UEA’s CRU research says:

CRU accepts with hindsight that they should have devoted more attention in the past to archiving data and algorithms and recording exactly what they did. At the time the work was done, they had no idea that these data would assume the importance they have today and that the Unit would have to answer detailed inquiries on earlier work. CRU and, we are told, the tree ring community generally, are now adopting a much more rigorous approach to the archiving of chronologies and computer code. The difficulty in releasing program code is that to be understood by anyone else it needs time-consuming work on documentation, and this has not been a top priority.

I think that his excerpt highlights several key issues nicely:

the importance of both data and algorithms

the uncertainty in knowing what the benefits of archival and preservation might be

the critical nature of these benefits (easy to say in hindsight)

the costs involved (financial, but perhaps more importantly researcher time)

the natural order of priorities (a researcher’s day job is, well, research!)

The thrust of our project was to raise awareness of the issues in advance, so that decisions can be made on the basis of the potential need, benefits, costs, etc. Please see our benefits framework for aquick overview or longer version with lots of examples!

I was intrigued to see in my alumni newsletter a headline saying that a working replica of EDSAC (the first fully-operational stored-programme computer!) is to be rebuilt.

On following the link I was heartened to see a clear (and noble) purpose (“in recognition of the pioneering computer scientists at the University of Cambridge who developed it”). This would probably fit into the category called “create cultural heritage” in our framework of purposes and benefits.

By the end though of the article I was a little disappointed. There is no mention of the software that it’ll run. Has any of the original software been preserved? And in what form? Paper tape? Specifications? A handwritten algorithm in a logbook somewhere? I think we should be told!