Issues for research software preservation

[Reposted from http://libraryblogs.is.ed.ac.uk/blog/2013/09/18/research-software-preservation/]

preservingcode

Twelve years ago I was working as a research assistant on an EPSRC funded project.  My primary role was to write software that allowed vehicle wiring to be analysed, and faults identified early in the design process, typically during the drafting stage within Computer Aided Design (CAD) tools.  As with all product design, the earlier that potential faults can be identified, the cheaper it is to eliminate them.

Life moves on, and in the intervening years I’ve moved between six jobs, and have worked in three different universities.  Part of my role now includes overseeing areas of the University’s Research Data Management service.  In this work, one area that gets raised from time to time is the issue of preserving software.  Preserving data is talked about more often, but the software that created it can be important too, particularly if the data ever needs to be recreated, or requires the software in order to interrogate or visualise the data.  The rest of this blog post takes a look at some of the important areas that should be thought about when writing software for research purposes.

In their paper ‘A Framework for Software Preservation’ (doi:10.2218/ijdc.v5i1.145) Matthews et al describe four aspects of software preservation:

1. Storage: is the software stored somewhere?

2. Retrieval: can the software be retrieved from wherever it is stored?

3. Reconstruction: can the software be reconstructed (executed)?

4. Replay: when executed, does the software produce the same results as it did originally?

Storage:

Storage of source code is perhaps one of the easier aspects to tackle, however there are a multitude of issues and options.  The first step, and this is just good software development, is documentation about the software.  In some ways this is no different to lab notebooks or experiment records that help explain what was created, why it was created, and how it was created.  This includes everything from basic practices such as comments in the code and using meaningful variable names, through to design documentation and user manuals.  The second step, which again is just good software development practice, is to store code in a source code management system such as git, mercurial, SVN, CVS, or going back a few years, RCS or SCCS.  A third step will be to store the code on a supported and maintained platform, perhaps a departmental or institutional file store.

However it may be more than the code and documentation that should be stored.  Depending on the language used, it may be prudent to store more than just the source code.  If the code is written in a well-known language such as Java, C, or Perl, then the chances are that you’ll be OK.  However there can be complexities related to code libraries.  Take the example of a bit of software written in Java and using the Maven build system.  Maven helps by allowing dependencies to be downloaded at build time, rather than storing it locally.  This gives benefits such as ensuring new versions are used, but what if the particular maven repository is no longer available in five years time? I may be in the situation where I can’t rebuild my code as I don’t have access to the dependencies.

Retrieval:

If good and appropriate storage is used, then retrieval should also be straightforward.  However, if nothing else, time and change can be an enemy.  Firstly, is there sufficient information easily available to describe to someone else, or to act as a reminder to yourself, what to access and where it is? Very often filestore permissions are used to limit who can access the storage.  If access is granted (if it wasn’t held already) then it is important to know where to look.  Using extra systems such as source code control systems can be a blessing and a curse.  You may end up having to ask a friendly sysadmin to install a SCCS client to access your old code repository!

Reconstruction:

You’ve stored your code, you’ve retrieved it, but can it be reconstructed?  Again this will often come down to how well you stored the software and its dependencies in the first place.  In some instances, perhaps where specialist programming languages or environments had to be used, these may have been stored too.  However can a programming tool written for Windows 95 still be used today?  Maybe – it might be possible to build such a machine if you can’t find one, or to download a virtual machine running Windows 95.  This raises another consideration of what to store – you may wish to store virtual machine images of either the development environment, or the execution environment, to make it easier to fire-up and run the code at a later date.  However there are no doubt issues here with choosing a virtual format that will still be accessible in twenty years time, and in line with normal preservation practice, storing a virtual machine in no way removes the need to store raw textual source code that can be easily read by any text editor in the future.

Replay:

Assuming you now have your original code in an executable format, you can now look forward to being able to replay it, and get data in and out of it.  That it, of course, as long as you have also preserved the data!

To recap, here are a few things to think about:

– Like with many areas of Research Data Management, planning is essential.  Subsequent retrieval, reconstruction, and replay is only possible if the right information is stored in the right way originally, so you need a plan reminding you what to store.

– Consider carefully what to store, and what else might be needed to recompile or execute the code in the future.

– Think about where to store the code, and where it will most likely be accessible in the future.

– Remember to store dependencies which might be quite normal today, but that might not be so easily found in the future.

– Popular programming languages may be easier to execute in the future than niche languages.

– Even if you are storing complete environments as virtual machines, remember that these may be impenetrable in the future, whereas plain text source code will always be accessible.

So, back to the project I was working on twelve years ago.  How did I do?

– Storage: The code was stored on departmental filestore. Shamefully I have to admit that no source code control system was used, the three programmers on the project just merged their code periodically.

– Retrieval: I don’t know!  It was stored on departmental filestore, so after I moved from that department to another, it became inaccessible to me. However, I presume the filestore has been maintained by the department, but was my area kept after I left, or deleted automatically?

– Reconstruction: The software was written in Java and Perl, so should be relatively easy to rebuild.

– Replay: I can’t remember how much documentation we wrote to explain how to run the code, and how to read / write data, or what format the data files had to be in.  Twelve years on, I’m not sure I could remember!

Final grading: Room for improvement!

Stuart Lewis, Head of Research and Learning Services, Library & University Collections.

Share

Comments are closed.