SCAPE workshop (notes, part 2)

Drawing on my notes from the SCAPE (Scalable Preservation Environments) workshop I attended last year, here is a summary of the presentation delivered by Peter May (BL) and the activities that were introduced during the hands-on training session.

Peter May (British Library)

To contextualise the activities and the tools we used during the workshop, Peter May presented a case study from the British Library (BL). The BL is a legal deposit archive that among many other resources archives newspapers. This type of item (newspaper) is one of the most brittle in their archive, because it is very sensitive and prone to disintegrate even in optimal preservation conditions (humidity and light controlled environment). With support from Jisc (2007, 2009) and through their current partnership with brightsolid, the BL has been digitising this part of the collection, at a current digitisation rate of about 8000 scans per day.

BL’s main concern is how to ensure long term preservation and access to the newspaper collection, and how to make digitisation processes cost effective (larger files require more storage space, so less storage needed per file means more digitised objects). As part of the digitisation projects, BL had to reflect on:

  • How would the end-user want to engage with the digitised objects?
  • What file format would suit all those potential uses?
  • How will the collection be displayed online?
  • How to ensure smooth network access to the collection?

As an end-user, you might want to browse thumbnails in the newspaper collection, or you might want to zoom in and read through the text. In order to have the flexibility to display images at different resolutions when required, the BL has to scan the newspapers at high resolution. JPEG2000 has proved to be the most flexible format for displaying images at different resolutions (thumbnails, whole images, image tiles). The BL investigated how to migrate from TIFF to JPEG2000 format to enable this flexibility in access, as well as to reduce the size of files, and thereby the cost of storage and preservation management. A JPEG2000 file is normally half the size of a TIFF file.

At this stage, the SCAPE workflow comes into play. In order to ensure that it was safe to delete the original TIFF files after the migration into JPEG2000, the BL team needed to make quality checks across all the millions of files they were migrating.

For the SCAPE work at the BL, Peter May and the team tested a number of tools and created a workflow to migrate files and perform quality checks. For the migration process they tested codecs such as kakadu and openJPEG, and for checking the integrity of the JPEG2000 and how the format complied with institutional policies and preservation needs, they used Jpylyzer. Other tools such as Matchbox (for image feature analysis and duplication identification) or exifTool (an image metadata extractor, that can be used to find out details about the provenance of the file and later on to compare metadata after migration) were tested within the SCAPE project at the BL. To ensure the success of the migration process, the BL developed their in-house code to compare, at scale, the different outputs of the above mentioned tools.

Peter May’s presentation slides can be found on the Open Planets Foundation wiki.

Hands-on training session

After Peter May’s presentation, the SCAPE workshop team guided us through an activity in which we checked if original TIFFs had migrated to JPEG2000s successfully. For this we used the TIFF compare command (tiffcmp). We first migrated from TIFF to JPEG2000 and then converted JPEG2000 back into TIFF. In both migrations we used tiffcmp to check (bit by bit) if the file had been corrupted (bitstream comparison to check fixity), and if the compression and decompression processes were reliable.

The intention of the exercise was to show the process of migration at small scale. However, when digital preservation tasks (migration, compression, validation, metadata extraction, comparison) have to be applied to thousands of files, a single processor would take a lot of time to run the tasks, and for that reason parallelisation is a good idea. SCAPE has been working on parallelisation and how to divide tasks using computational nodes to deal with big loads of data all at once.

SCAPE uses Taverna workbench to create tailored workflows. To run a preservation workflow you do not need Taverna, because many of the tools that can be incorporated into Taverna can be used as standalone tools such as FITS, a File Information Tool Set that “identifies, validates, and extracts technical metadata for various file formats”. However, Taverna offers a good solution for digital preservation workflows, since you can create a workflow that includes all the tools that you need. The ideal use of Taverna in digital preservation is to choose different tools at different stages of the workflow, depending on the digital preservation requirements of your data.

Related links:

http://openplanetsfoundation.org/
http://wiki.opf-labs.org/display/SP/Home
http://www.myExperiment.org/

SCAPE workshop (notes, part 1)

The aim of the SCAPE workshop was to show participants how to cope with large data volumes by using automation and preservation tools developed and combined as part of the SCAPE project. During the workshop, we were introduced to Taverna workbench, a workflow engine we we installed with a virtual machine (Linux) in our laptops.

It has taken me a while to sort out my notes from the workshop, facilitated by Rainer Schmidt (Austrian Institute of Technology, AIT), Dave Tarrant (Open Planets Foundation, OPF), Roman Graf (AIT), Matthias Rella (AIT), Sven Schlarb (Austrian National Library, ONB), Carl Wilson (OPF), Peter May (British Library, BL), and Donal Fellows (University of Manchester, UNIMAN), but here it is. The workshop (September 2013) started with a demonstration of scalable data migration processes, for which they used a number of Raspberry Pis as a computer cluster (only as proof of concept).

Rainer Schimdt (AIT)

Here is a summary of the presentation delivered by Rainer Schmidt (AIT) who explained the SCAPE project framework (FP7 16 organizations from 8 countries). The SCAPE project focuses on planning, managing and preserving digital resources using the concept of scalability. Computer clusters can manage data loads and distribute preservation tasks that cannot be managed in desktop environments. Some automated distributed tasks they have been investigating are extraction of metadata, file format migration, bit checking, quality assurance, etc.

During the workshop, facilitators showed scenarios created and developed as part of the SCAPE project, which had served as test bed to identify best use of different technologies in the preservation workflow. The hands-on activities started with a quick demonstration of the SCAPE preservation platform and how to execute a SCAPE workflow when running it in the virtual machine.

SCAPE uses clusters of commodity hardware to generate bigger environments to make preservation tasks scalable, to distribute the power required for computing efficiently, and to minimise errors. The systems’ architecture is based on partitions. If failure occurs, it only affects one machine and the tasks that it performs, instead of affecting a bigger group of tasks. The cluster can also be used by a number of people, so only a specific part of the cluster gets affected by the error and thereby only one user.

A disadvantage of distributing tasks in a cluster is that you have to manage the load balancing. If you put loads of data into the system, the system distributes the data among the nodes. Once the the distributed data sets have been processed, the results are sent to nodes where the results are aggregated. You have to use a special framework to deal with the distribution environment. SCAPE uses algorithms to find and query the data. The performance of a single CPU is far too small, so they use parallel computing to bring all the data back together.

The Hadoop framework (open source Apache) allows them to deal with the details of scalable preservation environments. Hadoop is a a distributed file system and execution platform that allows them to map, reduce and distribute data and applications. The biggest advantage of Hadoop is that you can build applications on top, so it is easier to build a robust application (the computation doesn’t break because a node goes down or fails). Hadoop relies on the MapReduce programming model which is widely used for data-intensive computations. Google hosts MapReduce clusters with thousands of nodes used for indexing, ranking, mining web content, and statistical analysis, Java APIs and scripts.

The SCAPE platform brings the open source technologies Hadoop, Taverna and Fedora together. SCAPE is currently using Hadoop version 1.2.0. Taverna which allows you to visualise and and model tasks, make them repeatable, and use repository technology such as Fedora (http://www.fedora-commons.org/). The SCAPE platform incorporates the Taverna workflow management system as a workbench (http://www.taverna.org.uk) and Fedora technology as the core repository environment.

You can write your own applications, but a cost-effective solution is to incorporate preservation tools such as Jhove or METS into the Taverna workflow. Taverna allows you to integrate these tools in the workflow, and supports repository integration (read from and distribute data back into preservation environments such as Fedora). The Taverna workflow (sandpit environment) can be run on the desktop, however for running long workflows you might want to run Taverna on the internet. More information about the SCAPE platform and how Hadoop, Taverna and Fedora are integrated is available at http://www.scape-project.eu/publication/1334/.

Setting up a preservation platform like this also entails a series of challenges. Some of the obstacles you might encounter are mismatches in the parallelisation (difference between desktop and cluster environment). Workflows that work for repositories might not work for web archiving, because they use different distributed environments. To avoid mismatches use a cluster that is centred on specific workflow needs. Keeping the cluster in-house is something of which institutions are wary, while on the other hand they may be reluctant about transferring big datasets over the internet.

Related links:

http://openplanetsfoundation.org/
http://wiki.opf-labs.org/display/SP/Home
http://www.myExperiment.org/