The Edinburgh DataShare Awards!

The Research Data Service team applauds those researchers at the University of Edinburgh who share their data. We therefore decided to show our appreciation by presenting awards to our most successful depositors, as part of the Dealing With Data conference. The prizes themselves do not come with a cash research grant attached unfortunately. However, the winners did receive a certificate bearing an image of our mascot for the day, Databot. We think you’ll agree the winning depositors and their data demonstrate the diversity of our collections, in terms of subject matter, formats and sheer size. We were particularly pleased with the reactions from both the recipients and the attendees, both in person, by email and on twitter (#UoEData was the Dealing with Data hashtag). Who doesn’t love the drama of an awards ceremony! A video is available.

Photograph of Pauline Ward announcing the award winners

Photo: CC-BY Lorna M. Campbell

The winners in full…

MOST DATASHARING SCHOOL: Edinburgh Medical School

– the School which boasts the greatest number of Edinburgh DataShare Collections currently. Thirty-three eligible Collections (already containing at least one dataset) such as “Connectomic analysis of motor units in the mouse fourth deep lumbrical muscle”, the Edinburgh Imaging “Image Library” and “Generation Scotland”.

MOST PROLIFIC DATASHARER: Professor Richard Baldock
– the most prolific depositor into Edinburgh DataShare for the academic year 2016-17, and over the lifetime of the repository, having shared a grand total of 1,105 data items with full metadata. These are grouped together into numerous Collections under the heading of “e-Mouse Atlas”. The majority of these detailed images show microscope slides of stained tissue, others are 3D models. They accompany a book and website published by Professor Baldock, building on the seminal work of Professor Matt Kaufman in developmental biology. The metadata for each of the slides links to a lower definition version within the e-Mouse Atlas website, where the data may be viewed and navigated in context. The original slides themselves are held by the University’s Centre for Research Collections.

detail of histological slide showing stained cells

Detail from Elizabeth Graham; Julie Moss; Nick Burton; Yogmatee Roochun; Chris Armit; Lorna Richardson; Richard Baldock. (2015). eHistology Kaufman Atlas Plate 21a image d, [image]. University of Edinburgh. College of Medicine and Veterinary Medicine. http://dx.doi.org/10.7488/ds/735.

MOST PROLIFIC DATASHARER (CSE): Professor Euan Brechin
– the depositor of the greatest number of Edinburgh DataShare items from the College of Science and Engineering in academic year 2016-2017. Euan deposits his coordination chemistry research data so frequently that we set up a Collection template on the Brechin Research Group, which automatically pre-populates some of the metadata fields for him, saving Euan time. If only we could find a way to mention metallosupramolecular cubes here.

The certificate awarded to Professor Euan Brechin

The certificate awarded to Professor Euan Brechin

MOST PROLIFIC DATASHARER (CAHSS): Dr Andrea Martin
– the depositor of the greatest number of Edinburgh DataShare items from the College of Arts, Humanities and Social Sciences in academic year 2016-2017. Some of these “Language Cognition and Communication” data items are still under temporary embargo. Users may nonetheless see all the metadata.

MOST POPULAR SHARED DATA: Professor Peter Sandercock
– the depositor of the Edinburgh DataShare item which has attracted the greatest number of page views over the lifetime of the repository: “International Stroke Trial database (version 2)” (aka IST-1).  These data from the International Stroke Trial provide a great example of how clinical trial data may be anonymised to allow them to be shared. For more information, you may want to watch Prof Sandercock’s very accessible and detailed  public lecture. Admittedly, one other item is higher up DataShare’s table of page views than IST. However we believe the traffic drawn by “RCrO3-xNx ChemComm 2016” to be artifactual, arising from the appearance of the word ‘doping’ in its abstract, and the fact the deposit was made at a time when doping in sport was very prominent in the news media. Additionally, the earlier, superseded, version of the IST-1 dataset also appears in the all-time top ten, and if we combine the number of views, it is in the No.1 spot outright :-)

MOST POPULAR DATA 2016-17: Dr. Junichi Yamagishi
– the depositor of the Edinburgh DataShare item which has attracted the greatest number of page views (1,720 to be precise, as counted by Google Analytics) over the academic year 2016-17: “Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof 2015) Database”. Here’s the suggested citation, which DataShare compiles automatically, and displays prominently, to encourage users to cite the data:

Wu, Zhizheng; Kinnunen, Tomi; Evans, Nicholas; Yamagishi, Junichi. (2015). Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof 2015) Database, [dataset]. University of Edinburgh. The Centre for Speech Technology Research (CSTR). http://dx.doi.org/10.7488/ds/298.

MOST POPULAR DATA 2016-17 (CAHSS): Professor Miles Glendinning

– the depositor of the Edinburgh DataShare item from the College of Arts, Humanities and Social Sciences which has attracted the greatest number of page views (1,374 to be precise, as counted by Google Analytics), over the academic year 2016-17: “Hong Kong Public Housing Archive”. The Research Data Service is working closely with Miles, Personal Chair of Architectural Conservation, on a series of batch imports to put his fabulous array of photographs of public housing tower blocks from all around the world on DataShare over the coming months – keep an eye on DOCOMOMO International Mass Housing Archive.

Sunny image of the façade of several tower blocks; a tree is visible in the foreground.

Image cropped from “HKI_H_Yue_Fai_Ct.jpg” from Glendinning, Miles; Forsyth, Louise; Maxwell, Gavin; Wood, Michael. (2015). Hong Kong Public Housing Database, 2006-2015 [image]. University of Edinburgh. Edinburgh College of Art. http://dx.doi.org/10.7488/ds/322.

MOST POPULAR DATA 2016-17 (MVM): Dr. Tom Pennycott
– the depositor of the Edinburgh DataShare Collection page from the College of Medicine and Veterinary Medicine which has attracted the greatest number of page views over the academic year 2016-17: “Diseases of Wild Birds”. Hundreds of grotesquely beautiful photographs of dead wild birds, bodies ravaged with viruses, bacteria and protists, found at locations all around the United Kingdom; these images support the PhD thesis of Dr Tom Pennycott from our Veterinary School.

You can see usage statistics for any DataShare Item or Collection simply by clicking on the “View usage statistics” button on the right-hand-side of the page.

Pauline Ward, Research Data Service Assistant
EDINA and Data Library

Share

DataShare 3.0: The ‘Download Release’ means deposits up to 100 GB

With the DataShare 3.0 release, completed on 6 October, 2017, the data repository can manage data items of 100 GB. This means a single dataset of up to 100 GB can be cited with a single DOI, viewed at a single URL, and downloaded through the browser with a single click of our big red “Download all files” button. We’re not saying the system cannot handle datasets larger than this, but 100 GB is what we’ve tested for, and can offer with confidence. This release joins up the DSpace asset store to our managed filestore space (DataStore) making this milestone release possible.

How to deposit up to 100 GB

In practice, what this means for users is:

– You can still upload up to 20 GB of data files as part of a single deposit via our web submission form.

– For sets of files over 20 GB, depositors may contact the Research Data Service team on data-support@ed.ac.uk to arrange a batch import. The key improvement in this step is that all the files can be in a single deposit, displayed together on one page with their descriptive metadata, rather than split up into five separate deposits.

Users of DataShare can now also benefit from MD5 integrity checking

The MD5 checksum of every file in DataShare is displayed (on the Full Item view), including historic deposits. This allows users downloading files to check their integrity.

For example, suppose I download Professor Richard Ribchester’s fluorescence microscopy of the neuromuscular junction from http://datashare.is.ed.ac.uk/handle/10283/2749. N.B. the “Download all files” button in this release works differently than before. And one of the differences which users will see is that the zip file it downloads is now named with the two numbers from the deposit’s handle identifier, separated by an underscore instead of a forward slash. So I’ve downloaded the file “DS_10283_2749.zip”.

I want to ensure there was no glitch in the download – I want to know the file I’ve downloaded is identical to the one in the repository. So, I do the following:

  • Click on “Show full item record”.
  • Scroll down to the red button labelled “Download all files”, where I see “zip file MD5 Checksum: a77048c58a46347499827ce6fe855127” (see screenshot). I copy the checksum (highlighted in yellow).

    screenshot from DataShare showing where the MD5 checksum hash of the zip file is displayed

    DataShare displays MD5 checksum hash

  • On my PC, I generate the MD5 checksum hash of the downloaded copy, and then I check that the hash on DataShare matches. There are a number of free tools available for this task: I could use the Windows command line, or I could use an MD5 utility such as the free “MD5 and SHA Checksum Utility”. In the case of the Checksum Utility, I do this as follows:
    • I paste the hash I copied from DataShare into the desktop utility (ignoring the fact the program confusingly displays the checksum hashes all in upper case).
    • I click the “Verify” button.

In this case they are identical – I have a match. I’ve confirmed the integrity of the file I downloaded.

Screenshot showing result of MD5 match

The MD5 checksum hashes match each other.

More confidence in request-a-copy for embargoed files

Another improvement we’ve made is to give depositors confidence in the request-a-copy feature. If the files in your deposit are under temporary embargo, they will not be available for users to download directly. However, users can send you a request for the files through DataShare, which you’ll receive via email, as described in an earlier blogpost. If you then agree to the request using the form and the “Send” button in DataShare, the system will attempt to email the files to the user. However, as we all know, some files are too large for email servers.

If the email server refuses to send the email message because the attachment is too large, DataShare 3.0 will immediately display an error message for you in the browser saying “File too large”. Thus allowing you to make alternative arrangements to get those files to the user. Otherwise, the system moves on to offer you a chance to change the permissions on the file to open access. So, if you see no error after clicking “Send”, you’ll have peace of mind the files have been sent successfully.

Pauline Ward, Research Data Service Assistant
EDINA and Data Library

Share

EDINA’s ShareGeo Open content into DataShare

Many fascinating datasets can be found in our new ShareGeo Open Collection: http://datashare.is.ed.ac.uk/handle/10283/2345  .

This data represents the entire contents of EDINA’s geospatial repository, ShareGeo Open, successfully imported into DataShare. We took this step to preserve the ShareGeo Open data, after the decision was taken to end the service. Not only have we maintained the accessibility of the data but we also successfully redirected all the handle persistent identifiers so that any existing links to the data, including those included in academic journal articles, have been preserved, such as the one in this paper: http://dx.doi.org/10.1007/s10393-016-1131-y .

Similarly, should the day ever arrive when DataShare was to be closed, we would endeavour to find a suitable repository to which we could migrate our data to ensure its preservation, as per item 13 of our Preservation policy.

We were able to copy the content of almost all metadata fields from ShareGeo to DataShare. The fact both repositories use the Dublin Core metadata standard, and both were running on DSpace, made the task a little easier. The University of Edinburgh supports the Dublin Core Metadata Initiative. DataShare’s metadata schema can be found at https://www.wiki.ed.ac.uk/display/datashare/Current+metadata+schema setting out what our metadata fields are and which values are permitted in them.

Our EDINA sysadmin (and developer) George was very helpful with all our questions and discussions that took place while the team settled on the most appropriate correspondence between the two schemas. The existing documentation was a great help too. George then produced a Python script to harvest the data, using OAI-PMH to get a list of ShareGeo items, then METS for the metadata and bitstreams. He then used SWORD to deposit them all in DataShare.

The team took the opportunity to use DSpace’s batch metadata editing utility and web interface to clean up some of the metadata: adding dates to the temporal coverage field and adding placenames and country abbreviations to the spatial coverage field, to enhance the discoverability of the data.

For example “GB Postcode Areas” can be found using the original handle persistent identifier: http://hdl.handle.net/10672/51 as well as the new DOI which DataShare has given it – DOI: 10.7488/ds/1755. Each of the 255 items migrated to our ShareGeo Open Collection contains a file called metadata.xml which contains all the metadata exactly as it was when exported from ShareGeo itself. I have manually added placenames in the spatial coverage field (which was used differently in ShareGeo, with a bounding box i.e. “northlimit=60.7837;eastlimit=2.7043;southlimit=49.8176;westlimit=-7.4856;”). Many of these datasets cover Great Britain, so they don’t include Northern Ireland but do include Scotland, England and Wales. In this case I’ve added the words “Scotland”, “England” and “Wales” in Spatial Coverage (‘dc.coverage.spatial’), even though these are already implicit in the “Great Britain” value in the same field, because I believe doing so:

  • enhanced the accessibility of the data (by making the geographical extent clearer for users unfamiliar with Great Britain) and…
  • enhanced the discoverability of the data (users searching Google for “Wales” now have a chance of seeing this dataset among the hits).

James Crone who compiled this “GB Postcode Areas” data is part of EDINA’s highly renowned geospatial services team.

Part of James’ work for EDINA involves producing census geography data for the UK DataService. He has recently added updated boundary data for use with the latest anonymised census microdata (that’s from the 2011 census): see the Boundary Data Selector at https://census.ukdataservice.ac.uk/get-data/boundary-data .

Pauline Ward is a Research Data Service Assistant for the University of Edinburgh, based at EDINA.

Detail from GB Postcode Areas data, viewed using QGIS.

Detail from GB Postcode Areas data, viewed using QGIS.

Share

DataShare upgraded to v2.3 – The embargo enhancement release

The latest upgrade of Edinburgh DataShare, from version 2.2 to 2.3, brings in several useability improvements.

  • Embargo expiry reminder
    If you want to deposit your data in DataShare, but you want to impose a delay before your files become freely downloadable, you can apply an embargo to your submission – see our “Checklist for deposit” for a fuller explanation of the embargo feature. As of DataShare v2.3, if you apply an embargo to your deposit, DataShare will now send you an email reminder one week before the embargo is due to expire. This gives you time to make us aware if you need the embargo to be extended, or to send us the details of your paper if it has been published, so that we can add those to the metadata, to help users understand your data.
  • DOI added to the citation field immediately
    When your DataShare deposit is approved by the curator, the system mints a new DOI for you. As of version 2.3, DataShare now immediately appends the URL containing that DOI into the “Citation” field, which is visible at the top of the summary view page of your item. The “Citation” field makes it easy for others to cite your data, because it provides them with text which they can copy and paste into any manuscript (or any other document where they want to cite the data). Previously you would have had to click on “Show full item record” to look for the DOI in the “Persistent identifier” field, or wait for an overnight script to paste the DOI onto the end of the “Citation” field.
  • Tombstone records
    We now have the ability to leave a ‘tombstone’ record in place for any DataShare item that is withdrawn. We only withdraw items in exceptional circumstances – for example where there is a substantive error or omission in the data, such that we feel merely labelling the item as “Superseded” is not sufficient. Now, when we tombstone an item, the files become unavailable indefinitely, but the metadata remain visible at the DOI and handle URLs. Whereas until now, every withdrawn item has become completely invisible, so that the original DOI and handle URLs produced a ‘not found’ error.
Screenshot of a DataShare item's citation field with the DOI

Cortical parcellation citation – now with DOI!

Enjoy!

Pauline Ward

Research Data Service

P.S. Many thanks to our software developer at EDINA, George Hamilton, who actually coded all these enhancements to DataShare, which uses the open-source DSpace system. EDINA’s DataShare code is available at https://github.com/edina/dspace .

Share