Twenty’s Plenty: DataShare v2.1 Upload Upgrade

We have upgraded DataShare (to v2.1) to enable HTML5 resumable upload. This means depositors can now use the user-friendly web deposit interface to upload numerous files at once via drag’n’drop. And to upload files up to 15 GB in size, regardless of network ‘blips’.

In fact we have reason to believe it may be possible to upload a 20 GB file this way: in testing, I gave it 2 hours till the progress bar said 100%, and even though the browser then produced an error message instead of the green tick I was hoping for, I found when I retrieved the submission from the Submissions page that I was able to resume, and the file had been added.

*** So our new advice to depositors is: our current Item size limit and file size limit is 20 GB. Files larger than 15 GB may not upload through your browser. If you have files over 15 GB or data totalling over 20 GB which you’d like to share online, please contact the Data Library team to discuss your options. ***

See screenshots below. Once the files have been selected and the upload commenced, the ‘Status’ column shows the percentage uploaded. A 10 GB file may take in the region of 1 hour to upload in this way. 15 GB files have been uploaded with Chrome, Firefox and Internet Explorer using this interface.

Until now, any file over 1 GB had caused browsers difficulties, meaning many prospective depositors were not able to use the web deposit interface, and instead had to email the curation team, arrange to transfer us their files via DropBox, USB or through the Windows network, and then the curator had to transfer these same files to our server, collate the metadata into an XML file, log into the Linux system and run a batch import script. Often with many hiccups concerning permissions, virus checkers and memory along the way. All very time-consuming.

Soon we will begin working on a download upgrade, to integrate a means for users to download much bigger files from DataShare outside of the limitations of HTTP (perhaps using FTP). The aim is to allow some of the datasets we have in the university which are in the region of 100 GB to be shared online in a way that makes it reasonably quick and easy for users to download them. We have depositors queueing up to use this feature. Watch this space.

Further technical detail about both the HTML5 upload feature and plans for an optimised large download release are available on the slides for the presentation I made at Open Repositories 2016 in Dublin this week: http://www.slideshare.net/paulineward/growing-open-data-making-the-sharing-of-xxlsized-research-data-files-online-a-reality-using-edinburgh-datashare .

NewUploadPage

A simple interface invites the depositor to select files to upload.

 

 

Upload15GB

A 15 GB file uploaded via Firefox on Windows and included in a submitted Item.

 

 

A 20 GB file uploaded and included in an incomplete submission.

A 20 GB file uploaded and included in an incomplete submission.

Pauline Ward, Data Library Assistant, University of Edinburgh

Share

DATA-X Workshop 2

We are holding our second DATA-X workshop on Wednesday 15 June at the
James Clerk Maxwell Building, Room 3217 and are inviting PhD students
and technologists to come along and participate in what we hope will
be lively discussion and activities.

We aim to engender Art and Science collaborations by offering
micro-funds towards each ‘installation’ as well as the opportunity to
publish in the exhibition catalogue and present at the Pioneering
Research Data Symposium later in the year.

We have over 20 registered participants with a range of research
interests including:
Crystal structure, Raman spectroscopy, Structural biology,
Measurement-Based Power Systems Control, Astrophysics, Polymer-sensors,
Biological data mining, Computational Mechanics,
Internationalization of Higher Education, Bioinformatics, Evolution,
Genomics, Visual sociology, Advertising, National identity, Environment,
Agriculture, Nutrient Management, Soil, Pollen, Farmers, Social Network
Analysis, Food security, Systems biology, Cell level
modelling, Cell physiology, Mobile User Experience, Enviromental
Sustainability, Political science, Human rights, Data materialisation,
Digital fabrication, Practice based research, Synthetic
biology!

To register for the workshop (and get a free lunch) see:
http://data-x.blogs.edina.ac.uk/

To find out more about DATA-X see:
http://data-x.blogs.edina.ac.uk/about/

or watch the short You Tube
trailer: https://youtu.be/NMPPZZc-sZ4

Please get in contact should you require further information.

All best,
Stuart Macdonald
DATA-X Project Manager / Associate Data Librarian
EDINA

Share

Highlights from the RDM Programme Progress Report: November 2015 – January 2016

Data Seal of Approval have awarded DataShare Trusted Repository status; their assessment of our service can be read at https://assessment.datasealofapproval.org/assessment_175/seal/html/. In addition a major new release of DataShare was completed in November, this makes the code open in Github as well as making general improvements to the look and feel of the website.

The ‘interim’ DataVault is now in final testing and will be rolled out on a request basis to those researchers who can demonstrate an urgent need to use the service now rather than waiting until the final version is ready later this year. The phase three funding for development of the DataVault has been received from Jisc, this runs from March to August, so the final version should be ready for launch sometime after this. The project was presented at the International Digital Curation Conference in February 2016.

Over the three month period a total of 328 staff and postgraduate researchers have attended a Research Data Management (RDM) course or workshop.

Work on the MANTRA MOOC (Massive Open Online Course) was expected to be finalised in February and launched on 1st March, at the following URL: https://www.coursera.org/learn/data-management.

University of Edinburgh wrote the Working with Data section (one out of 5 weeks of the course) and with the help of the Learning, Teaching and Web division of Information Services completed two video interviews with researchers and a ‘vox pop’ video clip of clinical researchers at the EQUATOR conference in Edinburgh in autumn, 2015. The content is open source and videos can be added to our YouTube channel to help with promotion. There will be some income from this, but a smaller portion than our partner, the University of North Carolina, based on certificates of completion priced at $49 or £33.

The need to create a dataset record in PURE for each dataset published, or referenced in a publication, is now being emphasised in all Research Data Service communications, formal and informal, and to staff at all levels. Uptake is understandably low at this point but we hope to see a steady increase as researchers and support staff begin to see the benefits of adding datasets to their research profile. In the case of DataShare records, a draft mapping of fields between DataShare and PURE has been produced as a start of a plan for migrating records from DataShare to PURE.

By the end of January 2016, 69 records had been created and published on Edinburgh Research Explorer.

Four interns have been employed using funding from Jisc as part of the UK Research Data Discovery Service (UKRDDS) project which aims to create a national aggregate register of data sets.  A trial site is available at: http://ckan.data.alpha.jisc.ac.uk/. The UKRDDS interns will help to create PURE records and upload open data into DataShare, and raise awareness of RDM generally within their schools. There are currently three PhD interns in place in LLC, SOS, and Roslin, two more in LLC, & DIPM will start in February. The approach each intern takes will depend on the nature and structure of their school and will, in some cases, be mediated by research administrators.

An innovation fund grant has been received to fund the delivery of an exhibition “Pioneering Research Data”. Each college will be represented by a PhD intern, the recruitment of these has already begun and they should be in post by the end of March. The Exhibition is due to be delivered in November of this year.

National and International Engagement Activities

Robin Rice led a panel at the IPRES conference, Chapel Hill, North Carolina, on 3rd November called ‘Good, better, best’? Examining the range and rationales of institutional data curation practices’.

Robin Rice had a proposal accepted for the forthcoming Force11 (2016) conference, on Overcoming Obstacles to Sharing Data about Human Subjects, building on the training course we are delivering, Working with Personal and Sensitive Data.

Kerry Miller
RDM Service Coordinator

Share

Publishing Data Workflows

[Guest post from Angus Whyte, Digital Curation Centre]

In the first week of March the 7th Plenary session of the Research Data Alliance got underway in Tokyo. Plenary sessions are the fulcrum of RDA activity, when its many Working Groups and Interest Groups try to get as much leverage as they can out of the previous 6 months of voluntary activity, which is usually coordinated through crackly conference calls.

The Digital Curation Centre (DCC) and others in Edinburgh contribute to a few of these groups, one being the Working Group (WG) on Publishing Data Workflows. Like all such groups it has a fixed time span and agreed deliverables. This WG completes its run at the Tokyo plenary, so there’s no better time to reflect on why DCC has been involved in it, how we’ve worked with others in Edinburgh and what outcomes it’s had.

DCC takes an active part in groups where we see a direct mutual benefit, for example by finding content for our guidance publications. In this case we have a How-to guide planned on ‘workflows for data preservation and publication’. The Publishing Data Workflows WG has taken some initial steps towards a reference model for data publishing, so it has been a great opportunity to track the emerging consensus on best practice, not to mention examples we can use.

One of those examples was close to hand, and DataShare’s workflow and checklist for deposit is identified in the report alongside workflows from other participating repositories and data centres. That report is now available on Zenodo. [1]

In our mini-case studies, the WG found no hard and fast boundaries between ‘data publishing’ and what any repository does when making data publicly accessible. It’s rather a question of how much additional linking and contextualisation is in place to increase data visibility, assure the data quality, and facilitate its reuse. Here’s the working definition we settled on in that report:

Research data publishing is the release of research data, associated metadata, accompanying documentation, and software code (in cases where the raw data have been processed or manipulated) for re-use and analysis in such a manner that they can be discovered on the Web and referred to in a unique and persistent way.

The ‘key components’ of data publishing are illustrated in this diagram produced by Claire C. Austin.

Data publishing components. Source: Claire C. Austin et al [1]

Data publishing components. Source: Claire C. Austin et al [1]

As the Figure implies, a variety of workflows are needed to build and join up the components. They include those ‘upstream’ around the data collection and analysis, ‘midstream’ workflows around data deposit, packaging and ingest to a repository, and ‘downstream’ to link to other systems. These downstream links could be to third-party preservation systems, publisher platforms, metadata harvesting and citation tracking systems.

The WG recently began some follow-up work to our report that looks ‘upstream’ to consider how the intent to publish data is changing research workflows. Links to third-party systems can also be relevant in these upstream workflows. It has long been an ambition of RDM to capture as much as possible of the metadata and context, as early and as easily as possible. That has been referred to variously as ‘sheer curation’ [2], and ‘publication at source [3]). So we gathered further examples, aiming to illustrate some of the ways that repositories are connecting with these upstream workflows.

Electronic lab notebooks (ELN) can offer one route towards fly-on-the-wall recording of the research process, so the collaboration between Research Space and University of Edinburgh is very relevant to the WG. As noted previously on these pages [4] ,[5], the RSpace ELN has been integrated with DataShare so researchers can deposit directly into it. So we appreciated the contribution Rory Macneil (Research Space) and Pauline Ward (UoE Data Library) made to describe that workflow, one of around half a dozen gathered at the end of the year.

The examples the WG collected each show how one or more of the recommendations in our report can be implemented. There are 5 of these short and to the point recommendations:

  1. Start small, building modular, open source and shareable components
  2. Implement core components of the reference model according to the needs of the stakeholder
  3. Follow standards that facilitate interoperability and permit extensions
  4. Facilitate data citation, e.g. through use of digital object PIDs, data/article linkages, researcher PIDs
  5. Document roles, workflows and services

The RSpace-DataShare integration example illustrates how institutions can follow these recommendations by collaborating with partners. RSpace is not open source, but the collaboration does use open standards that facilitate interoperability, namely METS and SWORD, to package up lab books and deposit them for open data sharing. DataShare facilitates data citation, and the workflows for depositing from RSpace are documented, based on DataShare’s existing checklist for depositors. The workflow integrating RSpace with DataShare is shown below:

RSpace-DataShare Workflows

RSpace-DataShare Workflows

For me one of the most interesting things about this example was learning about the delegation of trust to research groups that can result. If the DataShare curation team can identify an expert user who is planning a large number of data deposits over a period of time, and train them to apply DataShare’s curation standards themselves they would be given administrative rights over the relevant Collection in the database, and the curation step would be entrusted to them for the relevant Collection.

As more researchers take up the challenges of data sharing and reuse, institutional data repositories will need to make depositing as straightforward as they can. Delegating responsibilities and the tools to fulfil them has to be the way to go.

 

[1] Austin, C et al.. (2015). Key components of data publishing: Using current best practices to develop a reference model for data publishing. Available at: http://dx.doi.org/10.5281/zenodo.34542

[2] ‘Sheer Curation’ Wikipedia entry. Available at: https://en.wikipedia.org/wiki/Digital_curation#.22Sheer_curation.22

[3] Frey, J. et al (2015) Collection, Curation, Citation at Source: Publication@Source 10 Years On. International Journal of Digital Curation. 2015, Vol. 10, No. 2, pp. 1-11

http://doi:10.2218/ijdc.v10i2.377

[4] Macneil, R. (2014) Using an Electronic Lab Notebook to Deposit Data http://datablog.is.ed.ac.uk/2014/04/15/using-an-electronic-lab-notebook-to-deposit-data/

[5] Macdonald, S. and Macneil, R. Service Integration to Enhance Research Data Management: RSpace Electronic Laboratory Notebook Case Study International Journal of Digital Curation 2015, Vol. 10, No. 1, pp. 163-172. http://doi:10.2218/ijdc.v10i1.354

Angus Whyte is a Senior Institutional Support Officer at the Digital Curation Centre.

 

Share