Data Carpentry & Software Carpentry workshops

The Research Data Service hosted back to back 2-day workshops in the Main Library this week, run by the Software Sustainability Institute (SSI) to train University of Edinburgh researchers in basic data science and research computing skills.

Learners at Data Carpentry workshop

Learners at Data Carpentry workshop

Software Carpentry (SC) is a popular global initiative originating in the US, aimed at training researchers in good practice in writing, storing and sharing code. Both SC and its newer offshoot, Data Carpentry, teaches methods and tools that helps researchers makes their science reproducible. The SSI, based at Edinburgh Parallel Computing Centre (EPCC), organises workshops for both throughout the UK.

Martin Callaghan, University of Leeds

Martin Callaghan, University of Leeds, introduces goals of Data Carpentry workshop.

Each workshop is taught by trainers trained by the SC organisation, using proven methods of delivery, to learners using their own laptops, and with plenty of support by knowledgeable helpers. Instructors at our workshops were from Leeds and EPCC. Comments from the learners – staff and postgraduate students from a range of schools, included, ‘Variety of needs and academic activities/disciplines catered for. Useful exercies and explanations,’ and ‘Very powerful tools.’

Lessons can vary between different workshops, depending on the level of the learners and their requirements, as determined by a pre-workshop survey. The Data Carpentry workshop on Monday and Tuesday included:

  • Using spreadsheets effectively
  • OpenRefine
  • Introduction to R
  • R and visualisation
  • Databases and SQL
  • Using R with SQLite
  • Managing Research & Data Management Plans

The Software Carpentry workshop was aimed at researchers who write their own code, and covered the following topics:

  • Introduction to the Shell
  • Version Control
  • Introduction to Python
  • Using the Shell (scripts)
  • Version Control (with Github)
  • Open Science and Open Research
Software Carpentry learners

Software Carpentry learners

Clearly the workshops were valued by learners and very worthwhile. The team will consider how it can offer similar workshops in the future at a similarly low cost; your ideas welcome!

Robin Rice
EDINA and Data Library

Share

DATA-X Workshop 2

We are holding our second DATA-X workshop on Wednesday 15 June at the
James Clerk Maxwell Building, Room 3217 and are inviting PhD students
and technologists to come along and participate in what we hope will
be lively discussion and activities.

We aim to engender Art and Science collaborations by offering
micro-funds towards each ‘installation’ as well as the opportunity to
publish in the exhibition catalogue and present at the Pioneering
Research Data Symposium later in the year.

We have over 20 registered participants with a range of research
interests including:
Crystal structure, Raman spectroscopy, Structural biology,
Measurement-Based Power Systems Control, Astrophysics, Polymer-sensors,
Biological data mining, Computational Mechanics,
Internationalization of Higher Education, Bioinformatics, Evolution,
Genomics, Visual sociology, Advertising, National identity, Environment,
Agriculture, Nutrient Management, Soil, Pollen, Farmers, Social Network
Analysis, Food security, Systems biology, Cell level
modelling, Cell physiology, Mobile User Experience, Enviromental
Sustainability, Political science, Human rights, Data materialisation,
Digital fabrication, Practice based research, Synthetic
biology!

To register for the workshop (and get a free lunch) see:
http://data-x.blogs.edina.ac.uk/

To find out more about DATA-X see:
http://data-x.blogs.edina.ac.uk/about/

or watch the short You Tube
trailer: https://youtu.be/NMPPZZc-sZ4

Please get in contact should you require further information.

All best,
Stuart Macdonald
DATA-X Project Manager / Associate Data Librarian
EDINA

Share

Publishing Data Workflows

[Guest post from Angus Whyte, Digital Curation Centre]

In the first week of March the 7th Plenary session of the Research Data Alliance got underway in Tokyo. Plenary sessions are the fulcrum of RDA activity, when its many Working Groups and Interest Groups try to get as much leverage as they can out of the previous 6 months of voluntary activity, which is usually coordinated through crackly conference calls.

The Digital Curation Centre (DCC) and others in Edinburgh contribute to a few of these groups, one being the Working Group (WG) on Publishing Data Workflows. Like all such groups it has a fixed time span and agreed deliverables. This WG completes its run at the Tokyo plenary, so there’s no better time to reflect on why DCC has been involved in it, how we’ve worked with others in Edinburgh and what outcomes it’s had.

DCC takes an active part in groups where we see a direct mutual benefit, for example by finding content for our guidance publications. In this case we have a How-to guide planned on ‘workflows for data preservation and publication’. The Publishing Data Workflows WG has taken some initial steps towards a reference model for data publishing, so it has been a great opportunity to track the emerging consensus on best practice, not to mention examples we can use.

One of those examples was close to hand, and DataShare’s workflow and checklist for deposit is identified in the report alongside workflows from other participating repositories and data centres. That report is now available on Zenodo. [1]

In our mini-case studies, the WG found no hard and fast boundaries between ‘data publishing’ and what any repository does when making data publicly accessible. It’s rather a question of how much additional linking and contextualisation is in place to increase data visibility, assure the data quality, and facilitate its reuse. Here’s the working definition we settled on in that report:

Research data publishing is the release of research data, associated metadata, accompanying documentation, and software code (in cases where the raw data have been processed or manipulated) for re-use and analysis in such a manner that they can be discovered on the Web and referred to in a unique and persistent way.

The ‘key components’ of data publishing are illustrated in this diagram produced by Claire C. Austin.

Data publishing components. Source: Claire C. Austin et al [1]

Data publishing components. Source: Claire C. Austin et al [1]

As the Figure implies, a variety of workflows are needed to build and join up the components. They include those ‘upstream’ around the data collection and analysis, ‘midstream’ workflows around data deposit, packaging and ingest to a repository, and ‘downstream’ to link to other systems. These downstream links could be to third-party preservation systems, publisher platforms, metadata harvesting and citation tracking systems.

The WG recently began some follow-up work to our report that looks ‘upstream’ to consider how the intent to publish data is changing research workflows. Links to third-party systems can also be relevant in these upstream workflows. It has long been an ambition of RDM to capture as much as possible of the metadata and context, as early and as easily as possible. That has been referred to variously as ‘sheer curation’ [2], and ‘publication at source [3]). So we gathered further examples, aiming to illustrate some of the ways that repositories are connecting with these upstream workflows.

Electronic lab notebooks (ELN) can offer one route towards fly-on-the-wall recording of the research process, so the collaboration between Research Space and University of Edinburgh is very relevant to the WG. As noted previously on these pages [4] ,[5], the RSpace ELN has been integrated with DataShare so researchers can deposit directly into it. So we appreciated the contribution Rory Macneil (Research Space) and Pauline Ward (UoE Data Library) made to describe that workflow, one of around half a dozen gathered at the end of the year.

The examples the WG collected each show how one or more of the recommendations in our report can be implemented. There are 5 of these short and to the point recommendations:

  1. Start small, building modular, open source and shareable components
  2. Implement core components of the reference model according to the needs of the stakeholder
  3. Follow standards that facilitate interoperability and permit extensions
  4. Facilitate data citation, e.g. through use of digital object PIDs, data/article linkages, researcher PIDs
  5. Document roles, workflows and services

The RSpace-DataShare integration example illustrates how institutions can follow these recommendations by collaborating with partners. RSpace is not open source, but the collaboration does use open standards that facilitate interoperability, namely METS and SWORD, to package up lab books and deposit them for open data sharing. DataShare facilitates data citation, and the workflows for depositing from RSpace are documented, based on DataShare’s existing checklist for depositors. The workflow integrating RSpace with DataShare is shown below:

RSpace-DataShare Workflows

RSpace-DataShare Workflows

For me one of the most interesting things about this example was learning about the delegation of trust to research groups that can result. If the DataShare curation team can identify an expert user who is planning a large number of data deposits over a period of time, and train them to apply DataShare’s curation standards themselves they would be given administrative rights over the relevant Collection in the database, and the curation step would be entrusted to them for the relevant Collection.

As more researchers take up the challenges of data sharing and reuse, institutional data repositories will need to make depositing as straightforward as they can. Delegating responsibilities and the tools to fulfil them has to be the way to go.

 

[1] Austin, C et al.. (2015). Key components of data publishing: Using current best practices to develop a reference model for data publishing. Available at: http://dx.doi.org/10.5281/zenodo.34542

[2] ‘Sheer Curation’ Wikipedia entry. Available at: https://en.wikipedia.org/wiki/Digital_curation#.22Sheer_curation.22

[3] Frey, J. et al (2015) Collection, Curation, Citation at Source: Publication@Source 10 Years On. International Journal of Digital Curation. 2015, Vol. 10, No. 2, pp. 1-11

http://doi:10.2218/ijdc.v10i2.377

[4] Macneil, R. (2014) Using an Electronic Lab Notebook to Deposit Data http://datablog.is.ed.ac.uk/2014/04/15/using-an-electronic-lab-notebook-to-deposit-data/

[5] Macdonald, S. and Macneil, R. Service Integration to Enhance Research Data Management: RSpace Electronic Laboratory Notebook Case Study International Journal of Digital Curation 2015, Vol. 10, No. 1, pp. 163-172. http://doi:10.2218/ijdc.v10i1.354

Angus Whyte is a Senior Institutional Support Officer at the Digital Curation Centre.

 

Share

Jisc Data Vault update

Posted on behalf of Claire Knowles

Research data are being generated at an ever-increasing rate. This brings challenges in how to store, analyse, and care for the data. Part of this problem is the long term stewardship of researchers’ private data and associated files that need a safe and secure home for the medium to long term.

PrintThe Data Vault project, funded by the Jisc #DataSpring programme seeks to define and develop a Data Vault software platform that will allow data creators to describe and store their data safely in one of the growing number of options for archival storage. This may include cloud solutions, shared storage systems, or local infrastructure.

Future users of the Data Vault are invited to Edinburgh on 5th November, to help shape the development work through discussions on: use cases, example data, retention policies, and metadata with the project team.

Book your place at: https://www.eventbrite.co.uk/e/data-vault-community-event-edinburgh-tickets-18900011443

The aims of the second phase of the project are to deliver a first complete version of the platform by the end of November, including:

  • Authentication and authorisation
  • Integration with more storage options
  • Management / monitoring interface
  • Example interface to CRIS (PURE)
  • Development of retention and review policy
  • Scalability testing

Working towards these goals the project team have had monthly face-to-face meetings, with regular Skype calls in between. The development work is progressing steadily, as you can see via the Github repository: https://github.com/DataVault, where there have now been over 300 commits. Progress is also tracked on the open Project Plan where anyone can add comments.

So remember, remember the 5th November and book your ticket.

Claire Knowles, Library & University Collections, on behalf of the JISC Data Vault Project Team

Share