By Peter Sefton
This presentation was given by Peter Sefton at the eResearch Australasia 2019 Conference in Brisbane, on the 24th of October 2019.
This presentation is part of a series of talks delivered here at eResearch Australasia - so it won’t go back over all of the detail already covered - see the introduction of datacrate in 2017 and and the 2018 update. The standard formerly known as DataCrate has been subsumed into a new standard called Research Object Crate - RO-Crate for short.
This is a recent snapshot of the makeup of the current RO-Crate team- compiled by Stian.
The website says: RO-Crate is a community effort to establish a lightweight approach to packaging research data with their metadata. It is based on schema.org annotations in JSON-LD, and aims to make best-practice in formal metadata description accessible and practical for use in a wider variety of situations, from an individual researcher working with a folder of data, to large data-intensive computational research environments.
This is a timeline for the merging of the Research Object packaging work with DataCrate - again compiled by Stian. While our DataCrate work was driven by practical concerns and a desire to describe research data with high-quality metadata Research Object shared those concerns but with more of a focus on reproducibility and detailed provenance for research data.
This is what an RO-Crate looks like if you open the HTML file that’s in the root directory (or you see one on the web).
This is the home page for RO-Crate.
Where did RO-Crate come from? RO-Crate is the marriage of Research Objects with DataCrate. It aims to build on their respective strengths, but also to draw on lessons learned from those projects and similar research data packaging efforts. For more details, see background.
Who is it for?
The RO-Crate effort brings together practitioners from very different backgrounds, and with different motivations and use-cases. Among our core target users are: a) research engaged with computation and data-intensive, wokflow-driven analysis; b) digital repository managers and infrastructure providers; c) individual researchers looking for a straight-forward tool or how-to guide to “FAIRify” their data; d) data stewards supporting research projects in creating and curating datasets.
RO-Crate is a collaboration between people all over the world, but the Editors are from Cork, Manchester and Katoomba Version one of the standard will be out in by Summer. But which summer? Standard reference points are important. Standards are important.
Which brings us the benefits of Standards. Without this standardised date format chaos would reign. What if that date had been written 05/08 or 08/05 - someone might end up eating food from May in August, or worse, eating last August’s food in May.
Anyway, If you find a partner who’ll adopt the ISO 8601 data standard then ...
… you should marry them.
Like how we married the Research Object and DataCrate - we bonded over standardisation.
Let’s explore standards a bit more. Iif you see this in metadata - what does it mean?
Is it a name given to the resource? URI: https://www.dublincore.org/specifications/dublin-core/dcmi-terms/terms/title/
An honorific like Ms, or Dr? As it would be in the FOAF ontology.
Or a very specific meaning relating to job titles? As in Schema.org.
In RO-Crate - there’s an HTML page which ships with each dataset that allows you to browse the object in as much detail as the author described it and we are careful to avoid ambiguity by adding help links to each metadata term so you see the definition.
Just wanted to shout out to ResearchGraph - led by Amir Aryani at Swinburne Uni - they are also using schema.org.
RO-Crates ship with two files, a human readable one and a machine readable JSON file. The two views (human and machine) of the data are equivalent - in fact the HTML version is generated from the JSON-LD version, via the DataCrate nodejs library.
And here’s an automatically generated diagram extracted from the sample DataCrate showing how two images were created. The first result was an image file taken by me (as an agent) using two instruments (my camera and lens), of a place (the object: Catalina park in Katoomba). A sepia toned version was the result of a CreateAction, with the instrument this time being the ImageMagick software. The DataCrate also contains information about that CreateAction such as the command used to do the conversion and the version of the software-as-instrument.
convert -sepia-tone 80% test_data/sample/pics/2017-06-11\ 12.56.14.jpg test_data/sample/pics/sepia_fence.jpg
This way of representing file provenance is Action-centred - the focus is on the action that creates a file, rather than the more usual metadata approach of having the file at the centre with properties for “Author” and the like. The action-based approach is MUCH more flexible as it can model the contribution of multiple agents and instruments separately at the expense of being somewhat counter-intuitive to those of us who are used to a library-card approach to metadata where the work is at the centre and has simple properties.
There was a question after this presentation about whether I had the arrows in this diagram pointing in the right direction. Yes, I do! The convention here is the standard way of representing a subject-predicate-object semantic triple with the subject as the source of the arrow, the predicate (in this case Schem.org property) as a label, and the pointy end pointing at the object.
What’s new / developing at the moment in the RO-Crate world? I will illustrate by looking at recent activity on our Github project.
We’re working on ways to describe not just files, but the CONTENTS of files - using properties like variableMeasured.
We have a way to describe a workflow
and actions that can be performed on data such as firing up a computational environment to re-run the workflow.
You too can add Use Cases like this one about software containers.
Breakig news: In the last couple of months Marco La Rosa, an independent developer working for PARADISEC, has ported 10,000 data and collection items into RO-Crate format, AND built a portal which can display them. This means that ANY repository with a similar structure Items in Collections could easily re-use the code and the viewers for various file types.
This shows an intralinear transcription where you can play various segments of a recording and see the transcription.
The .eaf files in the previous example are produced using ELAN software. Marco has done the groundwork for a system that could work across multiple repositories and for stand-alone RO-Crates - the crate metadata describes the files, and what format they’re in, and the viewer which is an HTML page either served by a repository or possibly just off your hard disk, can use that information to load an appropriate viewer.
RO-Crate will be released in version 1 in November 2019 - we were aiming for October, but missed that.
We will publish the parts that are well-tested and stable, and immediately start on a new version with bleeding-edge cases.
We want input from potential users, current and prospective implementers and help drafting new parts of the spec is welcome.
You can join the team
<a rel="license" href="http://creativecommons.org/licenses/by/3.0/au/"><img alt="Creative Commons Licence" style="border-width:0" src="https://i.creativecommons.org/l/by/3.0/au/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/3.0/au/">Creative Commons Attribution 3.0 Australia License</a>.