Packaging data with detailed metadata using RO-Crate in FAIR open repositories

2023-06-13

This presentation was delivered by Peter Sefton at Open Repositories 2023: it includes slides adapted from other RO-Crate presentations by Stian Soiland-Reyes and others - but here “I” means is Sefton.

Abstract

Research Object Crate (RO-Crate) is a community effort and specification to practically achieve FAIR packaging of research objects (digital objects like data, methods, software) with structured metadata and context. RO-Crate uses well-established Web standards and FAIR principles. For common metadata representations, RO-Crate builds on schema.org, a mature and general mark-up vocabulary used by search engines, including Google Dataset Search. RO-Crate is adopted by many research projects as a pragmatic implementation of the FAIR principles that can be both general for interoperable exchange and extensible for domain-specific archiving. RO-Crate development began in early 2019, when a workshop at Open Repositories 2019 in Hamburg generated a significant number of use-cases and expressions of interest from the OR community. This presentation will introduce RO-Crate, its continuing development and rapid adoption since 2019, report on how it is now being used in repository software, and the potential for further use in repository platforms that will be familiar to OR attendees.

Outline

In this presentation we’ll cover:

A quick run thru what RO-Crate is and what it is for
New developments:
- Version 1.2 is coming
- Profiles are seeing a lot of activity
- Tooling continues to improve

Is it FAIR to use all these repositories? https://fairsharing.org/ https://faircookbook.elixir-europe.org/ https://www.re3data.org/

Researchers are asked to make their research outputs – including publications, FAIR – where to publish?

They have to choose between Thousands of public, institutional and domain-specific repositories Help from guidance and catalogues.

(FAIRsharing, re3data, FAIR Cookbook)

..but how to gather and reference outputs across multiple repositories? What about contextual information?

Describe and package data collections, datasets, software etc. with their metadata Platform-independent object exchange between repositories and services Support reproducibility and analysis: link data with codes and workflows Transfer of sensitive/large distributed datasets with persistent identifiers Aggregate citations and persistent identifiers Propagate provenance and existing metadata Publish and archive mixed objects and references Reuse existing standards, but hide their complexity Aims of FAIR Research Objects

These are our aims:

Describe and package data collections, datasets, software etc. with their metadata (And remember in the context of Open Repositories: publications are data too)
Platform-independent object exchange between repositories and services
Support reproducibility and analysis: link data with codes and workflows
Transfer of sensitive/large distributed datasets with persistent identifiers
Aggregate citations and persistent identifiers
Propagate provenance and existing metadata
Publish and archive mixed objects and references
Reuse existing standards, but hide their complexity

We're trying to be fairly platform-independent, and we're not too tied into a particular way of storing or identifying these components. We do want to have enough information for reproducibility ,and to support data that are coming in from different sources, that may not even be accessible directly because they require authorization.

The idea of the Research Object (RO) is to gather data in a kind of virtual package. This may include some actual files, and it may include outgoing references; these are related together and given brief descriptions. That way we know what the data are, and what role they play in this collection.

(I presented RO-Crate to a senior research technology leader recently who had not yet heard of RO-Crate – and they stopped me and asked “why is there Research in the name” – pointing out that RO-Crate is obviously applicable to more thanresearch use cases. The answer lies in the genealogy; RO-Crate is a merger between the Research Object line of work at the University of Manchester and DataCrate from the University of Technology Sydney - the technology is not inherently specific to research – but the motivations, particularly the FAIR principles do come from the the research world.)

This slide shows a screenshot of the RO-Crate specification. The spec is designed to be an implementation guide that builds on other standards – we will continue to work on making this as simple as possible for tool developers (we admit parts of it have started to get a bit complex as we take on more use cases).

Using common formats and vocabularies .. extending only when needed

We use the common vocabularies, but only extend where we need to.

RO-Crate is now a very healthy community - the spec is developed by an open process with fortnightly calls, and a github repository.

We have regular calls – a “main” monthly call and a Euro-focussed call. People call in from all over Europe, the US and Australia.

POST Conference note: This is obviously not what you'd call global coverage Claire Knowles pointed out to me the number of people at OR who were standing in Africa talking about their 'global' projects which often have low-to-zero represention outside of North America and Europe (we'll count Australasia as part of that as Australia's in Eurovision).

There is a growing body of work on RO-Crate this Zenodo repository captures part of it - but it’s starting to show up in repositories and presentations in a lot of research contexts.

Workflow Hub is a example of a repository (though it calls itself a registry) – it contains scientific workflows.

Here's an example of a workflow in the WorkflowHub registry/repository – there’s a download button to get a workflow in RO-Crate format. Note the ‘sketch’ which illustrates the workflow.

There’s an HTML page included in the RO-Crate Download that makes the crate human readable

If you download this workflow crate then you get a preview file like the one shown including the precis "sketch" of the workflow and links to the files – eg the "Main Workflow" link. This shows the benefits of RO-Crate – every download has a machine-readable metadata file, and there's a human-readable web page to go with it. If you find this on your computer in 10 year's time there is information there about what it is, and where it came from in a standardised format.

RO-Crate Built-in here as well at RO-Hub

This is an item from RO-Hub

The EOSC project RELIANCE use RO-Crate to package data cubes of earth observation data, along with documentation, images and workflows

Connects to related infrastructures for interactive execution/analysis.

Metadata includes temporal coverage, spatial coverage and vertical coverage.

ROHub publishes the archived RO-Crates to general-purpose repositories (Zenodo, B2Share) for longevity and PIDs.

(The RO-Crate preview file in this service could use work; it’s a raw representation of the JSON metadata but is still better than the old days)

In the above examples we showed how resources can be downloaded from repositories in RO-Crate format – but there are still no widely accepted standard in place to joint the dots between, say a DOI for a dataset and an actual download of that data. DOIs resolve to web pages, not data streams – the RO-Crate community is actively engaged in joining these dots with work on FAIR signposting – establishing protocols for automated signalling of where data can be downloaded.

Please join the Slack conversation if you’d like to talk to us about this.

We have just seen an example of an ATTACHED crate – you might call it RO-Crate “classic”, this is the starting point for RO-Crate – it’s first use case as a packaging format. In an Attached Crate data resources are included alongside the RO-Crate-metadata.json file. When we introduced RO-Crate at OR2019 in Hamburg this was the ONLY kind of crate.

Detached Crates, on the other hand, have resources that are NOT local. For example, an RO-Crate metadata document dowloaded from an API might reference resources available from the API.

This is what a “Detached RO-Crate” looks like over an API – in this case one that is showing a collection of plays in English from the 1500s (this data features in another presentation given at Open Repositories, a demonstration illustrating the technical details of an RO-Crate-based repository architecture.

This diagram sketches the architecture of the ]Australian Text Analytics](https://atap.edu.au), which is part of the Language Data Commons of Australia , and shows the integration between data repositories (in green, on the right) and code execution environments (in red, on the left). The integration between these things is via documentation – and standards-based metadata (including, of course, RO-Crate).

https://www.researchobject.org/ro-crate/tools/

We have been talking about RO-Crate tools – here’s the list from the website. Like any list of tools it can be hard to keep this up to date (like, for example I am talking about the Crate-O tool here but it is not yet on the list). Here’s the RO-Crate tools page: https://www.researchobject.org/ro-crate/tools/

Here’s another repository that uses RO-Crate metadata (from the Lanaguage Data Commons of Australia / Australian Text Analytics Platform) – users can launch a Jupyter notebook in a binderhub execution environment. The Notebook fetches a Detached RO-Crate metadata document, processes it to filter further resources to fetch, and then fetches them from the API.

This is a screenshot of the Notebook, using he python RO-Crate library to consume data from the API.

The RO-Crate Python library has lots of functionality for doing actual data packaging – it has a file-system interface (we mention this as it is different from the approach taken in the Javascript library).

And ro-crate-py has a commandline interface or making RO-Crates step by step.

RO-Crate-js (Javascript) takes a different approach – it is much more abstract, and has no direct connection to the file system.

RO-Crate excel creates a crate from a directory of files, and can allow existing ad hoc tabular metadata to be added to RO-Crates.

Crate-O – TODO when this is downloaded as a PPT

Here we see the Crate-O metadata tool (which is a zero-install web application that runs in Chrome and other browsers that support the new FilesystemAPI) being used to add an Organization as the Affiliation for a Person entity. Having imported this "Context Entity" (that's the RO-Crate term) it can then be re-used within the crate which we see here as the schema.org publisher property is linked to the same organization – with the ROR (Research Organization Registry) identifier https://ror.org/00eae9z71

If you’d like to join in or contact us choose one of the options on the RO-Crate Community page – eg join the Slack

And to cite RO-Crate:

Stian Soiland-Reyes, Peter Sefton, Mercè Crosas, Leyla Jael Castro, Frederik Coppens, José M. Fernández, Daniel Garijo, Björn Grüning, Marco La Rosa, Simone Leo, Eoghan Ó Carragáin, Marc Portier, Ana Trisovic, RO-Crate Community, Paul Groth, Carole Goble (2022):Packaging research artefacts with RO-Crate.Data Science 5(2)https://doi.org/10.3233/DS-210053

[ptsefton.com] | [CV & Bio]

Packaging data with detailed metadata using RO-Crate in FAIR open repositories

2023-06-13

Abstract

Outline