Arkisto: a repository based platform for managing all kinds of research data

2021-06-11

This presentation by Peter Sefton, Marco La Rosa and Michael Lynch was delivered at Open Repositories 2021 conference on 2021-06-10 (Australian time) - Marco La Rosa did most of the talking, with help from Michael Lynch.

We want to emphasise that this presentation is based on the FAIR principles that data should be Findable, Accessible, Interoperable and Reusable.

<p>Repositories: institutional, domain or both</p>
<p>Find / Access services
Research Data Management Plan
Workspaces:</p>
<p>working storage
domain specific tools
domain specific services
collect
describe
analyse
Reusable, Interoperable
data objects
deposit early
deposit often
Findable, Accessible, Reusable data objects
reuse data objects
V1.1 © Marco La Rosa, Peter Sefton 2021 https://creativecommons.org/licenses/by-sa/4.0/</p>
<p>🗑️
Active cleanup processes workspaces considered ephemeral
🗑️
Policy based data management

This schematic is a high level view of research data management showing workspaces enabling research activities (collect, analyse, describe) linked to repositories in a continuous cycle. This is similar to the software development model of commit early and commit often but in this case, deposit well described objects often and re-use as required. Workspaces can include systems like Redcap, OwnCloud and other active work systems and they should be treated as ephemeral and dirty. Repositories can include systems like Zenodo and Figshare where FAIR objects are managed for long term preservation and re-use.

Q. How can we “FIRST LOOK AFTER THE DATA”
🐩

Data must be well described in open standards Data not locked up We know it’s portable between applications Data storage layer is COMPLETELY separate from the services layer(s) So how? Use standards.... Weren’t you wondering about the picture of a STANDARD Poodle?

UTS has been an early adopter of the OCFL (Oxford Common File Layout) specification - a way of storing file sustainably on a file system (coming soon: s3 cloud storage) so it does not need to be migrated. I [presented on this at the Open Repositories conference] (https://eresearch.uts.edu.au/2019/07/01/OCLF.htm). PARADISEC has built a scalable and performant demonstrator using OCFL. Completeness, so that a repository can be rebuilt from the files it stores Parsability, both by humans and machines, to ensure content can be understood in the absence of original software Robustness against errors, corruption, and migration between storage technologies Versioning, so repositories can make changes to objects allowing their history to persist Storage diversity, to ensure content can be stored on diverse storage infrastructures including conventional filesystems and cloud object stores

Repository (Find / Access) services
CS3MESH4EOSC
FAIR Description Service
Publish stable FAIR digital objects
(Research Object Crates)
Describe data sets
future connectors
V1.0 © Marco La Rosa, Peter Sefton 2021 https://creativecommons.org/licenses/by-sa/4.0/

This is an example of service connection in the CS3MESH4EOSC. In this schematic we can see a FAIR Description Service forming the bridge between the CS3MESH4EOSC services (workspaces) and various repositories. A FAIR description service uses linked-data to describe data and its context. The next slide shows some of what you might want to add to a Digital Object (data package) to make it Findable, Interoperable and Reusable.

☁️
📂
<p>📄
ID? Title? Description?</p>
<p>👩‍🔬👨🏿‍🔬Who created this data?
📄What parts does it have?
📅 When?
🗒️ What is it about?
♻️ How can it be reused?
🏗️ As part of which project?
💰 Who funded it?
⚒️ How was it made?
Addressable resources
Local Data
👩🏿‍🔬 https://orcid.org/0000-0001-2345-6789
🔬 https://en.wikipedia.org/wiki/Scanning_electron_microscope

RO-Crate is method for describing a dataset as a digital object using a single linked-data metadata document

Lightweight approach to packaging research data with metadata - the examples here aid in the F, A, I and R in FAIR. For example “how can it be reused” means a data licence that specifies who can access, use and distribute data. And “How was it made” is important for reuse and interoperability - think file formats, resolution etc.
Community effort - 40+ contributors from AU, EU and US.
Can aggregate files and/or any URI-addressable content, with contextual information to aid decisions about re-use. (Who What When Where Why How).
Uses Schema.org as the main ontology, with domain-specific extensions
Has human readable summaries of datasets

Sefton is an Editor of the specification: http://www.researchobject.org/ro-crate/

FAIR Digital Object Export Service
FAIR Description Service
Repositories
CS3MESH4EOSC
Specific export coded for each service
Export for curation
Describe FAIR
Digital Object packages (RO-Crate)
Publish stable FAIR Digital Objects
(RO-Crate)
Check out for reuse
(Research Object Crates)
V1.1 © Marco La Rosa, Peter Sefton 2021 https://creativecommons.org/licenses/by-sa/4.0/

When your data is well described you can start thinking about higher level processes and workflows connecting workspaces to repositories. And your services can evolve over time as requirements change and systems improve without needing to transform your data first. Too often research data management infrastructures get caught up in the specific technologies / systems to be implemented without considering how an ecosystem of services can work as whole. If the previous architecture slide was a low level view of the CS3MESH4EOSC implementation then this is a higher level view of a possible architecture that connects workspaces to repositories.

Identity: authentication, authorisation and group services
workspaces
repositories
V0.1 DRAFT © Marco La Rosa, Peter Sefton 2021 https://creativecommons.org/licenses/by-sa/4.0/
<p>Provisioning

Going up another few levels we can see that the picture is incomplete. The environment of repositories and workspaces needs more services to actually form a functional system for end users. Going forward we need to think about how to do cross-service authentication of parties and authorization of access to resources, and group membership; licensing, environment provisioning etc. In this way we can tie together the active workspaces and repository services into a cohesive application for end users.

(See a blog post from Peter Sefton floating ideas about how we might close this gap specifically for data-access licenses).

Here’s a schematic of just such an environment at UTS - this shows how the the Stash research data management system, which is an instance of ReDBox, orchestrates workspaces and connects them to a research data catalogue (which is actually now a repository).

So who is doing this? The PARADISEC (https://paradisec.org.au) project has built a demonstrator (https://mod.paradisec.org.au) using these technologies that is scaleable and performant with approximately 70TB of data!

TOOLS 🧰 ⚒️ and SPECS
OCFL Spec: https://ocfl.io/
Research Object Crate (RO-Crate) Spec: http://www.researchobject.org/ro-crate
UTS: https://github.com/UTS-eResearch/ro-crate-js
OCFL JS
UTS OCFL JS Implementation: https://github.com/uts-eresearch/ocfl-js
CoEDL OCFL JS implementation: https://github.com/CoEDL/ocfl-js
UTS RO Crate / SOLR portal: https://github.com/uts-eresearch/oni-express
Describo:
https://github.com/Arkisto-Platform/describo
https://github.com/Arkisto-Platform/describo-online
https://github.com/Arkisto-Platform/describo-data-packs
CoEDL Modern PARADISEC: https://github.com/CoEDL/modpdsc
CoEDL OCFL tools: https://github.com/CoEDL/ocfl-tools

And there’s an ever growing ecosystem of tools and libraries.

Describo is an application to build RO-Crates. Installable as a desktop application it simplifies the process of packaging up data as RO-Crates.

The Arkisto website https://arkisto-platform.github.io/covers all the things we talked about here and more; it has links to all the Standards used, a growing number of case studies, abstract use cases and links to tools; repository.

[ptsefton.com] | [CV & Bio]

Arkisto: a repository based platform for managing all kinds of research data

2021-06-11