Arkisto: a repository based platform for managing all kinds of research data
Peter Sefton1, Marco La Rosa2, Michael Lynch1
<p>University of Technology Sydney
The University of Melbourne

This presentation by Peter Sefton, Marco La Rosa and Michael Lynch was delivered at Open Repositories 2021 conference on 2021-06-10 (Australian time) - Marco La Rosa did most of the talking, with help from Michael Lynch.

This presentation is 

We want to emphasise that this presentation is based on the FAIR principles that data should be Findable, Accessible, Interoperable and Reusable.

<p>Repositories: institutional, domain or both</p>
<p>Find / Access services
Research Data Management Plan
<p>working storage
domain specific tools
domain specific services
Reusable, Interoperable
data objects
deposit early
deposit often
Findable, Accessible, Reusable data objects
reuse data objects
V1.1  © Marco La Rosa, Peter Sefton 2021</p>
Active cleanup processes  workspaces considered ephemeral
Policy based data management

This schematic is a high level view of research data management showing workspaces enabling research activities (collect, analyse, describe) linked to repositories in a continuous cycle. This is similar to the software development model of commit early and commit often but in this case, deposit well described objects often and re-use as required. Workspaces can include systems like Redcap, OwnCloud and other active work systems and they should be treated as ephemeral and dirty. Repositories can include systems like Zenodo and Figshare where FAIR objects are managed for long term preservation and re-use.


Data must be well described in open standards Data not locked up We know it’s portable between applications Data storage layer is COMPLETELY separate from the services layer(s) So how? Use standards.... Weren’t you wondering about the picture of a STANDARD Poodle?

<p>ANSWER:  OCFL<br />

UTS has been an early adopter of the OCFL (Oxford Common File Layout) specification - a way of storing file sustainably on a file system (coming soon: s3 cloud storage) so it does not need to be migrated. I [presented on this at the Open Repositories conference] ( PARADISEC has built a scalable and performant demonstrator using OCFL. Completeness, so that a repository can be rebuilt from the files it stores Parsability, both by humans and machines, to ensure content can be understood in the absence of original software Robustness against errors, corruption, and migration between storage technologies Versioning, so repositories can make changes to objects allowing their history to persist Storage diversity, to ensure content can be stored on diverse storage infrastructures including conventional filesystems and cloud object stores

Repository (Find / Access) services
FAIR Description Service
Publish stable FAIR digital objects
(Research Object Crates)
Describe data sets
future connectors
V1.0 © Marco La Rosa, Peter Sefton 2021

This is an example of service connection in the CS3MESH4EOSC. In this schematic we can see a FAIR Description Service forming the bridge between the CS3MESH4EOSC services (workspaces) and various repositories. A FAIR description service uses linked-data to describe data and its context. The next slide shows some of what you might want to add to a Digital Object (data package) to make it Findable, Interoperable and Reusable.

ID? Title? Description?</p>
<p>👩‍🔬👨🏿‍🔬Who created this data?
📄What parts does it have?
📅 When?
🗒️ What is it about?
♻️ How can it be reused?
🏗️ As part of which project?
💰 Who funded it?
⚒️ How was it made?
Addressable resources
Local Data

RO-Crate is method for describing a dataset as a digital object using a single linked-data metadata document

  • Lightweight approach to packaging research data with metadata - the examples here aid in the F, A, I and R in FAIR. For example “how can it be reused” means a data licence that specifies who can access, use and distribute data. And “How was it made” is important for reuse and interoperability - think file formats, resolution etc.
  • Community effort - 40+ contributors from AU, EU and US.
  • Can aggregate files and/or any URI-addressable content, with contextual information to aid decisions about re-use. (Who What When Where Why How).
  • Uses as the main ontology, with domain-specific extensions
  • Has human readable summaries of datasets

Sefton is an Editor of the specification:

FAIR Digital Object Export Service
FAIR Description Service
Specific export coded for each service
Export for curation
Describe FAIR
Digital Object packages (RO-Crate)
Publish stable FAIR Digital Objects
Check out for reuse
(Research Object Crates)
V1.1 © Marco La Rosa, Peter Sefton 2021

When your data is well described you can start thinking about higher level processes and workflows connecting workspaces to repositories. And your services can evolve over time as requirements change and systems improve without needing to transform your data first. Too often research data management infrastructures get caught up in the specific technologies / systems to be implemented without considering how an ecosystem of services can work as whole. If the previous architecture slide was a low level view of the CS3MESH4EOSC implementation then this is a higher level view of a possible architecture that connects workspaces to repositories.

Identity: authentication, authorisation and group services 
V0.1 DRAFT  © Marco La Rosa, Peter Sefton 2021 

Going up another few levels we can see that the picture is incomplete. The environment of repositories and workspaces needs more services to actually form a functional system for end users. Going forward we need to think about how to do cross-service authentication of parties and authorization of access to resources, and group membership; licensing, environment provisioning etc. In this way we can tie together the active workspaces and repository services into a cohesive application for end users.

(See a blog post from Peter Sefton floating ideas about how we might close this gap specifically for data-access licenses).

Here’s a schematic of just such an environment at UTS - this shows how the the Stash research data management system, which is an instance of ReDBox, orchestrates workspaces and connects them to a research data catalogue (which is actually now a repository).

Who is doing this?

So who is doing this? The PARADISEC ( project has built a demonstrator ( using these technologies that is scaleable and performant with approximately 70TB of data!

TOOLS 🧰 ⚒️ and SPECS
OCFL Spec:
Research Object Crate (RO-Crate) Spec:
UTS OCFL JS Implementation:
CoEDL OCFL JS implementation:
UTS RO Crate / SOLR portal:
CoEDL OCFL tools:

And there’s an ever growing ecosystem of tools and libraries.

Describo is an application to build RO-Crates. Installable as a desktop application it simplifies the process of packaging up data as RO-Crates.

The Arkisto website all the things we talked about here and more; it has links to all the Standards used, a growing number of case studies, abstract use cases and links to tools; repository.


comments powered by Disqus