Arkisto: a repository based platform for managing all kinds of research data
2021-06-11
This presentation by Peter Sefton, Marco La Rosa and Michael Lynch was delivered at Open Repositories 2021 conference on 2021-06-10 (Australian time) - Marco La Rosa did most of the talking, with help from Michael Lynch.
We want to emphasise that this presentation is based on the FAIR principles that data should be Findable, Accessible, Interoperable and Reusable.
This schematic is a high level view of research data management showing workspaces enabling research activities (collect, analyse, describe) linked to repositories in a continuous cycle. This is similar to the software development model of commit early and commit often but in this case, deposit well described objects often and re-use as required. Workspaces can include systems like Redcap, OwnCloud and other active work systems and they should be treated as ephemeral and dirty. Repositories can include systems like Zenodo and Figshare where FAIR objects are managed for long term preservation and re-use.
Data must be well described in open standards Data not locked up We know it’s portable between applications Data storage layer is COMPLETELY separate from the services layer(s) So how? Use standards.... Weren’t you wondering about the picture of a STANDARD Poodle?
UTS has been an early adopter of the OCFL (Oxford Common File Layout) specification - a way of storing file sustainably on a file system (coming soon: s3 cloud storage) so it does not need to be migrated. I [presented on this at the Open Repositories conference] (https://eresearch.uts.edu.au/2019/07/01/OCLF.htm). PARADISEC has built a scalable and performant demonstrator using OCFL. Completeness, so that a repository can be rebuilt from the files it stores Parsability, both by humans and machines, to ensure content can be understood in the absence of original software Robustness against errors, corruption, and migration between storage technologies Versioning, so repositories can make changes to objects allowing their history to persist Storage diversity, to ensure content can be stored on diverse storage infrastructures including conventional filesystems and cloud object stores
This is an example of service connection in the CS3MESH4EOSC. In this schematic we can see a FAIR Description Service forming the bridge between the CS3MESH4EOSC services (workspaces) and various repositories. A FAIR description service uses linked-data to describe data and its context. The next slide shows some of what you might want to add to a Digital Object (data package) to make it Findable, Interoperable and Reusable.
RO-Crate is method for describing a dataset as a digital object using a single linked-data metadata document
- Lightweight approach to packaging research data with metadata - the examples here aid in the F, A, I and R in FAIR. For example “how can it be reused” means a data licence that specifies who can access, use and distribute data. And “How was it made” is important for reuse and interoperability - think file formats, resolution etc.
- Community effort - 40+ contributors from AU, EU and US.
- Can aggregate files and/or any URI-addressable content, with contextual information to aid decisions about re-use. (Who What When Where Why How).
- Uses Schema.org as the main ontology, with domain-specific extensions
- Has human readable summaries of datasets
Sefton is an Editor of the specification: http://www.researchobject.org/ro-crate/
When your data is well described you can start thinking about higher level processes and workflows connecting workspaces to repositories. And your services can evolve over time as requirements change and systems improve without needing to transform your data first. Too often research data management infrastructures get caught up in the specific technologies / systems to be implemented without considering how an ecosystem of services can work as whole. If the previous architecture slide was a low level view of the CS3MESH4EOSC implementation then this is a higher level view of a possible architecture that connects workspaces to repositories.
Going up another few levels we can see that the picture is incomplete. The environment of repositories and workspaces needs more services to actually form a functional system for end users. Going forward we need to think about how to do cross-service authentication of parties and authorization of access to resources, and group membership; licensing, environment provisioning etc. In this way we can tie together the active workspaces and repository services into a cohesive application for end users.
Here’s a schematic of just such an environment at UTS - this shows how the the Stash research data management system, which is an instance of ReDBox, orchestrates workspaces and connects them to a research data catalogue (which is actually now a repository).
So who is doing this? The PARADISEC (https://paradisec.org.au) project has built a demonstrator (https://mod.paradisec.org.au) using these technologies that is scaleable and performant with approximately 70TB of data!
And there’s an ever growing ecosystem of tools and libraries.
Describo is an application to build RO-Crates. Installable as a desktop application it simplifies the process of packaging up data as RO-Crates.
The Arkisto website https://arkisto-platform.github.io/covers all the things we talked about here and more; it has links to all the Standards used, a growing number of case studies, abstract use cases and links to tools; repository.