This is a talk delivered in recorded format by Peter Sefton, Nick Thieberger, Marco La Rosa and Mike Lynch at eResearch Australasia 2020. Also posted on the UTS eResearch website.
Research data from all disciplines has interest and value that extends beyond funding cycles and must continue to be managed and preserved for the long term. However much of the effort in eResearch goes into building systems which provide functionality and services that operate on data but which actually put data at risk, that is, by loading data into a particular tool so that the data is not be easily retrievable if the service cannot be sustained, or, at worst, the data is lost.
The Arkisto (https://arkisto-platform.github.io/why/) approach is to work with a set of standards which make data available for long term access. Using the Oxford Common File Layout (OCFL) to organize data in a repository and Research Object Crate to describe data down to the file or even variable level Arkisto supports the safeguarding of data for the long term. A growing set of Arkisto-compatible software tools allow data ingest into repositories, and the creation of data discovery portals that connect data to analytical, visualisation and computing tools.
In this presentation we will introduce the standards based platform and show a number of examples from multiple disciplines of current Arkisto deployments, including an institutional Research Data Portal, a snapshot of the Expert Nation history project, crowd-sourced data from historical criminology , and the Pacific And Regional Archive for Digital Sources in Endangered Cultures (PARADISEC).
Across the sector we build services that operate on data but which actually put data at risk, that is, by loading data into a particular tool so that the data is not be easily retrievable if the service cannot be sustained, or, at worst, the data is lost.
The Arkisto (https://arkisto-platform.github.io/why/) approach is to work with a set of standards which make data available for long term access. The closest emoji I could find to represent standards was this “standard poodle”. Previously I used a toothbrush - on the basis that “standards are like toothbrushes, everyone wants to use their own”
The first of the two core standards is the Oxford Common File Layout (OCFL) to organize data in a repository as a set of files. This approach is scalable indefinitely, and reduces the risk that data will be locked up in monolithic systems.
This diagram by Mike Lynch shows a series of different sized collections of data, each with a label. The labels (manifests) in this case are purely about data integrity - and contain checksums. The bundles of data are the next level up as we move on to look at Standard number 2.
RO-Crate is the standard Arkisto uses for packaging and describing data sets. It is based on other standards:
Schema.org is used as the main ontology, for classes and properties - it has coverage for all the basic Who What Where style metadata and is used by Google’s dataset search and a number of other projects. There are a few terms from other ontologies where Schema.org does not have coverage.
RO-Crates may also have an HTML human readable summary of data. If you find a stray crate in your downloads folder it is easy to click on the HTML file and get a summary of what’s inside - they can also be hosted on the web using a plain-old webserver.
This is a screenshot of an RO-Crate in the UTS data portal. We are looking at it’s HTML summary.
With extensive metadata.
A growing set of Arkisto-compatible software tools allow data ingest into repositories, and the creation of data discovery portals that connect data to analytical, visualisation and computing tools.
On important tool is Describo, a desktop (and soon to be online) tool for describing data using the RO-Crate standard. It created linked-data descriptions that can describe a dataset at the top level, and also individual files or variable inside files.
There are two projects working on online version of Describo - one at UTS and on led by CERN working with the European National Research Networks.
Describo can be configured for use in specific domains, for example in cultural archives like PARADISEC. This slide shows how users can create entities and link them, and select from pre-defined data loaded in as part of the profile.
Arkisto currently has two data discovery tools that index the contents of an OCFL repository so human and machines can discover data and connect to analytical, visualisation and computing tools. This is Michael Lynch’s diagram showing, from the left, how data can be “delivered” to a repository via standard tools (such as rsync) over SSH.
An indexing process uses Solr (or another index like Elasticsearch) the RO-Crate metadata for objects builds an index that can be then used for search and faceted browsing over data. There is a user on the right, requesting access to a dataset and a security guard checking her credentials - the user has the rights to see datasets marked with a *.
Here is an example from the PARADISEC indexer. As per the Arkisto appraoacgh, PARADISEC site is data driven - objects are stored on disk in OCFL using RO-Crate for describe each research object. Indexing tools walk the OCFL filesystem looking for RO-Crates, then, using the crate metadata in addition to the OCFL inventory metadata construct appropriate indexes into the content. In this example we can see version 1 of this item and the metadata we get from the OCFL inventory.
This is an example of faceted search interface - constructed using the Oni portal tool developed at UTS - this is showing a data-export from the Expert Nation https://expertnation.org/ with the tagline “Universities, War and 1920s & 30s Australia” led by Associate Professor Tamson Pietsch - who asked us to create an archive of the state of the dataset to support a book. We are working with Pietsch’s team to configure the portal to be useful in exporting the data. The SectorName facet is particularly important; it shows that the health sector was by far the biggest employer of returned service people.
Here is another dataset - this time we are looking at an RO-Crate sitting on a plain old web site (not a search portal). This is a screenshot of a map with a time-window function showing where one Laura Adams was convicted of 42 offences between 1918 and 1942. The power of the Arkisto platform, based on Standards is that adding this kind of functionality to other collections with geographical features in it is a matter of writing a few simple bits of code - the component can be re-used because the data and metadata use the RO-Crate standard (which in turn is built on other standards).The data in this demonstrator came from Alana Piper’s Criminal Characters project.
This is a screenshot of geographical data about a single offender’s sentences that has been exported in to the Time Layered Cultural Map.
We are working on making this an automated service so that any Arkisto portal can be configured to display relevant geo-data but also to be able to export it for analysis to other tools via APIs including at large scale.
The researcher, Dr Alana Piper says:
Analytical possibilities here would be uploading all offenders in bulk and comparing the 'range' results to determine what types of offences or other factors are associated with higher/lower levels of mobility.
A modern catalog driven from OCFL and RO-Crate. This is the landing page built from the content indexed into elastic search. We can see the number of collections, items, contributors and universities at a glance. There are controls for jumping to a specific item or collection and a simple auto complete search for quickly finding known content. The bottom half is a dynamic list of the most recently updated items.
PARADISEC has viewers for various content types: video and audio with time aligned transcriptions, image set viewers and document viewers (xml, pdf and microsoft formats). We are working on making these viewers available across Arkisto sites by having a standard set of hooks for adding viewer plugins to a site as needed.
PARADISE has advanced search and deep indexing into transcriptions with the ability to play segments directly from the search interface.
This is another Arkisto based website - it’s a confidential access controlled database of successful grant applications using an OCFL repository, RO-Crate objects presented by the Oni portal.
The Arkisto website has a growing list of use cases for different data pipelines - here’s a sketch of the architecture we’re working on for Associate Professor Shauna Murray’s group at UTS - managing data from a sensor network in estuaries along the NSW coast.
See the Use Cases page for more.
Arkisto is a flexible research platform which can be used to assemble a variety of data pipelines, for a variety of disciplines.
The emphasis in on FIRST keeping data safe and re-usable by storing and describing it using standards, so that in the absence of budget and resources to maintain complex virtual labs the data are still available for re-use, and THEN to use our growing set of interoperable tools to build data hubs, with re-usable data viewer plugins and standards-based interoperable analytical services.
There are active projects under way at the Universities of Melbourne and University of Technology Sydney across a wide range if disciplines and we are seeking funding to enhance the platform.