Towards a (technical architecture for a) HASS Research Data Commons for language and text analysis

2021-10-12

This is a presentation by Peter (Petie) Sefton and Moises Sacal, delivered at the online eResearch Australasia Conference on October 12th 2021.

The presentation was by recorded video - this is a written version. Mosies and I are both employed by the University of Queensland School of Languages and Culture.

Towards a (technical architecture for a) HASS Research Data Commons for language and text analysis
Peter Sefton & Moises Sacal
technical architecture for a

Here is the abstract as submitted:

The Language Data Commons of Australia Data Partnerships (LDaCA) and the Australian Text Analytics Platform (ATAP) are building towards a scalable and flexible language data and analytics commons. These projects will be part of the Humanities and Social Sciences Research Data Commons (HASS RDC).

The Data Commons will focus on preservation and discovery of distributed multi-modal language data collections under a variety of governance frameworks. This will include access control that reflects ethical constraints and intellectual property rights, including those of Aboriginal and Torres Strait Islander, migrant and Pacific communities.

The platform will provide workbench services to support computational research, starting with code-notebooks with no-code research tools provided in later phases. Research artefacts such as code and derived data will be made available as fully documented research objects that are re-runnable and rigorously described. Metrics to demonstrate the impact of the platform are projected to include usage statistics, data and article citations.

In this presentation we will present the proposed architecture of the system, the principles that informed it and demonstrate the first version. Features of the solution include the use of the Arkisto Platform (presented at eResearch 2020), which leverages the Oxford Common File Layout. This enables storing complete version-controlled digital objects described using linked data with rich context via the Research Object Crate (RO-Crate) format. The solution features a distributed authorization model where the agency archiving data may be separate from that authorising access.

Project Team(alphabetical order)
Michael D’Silva
Marco Fahmi
Leah Gustafson
Michael Haugh
Cale Johnstone
Kathrin Kaiser
Sara King
Marco La Rosa
Mel Mistica
Simon Musgrave
Joel Nothman
Moises Sacal
Martin Schweinberger
PT Sefton
<p>With thanks for their contribution:
Partner Institutions:

This cluster of projects is led by Professor Michael Haugh of the School of Languages and Culture at the University of Queensland with several partner institutions.

I work on Gundungurra and Darug land in the Blue Mountains, Moises is on the land of the Gadigal peoples of the Eora Nation. We would like to show acknowledge the traditional custodians of the lands on which we live and work and the importance of indigenous knowledge, culture and language to the these projects.

The Language Data Commons of Australia (LDaCA) and Australian Text Analytics Platform (ATAP) projects received investment (https://doi.org/10.47486/DP768 and https://doi.org/10.47486/PL074) from the Australian Research Data Commons (ARDC). The ARDC is funded by the National Collaborative Research Infrastructure Strategy (NCRIS).

This work is supported by the Australian Research Data Commons.

We are going to talk about the emerging architecture and focus in on one very important part of it: Access control. 🔏

But first, some background.⛰️

The platform will:
Be sustainable, with a focus on data preservation as an overriding concern - data will not be ‘trapped’ in a particular platform and all data and code developed on the platform will be in a “migration free” layout ready for reuse
preserve interoperable and re-usable data via the use of common standards for describing and structuring data with useful detailed context and provenance
make data from ATAP and LDaCA and collections discoverable - with the caveat that harvesting harmonised metadata from existing corpora may be difficult
Provide workbench services for computational research - starting with code-notebooks but with the aim of building towards no-code environments and automatically re-runnable workflows
include clear licensing on all data and code on how data may be reused, informed by a legally sound policy framework, with an access-control framework to allow automated data access where possible (there are some external dependencies here)
be distributed - with data held by a number of different organizations under a variety of governance models and technologies (potentially including copies for redundancy or to put data close to compute and analytical services)
enable best-practice in research, with research products such as code and derived data available as “fully documented research objects” that as as re-runnable and rigorously described as possible
provide and be able to show value in enabling and measuring the impact of research
<p>

The architecture for the Data Commons project is informed by as set of goals and principles starting with ensuring that important data assets have the best chance of persisting into the future.

<p>Repositories: institutional, domain or both</p>
<p>Find / Access services
Research Data Management Plan
Workspaces:</p>
<p>working storage
domain specific tools
domain specific services
collect
describe
analyse
Reusable, Interoperable
data objects
deposit early
deposit often
Findable, Accessible, Reusable data objects
reuse data objects
V1.1 © Marco La Rosa, Peter Sefton 2021 https://creativecommons.org/licenses/by-sa/4.0/</p>
<p>🗑️
Active cleanup processes workspaces considered ephemeral
🗑️
Policy based data management

The diagram which we developed with Marco La Rosa makes a distinction between managed repository storage and the places where work is done - “workspaces”. Workspaces are where researchers collect, analyse and describe data. Examples include the most basic of research IT services, file storage as well as analytical tools such as Jupyter notebooks (the backbone of ATAP - the text analytics platform). Other examples of workspaces include code repositories such as GitHub or GitLab (a slightly different sense of the word repository), survey tools, electronic (lab) notebooks and bespoke code written for particular research programmes - these workspaces are essential research systems but usually are not set up for long term management of data. The cycle in the centre of this diagram shows an idealised research practice where data are collected and described and deposited into a repository frequently. Data are made findable and accessible as soon as possible and can be “re-collected” for use and re-use.

For data to be re-usable by humans and machines (such as ATAP notebook code that consumes datasets in a predictable way) it must be well described. The ATAP and LDaCA approach to this is to use the Research Object Crate (RO-Crate) specification. RO-Crate is essentially a guide to using a number of standards and standard approaches to describe both data and re-runnable software such as workflows or notebooks.

In the context of the previous high-level map distinguishing workspaces and repository services, we are using the Arkisto Platform (introduced at eResearch 2020).

Arkisto is an approach to eResearch service that places the emphasis on ensuring the long term preservation of data independently of code and services - recognizing the ephemeral nature of software.

An example of a corpus is the PARADISEC collection - Pacific and Regional Archive for Digital Sources in Endangered Cultures

PARADISEC has viewers for various content types: video and audio with time aligned transcriptions, image set viewers and document viewers (xml, pdf and microsoft formats). We are working on making these viewers available across Arkisto sites by having a standard set of hooks for adding viewer plugins to a site as needed.

Compute
<p>HPC
Cloud
Desktop</p>
<p>collect
describe
analyse
🗑️
Active cleanup processes workspaces considered ephemeral
… etc
ATAP Notebooks
Apps, Code, Workflows</p>
<p>Deposit /Publish
PARADISEC
Analytics Portal
Code discovery
Launch / Rerun
Data Discovery
Authenticated API</p>
<p>Workbench
Notebooks
Data import by URL
Export fully described pkg
Stretch goals:
Code gen / simple interfaces eg Discursis</p>
<p>BYOData 🥂
⚙️
STORAGE (including Cloudstor)
.
Data Curation
& description
Reuse
Licence Server
Identity Management
AAF / social media accounts</p>
<p>Data Cleaning
OCR / transcription format migration
Archive & Preservation Repositoriesinstitutional, domain or both
AU Nat. Corpus
AusLan (sign)
Sydney Speaks
ATAP Corpus
Reference,Training & BYO
Workspaces:
working storage
domain specific tools
domain specific services
Harvested
external
Lang. portal(s)
Corpus discovery
Item discovery
Authenticated API
Create virtual corpora</p>
<p>

This slide captures the overall high-level architecture - there will be an analytical workbench (left of the diagram) which is the basis of the Australian Text Analytics (ATAP) project - this will focus on notebook-style programming using one of the emerging Jupyter notebook platforms in that space. The exact platform is not 100% decided yet, but that has not stopped the team from starting to collect and develop notebooks that open up text analytics to new coders from the linguistics community. Our engagement lead, Dr Simon Musgrave sees the ATAP work as primarily an educational enterprise - which will be underpinned by services built on the Arkisto standards that allow for rigorous, re-runnable research.

Today we will look in detail at one important part of this architecture - access control. How can we make sure that in a distributed system, with multiple data repositories and registries residing with different data custodians, the right people have access to the right data?

I didn’t spell this out in the recorded conference presentation, but for data that resides in the repositories at the right of the diagram we want to encourage research processes that clearly separate data from code. Notebooks and other code workflows that use data will fetch a version-controlled reference copy from a repository - using an access key if needed, process the data and produce results that are then deposited into an appropriate repository alongside the code itself. Given that a lot of the data in the language world is NOT available under open licenses such as Creative Commons it is important to establish this practice - each user of the data must negotiate or be granted access individually. Research can still be reproducible using this model, but without a culture of sharing datasets without regard for the rights of those who were involved in the creations of the data.

Regarding rights, our project is informed by the CARE principles for indegenous data.

The current movement toward open data and open science does not fully engage with Indigenous Peoples rights and interests. Existing principles within the open data movement (e.g. FAIR: findable, accessible, interoperable, reusable) primarily focus on characteristics of data that will facilitate increased data sharing among entities while ignoring power differentials and historical contexts. The emphasis on greater data sharing alone creates a tension for Indigenous Peoples who are also asserting greater control over the application and use of Indigenous data and Indigenous Knowledge for collective benefit

https://localcontexts.org/labels/traditional-knowledge-labels/

We are designing the system so that it can work with diverse ways of expressing access rights, for example licensing like the Tribal Knowledge labels.The idea is to separate safe storage of data with a license on each item, which may reference the TK labels from a system that is administered by the data custodians who can make decisions about who is allowed to access data.

We are working on a case-study with the Sydney Speaks project via steering committee member Professor Catherine Travis.

This project seeks to document and explore Australian English, as spoken in Australia’s largest and most ethnically and linguistically diverse city – Sydney. The title “Sydney Speaks” captures a key defining feature of the project: the data come from recorded conversations between Sydney siders, as they tell stories about their lives and experiences, their opinions and attitudes. This allows us to measure how their lived experiences impact their speech patterns. Working within the framework of variationist sociolinguistics, we examine variation in phonetics, grammar and discourse, in an effort to answer questions of fundamental interest both to Australian English, and language variation and change more broadly, including:

How has Australian English as spoken in Sydney changed over the past 100 years?

Has the change in the ethnic diversity over that time period (and in particular, over the past 40 years) had any impact on the way Australian English is spoken?

What affects the way variation and change spread through society

Who are the initiators and who are the leaders in change?

How do social networks function in a modern metropolis?

What social factors are relevant to Sydney speech today, and over time (gender? class? region? ethnic identity?) A better understanding of what kind of variation exists in Australian English, and of how and why Australian English has changed over time can help society be more accepting of speech variation and even help address prejudices based on ways of speaking. Source: http://www.dynamicsoflanguage.edu.au/sydney-speaks/

The collection contains recordings of people speaking both contemporary and historic.

Because this involved human participants there are restrictions on the distribution of data - a situation we see with lots of studies involving people in a huge range of disciplines.

There are four tiers of data access we need to enforce and observe for this data based on the participant agreements and ethics arrangements under which the data were collected.

Concerns about rights and interests are important for any data involving people - and a large amount the data both indigenous and non-indigenous we are using will require access control that ensures that data sharing is appropriate.

In this example demo we uploaded various collections and are authorising with Github organisations

In a our production release we will use AAF to authorise different groups

Let's find a dataset: The Sydney Speaks Corpus

As you can see we cannot see any data

Lets login… We authorise Github…

Now you can see we have access sub corpus data and I am just opening a couple of items

—

Now in Github we can see the group management example.

I have given access to all the licences to myself, as you can see here and given access to licence A to others.

This diagram is a sketch of the interaction that took place in the demo - it shows how a repository can delegate authorization to an external system - in this case Github rather than CILogon. But we are working with the ARDC to set up a trial with the Australian Access Federation to allow CILogon access for the HASS Research Data Commons so we can pilot group-based access control.

NOTE: This diagram has been updated slightly from the version presented at the conference to make it clear that the lookup to find the licence for the data set is internal to the repository - the id is a DOI but it is not being resolved over the web.

In this presentation, about work which is still very much under construction, we have:

Shown an overview of a complete Data Commons Architecture
Previewed a distributed access-control mechanism which will separates out the the job of storing and delivering data from that of authorising access
We'll be back next year with more about how analytics and data repositories connect using structure and linked data.

[ptsefton.com] | [CV & Bio]

Towards a (technical architecture for a) HASS Research Data Commons for language and text analysis

2021-10-12