PTsefton.comBlog Title

Google's sketchy AI

2026-01-22T00:00:00Z

You've all noticed that Google is a bit shit lately, right? You get AI answers instead of decent search results, and for me at least it always sends me to Reddit where there's more AI shit.

Quite a few of the bloggers I follow have been talking about how they are extracting themselves from the services run by the big tech companies, for all sorts of reasons to do with risk, privacy, social justice and so on. This post is just another reminder that maybe using stuff from Google et al is not such a great idea.

This story is kind of funny, but it's also very sobering.

We were in this meeting today, me, Moises, Ben and we were joined by River who guessed the Zoom room ID and joined us. We were working on this presentation, only as you see here in this screenshot the slide notes were not showing up for Moises.

(The bottom of the page where there is normally some stuff is just a void)

Not in Safari, not in Chrome, and not in Firefox. I was all, like "Works for me" in the tab I'd had open all week, and wondering if it was a recent permissions change I'd made. But no, turned out it didn't work for Ben or River, and if I opened the presentation in a new browser session it didn't work for me either.

And then River said - hey, it looks like it's something to do with this. He'd done the developer thing and opened the develop tools and had a poke around the screen and found this:

<div class="appsSketchyGenerativeaiNudgesCanvasNudgeSurfaceSoyContainer" id="canvas-nudge-surface-soy-container"></div>

This happens to be right before the speakernotes-container div - where, one assumes the speaker notes live.

<div id="speakernotes-container"> ... </div>

Yep there's a an element in Google's office product used by millions of people with a class called appsSketchyGenerativeaiNudgesCanvasNudgeSurfaceSoyContainer. May contain traces of sketchy generative AI but least it's lactose free :-). The fact that it's a class indicates that it might be used in more than one place, so who knows what else it is breaking.

You can't make this stuff up. You need a SketchyGenerativeai for that, though to be fair there are actually other bits of the interface that have sketchy in the name - maybe it doesn't mean shitty in this context?

And then Ben said if you delete that class the page works again. The little handle reappears and you can drag it up to see your notes.

The little handle:

And we're back:

Now we do pay for this Google stuff at work and I use it for all sorts of things and I also pay. But I am thinking that this is a very good reminder to get thinking about how to move all our stuff to services that are not, you know, turning to shit.

The End of The World is Nigh (ish)

2025-11-20T00:00:00Z

[UPDATED 2025-12-01]

Recorded a demo for this song and made a lyrics video:

Skip the intro, straight to the live video

Or play using the chart or read up on the background in the free PDF.

I am getting around to setting up a part of this website for all my songs as something to leave behind for the two people who might care about that (they know who they are). Meanwhile, I have this blog so I'm putting this here.

This song is inspired by Scott Cook - a wonderful Canadian musician. It's only been performed in public once or twice so I thought I'd better learn to play it again before I completely forgot how it goes. Also I told Scott I'd send it to him, so now I think I have to at least try to sing it. I've been working in the "studio" but we sang it at the Katoomba open mic last night.

When I tested this song out on my friend Brian he said. "That's dense!" and I was like "You're dense!" and he was like, no I mean there's a lot going on in that song, it needs footnotes. So I made footnotes - which are in the PDF download. The PDF doesn't have the chords in it so I've included a chart with them in it it too. They're a bit funny but you'll get the hang of it if you retune your guitar a little bit. Actually I'm a bit worried about what to call the first on - it's a kind of D Minor 7 but I think that mostly it functions as a B Flat Major chord with no root, and an added 4th or 11th or whatever. The song is in B Flat - I'm sure of that.

But don't worry about the PDF now I'll cover the main points in this intro, you can download it later and print it out, that's yours to keep, as a memento. One young bloke even asked me to sign his copy, but we're not up to that part of the story yet.

Scott wouldn't mind this intro I'm sure, he's a folk singer, and he's been known to have a bit of a chat. Some of those type of singers hardly get around to singing not to name any Folk Singer, Folk Singers called Steve.

We begin our story in the nineteen eighties. Back in those days there were these "Record Shops" and we'd go there and find the kind of music we were after and then look through these 'bins' that were labelled alphabetically. I was always thinking about band names -- in hindsight I should have practiced actually being a band rather than just saying we were one. Anyway, I thought a good move would be to choose an name between the "Velvet Underground" and and the "Violent Femmes" so when the cool kids were flipping through the "Vinyls" in the "V" bin they'd find us. Though in those days Vinyl was not a count-noun, we called them something else.

Anyway, perusing the dictionary thought I might call my band "Vermicide" or "Vermicidal Vendetta"

Jump forward to 2019 and we're in a tent at the Kangaroo Valley Folk Festival, at the showgrounds - just got married, so we're on our honeymoon and we're listening to "Scott Cook and the She'll be rights". Something went a little bit wrong on stage - some kind of minor stuff up and Scott said something along the lines of "Sorry everyone, this band is somewhere between nonchalance and negligence". I'm generally not a "she'll be right" kind of person, being more of a "this will end in tears" disposition, but I liked that phrase and I typed it into my song ideas file.

Then, weirdly, a few years later, still waiting for someone to discover the band I didn't bother forming I was plucked from the obscurity of an open mic in Lithgow to perform at ... a Comedy night with some rude comedians from Sydney picking on the audience and talking about sex and drugs. And then when that went OK, "The inaugural Lithgow Comedy Festival". I didn't have to pay very much to perform, thanks Martin and as you can imagine the exposure was terrifically valuable. Anyway, for the occasion I thought I'd better make a new song to go with my one about the dog park that people think is funny -- it's not really, it's a tragic and closely observed look at a few very pressing social issues with swearing and drug references but no sex and I don't pick on the audience though I do shout a little bit.

So anyway - unlike some people in 2023, I didn't get chatGPT to write a song for me but I did ask it what English word was half way between Negligence and Nonchalance - and it replied with bit of computer code that fetches a dodgy list of English words, finds Scott's two words in it and gets the one that's halfway in between. Back then that was kind of spooky. It also suggested that that word was negotiate, which is absolute bullshit - it's waaay to close to Negligence so it's obviously not that.

import urllib.request
url = "https://raw.githubusercontent.com/dwyl/english-words/master/words.txt"
response = urllib.request.urlopen(url)
words = response.read().decode().split("\n")
start_word = "negligence"
end_word = "nonchalance"
start_index = words.index(start_word)
end_index = words.index(end_word)
mid_index = start_index + ((end_index - start_index) // 2)
mid_word = words[mid_index]
print(mid_word)

I ran the python script and it downloaded the list and it spat out an answer: Nighish. Nighish? I told you that was a dodgy list of words but it's good enough for folk music I guess. The only thing that is ever nigh is the end of the world, if it was nighish in '23 it's well and fucking truly nigh now (2025).

Oh and you might not know that Vespoidea is a taxonomic term for a Super Family (whatever that is) of Wasps. And the the reason that the Italian scooter is called Vespa as allegedly cos it is classified as part of some wasp super family on the basis of the noise that it makes.

Here is my song "The end of the world is Nigh(ish)". It's not funny - but you might learn something.

Low friction FAIR interoperability using RO-Crate metadata in text analytics pipelines

2025-10-23T00:00:00Z

PDF version

Copyright Rosanna Smith, Mike Lynch, Peter Sefton, Simon Musgrave, River Tae Smith 2025 Creative Commons Attribution-Share Alike 4.0 International.

This presentation is from the eResearch Australasia Conference. It was delivered by Rosanna Smith and Michael Lynch. I'm putting it here as one of the authors.

We followed this presentation with an RO-Crate Birds of a Feather session with some other colleagues. We were able to help out a few RO-Crate commiunity memebers with some of their questions, and direct them towards solutions and avenues for further discussion - mainly the RO-Crate Regional Drop in Calls.

Research Object Crate (RO-Crate) is a simple method for linked-data description and packaging of data. Since 2021, the Language Data Commons of Australia (LDaCA) project has onboarded a number of language data collections with thousands of files. These are all consistently described as RO-Crates using a human and machine-readable Metadata Profile, discoverable through an online portal, and available via an access-controlled API. This presentation will show how analytics workflows can be connected to data in the LDaCA repository and use linked data descriptions, such as the W3C “CSV for the web” (CSVW) standard, to automatically detect and load the right data for analytical workflows. We will show how the general-purpose flexible linked metadata and raw data is prepared for use with common tools implemented in Jupyter notebooks. This work, funded by the Australian Research Data Commons ARDC, has enabled novel research by making data collected using sub-disciplinary norms of linguistics available to researchers working in other specialised areas – we will show examples of this and how this approach is relevant to other HASS and STEM disciplines, demonstrating work which would not have been possible without this co-investment between the Language Data Commons partners and ARDC The presentation should be accessible to the broad audience of eResearch and be of particular relevance to those with an interest in workflows and analytics, as well as metadata, vocabulary and repository specialists. It shows a FAIR research system which runs on open specifications and code and can be redeployed for other domains.

The Language Data Commons of Australia or LDaCA is part of the Humanities and Social Sciences and Indigenous Research Data Commons, which is led by the ARDC.

This project is co-funded by the University of Queensland (UQ). Authors Rosanna, Peter, Simon and River all work with UQ and Mike is with the University of Sydney.

What you see on the right is the execution strategy and what drives the LDaCA tech team.

To summarise, the strategy is about data management, developing tools and standards, technical architecture, and care for data in the long term.

LDaCA builds data portals with robust access controls in place, and this ensures that access is as open as possible but as restricted as needed according to the data stewards and communities the collections relate to.

We also develop shared tools for processing, analysis and visualisation of data and metadata, some of which we’ll be demonstrating today. We will be focussing on the indicated parts of this strategy “access” and “analyse”.

Looking at analytics specifically, LDaCA aims to ensure workflows and infrastructure developed for analysing collections are available for access and reuse.

These should also be easy to re-run with clear documentation on their uses and limitations, and should allow for adaptation for a range of contexts.

The core idea of LDaCA is to develop standardised methods for describing and organising data in a Data Commons environment, which reduces friction in finding, using and re-using data.

We have captured this approach with PILARS, which are Protocols for Implementing Long-Term Archival Repository Services.

These services should be designed to work in low-resource environments, allowing communities to have agency and control over their materials.

The protocols prioritise sustainability, simplicity and standardisation, with linked-data description and clear licensing.

Data is organised into objects, taking into account practical considerations such as the size of each object, and access conditions. Each data object is stored in a repository as an RO-Crate (which stands for Research Object Crate).

An RO-Crate is a way of packaging research data that stores the data together with its associated metadata and other component files, such as the data license.

In this diagram, we have one collection containing items, such as a set of interviews, and each item describes the files linked to it, in this case, a text file and an audio file. Licenses for each of the items are also included.

The RO-Crates are modelled according to a metadata profile which outlines the expected collection structure and provides guidance on describing language data in a repository.

The profile uses schema.org as its foundation for metadata description, as well as a few other standard vocabularies.

It also draws on the Language Data Commons schema http://w3id.org/ldac/terms, which contains metadata terms specific to describing language data.

This diagram overviews the architecture for indexing data with a focus on findability, and illustrates the key conceptual components of our data storage architecture.

Storage services follow the PILARS protocols and store data as a series of Storage Objects using the Oxford Common File Layout (OCFL). This is a specification from the digital library community for storing data independently of applications such as portals, repositories or particular analytical pipelines.

This data is distributed across multiple storage services, including file-systems and object stores hosted by different institutions.

The diagram shows our distributed access control approach at LDaCA, and this is motivated by a need for controlled access in conjunction with CARE and FAIR data principles.

Each item in the repository is stored as an RO-Crate with licensing information included, and the repository defers access decisions to an external authentication and authorization system.

Here, data custodians can design whatever process they like for granting license access, ranging from simple click-through licenses to detailed multi-step workflows based on whatever criteria the rights holder requires.

This can be scaled to multiple portals and repositories as well.

(Image is from here: https://github.com/Language-Research-Technology/plantuml-diagrams/blob/main/generic/simple-distributed-access-control.svg)

All of this architecture comes together in the “main” portal, where we add language data collections that meet LDaCA’s collection guidelines. These are batch loaded from scripts into a loading area and then they’re indexed appropriately.

We’ve also set up automation with Terraform to build portals on demand for communities, so that bespoke requirements for the portal interface and other needs can be catered to.

The portals provide secure access to the data in an automated way through an API, and this is also used for downloads and analytics.

On the analytics side, we’re building Jupyter notebooks to explore the collections, which users can launch in a binder infrastructure, and these are also accessible in the portal.

The notebooks allow you to download the collection data and analyse its contents in a repeatable way with reproducible results.

Library upgrades and version changes can break once-working Jupyter Notebooks, and this makes it difficult for future users to verify and reproduce results or build on them.

BinderHubs enhances reproducibility by allowing users to launch pre-configured notebooks as interactive computing environments, and these have explicitly defined hardware and software requirements.

We use the Nectar BinderHub service that is provided by the ARDC and AARNet.

To illustrate this, I’ll walk through a recent notebook that has been developed for the COOEE collection.

The COrpus of Oz Early English, or COOEE, is a collection of texts written in Australia between 1788 and 1900. These include letters, published materials in book form, and other historical text.

The corpus is divided into four time periods, each holding about 500,000 words.

It’s also divided into four registers: these are Speech-based, Private Written, Public Written and Government English. The proportions of these registers is consistent for each time period as well.

Because COOEE is organised across both time period and register, the corpus can be stratified into 16 sub-corpora and allows for analysis of linguistic features according to either or both of the variables.

Our notebook uses these sub-corpora as the basis for topic modeling, to show what topics are more or less strongly associated with particular sub-corpora.

Before we could analyse the collection though, there was some transformation of metadata needed, so that where possible, the terms adhere to a standard framework that can be applied across collections, and allow for further interoperability.

A number of these included mapping metadata in the COOEE collection to their schema.org equivalents, for example, Birth to birthDate, and Nr to identifier.

We also needed to identify the main text for analysis, because each object in the COOEE collection has two types of text files - one is the plain text and the other has metadata encoding, with information about the register and the author of the text.

For analysis, we only want to use the plain text so that the metadata codes won’t be included as part of the topic counts.

For this, we defined the new term mainText in the Language Data Commons Schema vocabulary, which identifies the most relevant sub-component for computational text analytics.

This metadata standardisation is an important step because it not only makes analysis faster and easier to do, but also allows us to re-use analytical approaches on multiple collections, and streamlines processes like comparing datasets.

In the notebook, we first download the whole COOEE collection directly from the LDaCA Portal, and then we use the RO-Crate tabulator Python library which converts RO-Crate linked data (a network, or “graph” of relationships) to tabular representations (rows and columns).

The tabulator also allows you to select the tables and properties that are relevant to your analysis. Mike will talk about this later in more detail.

We also specify the mainText field as the property to be loaded and indexed as the text to analyse for this collection.

We then convert the table to a Pandas DataFrame so that the metadata sits alongside the text data.

Finally, we slice the data by register and time period, and concatenate the text of each document within a slice to create 16 large documents.

In order to discover the topics in the collection, we need to be working with a list of words for each document instead of paragraphs of text.

For this, we used the Natural Language Toolkit Python libraries, which allow us to tokenise the data.

Some words and other items in the text can be considered as 'noise' for the analysis and these are removed: these include punctuation marks, numbers, artefacts of digital text such as new line symbols and many common function words which are not relevant for the analysis.

The diagram shows an example of some of the input text in the first box, and the second box shows the same data tokenised in a list with punctuation and other ‘noise’ items removed.

Using the tokenised word lists, we can now model the data using Gensim, and visualise the output, and we’ve done this with both interactive and static options.

LDAvis – on the left – is an interactive visualisation of the topics learned by the model. The circles represent topics, and the most prevalent topics are larger. The bars show individual word frequencies.

Because this visualisation is interactive, users can select a topic circle to view its most salient terms, which can be used to analyse any plausible semantic groupings in those topics.

The heatmap – on the right – shows the distribution of topics across time periods and registers. The horizontal rows of three or four dark squares show where topics are strongly associated with a particular register across time.

Although this particular notebook example is just run on the COOEE data, these processes can be easily re-applied to explore further collections, and can be adjusted according to the needs of the user.

I’ll now hand over to Mike to talk more about the RO-Crate tabulator.

The first version of Tabulator was written because a new version of the Observable data visualisation platform had come out which allowed you to build interactive dashboards without requiring a custom backend.

I wanted to see how quickly I could use it to visualise the contents of an RO-Crate.

Observable is a data-sciency tool so it really wants to work with tables. So Tabulator needed to be able to turn a random RO-Crate, which could be an arbitrary network of entities, into a set of tables.

I'm a software engineer, so the first version was extremely general and not very performant. In most practical cases, you only want to turn a couple of the entities into tables - with the LDaCA collections, for example, things like RepositoryObjects, and also lifting relations to Persons into the table of documents

With a bit of config the Tabulator can be used to convert the COOEE RO-Crate to an SQLite database with a row for each of the documents and their metadata - you can then load it in Observable and start building interactive plots to look at corpus features.

Tables are also how researchers like to work with data, because you can load them into spreadsheets. But that raises the problem of spreadsheets and data types - CSV is just text with commas - there's no type information.

But - we're exporting our data from a well-crafted RO-Crate, which has schema.org mappings, and there's an existing standard, CSVW, which can annotate CSV with JSON-LD column types.

So Tabulator can export CSVs, together with a secondary RO-Crate which provides a CSVW schema for each of the exported tables, explaining what the columns are and providing links to definitions

We've got more work underway at Sydney Informatics Hub to make it easier to use the Tabulator on LDaCA corpora. I still want to keep it as a general-purpose library but we can add some Python code which will use the common features of LDaCA RO-Crates to get out the relevant entities and build a table of the texts.

This will then feed into the work we're doing on a new web frontend to a range of different text-analytics tools.

Eventually, rather than downloading a whole corpus, running Tabulator and then loading it, you could run a component which fetches texts from the LDaCA data portal and returns rows which can be analysed in a web interface

General purpose - visualise/analyse in your platform of cj

To conclude, this work is part of a Data Commons. The key idea is to create low-friction analysis pipelines by:

Having consistently described, well managed data available with a discovery portal and secure access APIs
Data prep tools which will make it possible to align BYO data with the standards
Tools should be easy to run on more than one data set, datasets work with more than one tool
Limitations and assumptions on tools are clearly documented
Tools are adaptable to different contexts

LDaCA Technical Architecture Update

2025-09-24T00:00:00Z

This presentation is an update on the Language Data Commons of Australia (LDaCA) technical architecture for the LDaCA Steering Committee meeting 22 August 2025, written by members of the LDaCA team; me, Moises Sacal, and Ben Foley edited by Bridey Lea. This version has the slides we presented and our notes, edited for clarity. There's a more compact version of this up over on the LDaCA site

The architecture for LDaCA has not changed significantly for the last couple of years. We are still basing our design on the PILARS protocols.

This presentation will report on some recent developments, mostly in behind-the-scenes improvements to our software stack. It will give a brief refresh of the principles behind the LDaCA approach, and talk about our decentralised approach to data management and how it fits with the metadata standards we have been developing for the last few years. We will also show how the open source tools used across LDaCA’s network of collaborators are starting to be harmonised and shared between services, reducing development and maintenance costs and improving sustainability.

The big news is a new API a new RO-Crate API (“An RO-Crate API” - AROCAPI ) which offers a standardised interface to PILARS-style storage where data is stored as RO-Crates, organized into "Collections" of "Objects" according to the Portland Common Data Model (PCDM) specification, which is built-in to RO-Crate.

A concrete example is that PARADISEC will implement different authentication routes (using the existing “Nabu” catalog) than the LDaCA data portal which uses CADRE ([REMS])(https://www.elixir-finland.org/en/aai-rems-2/).

Promising discussions are taking place with one of our partners about taking on LDaCA data long-term (instead of having to distribute the collections across partner institutions). This would give a consolidated basis for a Language Data repository and a broader Humanities data service.

This slide shows the LDaCA execution strategy. All of the strands (Collect & organise, Conserve, Find, Access, Analyse, Guide) are relevant to the technical architecture.

From the very beginning of the project, the LDaCA architecture has been designed around the principle that to build a “Research Data Commons” we need to look after data above all else. We took an approach that considered long-term data management separately from current uses of the data.

This resulted in some design choices which are markedly different from those commonly seen in software development for research.

Effort was put into:

Organising and describing data using open specifications BEFORE building features into applications;
Designing an access-control system with long-term adaptability in mind (read the story about that as presented at eResearch Australasia 2022);
Batch-conversion of existing data to the new approach; and
Developing a metadata framework and tools to implement it.

With this foundation, and the new interoperability we gain from our collaboration on the AROCAPI API, we are well placed to move into a phase of rapid expansion of the data assets building workspace services. For example:

The new LDaCA analytics forum will drive analytical workspaces
Work by the LDaCA technical team will continue to improve data preparation workspaces, possibly by collaborating to adapt the Nyingarn Workspace for general purpose use.

In 2024, we released the Protocols for Implementing Long Term Archival Repositories (PILARS), described in this 2024 presentation at Open Repositories. The first principle of PILARS is that data should be portable, not locked in to a particular interface, service or mode of storage. Following the lead of PARADISEC two decades ago, the protocols call for storing data in commodity storage services such as file systems or (cloud) object storage services. This means that data is available independently of any specific software.

For the rest of this presentation, we will focus on recent developments in the “Green zone” – the Archival Repository functions of the LDaCA architecture. We will not be talking about the analysis stream as that will be discussed in detail in the newly established Analytics Forum.

I (PT) wanted to throw in a personal story here. This is an unstaged picture of my (PT Sefton’s) garage this morning. The box of hard drives contains some old backups of mine just in case, and also my late father Ian Sefton’s physics education research data, stuff like student feedback from lab programs in the 80s trialling different approaches to teaching core physics concepts and extensive literature reviews. These HAVE been handed on to his younger colleagues but could easily have ended up only available here in this garage. I wanted to remind us all that this project is a once in a career opportunity to develop processes for organising data and putting it somewhere alongside other data in a Data Commons where (a) your descendants are not made responsible for it and put it in a box in the shed or chuck it in a skip; and (b) others can find it, use it (subject to the clear “data will” license permissions you left with the data to describe who should be allowed to do what with it), and build on your legacy.

Remember:

Storage is not data management (particularly if the storage is a shopping bag full of mistreated hard drives)
Passing boxes of storage devices hand to hand is NOT a good strategy to conserve data
Hard drives are not archives

The first principle of PILARS is that data should be portable, not locked-in to a particular mode of storage, interface or service. Following the lead of PARADISEC two decades ago, the protocols call for storing data in commodity storage services such as file systems or (cloud) object storage services. This means that data is available independently of any specific software. This diagram is a sketch of how this approach allows for a wide range of architectures – data stored according to the protocols can be indexed and served over an API (with appropriate access controls). Over the next few slides, we will show some of the architectures that have emerged over the last couple of years at LDaCA.

The first example is the LDaCA data portal, which is a central access-controlled gateway to the data that we have been collecting.

NOTE: during the project it has been unclear how we would look after data at the conclusion of the project. No single organisation had put up its hand up to host data for the medium to long term, but as noted in the News section we have had some positive talks with one of our partner institutions indicating that they may have an appetite for hosting data that otherwise does not have a home, and/or providing some redundancy for at-risk collections where data custodians are comfortable with a copy residing at the university (we won’t say which one until negotiations are more advanced).

This slide shows a demo of two different portal designs accessing the same PARADISEC data, which has been accomplished using the new AROCAPI API. The API will speed development of new PILARS-compliant Research Data Commons deployments, using a variety of storage services and portals that can be adapted and "mixed and matched" via a common API.

Alongside the data portal, we have explored other ways of sharing data assets, including local distribution via portable computers such as Raspberry Pi with a local wireless network. We have also discussed establishing regional cooperative networks where communities reduce risk by holding data for each other.

With our partners, we have developed and adapted a suite of other technical resources, including:

Oni portal software for mid-to-large deployments. Version 1 is live and Version 2 is currently under development with PARADISEC, involving a new shared API and code base that can be used across LDaCA and beyond.
REMS overlaid with CADRE to manage access control for identified users. A service agreement between LDaCA and CADRE has been signed, to manage access control. REMS is still the backend of this tool, but CADRE’s wrapper makes it more user-friendly. CADRE version 2 will replace the admin component of REMS and is in the testing phase now.
‘Corpus tools’ for migrating data from existing formats to LDaCA-ready RO-Crates are available on github. These reduce the cost of developing new migration tools by adapting existing corpus tools, provide reproducible migration processes and are a strong foundation for quality assurance checks.
Software libraries for managing data in RO-Crate, maintaining schemas available on our github organisation.
RO-Crate preparation tools, including:
- Crate-O (now included in Nyingarn)
- Crate-O-compatible spreadsheet templates for DIY data import and supporting familiar Excel-based workflows — documented on the LDaCA website
- LaMeta, which now has RO-Crate support
- RO-Crate playground to experiment with and validate metadata.
Data preparation workspaces:
- Nyingarn (focussed on creating searchable text from manuscripts)
- Our next steps will involve a multi-modal workspace, for audio and video transcription.

This diagram shows how the PILARS principles have been implemented by different organisations. Each example uses open source software, and accepted standards for metadata and storage, meaning that data is portable.

This slide shows one potential view of LDaCA’s architecture in 2026. There may be an opportunity to deepen the collaboration between the UQ LDaCA team and the PARADISEC team at Melbourne, sharing the development of more code.

For example, Nyingarn’s incomplete repository function could be done by a stand alone instance of the Oni portal, or as shown here, added to the LDaCA portal as a collection.

Likewise the non-existent user-focussed data preparation functions of Nyigarn, where a user can describe an object and submit it could be generalized for use in LDaCA.

Changes shown in this diagram:

Remove the “NOCFL” storage service from Nyingarn and replace with either OCFL or an Object Store solution
Upgrade Nyingarn workspace to be a generic data onboarding app for all kinds of data (rather than only manuscript transcription focus)

To conclude, we have an opportunity now to consider how the distributed LDaCA technical team can collaborate on key pieces of re-deployable infrastructure. This work is having an impact in other Australian Research Data Commons (ARDC) co-investments.

Open Repositories 2024 Trip Report: Göteborg

2024-06-26T00:00:00Z

This is my summary of The 19th International Conference on Open Repositories, June 3-6th 2024, Göteborg (Gothenburg), Sweden.

I paid for air travel to Sweden, via Tokyo, Helsinki and Stockholm and the University of Queensland paid for the conference registration and accommodation for the 4 days of the conference.

I've been going to this series of conferences for a while, starting with the initial meeting in Sydney in 2006 - missed the 2022 one in Denver, but apart from that I've been to all of the face to face meetings. Starting to think about how to mark the 20th episode which will be in Chicago next year.

In this post I'll cover this year's keynotes, some highlights and what I did at the conference, starting with our workshop on the first day.

Our workshop

Last year I had an idea to pick up on some of the themes of the closing keynote from Hussein Suleman who has deep experience implementing repositories and archives. He attended the conference and pointed out issues with complexity and cost (too much), and sustainability (not so much) with the presentations he saw at the conference. I think he was a bit disappointed with us all. I said:

In response to Suleman's challenges I'd like to propose a stream of work at next year's Open Repositories conference in Sweden.

How about we hold a pre-conference hands-on workshop that challenges repository developers to embrace some of the approaches Suleman is talking about - storing files on disk, zero-install indices of content etc? How simple could you go to radically re-imagine a repo stack?

I'd like to see a mixture of institutional and commercial developers get involved, and to step out of their big fully-featured repository palaces and see what we can get done in a few days over the conference. We'd then have a session at the end of the conference that builds on work by Suleman and others on low-resource-ready repository and archive solutions. There might be token prizes as there are for poster presentations.

Well, Claire Knowles, Kim Shepherd and I proposed a workshop - it was accepted and Kim and I ran it on the Monday with Claire's colleague John Salter.

The workshop team: Kim, Me, John [Update forgot to mention Wray the whippet from Leeds - he has a name that starts with WR cos White Rose https://eprints.whiterose.ac.uk/ ]

We had a teaching space at the uni, a short tram ride from the hotel. There were about 15 people in the room for the morning.

The workshop room

The morning was spent having a conversation about what we mean by "static" website / repository? Conclusion: it means lots of different things depending on what you're trying to achieve. We discussed some of the things Suleman had spoken about - lack of resources for complex tech, poor or no networks and whether it's OK to have Javascript in static pages. Our conclusion? It all depends.

We were very lucky to have Kiflom Michael Kahsay from the Eritrean Research and Documentation Centre at the workshop. He gave us a quick presentation of what repositories mean in the Eritrean context (carefully curated stand-alone instances of DSpace) and talked about how they run an "internet in a box" with teaching materials, as the country is basically not connected to the internet; every time he's got an internet connection he's gathering materials to be taken home for teaching - off local networks and stand-alone machines, or via USB sticks plugged in to smart TVs for home learning.

We not only had projects emerge from the workshop, we had a way to tie them together; "Easy Open" proposed by Patrick Stickler. Patrick wanted to make repository downloads immediately usable and well described with a nice HTML readme in every zip. I (predictably) said "Use RO-Crate" and he said (something like) "No, those weird file names, not easy"^[1]. Patrick came up with an idea to wrap any kind of data package with a simple HTML entry point called OPEN.html (in your language of choice) which would contain a useful guide to the data package^[2]. Patrick's team, which I joined, helped flesh out this idea into a draft spec and the other teams all aimed to support it in their own designs.

Our aims for the workshop included building professional networks as much as tackling the issue at hand. The personal feedback we received included the phrase "life changing" from one participant and we had thanks from others for making their conference experience much more valuable than it would otherwise have been. We're happy with the outcome; I think this is an interesting approach to try again at this and other conferences as a way for people to meet, think about bigger issues and maybe progress the Open Repositories agenda (whatever that is, more on that below).

As if by magic, after lunch, 5 teams coalesced out of the workshop, and spent the next few hours exploring different responses to Suleman's call to action. We presented these at the closing plenary in ten minutes. The presentation is up on the conference github organization. Here's my slide from the closing plenary session:

Inspired by Prof Hussein Suleman’s closing keynote at OR2023 “Designing Repositories in Poor Countries” this workshop explored practical implementations of Suleman’s principles and goals. We encouraged developers, metadata specialists and managers to embrace, simple and sustainable approaches to making data safe and available using well established static web publishing techniques with on-demand indexes and discovery interfaces – contrasting with the complexity of current repository stacks which have grown into bloated enterprise software where setup and data migration are very expensive and hosting requirements are substantial.

We explored a range of technologies including those mentioned in the keynote and the participants came up with the following ideas which they’ve been working on and documenting throughout the conference. We’d love to see some of these ideas continue to develop and to show up at OR2025 as poster or dev-track submissions.

Our presentations

I talked about three other things prepared with colleagues at the Language Data Commons of Australia and RO-Crate:

These are all annotated presentations, with detailed notes. The third of these is the most important - it introduces a set of implementation protocols that we have been developing for Research Data Management; the Protocols for the Implementations of Long-term Archival Repository services (PILARS). https://w3id.org/ldac/pilars

Some other interesting things

Experiential Learning and Technical Debt

Clinton Graham University of Pittsburgh, United States of America

Experimental student projects provide a low-barrier opportunity for research universities to both support student learning and to pilot novel presentations of institutional research and data. This is particularly advantageous when students of a minority community are enabled to tell the stories of the research data of that minority community. The unresolved risk, however, is the ongoing stewardship of this work, as such an experimental or pilot project represents an inherited technical debt after the students have graduated.

This presentation will describe one such student endeavor as a case study. At the University of Pittsburgh, the University Library System and the School of Computing and Information Science partnered to create a query/visualization tool highlighting a distinctive-collections deposit within the University’s institutional repository of transcriptions of a substantial Chinese village gazetteer collection. The presenter will reflect on successes and challenges of this project and will invite conversation on similar technical management of experimental/pilot student projects which highlight institutional repositories’ research and datasets.

This presentation was great - Graham talked about a student built project; as he talked I was thinking, that sounds terribly unsustainable and I was working out how to phrase a pointed question or two. But then he went on to talk about the problems inherent in this approach with sustaining bespoke applications. Given projects in the Repo space, students (and also IT faculty members) are not likely to come up with sustainable solutions; they'll build with the tech they've been learning about and not think about long term concerns. It sounds like a more principled, guided approach is on the cards for future work with student cohorts.

Using Fedora 6 to architect future-proof and easily maintained repositories

Dustin Slater University of Texas at Austin, United States of America

At the University of Texas at Austin Libraries we have several Digital Asset Management Systems (DAMS) performing a variety of functions, most of which were in Islandora 7. With the Drupal 7 end of life we pursued rebuilding each of these platforms, which was complicated and labor-intensive. We sought to find another approach to solving this problem, one which would allow us to both modularize the infrastructure to move away from a monolithic implementation and adopt the Oxford Common File Layout (OCFL) as our storage layer to prevent needing future data migrations. Fedora 6 allowed us to rethink how to solve our DAMS challenges by providing OCFL and an API to access the data. With this as our foundation, we embarked on rebuilding our Archive of the Indigenous Languages of Latin America portal using a selection of open-source technologies. This resulted in a trilingual portal that is attractive, flexible, and sustainable. We are excited to have completed this project during the International Decade of Indigenous Languages and look forward to seeing how communities all over the world will continue to use the content to reclaim their languages.

This presentation included a mention of a presentation I gave last year at OR - about separating storage from display/dissemination and is using a similar pattern with OCFL as a storage layer. The difference is that the University of Texas team are using Fedora as the API to access the OCFL. This was not an option when we started our work with LDaCA (RRKive) but it may be worth exploring for us going forward. To get technical, one main difference between the approaches is that the RRKive approach follows the RO-Crate specification to store metadata about objects in OCFL while Fedora 6 (as far as I know based on attending a workshop last year) stores many small metadata files - at least one per content file and one per object which can pose additional challenges with file-system limits. I'm not up to date with all the Fedora discussions about this, but I do know that the too-many-little-files problem is being addressed in the next release of OCFL.

One of their projects is an Indigenous Language Archive which is similar to what we're doing at LDaCA; The Archive of the Indigenous Languages of Latin America (AILLA).

Here's a search for "alpaca" - which thanks to Moises is the LDaCA mascot 🦙^[3].

A search for Alpaca

Venue

The conference venue was fine - no networking problems, good food.

No shortage of radishes

The hotel is full of stupid hostile designs, lots of strikingly uncomfortable chairs in the lobby, and unusable furniture in the room: had a weird little nook/notch in which to plug in the kettle and a bizarre curved, tall cabinet to house the bar fridge. And the lights! Light switching in hotel rooms is often confusing, but they'd really gone to town on that part - multiple different kinds of switch on the same panel beside the bed, changing shit all over the room - flip this one, push that one until you get it to go dark. It never really gets dark though in June, in those parts.

The outside is nice though.

The Clarion Post Hotel with ice-cream stand and sculptures

Keynotes

This year's opening & closing keynotes were both delivered by researchers and talked about the Open Access agenda.

Opening Keynote

The opening keynote presentation got into some details about reproducibility

Gustav Nilsonne is associate professor of neuroscience at Karolinska Institutet, Sweden. His work is largely in metascience: assessing and improving the transparency and reproducibility of research. Gustav is a long-standing advocate for open science and is a senior advisor to the Swedish National Data Service.

Nilsonne has been active in efforts to re-form science - he mentioned this article on which he is an author.

1. Vicious cycle

After three decades of deterioration, more and more experts consider the scholarly journal system fundamentally broken and demand that it be replaced [1]. Most recently, Robert Terry, project manager at the World Health Organization stated at the R&I Days of the EC DG Research and Innovation that ‘The whole concept of a "journal" is kind of dead actually. What we need is a complete rethink', to strong support from the DG Jean-Eric Paquet [2] and reiterated by the Council of the EU [3]. Replacing traditional journals with a more modern solution is not a new idea [4–12], but the lack of progress since the first calls and ideas more than 20 years ago has convinced an increasing number of experts that the time for small tweaks is long gone and a disruptive break is now overdue. https://doi.org/10.1098/rsos.230206

And he quoted this from Richard Poynder who has been reporting on the Open Access movement for ages:

Sunday, January 07, 2024

Signing off

After reporting on the open access movement for 20+ years I have reached the conclusion that the movement has failed.

As a result, I shall no longer be writing about open access or updating this blog.

I explain my reasons for reaching the conclusion I have in this Q&A on The Scholarly Kitchen website.

I would like to thank those who wished me well when I announced my decision and I wish all those who continue to advocate for open access the very best.

My challenge to the latter is: please prove me wrong https://poynder.blogspot.com/2024/01/after-reporting-on-open-access-movement.html

This got my friend Richard Jones all fired up for his talk: he's tired of all this "OA is dead talk" see the next section about how Richard gave more nuanced view of how we're going.

This was a not a doom and gloom talk; it was about what Nilsonne and colleagues are doing to make science work better, which also requires work to get scholarly communications out of the clutches of big business. The speaker did note that there is still a lot of work to do with Research Data Repositories; that data has not been completely co-opted by the publishing industry just yet.

Closing keynote

The closing keynote was a call to action on increasing the amount of literature on climate science that's available for Open Access above the 50% mark where it currently sits.

Dr. Monica Granados has a PhD in ecology from McGill University. While working on her PhD, Monica discovered incentives in academia promote practices that make knowledge less accessible and has since devoted her career to working in the open science space in pursuit of making knowledge more equitable and accessible. She has worked on open knowledge initiatives with Mozilla, PREreview and the Government of Canada. Monica is now an Assistant Director at Creative Commons working on the Open Climate Campaign promoting open access of climate and biodiversity research.

Granados talked about the problem that research literature is not available and introduced us to a campaign to change this by various means including a toolkit for the authors of recent articles to source and deposit OA versions of articles.

Screenshot of openclimatecampaign.org

Yes, of course we should do this, though the fact that we still have to mobilise to recruit open access content and we're talking about it at Open Repositories number nineteen might support the view that the Open Access movement has failed.

The talk was all about finding the literature and making it open but the site does mention other outputs which I have

highlighted in this quote from the website:

The goal of the Open Climate Campaign is to promote open access to research to accelerate progress towards solving the climate crisis and preserving global biodiversity. If we are going to solve these global challenges, the knowledge
(research, data, educational resources, software)
about them must be open.

Only 47% of publications available as OA is bad enough but the situation is much worse for those other outputs of research, particularly data. This talk was mainly about the publications though and I got the strong impression that that's where the project is focussing its advocacy and tools. There are a few links on the site to very general resources on Open Data and Open Code but nothing that would move the needle on access to either.

There's a tendency on this project, and in lots of other discourse around Open Access to refer to research publications as "research". This habit is an issue in the Open Repositories and Open Access community broadly.

Bibliometrics Research

One of our main Campaign objectives is to conduct an open access review of climate and biodiversity research. We work with bibliometric experts to track the percentage of open vs closed climate science and biodiversity research [publications]. We also use these tools to identify seminal climate and biodiversity research [publications] in order to prioritize our efforts in advocating for that research [publications] to be open. We publish our findings on our homepage and regularly share updates on progress toward 100% open access for climate change research.

Using publications (and the journals in which they are bundled) as a proxy for "research" and basing funding decisions on metrics about this proxy is a huge part of what got us in trouble in the first place. Nilsonne's keynote made it clear "research" is not just about publications, and deeper systemic change is needed; I think a challenge for the Open Repositories community is working out how to implement the repositories that support this. And this will mean we have to reconsider the word "Open" -- a lot of research with human subjects can not just be Open -- we need to talk about Access by the right parties; the A in FAIR.

In the context of this conference, it has been understood since day one that "Institutional Repository" actually means "Institutional (Scholarly) Publications Repository" but now there are more kinds of repositories, so more explicit language would be helpful I think.

All that said, it's still a good idea to recruit more open access publications about climate change.

My fave presentation this year

My pick this time was from someone I've known for a long time (and I have done some work through his company); Richard Jones gave a very polished and clear summary of how Open Access is going in repository land.

Looking up from the weeds: seeing what's next for OA by learning from the past

Richard David Jones Cottage Labs, United Kingdom

This presentation looks at the history of Open Access from the perspective of the author, a software engineer in the sector for nearly 25 years. In that time he has been involved in local, national and international efforts in all aspects of the development of repositories and the infrastructure services that support them. It is about what we’ve actually seen be developed and become reality in that time, and how it connects to the goals and desires of the community. It asks what lessons we can learn from that time, and what that might tell us about the future of OA, and whether it is alive and well or under threat. What is the role that the repository has to play in the coming publishing paradigms which aim to improve global publishing equity (such as Diamond and Overlay Journals) and how can we as a community enable it.

Jones and his colleagues have worked on many of the Open Access infrastructure projects that tie repositories together; things I've seen discussed at the conference for ages but which are not part of my day to day work, so it was great to hear about a bunch of them in one place. He noted that every year someone announces the death or failure of OA but takes a much more nuanced view. He asked if Open Access is succeeding? First thing, it's not a Yes/No question. He presented a kind of score-card on what is and is not working.

We're all aware that the academic publishing industry has not loosened its hold on the research industry, it's making as much or more money as it was before and it didn't just go away when people started publishing open versions of their articles. This came up in the opening keynote and other discussions as well.

This thing called "Gold Open Access" where research teams pay up-front for publishers to organize peer review, editing, formatting and distribution always seemed problematic to me, but I have not been following it closely. Jones showed that it has not hurt profitability much but has shifted the costs around in a way that's probably unfair. The big issue here is that the publishers still have their journals which (I believe) they can rent out even if the content is nominally free and the journals are integral to funding and staffing in institutions.

He noted that the Research Data is an area where Big Publishing has not yet completely locked things up, a point which I think was also made by the opening keynote presenter; this is something to think about for the next twenty years of the conference; more emphasis on data. But data can't always be Open so how does that sit with an Open Repositories conference? Which brings us to the future.

Next year?

Next year in Chicago will be the 20th of these conferences. I'll probably attend. Thinking about some stuff to do:

Another workshop along the lines of the one we ran this year? I think so, but it will be interesting to see the feedback from participants and the conference committee.
A reflection on 20 years of Open Repositories, from a very different perspective than Richard's, of course.
Maybe suggest a panel discussion on the future of the conference and even the name? Is it meant to be a conference about Publications Repositories or is it more than that? (We know it's much more than that)

RO-Crate was created five years ago merging two approaches; DataCrate and Research Objects. In DataCrate the 'easy to open' HTML file was called CATALOG.html (and there was CATALOG.jsonld) - but we changed this to avoid having name collisions with the content being packaged, at the cost of making it much more obscure. Easy Open is a potential antidote to this, and can be added to RO-Crates. EasyCrate? ↩︎
This idea is 100% aligned with RO-Crate but experience is showing that this aspect of the spec is under-valued by implementers, and many existing crate download services don't do this well at all in my opinion. ↩︎
UPDATE: 2024-06-27 from Ben Foley - "Just a minor point which might be worth correcting for posterity, Melindah Holden was the first to call us "alpacas". She was on the board of FLA and at some of the early meetings. Moises has certainly championed the use of it 😀 ↩︎

Twenty year celebration: Site update number three

2024-06-24T00:00:00Z

[Update: 2024-06-25 Fixed typos]

This blog turned 20 this year.

Now, I mostly use this site to put up what Simon Willison calls annotated conference presentations^[1] like this recent one from June. Most of the projects or institutions I've worked for don't bother to maintain their websites over time, but with my own site I still have access to (most of) the stuff I've done in one place.

It's down to only a few posts a year for the last several years, but maybe I'll start putting more things here so the Large Language Models (The "AIs") have some fresh content to feed on.

Speaking iof the AIs, people have been asking me and my team about using AI to create metadata for ages, so I thought I'd give it a go here. I used another LLM to help me write a script to use GPT4.0 to categorise my posts using using the set of categories you see on the home page. It seemed to do an OK job, but managed to go well off piste and tag this post with "Research Data Pokémon_mgmt" - making up its own tag. I like that so I'm keeping it.

History

This site started in 2004 using Leonardo, a Python-based blog engine which required a server to be running. In 2007 I moved it to Wordpress, which I was using for lots of other sites at work and hosted it with the same provider.

In 2015 I started working at UTS; we got sick of the constant upgrades and the overhead of running a Wordpress instance for the eResearch website and ported that site to Pelican, a tool which produces static web sites that don't need a database and ongoing maintenance like Wordpress, so I migrated ptsefton.com at the same time (and I am now noticing that I missed a few bits in that process).

Static site generators, of which there are hundreds, typically work by transforming text files in markdown format into HTML files that can be hosted with your images and other files without needing anything running on a server (except a web server). The UTS eResearch team evaluated a number of static site generators at the time - and I recall them all being a bit annoying in various ways but Pelican was the least so, partly because we were a Python Shop, as they say.

Some time after 2015 I realised that I didn't even need to pay for hosting, and moved the site to github pages where it is as of mid 2024.

New tool - 11ty

Now I've decided to update the site to use a Javascript framework called Eleventy, AKA 11ty, which I've been using in some projects for work. It's (currently) annoying me less than Pelican, and I've been home by myself for a week while Gail is away, so I can spend evenings playing with it and learning how it all works. I didn't even consider Hugo which we also use for work or Jekyll which is very widely used, as both of those have really annoyed me. And this is my site, so I don't want to be annoyed.

11ty is very flexible (that's their thing, the 11ty developers) but for something where you can use any templating engine and you can mix code into every nook and cranny, it has some perplexing hard-wired behaviour.

There's a built in thing where you use "tags" to classify things and these "tags" cause 11ty to make "collections". Here's the (not very clear) explanation:

While pagination allows you to iterate over a data set to create multiple templates, a collection allows you to group content in interesting ways. A piece of content can be a part of multiple collections, if you assign the same string value to the tags key in the front matter.

Take care to note that tags have a singular purpose in Eleventy: to construct collections of content. Some blogging platforms use Tags to refer to a hierarchy of labels for the content (e.g. a tag cloud).

https://www.11ty.dev/docs/collections/

I have indeed taken care to note this, and it's annoying, because I'd rather Eleventy didn't use a term like "tag" to mean "collection" and then to bake this in to the code; what if I have a lot of content from another site that used "tag" in the other sense? Either make it configurable like all the other configurable things you're proud of, or use something that won't clash with what "some blogging platforms do", preferably something like, say _11ty_collection or collection.

Also weird (to me) is the thing where all your actual documents are considered to be "templates" as well as the things I would call templates, such as the generic layouts for the pages on the site.

The only thing that took me a significant amount of time to configure was working out how to co-locate posts and the images that go with them.

Eleventy is not unusual in making this difficult. For some reason, most of these static site generators that I've seen don't simply copy images used in a post from the folder where you write the post through to a folder in the output; the designers think it's a good idea to put all your images over there somewhere. I found a Github issue about this which led me to try out an unreleased image plugin which didn't really do what I wanted. See my contribution to that issue showing what I ended up after a few hours of frustration.

The thing is once I worked out I needed to do it, adding that bit of dodgy code was super-easy to do, as have been most of the other things I've wanted like adding support for markdown footnotes^[2], or putting in an Atom feed, the usual things. These are simple to add and configure. Full marks for that.

Design

Previously I have spent time working with "themes" to add styling and structure to the site with time I didn't bother instead of

It's still using (almost) the same colour scheme as the first version. Like a stopped calendar, pink and yellow comes back into fashion every so often.

Comments

I have been using Disqus comments for years but I'm locked out of my account and I can't reset my password cos it won't send me an email to that account, but there's no email support unless I'm on a paid plan, but I can't start a paid plan because I'm locked out of my account. I'm thinking of screen-scraping the old comments but for now they'll just be missing.

Anyway, if you're reading this then the site has been migrated. I'm assuming if you're here then you're likely to have a github.com account and that the feed is working. Leave a comment to let me know you're out there?

I have been doing this for years, and like Simon I have tool, which I should tidy up and document here ↩︎
like this! ↩︎

Five ways RO-Crate data packages are important for repositories

2024-06-05T00:00:00Z

This post is also available at the Language Data Commons of Australia site.

PDF version

Five ways RO-Crate data packages are important for repositories

Presented at: The 19th International Conference on Open Repositories, June 3-6th 2024, Göteborg, Sweden

Session: Presentations: Integrations for Research Data Management

Time: 05/June/2024: 11:00 - 12:30 · Location: Drottningporten 1

Peter Sefton*, Stian Soiland-Reyes**

*University of Queensland, Australia; **The University of Manchester, UK

Research Object Crate is a linked data metadata packaging standard which has been widely adopted in research contexts. In this presentation, we will briefly explain what RO-Crate is, how it is being adopted worldwide, then go on to list ways that RO-Crate is growing in importance in the repository world:

Uploading of complex multi-file objects means RO-Crate is compatible with any general-purpose repository that can accept a ZIP file (with some coding, repository services can do more with RO-Crates).
Download for well-described data objects complete with metadata from a repository rather than just a ZIP or file with no metadata.
Using RO-Crate metadata reduces the amount of customisation that is required in repository software, as ALL the metadata is described using the same simple, self-documenting linked-data structures, so generic display templates.
Sufficiently well-described RO-Crates can be used to make data FAIR compliant, aiding in Findability, Accessibility, Interoperability and Reusability thanks to standardised metadata and mature tooling.
And if you’re looking for a sustainable repository solution, there are tools which can run a repository from a set of static files on a storage service, in line with the ideas put forward by Suleman in the closing keynote for OR2023.

Uploading of complex multi-file objects

RO-Crate [1], [2] is a data packaging format and can be used to put multiple data files together with their metadata into a package such as a ZIP, tar or disk image file. This means that as long as your repository can handle a ZIP file it can take RO-Crates.

RO-Crates enable data to travel with metadata.

Beyond simply allowing the upload of opaque RO-Crates, there are opportunities for repository software to recognise metadata in an uploaded package and to pre-populate built-in metadata forms and/or datastores. This is not a pattern the authors have seen widely implemented in comprehensive institutionally focussed repositories, although at the time of writing it is being explored in Dataverse and InvenioRDM/Zenodo. We would encourage repository developers to explore this further, particularly those working with research data. RO-Crate support is increasing in research-domain repositories; e.g. RO-Crate upload with metadata extract is supported by WorkflowHub and ROHub).

One of the design features of an RO-Crate is that as it can be “just a ZIP file” it can be used with any old repository that can handle ZIP files. The JSON file can also be uploaded separately. As RO-Crate adoption increases, the repository may be able to start to use the RO-Crate metadata it already has. You can/will have heard from Dieuwertje Bloemen here at OR2024 that DataVerse has support for RO-Crate metadata preview and building import/export mechanisms.

The RO-Crate specification described a method of packaging data in a folder, which can be zipped, with any kind of file.

This slide shows a folder of data, including a file with a typical obscure file name based on a timestamp (in this case created by a Dropbox upload from a digital camera). The RO-Crate Metadata file, in JSON Linked Data Format, can describe files. This one has a name (i.e. a title) for the file; “Cute puppy”.

A human-readable description and preview (ro-crate-preview.html) can be in an HTML file that lives alongside the metadata. This slide shows an HTML view of the data that shows the image with its metadata, including the Schema.org name (equivalent to a Dublin Core title) for the file.

Increasingly, research repository infrastructure is accepting RO-Crate input - this screenshot from WorkflowHub [documents the upload API[(https://about.workflowhub.eu/developer/ro-crate-api/) for submitting RO-Crate packaged descriptions of scientific workflows to the system. These can then be downloaded by others for reuse. Here, RO-Crate allows bypassing of the traditional “title, author, license, description” fields (rendered from the crate), as well as permitting user extensions on metadata to be kept in the repository.

RO-Crate is a packaging format suitable for downloads

One of the perennial problems with downloads is that once a user has the data, it often does not come with metadata as shown on the landing page, or if present it is in an ad hoc or specialised format. RO-Crate solves this by specifying an extensible way to put linked-data metadata with data assets and to provide an HTML page or small website with the data to explain it. Thus data travels with its metadata and can be made human-readable.

RO-Crate download is already available in many data repositories. Examples include:

WorkflowHub: A registry for describing, sharing and publishing scientific computational workflows.
ROHub: A repository of Earth Science datasets and computational methods.
TLCMap: The Time Layered Cultural map is a set of tools that work together for mapping Australian history and culture.
The Language Data Commons of Australia data portal: entirely built on RO-Crates, the underlying data consists of crates-on-disk and the API is based on RO-Crate metadata.
Senckenberg Wildlive portal: exposes metadata about automatic photo captures of endangered animals using RO-Crate.
Dataverse: at the time of writing, RO-Crate downloads are in development.

We will encourage developers from other repository platforms to follow the Dataverse project’s lead and add RO-Crate support.

This is a screenshot of the Gazetteer of Historical Australian Places – not exactly a repository but an example of a place where people can download datasets.

This kind of download could be added to any repository system where there is at least one file that has metadata; offer a download ZIP option that has machine-readable JSON metadata (linked data in JSON-LD) and a human-readable summary of the metadata – this one has descriptions of a few files in it.

Every repository implementer should add FAIR Signposting – just a couple of HTTP headers – this means machines can go from an HTML landing page to the actual download without guesswork – or even better – to an RO-Crate! I’m sure you’ll hear this mentioned in one of the talks by Herbert van de Sompel and colleagues, such as the one on FAIRiCat. Again, DataVerse is ahead of the curve and has already implemented this.

As we mentioned above, data should travel with the metadata – one example of this from the WorkflowHub is how other services like the LifeMonitor retrieves the RO-Crate and then looks for custom annotations in the metadata to pick up and connect to the testing infrastructure. This was only possible because RO-Crate is extensible - you are not trapped with whatever 25 properties we’ve selected. The vessel is still RO-Crate, the repository didn't need to add anything to support the LifeMonitor.

One of the key benefits of linked-data metadata over previous ‘legacy’ approaches, is that multiple vocabularies can be combined into a single metadata document in a way that is not possible with, say MARC, or MODS XML, and that all these vocabularies can use the same syntax and approach to describing data. This means that a simple generic RO-Crate viewer can be used to visualise any metadata whether it is basic “Who, What, Where” metadata (like Dublin Core) or domain-specific metadata like the RO-Crate metadata profile (https://w3id.org/ldac/profile) used by the Language Data Commons of Australia. This can be displayed alongside the core RO-Crate metadata without any expensive configuration or coding. If the recommendations are followed, the RO-Crate metadata terms are self-documenting, e.g. all the Language Data Commons terms which use a Schema.org Style approach, are defined here: https://w3id.org/ldac/terms.

This slide is a repeat of one we used last year – this screenshot is a bit of (undated) DSpace documentation found following a tip from Kim Sheppard – we have included it here to illustrate that storing additional metadata (in this case METS) for an object was done by convention – it had to be stored in a special file called METS.xml. Using a linked-data system means that we no longer have to do this kind of thing – there’s still one magic file name in RO-Crate but it’s only one for the metadata and one for the HTML preview – everything else is labelled and extensible.

This is an example of a data object in RO-Hub, which has a variety of files described in the object.

This site does not need to have multiple plugins for different kinds of data, as linked data has a generic structure. We saw another example of how the generic structure can be rendered in the opening example with a picture of Sefton's dog.

So now we can tell all these research software developers, I know you like making new file formats, and I would love to support that in my repository, so could you perhaps use RO-Crate as the basis for making that format? Then we can pull out all the boring stuff, even for new file extensions, we just detect it’s a ZIP file and look for that magic file. If it says it’s an RO-Crate we’ll believe you.

The availability of RO-Crate editing tools opens the way for repository software to focus on access and discoverability. We argue that the core functionality of a repository is keeping data safe and making it available with appropriate access controls (remember, not all data can be made Open Access - the A for accessibility in FAIR is about giving the right people (or other agents) access to the right data). RO-Crates require clear licensing statements to travel with data, and we will demonstrate how these have been integrated into access-control systems.

There is an opportunity, if RO-Crate is adopted as an interchange format, for the metadata editing functions (and authorisation) functions of a repository to be decoupled from it so the editor components for a particular metadata profile can be shared between repository instances, or handled in a more distributed architecture than in typical current repositories.

The Describo website mentions this integration with the Dataverse repository where the Describo RO-Crate editor can be used to enter metadata. The pattern is potentially very powerful - separate the creation of metadata from the repository so that repositories can focus on data retention and multiple other applications can be built for use by researchers or other users – closer to or embedded in systems they are already using.

The Language Data Commons of Australia team has produced an alternative to Describo known as Crate-O – this is available as a web component ready to drop in to any web app or can function stand-alone – we use it as part of the workflow in maintaining the data portal for the Language Data Commons of Australia.

This slide shows the beginning of an access-control process. The data was prepared in RO-Crate format using batch-processing scripts and has a license attached. The repository portal is shown here - with an indication that the user needs to log in to access the data.

The user logs in using CILogon (or another federated authentication service).

The user is then directed to an instance of REMS - the Resource Entitlement Management System (REMS) to request that a licence be granted to access the data. REMS is open-source software.

After an approval process which may be automated, or may involve humans checking credentials, the user is directed back to the repository.

With a repository to keep data safe and serve it using persistent identifiers, RO-Crates help make data FAIR.

RO-Crate is increasingly being used to describe the provenance [3] of derived data in such a way that the workflows/computation that produced it can be re-run automatically to validate it, or as a basis for new research. This might be a button on a repository to run a bioinformatics workflow, or re-run a Jupyter notebook that produces a set of plots.

RO-Crate helps to enable FAIR research practice; RO-Crates can describe inputs, outputs and code in any combination to record research processes, and can be used to provision services.

This slide is a collage from a presentation at eResearch Australasia, lead author Alex Ip put together showing some examples of code notebooks for text analytics and geophysics. Alex is working with the Language Data Commons of Australia team to make RO-Crate Profiles that can describe code, not just in terms of its authorship, language and inputs and outputs, but the (usually virtual) execution environment and hardware requirements needed to run it. This is a key step forward in the Interoperability and Reuse of data called for by the FAIR principles.

There are tools which can run a repository from a set of static files on a storage service, in line with the ideas put forward by prof Suleman at OR 2023. The team at the Language Data Commons of Australia, with partner institutions and colleagues, has been working to produce a set of tools for building Archival Repository software stacks that is based on a principled approach to keeping data safe, based on the principles presented in the Arkisto website[4] and more recently at https://w3id.org/ldac/pilars the core idea is that a collection of RO-Crates in a storage service can be the basis of a repository – either using a simple on-disk directory layout or something more complicated such as an Oxford Common File Layout (OCFL) specification.

The [UTS Research Data Portal] is an example of a very minimal data repository system which uses a standard RO-Crate viewer to show RO-Crates that are sitting on file storage. This example is of an engineering dataset.

The Language Data Commons of Australia Data Partnerships (LDaCA-DP), Language Data Commons of Australia Research Data Commons (LDaCA-RDC), and Australian Text Analytics Platform (ATAP) projects received investment (https://doi.org/10.47486/DP768, https://doi.org/10.47486/HIR001, & https://doi.org/10.47486/PL074) from the Australian Research Data Commons (ARDC).

The ARDC is funded by the National Collaborative Research Infrastructure Strategy (NCRIS).

Crate-O - a drop-in linked data metadata editor for RO-Crate (and other) linked data in repositories and beyond

2024-06-05T00:00:00Z

This post is also available at the Language Data Commons of Australia site.

PDF version

Developer Track Session 2

Time: 05 June 2024, 09:00 - 10:30 · Location: Drottningporten 2

Crate-O - A drop-in linked data metadata editor for RO-Crate (and other) linked data in repositories and beyond

Peter Sefton, Alvin Sebastian, Moises Sacal Bonequi, Rosanna Smith

University of Queensland, Australia

Research Object Crate is a metadata packaging standard which has been widely adopted over the last few years in research contexts and which debuted at Open Repositories with a workshop in 2019.

Crate-O is an editor for the RO-Crate Metadata Specification.

RO-Crate has been presented here at Open Repositories for the last few years, and is now starting to be incorporated into many research repository solutions (though they are not always called repositories).

I am presenting in the next session on why RO-Crate is important for repositories.

Five ways RO-Crate data packages are important for repositories

Time: 05 June 2024, 11:00 - 12:30 · Location: Drottningporten 1

Peter Sefton*, Stian Soiland-Reyes**

*University of Queensland, Australia; **The University of Manchester, UK

Uploading of complex multi-file objects means RO-Crate is compatible with any general-purpose repository that can accept a zip file (with some coding, repository services can do more with RO-Crates)
Download for well-described data objects complete with metadata from a repository rather than just a zip or file with no metadata
Using RO-Crate metadata reduces the amount of customisation that is required in repository software, as ALL the metadata is described using the same simple, self-documenting linked-data structures, so generic display templates
Sufficiently well-described RO-Crates can be used to make data FAIR compliant, aiding in Findability, Accessibility, Interoperability and Reusability thanks to standardised metadata and mature tooling
And if you’re looking for a sustainable repository solution, there are tools which can run a repository from a set of static files on a storage service, in line with the ideas put forward by Suleman in the closing keynote for OR2023.

The ARDC is funded by the National Collaborative Research Infrastructure Strategy (NCRIS).

A version of the Crate-O component is available as a playground for Chrome browsers only. This allows you to describe your local folders in your computer and generate an RO-Crate.

https://github.com/Language-Research-Technology/ro-crate-modes/blob/main/docs/soss-profiles.md

A Schema.org style Schema (SOSS) specifies a metadata vocabulary of Classes and Properties, based on the RO-Crate specification's use of Schema.org classes. An RO-Crate Profile has (at least) a document that explains how metadata entities from the Schema are used for a particular purpose. An RO-Crate Mode is a set of lightweight syntactic rules for combining SOSS Classes, Properties and DefinedTerms, expressed in a JSON file.

Crate-O RO-Crate Editor Mode Files are editor configurations that implement RO-Crate Metadata Profiles.

The configuration files are intended to form the basis of an approach for describing RO-Crate editor behaviour and can be used for RO-Crate validation.

Initial versions of this work were based on the Describo Profiles (which vary between versions of Describo) used to configure the Describo family of RO-Crate editing tools, currently maintained by Marco La Rosa.

Crate-O is a single-page front-end web app developed using the Vue Javascript framework. Vue app is built by composing and nesting modular structures called components.

The main part Crate-O is then bundled in the CrateEditor component (highlighted in red) that allows a user to view, add, edit and delete properties of an entity, add and delete entities, and navigate and browse all entities in the crate.

The CrateEditor component is exported as EMAScript Module and can be imported into any Vue web app by adding Crate-O as dependency.

The default main app (for showcasing) only runs in Chrome-based browsers because it requires the showDirectoryPicker method from File System API to access (read/write) a directory on a local machine. The feature is still experimental and only available in Chrome-based browsers. However, the CrateEditor component itself does not require that feature.

This slide illustrates the very basic example of embedding Crate-O CrateEditor in any Vue app.

The example shows a Vue Single-File Component (SFC) (*.vue)

Inside the < import the CrateEditor component from the Crate-O package
Initialise all required variables
Add the <crate-editor> tag inside the <template> and pass in the data via the attributes:
crate: a plain js object in json-ld format, usually is the result of JSON.parse() of the string content of ro-crate-metadata.json file
mode: a plain js object conforming to the ro-crate-mode syntax

Data can be imported into Crate-O using spreadsheets - this is an efficient way to create metadata for collections of objects. Spreadsheet skills are common and many projects have already been described using spreadsheets to describe and manage files, we work with data custodians to standardize their approach to this so that they can create rich linked-data metadata.

Crate-O can be found here https://github.com/Language-Research-Technology/crate-o

A comprehensive, open and sustainable set of principles and tools for low (and high) resource Archival Repositories

2024-06-04T00:00:00Z

This post is also available at the Language Data Commons of Australia site.

PDF version

From “R-Drive to RRKive” – a comprehensive, open and sustainable set of principles and tools for low (and high) resource archival-repositories

Presented at: The 19th International Conference on Open Repositories, June 3-6th 2024, Göteborg, Sweden

Session: Presentations: Open and Sustainable Infrastructure

Time: 04/June/2024: 13:30 - 15:00 · Location: Drottningporten 1

Peter Sefton*, Robert McLellan*, Michael Lynch**, Moises Sacal Bonequi*, Nick Thieberger***

*University of Queensland; **University of Sydney; ***University of Melbourne

We present a toolkit for sustainable Archival Repositories, with metadata and storage standards as well as APIs and data portals, that can be assembled by communities with various levels of resourcing into repository solutions at a variety of scales, with fallback to offline operation. Tools are designed to work in low-resource environments, allowing communities to have agency and control over their materials. We prioritise sustainability, simplicity, standardisation, linked-data description and clear licensing over user interface features, in line with Suleman’s keynote presentation at OR2023, while still being able to drive rich, full-featured services when resources allow and fall back to a sustainable core if needed.

Much of the data we work with is subject to Indigenous Cultural Intellectual Property (ICIP) rights, controlled by First Nations peoples guided by Indigenous data sovereignty principles; we must ensure it is handled in a conscionable and culturally responsible manner.

Making data Accessible does not always mean Open Access – under both research ethics and the CARE and FAIR principles, data by and about humans needs access control and licensing. Our framework deals with these issues and is built from the ground up to be “as open as possible, as closed as needed”.

NOTE: Since the abstract was submitted we have adopted the name “Protocols” for the use in work we are doing on tools for FAIR and CARE adoption.

This slide summarises the aims of the Language Data Commons of Australia project.

The area of the strategy we’re focusing on for this presentation is highlighted – tools, standards and technical infrastructure.

The activities we are undertaking are:

Develop shared tools, standards and technical infrastructure to help data stewards care for data for the long term

Build data portals with useful search functions and lightweight technical structures.

Towards the following outcomes:

Standards and tools are available and being applied by data stewards

Good governance and standardised, distributed storage of data helps preserve data

Discovering and locating language data is easy via linked portals.

We don’t need to define the term “Repository” at the Open Repositories conference, but we do need to define a term that we use at LDaCA when we talk about Repositories – we use the inclusive term “Archival Repository” to refer to systems designed to keep data for the long term. We do this to avoid drawing boundaries around what is a “repository” vs an “archive” or a Digital Preservation system and to include as many practitioners as possible in the conversation.

The first implementation of this stream of work was the UTS Research Data Portal – this has not been updated for a while due to staffing changes at the university and a lack of ongoing governance that made the data repository dependent on individual people (three of whom are authors of this paper and no longer work for UTS), but as of mid-2024 it is being renovated and revitalised.

This screenshot shows what an LDaCA site looks like; a typical data-discovery portal. Underneath is not a typical (by Open Repositories standards) repository solution like Dspace, Innvenio or EPrints, all of which are more or less monolithic architectures. Following the implementation protocols we will discuss below, data is very much considered separately from the application used to provide views of the Archival Repository.

This slide shows a collage of images from the PARADISEC catalogue and some images of the materials that have been digitised. Objects in the collection range from manuscript images to dynamic media, predominantly audio from analog tapes which are now at risk of loss due to old age.

PARADISEC’s separation of data-at-rest from portal application was one of the main inspirations for our implementation protocols, introduced below.

Opening Australia’s Multilingual Archive is a collection of linked documents, authors and places which was created in Heurist, a legacy digital humanities web app - we used it as a proof-of-concept for getting the objects and their relationships crosswalked from Heurist’s native XML format to an RO-Crate, giving us a file-based archive of the collection. We then extended the LDaCA data discovery portal to include maps and timelines. This gives the sort of rich, immersive data exploration typical of custom websites.

The work we are presenting here builds on “Arkisto”, an initiative started in 2019 and funded by an ARDC precursor, the Australian National Data Service. Arkisto was an informal collaboration between the University of Technology Sydney and PARADISEC with contributions from AARNet and the State Library of NSW, and resulted in a website that articulated some principles and use cases for designing sustainable repositories. The initial work was presented at OR 2021. The Arkisto approach is summarised on the home page: Data on an Arkisto deployment is always available on disc (or object storage) with a complete description independently of any services such as websites or APIs. Once the data is safe and well described, Arkisto has a flexible model for how data can be accessed using a variety of services. The approach, and the standards it mentioned, the Oxford Common File Layout (OCFL) (Hankinson et al., 2019) and Research Object RO-Crate (RO-Crate) (Soiland-Reyes et al., 2022), were included in the successful proposal for the Language Data Commons of Australia project, and that project with continued collaboration with PARADISEC and the University of Sydney, has advanced the work.

The Arkisto work was helpful in that it aided in securing funding and was the starting point for principled development of Archival Repository solutions. However, it had a few issues.

The core idea of Arkisto was indeed a set of principles, but:

We also presented it as a “Platform” (whatever that is) and tied implementation of the platform to two particular Specifications (which we still use at LDaCA). The informality of the collaboration became a problem when there were disagreements; the site did not have clear ownership or governance so there was no mechanism for resolving issues or making major changes to the site. We decided to revisit the Arkisto approach, separating out principles (CARE and FAIR), implementation protocols, and then implementation details such as which specifications are followed and software tools to implement them.

Here we introduce the PILARS protocols, which the lead of the LDaCA project, Professor Michael Haugh, described as a guide to implementing “CAREul FAIRness”, referring to the CARE and FAIR principles. This is a snapshot of the current draft. The Protocols have a persistent identifier: http://w3id.org/ldac/pilars

The CARE principles frame the FAIR principles with an emphasis on the rights of Indigenous Peoples, and (we think on the LDaCA project) have relevance to any data.

The current movement toward open data and open science does not fully engage with Indigenous Peoples' rights and interests. Existing principles within the open data movement (e.g. FAIR: findable, accessible, interoperable, reusable) primarily focus on characteristics of data that will facilitate increased data sharing among entities while ignoring power differentials and historical contexts. The emphasis on greater data sharing alone creates a tension for Indigenous Peoples who are also asserting greater control over the application and use of Indigenous data and Indigenous Knowledge for collective benefit.

PILARS is being managed on GitHub in a public repository, which means anyone can contribute via issues or pull requests.

Setting up the governance is still in progress, but we will be formalising this drawing on the expertise of Robert McLellan - one of the authors of this presentation. The current plan is to use a variant of GitHub’s Minimum Viable Governance model which explains how to create a consensus-based governance system with a management committee.

We will now introduce the three main PILARS protocols one by one.

1 Data is Portable: assets are not locked in to a particular mode of storage, interface or service.

1.1 Keep data in one or more general-purpose commodity IT storage systems.

1.1.1 The storage system has a method to store and retrieve file-like datastreams using hierarchical file-paths.

1.1.2 The storage system has a method to list all the file-paths in the storage system.

1.2 Divide up data files into [Storage Objects] that form meaningful units, of the smallest practical size.

1.2.1 Each Storage Object is a directory (or storage object equivalent) containing the files, including metadata and administrative files such as checksums that make up an Object.

1.2.2 Storage Objects can be located by inspecting the contents of the storage hierarchy by listing the paths (1.1.2), for example by the presence of a file with a defined name in the hierarchy.

1.3 Document and implement an ID resolution mapping system to map IDs to storage locations FAIR-F1.

1.4. Store documentation about the conventions and standards such as (1.3) used in a data store within the root of the storage service itself.

1.5 Data storage of well-described data objects is considered separately from the current uses to which the data is put.

1.6 Data files use open or standard formats where possible, independent of particular software FAIR-I.

1.7 If data resides in systems, such as content management systems or database applications which do not inherently support all of the protocols 1 & 2 then put processes in place to export data to a system that does.

2 Data is Annotated: contents, structure, provenance and access and reuse permissions are comprehensively described with metadata and licenses.

2.1 For each Storage Object, store metadata that describes (annotates) the object and (optionally) the files that make up the object. The metadata should be stored in a file or files with the data files.

2.2 For Protocol 2.1 use interoperable general-purpose linked-data metadata stored in a file format which has an Open Specification. This may be extended with domain-specific or ad hoc metadata (which may be in non-linked-data formats) and may be stored in additional files (FAIR-F1 FAIR-F2).

2.3 For each Storage Object, include at least one license document linked from the metadata using the appropriate property for a ‘license’ from the core vocabulary (e.g. http://schema.org/license), setting out in plain language how data may be used and/or redistributed and by whom (CARE & [FAIR-R1.1]).

2.3.1 Do not expose data, for example via a portal without access controls or disseminate confidential license or other governance information. Licensing may change, be withdrawn and new licenses added over time, note, however once data has been distributed under an Open Access license it may not be withdrawn from those who have downloaded it.

2.3.2 Documentation about licenses for deposit and archive-wide accession policies may also be stored with an object.

2.4 Store checksum-metadata in a documented standard format alongside data files to help ensure data integrity.

2.5 Represent Repository Collections such as archival series or other organising entities as Storage Objects; either self-contained with their member data within the Storage Object or as metadata-only Storage Objects referencing or referenced by other Storage Objects.

3 Governance

3.1 The purpose of the Archival Repository holding the data is articulated.

3.2 Management systems are in place to sustain the Archival Repository.

3.3 Deposit agreements are in place and documented setting out the rights needed for the Archival Repository as an organisation to manage data.

3.3 Processes are in place to ensure data persistence for the defined periods that meet the repository purpose (including indefinitely).

3.4 Processes are in place for disposal/deaccessioning if appropriate to the purpose.

The first two Protocols are about data storage and description and are the core of the PILARS approach to implementation - the third is a recognition that they need to be followed in a governance context. Archival Repositories are not just software packages or application deployments; they need to be seen as social institutions that have the means to persist, and have management and contingency planning in place.

The initial work on establishing Archival Repository infrastructure for the Language Data Commons of Australia focussed on re-housing existing data sets which were in a variety of legacy web applications, with data and metadata locked up in the likes of WordPress and Zope websites. The best of these applications from a migration perspective was the data from Alveo (Burnham et al., 2014); this was available in a Fedora-based repository which was only intermittently available due to support issues and more usefully, a copy of the data stored on disk with linked-data metadata files (N-Triples) allowed us to migrate those collections to the Arkisto standards-stack by cross walking files directly.

The experience with the Alveo collections, including the largest collection, Austalk (Estival et al., 2014), which had data from around 1000 speakers, totalled 60GB (for metadata and audio) with circa 8 million files, was that having data available on disk with adjacent metadata, free from an API was extremely valuable. We had to develop tooling to migrate data but this was much more straightforward because of the use of standards with available software implementations both in source and target, and the ability to have direct access to the data and metadata from storage.

Subsequently, the team has been working with other legacy and contemporary research data, most of which has never been archived and does not have standardised metadata. Typically researchers manage this kind of data by keeping it on disk, often with an ad-hoc spreadsheet for metadata. This approach is common to many disciplines; we have dubbed this the “R-Drive” approach, where researchers use local storage, or an institutional research data store (hence the “R”). The following is a simplified workflow infographic developed to assist in communicating the basic principles.

The above diagram shows the current state of known PILARS implementation as of the middle of 2024.

There are three main open-source toolkits at the moment.

Oni is an open-source portal application for building data portals from RO-Crate data – (we showed screenshots above; most of them look like shopping websites for data) – but Oni sites are extremely flexible in how they are presented.
- The Language Data Commons of Australia (LDaCA)’ s main deployment is the LDaCA data portal
- The University of Technology Sydney uses an earlier version of Oni, and is currently being updated to use the LDaCA version.
Nyingarn makes manuscript sources of Australian Indigenous languages available as searchable and reusable text documents to support language revitalisation

We showed screenshots for the OMAA above.

The RRKive website is the project site that will contain additional information about PILARS – how to implement the protocols, open source software, presentations etc. to supplement PILARS.

Note – pronounce the “RR” pirate-fashion.

This slide shows a conceptual architecture for [PARADISEC] with some statistics about its large and important collection.

As we are dealing with resources created by people, with inherent rights in what they say and perform, the RRKive approach has a baked-in expectation that access control will be needed, in order to respect ICIP, moral rights and legal and ethical constraints imposed by research institutions and “the law of the land” (although within the so-called Australian context, lands were never ceded by the original long-term custodians, and still today, no legitimate legal instrument, or agreement exists to suggest otherwise) we need to provide mechanisms for authorised data access which is why we engage the above mentioned Indigenous Data Sovereignty principles. And it cannot be emphasised enough that implementing a license-based access system is the easy part of this, it is the engagement with rights-holders which can take a lot of time and effort.

LDaCA has implemented a distributed access control system – we have presented on this before, so will not include that here, see this previous presentation.

Both PARADISEC and Nyingarn at this stage use ‘traditional’ access control lists which are local to the applications themselves, but there is some scope to discuss the adoption of a more decentralised approach if needed.

The Language Data Commons of Australia (LDaCA) has extensive materials for partners bringing data to the project about how to work our data licensing issues, obtain a DOI, prepare data for ingest and source long-term storage, remembering that LDaCA is a project and cannot provide perennity for data.

, , & ) from the Australian Research Data Commons (ARDC). The ARDC is funded by the National Collaborative Research Infrastructure Strategy (NCRIS)." title="Slide: 24">

The Language Data Commons of Australia Data Partnerships (LDaCA-DP), Language Data Commons of Australia Research Data Commons (LDaCA-RDC), and Australian Text Analytics Platform (ATAP) projects received investment. (https://doi.org/10.47486/DP768, https://doi.org/10.47486/HIR001, & https://doi.org/10.47486/PL074) from the Australian Research Data Commons (ARDC).

The ARDC is funded by the National Collaborative Research Infrastructure Strategy (NCRIS).

References

Burnham, D., Estival, D., Cassidy, S., Sefton, P., & Verspoor, K. (2014). Two platforms for research in Human Communication Science: The AusTalk corpus and the Alveo Virtual Laboratory. 17th Oriental Chapter of the International Committee for the Co-Ordination and Standardization of Speech Databases and Assessment Techniques (COCOSDA), 1–6. https://doi.org/10.1109/ICSDA.2014.7051412

Estival, D., Cassidy, S., Cox, F., & Burnham, D. (2014). AusTalk: An audio-visual corpus of Australian English. https://drive.google.com/file/d/1E6D0tDCwz-Y4FmHGDKs45CYCdKwrZMor/view

Hankinson, A., Brower, D., Jefferies, N., Metz, R., Morley, J., Warner, S., & Woods, A. (2019). The Oxford Common File Layout: A Common Approach to Digital Preservation. Publications, 7(2), Article 2. https://doi.org/10.3390/publications7020039 Sefton, P., Sacal Bonequi, M., Sebastian, A., & Raadgever, M. (2023, June 13). Introducing the Oni Repository Stack. https://doi.org/10.5281/zenodo.8091519

Soiland-Reyes, S., Sefton, P., Crosas, M., Castro, L. J., Coppens, F., Fernández, J. M., Garijo, D., Grüning, B., La Rosa, M., Leo, S., Ó Carragáin, E., Portier, M., Trisovic, A., RO-Crate Community, Groth, P., & Goble, C. (2022). Packaging research artefacts with RO-Crate. Data Science, 5(2), 97–138. https://doi.org/10.3233/DS-210053

Thieberger, N. 2023. Doing it for Ourselves: The New Archive Built by and Responsive to the Researcher. Digital Humanities Quarterly (DHQ) Volume 17 Number 1. https://digitalhumanities.org/dhq/vol/17/1/000667/000667.html

Open Repositories 2023: Trip report

2023-06-29T00:00:00Z

This is a summary of my trip to the 18th Open Repositories conference, hosted by Stellenbosch University in South Africa. My travel was paid for by my main employer, the University of Queensland. I've attended 17 of the 18 OR conferences -- including the first one which was held in Sydney, organized by the Australian Partnership for Sustainable Repositories. I think Jon Dunn from Indiana and I are now tied at 17 attendances each, not sure if there are any other contenders?

An early morning view from the conference hotel of what I think is the seaside suburb "The strand"

This conference went by really fast -- I was presenting in three sessions, on RO-Crate, A description of the Arkisto repository stack and a tech-stream talk on building an ad-hoc repository (not a live demo, but featuring screen recordings I made on the 14 hour plane ride to Johanesburg that took me 8 hours into the past), chaired a session (more on that below), fell into a jet-lag-coma in my room for one session, and by the time I'd done all that there was not a huge amount of choice left in what to attend.

what's new in repositories

I chaired a session, Updates on technology platforms, which had coverage of only two major platforms (DSpace and Islandora) or three if you count DSPace CRIS (which adds features for tracking research metrics) as a separate thing from DSpace, I heard about Dataverse, Invenio in other sessions, but didn't manage to see anything on EPrints. My general impression from this was wow - these things have gotten big and complicated, and there are a lot of features these days. Overall, architectures have matured so the software stacks now tend to have APIs, but they tend to remain fairly monolithic (just an impresson, maybe I'm wrong).

An example of what I mean by of lots of features; DSPace CRIS is a layering of a "Current Research Information System" onto the repository platform. We did the same thing with RedBox - it started as a repository platform and then after the funding organization in its wisdom had killed off the idea of it actually being a repository it became a metadata registry + data management plans + provisioning of services. What we ended up with was a very complicated system built on a repository-focussed platform that in hindsight would have been better built in a standard web application platform. It's still valuable, but last time I checked it could do with a rewrite, maybe in Salesforce? No, I'm kidding not that. I'm not sure, but I suspect DSpace CRIS may be like that - building lots of extra functionality into a core application certainly does complicate how you keep it up to date with the main-line application at any rate.

Chairing has never been my favourite part of a conference, but I kept everybody to time -- I was grinding my teeth at presenters (often with something to sell) in this and other sessions (no names) who had super-slick slides with a pretty high marketing content, including stuff like the history of their thing, laced with cute cultural references. Fine, but don't be asking the chair for more time if your presentation is half filler that seemed funny back in the office. Speaking of cultural references, did I mention that some of the 'featuritis' that makes these things expensive to deploy and manage reminded me of that time Homer put all those gadgets in his car:

Image of Homer Simpson DJing in his car for which I have not secured rights, which I found here.

Highlight

The highlight of the conference for me was the the closing keynote:

Our closing keynote speaker is Prof Hussein Suleman whose research is situated within the Digital Libraries Laboratory in the Department of Computer Science at the University of Cape Town. This session promises to inspire and challenge our participants - to encourage them to think broadly about the ways repositories enable discoverability and interoperability of information and data within the structured web of data.

Prof Suleman's (home page talk, ORCiD) was indeed challenging the OR status quo: he questioned the complexity of current repositories (see above for my comments on the same), bemoaned the incursion of proprietary software into institutions, pushing aside a movement that has its roots firmly in open source. Like I mentioned above some stuff in the repositories world has drifted pretty far from the original mission of keeping stuff safe and making sure people can find and use it. Anyway regarding resourcing, Suleman said he doesn't like the the label "The Global South" and prefers to talk about low-resource environments. He noted that much of what he'd heard at the conference -- which he actually attended -- was not relevant to those working in these environments.

Suleman pointed out that while resourcing is of course about money there are several dimensions; he mentioned that in rural areas all over the world various resources are in short supply (money, people, skills, network etc), as well poor organizations such as some NGOs. All of this of course resonates with experiences of colleagues working with language data, or anything really in parts of Australia and the Pacific.

He also identified a huge resource problem, which is that archive projects are often funded for build but not maintenance (this is not just archive projects - even for rich countries like Australia we see funding for building-but-not-maintaining), which has lead Suleman and his colleagues to investigate technologies that are resilient to funding-failure.

Speaking of "low resource", not far from the conference hotel are very different accommodations than shining towers at The Strand. Someone told me there are about 500,000 people living in this part of Cape Town

Suleman works on systems such as the "Simple Archives Project" that demonstrate techniques for preserving archives in low-resource environments:

Digital library systems are not always successfully implemented and sustainable in low resource environments, such as in poor countries and in organisations without resources. As a result, some archives with important collections are short-lived while others never materialise. This paper presents a new toolkit for the creation of simple digital libraries, based on a long trajectory of research into architectural styles. It is hoped that this system and approach will lower the barrier for the creation of digital libraries and provide an alternative architecture for experiments and the exploration of new design ideas. https://pubs.cs.uct.ac.za/id/eprint/1512/

This uses CSV (the Killer app, says prof Suleman) for metadata, which is transformed into XML and then to a static website - taking this approach should guarantee that things are a little more easily sustainable than repeatedly having to migrate to the latest and greatest DSpace. (I showed our approach to this in the tech stream).

I had noted a couple of things I heard in a session on accessing research data that are relevant to Suelman's message:

There was a really great presentation "Rethinking the A in FAIR Data: issues of data access and accessibility in research" in the that looked at just how accessible supposedly universally accessible resources really are, by testing network access to repositories from different regions, from some countries you get to see nothing at all - here's a link to another full text paper by the same authors (I have linked to a full version of this which ).
In reply to a a question about impact after the presentation "Repositioning Repositories: Designing and Assessing the Life Cycle of Research Infrastructures" the presenter (I think it was Ron Dekker speaking) noted that when the UK introduced Hybrid Open Access to publishing, publishers used this as an opportunity to increase their profits -- it didn't result in better OA or lower costs as planned by the policy makers.

Reflections on the keynote in the context of our work

To me, Suleman's presentation was spot on and I agreed with his main themes, but I did have a couple of issues to discuss, relating what he said back to our work.

I don't think XML is currently the best way to represent metadata for preservation, or even as part of an ephemeral toolchain, I think it's better to use JSON-LD (but I would say that, as I'm the co-editor of a the RO-Crate spec which does just that). RO-Crate is much more friendly to 2020s programmers, is more easily extensible and is absolutely aligned with the idea of having static repositories in that it encourages the use of HTML previews for data objects so they can be understood without additional software. RO-Crate has standard tools for this and more are in the pipeline.

And if RO-Crate/JSON-LD is too complicated (and I have to admit, the latest spec is getting that way) then Frictionless Data is a JSON-only approach that is simpler to implement.

By the way, thinking about XML for metadata I was reminded of a comment made by Ron Ward at USQ many years ago when he encountered the use of XML for metadata. He remarked that in this role (and other data interchange scenarios) XML is not acting as a markup language at all. A Markup language, like HTML - THE Hypertext Markup Language, is something that adds semantic and/or formatting information to textual data which is a very different task from representing hierarchical data structures in the way JSON does. XML's Markup-focussed heritage means that it is MUCH more complicated to use even for simple metadata schemas.

One XML-based protocol that Suleman mentioned was the venerable OAI-PMH, which he was invovled in creating around the turn of this century. OAI-PMH is a way for repositories to be harvested so their resources can be centrally indexed for discovery. He made a comment that it had stood the test of time, and considered it to be one of those things that could be implemented by a competent developer in a day. I'd have to say that I think that's overly optomistic, just using existing OAI-PMH software, with all the complexities of sending differently flavoured XML over the wire used to consume a lot of developer and support time when I was involved in running a repository support service at the University of Southern Queensland. (Maybe the reason we had trouble was the implementations we used were built in a day? 🤣).

Suleman referenced current work on static repositories, with a few totally justified "what took you so long?" jibes. In particular he name-checked the Oxford Common File Layout (OCFL) (which we use in our current projects) saying that maybe it's too complicated. Maybe it is, and we've had these discussions in the LDaCA project and with people in our partner project PARADISEC, but our current thinking is that while the OCFL file structure is not exactly human friendly its other preservation features make it an OK trade off. And it's not that hard to implement. For example with Mike Lynch and Moises Sacal Bonequi at UTS we got a minimal OCFL library going in not much more than a day's work, and I believe John Ferlito at PARADISEC has had a similar experience adding S3 support this year -- it's days, not months of work to write a library that can be used for a project. (Writing a more mature, more fully featured library is a lot more work, but will benefit a lot more projects).

But as with the comment I made about RO-Crate above, if you're working in a resource-constrained environment where OCFL is considered too complicated then data can of course simply be placed in a directory/folder in a storage system a la Suleman's Simple Archives -- this is the approach that the PARADISEC team take when they take portions of the the PARADISEC repository on Raspberry PI based servers to various the Pacific Islands where people can use the repositories from their phones (or whatever devices) on hyper-local networks.

Summary & follow-up

In summary I think Suleman's talk was a really great articulation of many of the issues in repository practice; I think he made his points very powerfully, inluding a couple of comments that may have made a conference sponsor or two grimace.

In response to Suleman's challenges I'd like to propose a stream of work at next year's Open Repositories conference in Sweden.

How about we hold a pre-conference hands-on workshop that challenges repository developers to embrace some of the approaches Suleman is talking about -- storing files on disk, zero-install indices of content etc? How simple could you go to radically re-imagine a repo stack?

I'd like to see a mixture of institutional and commercial developers get involved, and to step out of their big fully-featured repository palaces and see what we can get done in a few days over the conference. We'd then have a session at the end of the conference that builds on work by Suleman and others on low-resource-ready repository and archive solutions. There might be token prizes as there are for poster presentations.

That thing where you go to a conference to meet someone from up the road

There were two people from Australian institutions at the conference. Me obviously, and the other one was Janet McDougall from the Australian Data Archive and the Australian National University. She was presenting in a session on Indigenous Knowledge Preservation

Australian Data Archive (ADA) and Australian National University (ANU) and The Dataverse Project, TKLabels use case

Janet McDougall & Steven McEachern (Australian Data Archive, Australia), Sonia Maria Barbosa (Harvard University, United

The TK and BC Labels are an initiative for Indigenous communities and local organizations. Developed through sustained partnership and testing within Indigenous communities across multiple countries, the Labels allow communities to express local and specific conditions for sharing and engaging in future research and relationships in ways that are consistent with already existing community rules, governance, and protocols for using, sharing and circulating knowledge and data. ADA has an interest in establishing the means for providing suitable representation of indigenous knowledge within the Dataverse software. This includes functionality in Dataverse to:

link to and incorporate identified sources for indigenous knowledge representation, such as TKLabels and Notices

curation processes for managing the creation, reading, updating, and deleting of metadata

present curated metadata (e.g., TKLabels and TKNotices) in catalogue records

allow external aggregators to harvest this metadata (specifically the IDN Data Catalogue, but a preferably standardized model that allows for multiple external parties to harvest)

There's a workshop next week in Brisbane for the Australian HASS and Indigenous Research Data Commons project(s) at which Janet will be presenting this work and I will be presenting the LDaCA data access and authorization model, where we look to share implementation experience between projects -- more on that soon. We're interested in how to translate labels that "express conditions" can be incorporated into environments that authorize access.

On a global scale Janet and I are near neighbours, but we only seem to get to talk in Montana, or the Kirstenbosch botanic gardens, where outgoing conference chair Claire Knowles and I accompanied Janet for a post-conference volunteer-led botanical tour.

Our super-informative volunteer guide (didn't catch her name) explaining something about a strelitzia (she said you have to understand the empire builder Cecil Rhodes as a man of his time, and after all he did give Cape Town this botanic garden)

South Africa

Here are a few pics fragments from the trip. I had a few days of personal time (in accordance with the strict UQ policy on these matters).

~~Tigers~~Cheetahs on a ~~gold~~ leash - good thing I packed the 75-300mm MFT lens for this suburban safari

Not far from the conference, ten minutes by the ubiquitous Toyota Corolla Uber, in a gated community is a Cheetah 'sanctuary' where you can pat Cheetah 'ambassadors' and gawk at few other animals through the wire. This apparently has something to do with conservation.

A little bushwalk (boschwalk?) around Kirstenbosch - clouds on Table Mountain again

Also this happened:

Kim Shepherd and I were quietly jamming on our ukes in the hotel lobby after the conference dinner when Pedro Principe turned up -- Pedro got the party started on Kim's uke and I tried to keep up. Kim did percussion on a wine bottle he'd rescued from the dinner. We were mentioned in dispatches, incoming conference chair Torsten Reimer put out a call in the closing remarks for small instruments to be brought to Gothenburg next year

Speaking of conference chairs here's me and my conference besties, Kim and outgoing chair Claire Knowles about to ascend via cable car into the Table Mountain cloud. View was OK at the bottom station and the top reminded me of home in Katoomba at 1000m above sea level on a misty day, some nice wet rocks and shrubs were on view. Also a youth group running about purposefully with a couple of cheerful leaders, possibly channeling the spirit of Baden-Powell. Anyway they were totally lost; we were thinking of following them to safety when they ran back from the direction opposite to the one in which they'd departed and asked us the way to somewhere or other.

We talked about the complexity of the current crop of repository systems, and how to use more distributed, less monolithic designs; Claire's planning something big but confidential for now (possibly with a similar design to this), and Kim's been working with a regional archive that uses quite simple underlying tech somewhat reminiscent of the approaches advocated by Suleman in his keynote (though we were yet to hear that).

Here are some penguins.

And I did a Cape Malay cooking class/tour of the Bo-Kaap neighbourhood with Zayed and family. Highly recommended.

Here I am flipping a roti