ptsefton.comhttps://ptsefton.com/2023-06-29T00:00:00+02:00Open Repositories 2023: Trip report2023-06-29T00:00:00+02:002023-06-29T00:00:00+02:00Peter Seftontag:ptsefton.com,2023-06-29:/2023/06/29/report-or-2023/index.html<p>This is a summary of my trip to the 18th Open Repositories conference, hosted by Stellenbosch University in South Africa. My travel was paid for by my main employer, the University of Queensland. I've attended 17 of the 18 OR conferences -- including the first one which was held in Sydney …</p><p>This is a summary of my trip to the 18th Open Repositories conference, hosted by Stellenbosch University in South Africa. My travel was paid for by my main employer, the University of Queensland. I've attended 17 of the 18 OR conferences -- including the first one which was held in Sydney, organized by the <a href="https://openresearch-repository.anu.edu.au/handle/1885/6614">Australian Partnership for Sustainable Repositories</a>. I think <a href="https://orcid.org/0000-0003-4346-1416">Jon Dunn</a> from Indiana and I are now tied at 17 attendances each, not sure if there are any other contenders?</p>
<img src='https://ptsefton.com/2023/06/29/report-or-2023/image-1.png' alt='A view of (I think) The Strand, from the conference hotel' />
<p><em>An early morning view from the conference hotel of what I think is the seaside suburb "The strand"</em></p>
<p>This conference went by really fast -- I was presenting in three sessions, on <a href="http://localhost:8001/2023/06/13/ro-crate-or-2023/index.html">RO-Crate</a>, <a href="/2023/06/14/arkisto-stack-or-2023/">A description of the Arkisto repository stack</a> and a <a href="http://localhost:8001/2023/06/13/oni-dev-track-or-2023/index.html">tech-stream talk on building an ad-hoc repository</a> (not a live demo, but featuring screen recordings I made on the 14 hour plane ride to Johanesburg that took me 8 hours into the past), chaired a session (more on that below), fell into a jet-lag-coma in my room for one session, and by the time I'd done all that there was not a huge amount of choice left in what to attend.</p>
<h1>what's new in repositories</h1>
<p>I chaired a session, <a href="https://www.conftool.net/or2023/index.php?page=browseSessions&form_session=511&presentations=show">Updates on technology platforms</a>, which had coverage of only two major platforms (DSpace and Islandora) or three if you count DSPace CRIS (which adds features for tracking research metrics) as a separate thing from DSpace, I heard about Dataverse, Invenio in other sessions, but didn't manage to see anything on EPrints. My general impression from this was wow - these things have gotten big and complicated, and there are a lot of <em>features</em> these days. Overall, architectures have matured so the software stacks now tend to have APIs, but they tend to remain fairly monolithic (just an impresson, maybe I'm wrong).</p>
<p>An example of what I mean by of lots of features; DSPace CRIS is a layering of a "Current Research Information System" onto the repository platform. We did the same thing with <a href="https://ptsefton.com/2018/07/06/RedBoX-Provisioner-OR2018/index.html">RedBox</a> - it started as a repository platform and then after the funding organization in its wisdom had killed off the idea of it actually being a repository it became a metadata registry + data management plans + provisioning of services. What we ended up with was a very complicated system built on a repository-focussed platform that in hindsight would have been better built in a standard web application platform. It's still valuable, but last time I checked it could do with a rewrite, maybe in Salesforce? No, I'm kidding not that. I'm not sure, but I suspect DSpace CRIS may be like that - building lots of extra functionality into a core application certainly does complicate how you keep it up to date with the main-line application at any rate.</p>
<p>Chairing has never been my favourite part of a conference, but I kept everybody to time -- I was grinding my teeth at presenters (often with something to sell) in this and other sessions (no names) who had super-slick slides with a pretty high marketing content, including stuff like the history of their thing, laced with cute cultural references. Fine, but don't be asking the chair for more time if your presentation is half filler that seemed funny back in the office. Speaking of cultural references, did I mention that some of the 'featuritis' that makes these things expensive to deploy and manage reminded me of that time Homer put all those gadgets in his car:</p>
<img src="https://ptsefton.com/2023/06/29/report-or-2023/image-4.png">
<p><em>Image of Homer Simpson DJing in his car for which I have not secured rights, which I found <a href="https://simpsons.fandom.com/wiki/Brake_My_Wife,_Please/Gallery?file=Screenshot_%289277%29.png">here</a>.</em></p>
<h1>Highlight</h1>
<p>The highlight of the conference for me was the the closing keynote:</p>
<blockquote>
<p>Our closing keynote speaker is Prof Hussein Suleman whose research is situated within the Digital Libraries Laboratory in the Department of Computer Science at the University of Cape Town. This session promises to inspire and challenge our participants - to encourage them to think broadly about the ways repositories enable discoverability and interoperability of information and data within the structured web of data.</p>
</blockquote>
<p>Prof Suleman's (<a href="http://www.husseinsspace.com/">home page</a> talk, <a href="https://orcid.org/0000-0002-4196-1444">ORCiD</a>) was indeed challenging the OR status quo: he questioned the complexity of current repositories (see above for my comments on the same), bemoaned the incursion of proprietary software into institutions, pushing aside a movement that has its roots firmly in open source. Like I mentioned above some stuff in the repositories world has drifted pretty far from the original mission of keeping stuff safe and making sure people can find and use it. Anyway regarding resourcing, Suleman said he doesn't like the the label "The Global South" and prefers to talk about low-resource environments. He noted that much of what he'd heard at the conference -- which he actually attended -- was not relevant to those working in these environments.</p>
<p>Suleman pointed out that while resourcing is of course about money there are several dimensions; he mentioned that in rural areas all over the world various resources are in short supply (money, people, skills, network etc), as well poor organizations such as some NGOs. All of this of course resonates with experiences of colleagues working with language data, or anything really in parts of Australia and the Pacific.</p>
<p>He also identified a huge resource problem, which is that archive projects are often funded for build but not maintenance (this is not just archive projects - even for rich countries like Australia we see funding for building-but-not-maintaining), which has lead Suleman and his colleagues to investigate technologies that are resilient to funding-failure.</p>
<img src='https://ptsefton.com/2023/06/29/report-or-2023/image-2.png' alt='The edge of what I think is called an "informal settlement" near the road to Cape Town' />
<p><em>Speaking of "low resource", not far from the conference hotel are very different accommodations than shining towers at The Strand. Someone told me there are about 500,000 people living in this part of Cape Town</em></p>
<p>Suleman works on systems such as the "Simple Archives Project" that demonstrate techniques for preserving archives in low-resource environments:</p>
<blockquote>
<p>Digital library systems are not always successfully implemented and sustainable in low resource environments, such as in poor countries and in organisations without resources. As a result, some archives with important collections are short-lived while others never materialise. This paper presents a new toolkit for the creation of simple digital libraries, based on a long trajectory of research into architectural styles. It is hoped that this system and approach will lower the barrier for the creation of digital libraries and provide an alternative architecture for experiments and the exploration of new design ideas. <a href="https://pubs.cs.uct.ac.za/id/eprint/1512/">https://pubs.cs.uct.ac.za/id/eprint/1512/</a></p>
</blockquote>
<p>This uses CSV (the Killer app, says prof Suleman) for metadata, which is transformed into XML and then to a static website - taking this approach should guarantee that things are a little more easily sustainable than repeatedly having to migrate to the latest and greatest DSpace. (I <a href="/2023/06/13/oni-dev-track-or-2023/">showed our approach to this</a> in the tech stream).</p>
<p>I had noted a couple of things I heard in a <a href="https://www.conftool.net/or2023/index.php?page=browseSessions&form_session=509#paperID200">session on accessing research data</a> that are relevant to Suelman's message:</p>
<ol>
<li>
<p>There was a really great presentation "Rethinking the A in FAIR Data: issues of data access and accessibility in research" in the that looked at just how accessible supposedly universally accessible resources really are, by testing network access to repositories from different regions, from some countries you get to see nothing at all - here's a <a href="https://www.frontiersin.org/articles/10.3389/frma.2022.912456/full">link to another full text paper by the same authors</a> (I have linked to a full version of this which ).</p>
</li>
<li>
<p>In reply to a a question about impact after the presentation "Repositioning Repositories: Designing and Assessing the Life Cycle of Research Infrastructures" the presenter (I think it was Ron Dekker speaking) noted that when the UK introduced Hybrid Open Access to publishing, publishers used this as an opportunity to increase their profits -- it didn't result in better OA or lower costs as planned by the policy makers.</p>
</li>
</ol>
<h2>Reflections on the keynote in the context of our work</h2>
<p>To me, Suleman's presentation was spot on and I agreed with his main themes, but I did have a couple of issues to discuss, relating what he said back to our work.</p>
<p><strong>I don't think XML is currently the best way to represent metadata for preservation</strong>, or even as part of an ephemeral toolchain, I think it's better to use JSON-LD (but I would say that, as I'm the co-editor of a the RO-Crate spec which does just that). RO-Crate is much more friendly to 2020s programmers, is more easily extensible and is absolutely aligned with the idea of having static repositories in that it encourages the use of HTML previews for data objects so they can be understood <em>without</em> additional software. RO-Crate has standard tools for this and more are in the pipeline.</p>
<p>And if RO-Crate/JSON-LD is too complicated (and I have to admit, the latest spec is getting that way) then <a href="https://frictionlessdata.io/">Frictionless Data</a> is a JSON-only approach that is simpler to implement.</p>
<p>By the way, thinking about XML for metadata I was reminded of a comment made by Ron Ward at USQ many years ago when he encountered the use of XML for metadata. He remarked that in this role (and other data interchange scenarios) XML is not acting as a markup language at all. A Markup language, like HTML - THE Hypertext Markup Language, is something that adds semantic and/or formatting information to textual data which is a very different task from representing hierarchical data structures in the way JSON does. XML's Markup-focussed heritage means that it is MUCH more complicated to use even for simple metadata schemas.</p>
<p>One XML-based protocol that Suleman mentioned was the venerable <a href="https://www.openarchives.org/pmh/">OAI-PMH</a>, which he was invovled in creating around the turn of this century. OAI-PMH is a way for repositories to be harvested so their resources can be centrally indexed for discovery. He made a comment that it had stood the test of time, and considered it to be one of those things that could be implemented by a competent developer in a day. I'd have to say that I think that's overly optomistic, just using existing OAI-PMH software, with all the complexities of sending differently flavoured XML over the wire used to consume a <em>lot</em> of developer and support time when I was involved in running a repository support service at the University of Southern Queensland. (Maybe the reason we had trouble was the implementations we used were built in a day? 🤣).</p>
<p>Suleman referenced current work on static repositories, with a few totally justified "what took you so long?" jibes. In particular he name-checked the Oxford Common File Layout (OCFL) (which we use in our current projects) saying that <em>maybe</em> it's too complicated. Maybe it is, and we've had these discussions in the LDaCA project and with people in our partner project PARADISEC, but our current thinking is that while the OCFL file structure is not exactly human friendly its other preservation features make it an OK trade off. And it's not that hard to implement. For example with Mike Lynch and Moises Sacal Bonequi at UTS we got a minimal OCFL library going in not much more than a day's work, and I believe John Ferlito at PARADISEC has had a similar experience adding S3 support this year -- it's days, not months of work to write a library that can be used for a project. (Writing a more mature, more fully featured library is a lot more work, but will benefit a lot more projects).</p>
<p>But as with the comment I made about RO-Crate above, if you're working in a resource-constrained environment where OCFL is considered too complicated then data can of course simply be placed in a directory/folder in a storage system a la Suleman's Simple Archives -- this is the approach that the PARADISEC team take when they take portions of the the PARADISEC repository on Raspberry PI based servers to various the Pacific Islands where people can use the repositories from their phones (or whatever devices) on hyper-local networks.</p>
<h2>Summary & follow-up</h2>
<p>In summary I think Suleman's talk was a really great articulation of many of the issues in repository practice; I think he made his points very powerfully, inluding a couple of comments that may have made a conference sponsor or two grimace.</p>
<p>In response to Suleman's challenges I'd like to propose a stream of work at next year's Open Repositories conference in Sweden.</p>
<p>How about we hold a pre-conference hands-on workshop that challenges repository developers to embrace some of the approaches Suleman is talking about -- storing files on disk, zero-install indices of content etc? How simple could you go to radically re-imagine a repo stack?</p>
<p>I'd like to see a mixture of institutional and commercial developers get involved, and to step out of their big fully-featured repository palaces and see what we can get done in a few days over the conference. We'd then have a session at the end of the conference that builds on work by Suleman and others on low-resource-ready repository and archive solutions. There might be token prizes as there are for poster presentations.</p>
<h2>That thing where you go to a conference to meet someone from up the road</h2>
<p>There were two people from Australian institutions at the conference. Me obviously, and the other one was Janet McDougall from the Australian Data Archive and the Australian National University. She was presenting in a session on Indigenous Knowledge Preservation</p>
<blockquote>
<h3>Australian Data Archive (ADA) and Australian National University (ANU) and The Dataverse Project, TKLabels use case</h3>
</blockquote>
<blockquote>
<p>Janet McDougall & Steven McEachern (Australian Data Archive, Australia), Sonia Maria Barbosa (Harvard University, United</p>
</blockquote>
<blockquote>
<p>The TK and BC Labels are an initiative for Indigenous communities and local organizations. Developed through sustained partnership and testing within Indigenous communities across multiple countries, the Labels allow communities to express local and specific conditions for sharing and engaging in future research and relationships in ways that are consistent with already existing community rules, governance, and protocols for using, sharing and circulating knowledge and data.
ADA has an interest in establishing the means for providing suitable representation of indigenous knowledge within the Dataverse software. This includes functionality in Dataverse to:</p>
<ul>
<li>link to and incorporate identified sources for indigenous knowledge representation, such as TKLabels and Notices</li>
<li>curation processes for managing the creation, reading, updating, and deleting of metadata</li>
<li>present curated metadata (e.g., TKLabels and TKNotices) in catalogue records</li>
<li>allow external aggregators to harvest this metadata (specifically the IDN Data Catalogue, but a preferably standardized model that allows for multiple external parties to harvest)</li>
</ul>
</blockquote>
<p>There's a workshop next week in Brisbane for the Australian HASS and Indigenous Research Data Commons project(s) at which Janet will be presenting this work and I will be presenting the LDaCA data access and authorization model, where we look to share implementation experience between projects -- more on that soon. We're interested in how to translate labels that "express conditions" can be incorporated into environments that authorize access.</p>
<p>On a global scale Janet and I are near neighbours, but we only seem to get to talk in Montana, or the Kirstenbosch botanic gardens, where outgoing conference chair Claire Knowles and I accompanied Janet for a post-conference volunteer-led botanical tour.</p>
<img src="https://ptsefton.com/2023/06/29/report-or-2023/image-6.png" alt="A woman holding onto a live flower">
<p><em>Our super-informative volunteer guide (didn't catch her name) explaining something about a strelitzia (she said you have to understand the empire builder Cecil Rhodes as a man of his time, and after all he did give Cape Town this botanic garden)</em></p>
<h2>South Africa</h2>
<p>Here are a few pics fragments from the trip. I had a few days of personal time (in accordance with the strict UQ policy on these matters).</p>
<img src='https://ptsefton.com/2023/06/29/report-or-2023/image-3.png' alt='A cheetah on a leash with his handler' />
<p><i><s>Tigers</s>Cheetahs on a <s>gold</s> leash - good thing I packed the 75-300mm MFT lens for this suburban safari</i></p>
<p>Not far from the conference, ten minutes by the ubiquitous Toyota Corolla Uber, in a gated community is a Cheetah 'sanctuary' where you can pat Cheetah 'ambassadors' and gawk at few other animals through the wire. This apparently has something to do with conservation.</p>
<img src='https://ptsefton.com/2023/06/29/report-or-2023/image.png' alt='Alt text' />
<p><em>A little bushwalk (boschwalk?) around Kirstenbosch - clouds on Table Mountain again</em></p>
<p>Also this happened:</p>
<img src="https://ptsefton.com/2023/06/29/report-or-2023/image-5.png" alt="Tweet about a singalong" />
<p><em><a href="https://orcid.org/0000-0001-5754-9940">Kim Shepherd</a> and I were quietly jamming on our ukes in the hotel lobby after the conference dinner when <a href="https://twitter.com/pedroprincipe">Pedro Principe</a> turned up -- Pedro got the party started on Kim's uke and I tried to keep up. Kim did percussion on a wine bottle he'd rescued from the dinner. We were mentioned in dispatches, incoming conference chair Torsten Reimer put out a call in the closing remarks for small instruments to be brought to <a href="https://or2024.openrepositories.org/">Gothenburg next year</a></em></p>
<p>Speaking of conference chairs here's me and my conference besties, Kim and outgoing chair <a href="https://orcid.org/0000-0002-6969-7382">Claire Knowles</a> about to ascend via cable car into the Table Mountain cloud. View was OK at the bottom station and the top reminded me of home in Katoomba at 1000m above sea level on a misty day, some nice wet rocks and shrubs were on view. Also a youth group running about purposefully with a couple of cheerful leaders, possibly channeling the spirit of Baden-Powell. Anyway they were totally lost; we were thinking of following them to safety when they ran back from the direction opposite to the one in which they'd departed and asked us the way to somewhere or other.</p>
<img src="https://ptsefton.com/2023/06/29/report-or-2023/image-7.png" alt="Me, Kim and Claire above Cape Town">
<p>We talked about the complexity of the current crop of repository systems, and how to use more distributed, less monolithic designs; Claire's planning something big but confidential for now (possibly with a <a href="https://github.com/wellcomecollection">similar design to this</a>), and Kim's been working with a <a href="https://www.knowledge-basket.co.nz/">regional archive</a> that uses quite simple underlying tech somewhat reminiscent of the approaches advocated by Suleman in his keynote (though we were yet to hear that).</p>
<p>Here are some penguins.</p>
<img src="https://ptsefton.com/2023/06/29/report-or-2023/image-8.png" alt="African Penguins on a rock">
<p>And I did a <a href="http://www.bokaapcookingtour.co.za/">Cape Malay cooking class/tour</a> of the Bo-Kaap neighbourhood with Zayed and family. Highly recommended.</p>
<img src="https://ptsefton.com/2023/06/29/report-or-2023/image-9.png" alt="Me flipping a roti while Zayed cheers me on">
<p><em>Here I am flipping a roti</em></p>
Towards a Generic Research Data Commons: A highly scalable standard-based repository framework for Language and other Humanities data2023-06-14T00:00:00+02:002023-06-14T00:00:00+02:00Peter Seftontag:ptsefton.com,2023-06-14:/2023/06/14/arkisto-stack-or-2023/index.html<p><a href="arkisto-stack-or-2023.pdf">Download as PDF</a></p>
<p>This presentation was delivered by Peter Sefton at the <a href="https://or2023.openrepositories.org/">Open Repositories 2023</a> conference in South Africa on 2023-06-14 in the <a href="https://www.conftool.net/or2023/index.php?page=browseSessions&form_session=460&presentations=show">Presentations: Discipline specific systems with FAIR principles
</a> session. It is also available on the <a href="https://www.ldaca.edu.au/posts/arkisto-stack-or-2023/">LDaCA website</a>.</p>
<p>This contains the slides and complete speaker notes, which have been …</p><p><a href="arkisto-stack-or-2023.pdf">Download as PDF</a></p>
<p>This presentation was delivered by Peter Sefton at the <a href="https://or2023.openrepositories.org/">Open Repositories 2023</a> conference in South Africa on 2023-06-14 in the <a href="https://www.conftool.net/or2023/index.php?page=browseSessions&form_session=460&presentations=show">Presentations: Discipline specific systems with FAIR principles
</a> session. It is also available on the <a href="https://www.ldaca.edu.au/posts/arkisto-stack-or-2023/">LDaCA website</a>.</p>
<p>This contains the slides and complete speaker notes, which have been edited after the conference.</p>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<p>We will present a standards-based generalized architecture for large-scale data* repositories for research and preservation illustrated with real world examples drawn from a number of languages and cultural archive projects. This work is taking place in the context of the Australian Humanities and Social Sciences Research Data Commons, particularly the Language Data component thereof and the long-established PARADISEC cultural archive. The standards used include the Oxford Common File Layout for storage, Research Object CRATE (RO-Crate) for consistent linked-data description of FAIR digital objects, and a language data metadata profile to ensure long-term interoperability between systems and re-usability over time. We also discuss data licensing and authorization for access to non-open resources. We suggest that the approach shown here may be used in other disciplines or for other kinds of digital library, repository or archival systems.</p>
<p>*The submitted abstract did not have the word data here - added for clarity</p>
<p>By: Peter Sefton (University of Queensland), Simon Musgrave (University of Queensland & Monash University) & Nick Thieberger (University of Melbourne)</p>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide01.png' alt='The Language Data Commons of Australia (LDaCA) and Australian Text Analytics Platform (ATAP) projects received investment (https://doi.org/10.47486/DP768 and https://doi.org/10.47486/PL074) from the Australian Research Data Commons (ARDC). The ARDC is funded by the National Collaborative Research Infrastructure Strategy (NCRIS). ARC LIEF LE210100013 (2021-2024) Nyingarn: a platform for primary sources in Australian Indigenous languages ' title='Slide: 1' border='1' width='85%%'/>
<p>This work is supported by the Australian Research Data Commons.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide02.png' alt=' With thanks for their contribution: Partner Institutions: ' title='Slide: 2' border='1' width='85%%'/>
<p>The Language Data Commons of Australia Data Partnerships (LDaCA) and the Australian Text Analytics Platform (ATAP) are building towards a scalable and flexible language data and analytics commons. These projects will be part of the HASS (Humanities and Social Sciences) and Indigenous Research Data Commons (HASS+I RDC).</p>
<p>The Data Commons will focus on preservation and discovery of distributed multi-modal language data collections under a variety of governance frameworks. This will include access control that reflects ethical constraints and intellectual property rights, including those of Aboriginal and Torres Strait Islander, migrant and Pacific communities.</p>
<p>The platform will provide workbench services to support computational research, starting with code-notebooks with no-code research tools provided in later phases. Research artifacts such as code and derived data will be made available as fully documented research objects that are re-runnable and rigorously described. Metrics to demonstrate the impact of the platform are projected to include usage statistics, data and article citations. These projects are led by Professor Michael Haugh of the School of Languages and Cultures at the University of Queensland with several partner institutions.</p>
<p>We would like to acknowledge the traditional custodians of the lands on which we live and work and the importance of indigenous knowledge, culture and language to the these projects. Peter Sefton lives and works on Wiradjuri land, and for Nick Thieberger and Simon Musgrave it's the land of the Kulin nation.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide03.png' alt='Pacific and Regional Archive for Digital Sources in Endangered Cultures (PARADISEC) Established 2003 Researchers concerned to digitise, preserve, and make accessible recordings in the many languages of the region around Australia No other agency taking responsibility for these recordings so they were at risk of loss Catalog exposes the existence of these recordings, 38,000 items in 690 collections Currently represent 1,350 languages, in 205 terabytes, with over 16,000 hours of audio recordings, 3,000 hours of video ' title='Slide: 3' border='1' width='85%%'/>
<p>PARADISEC is an online archive of cultural data which has been maintained for twenty years, in this presentation we will look at some of the lessons learned from PARADISEC. In summary – the PARADISEC approach to simple data and metadata storage is something we want to continue in LDaCA, while the high cost for PARADISEC of commissioning and maintaining its own software stack is something we want to address by taking a more standards-based approach to managing language and other data over the coming decades.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide04.png' alt=' ' title='Slide: 4' border='1' width='85%%'/>
<p>The Arkisto platform started in 2019 as a way to capture the lessons of PARADISEC and other projects such as Alveo (another language data project similar in scope to LDaCA) which was presented at OR 2014: Sefton PM, Estival D, Cassidy S, Burnham D, Berghold J. The Human Communication Science Virtual Lab (HCS vLab): A repository microclimate in a rapidly evolving research-ecosystem. In: Open Repositories 2014. Helsinki; 2014 [cited 2016 Jul 19]. Available from: http://www.doria.fi/handle/10024/97740</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide05.png' alt=' ' title='Slide: 5' border='1' width='85%%'/>
<p>This diagram was used in the bid documents that established LDaCA - it shows the progression of data from end-of life projects and active repositories into a standards-based data-commons.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide06.png' alt=' ' title='Slide: 6' border='1' width='85%%'/>
<p>This is the data triage process we have been going through in LDaCA — and it should be noted that of all the data we are presented with, most of it needs to be reworked into the Arkisto Standards Stack. Even PARADISEC which in 2019 received the international <a href="https://www.coretrustseal.org/why-certification/certified-repositories/">Core Trust Seal</a> based on the <a href="http://www.coretrustseal.org/requirements/">DSA-WDS Core Trustworthy Data Repositories Requirements</a> is still in the process of migrating data to more sustainable formats.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide07.png' alt=' ' title='Slide: 7' border='1' width='85%%'/>
<p>This is a taster of what data looks like in the kids of repositories we are talking about. This site contains harvested metadata about holdings on Australian Indigenous Languages in University of Queensland Libraries.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide08.png' alt='' title='Slide: 8' border='1' width='85%%'/>
<p>The LDaCA services we are building use an API to drive the data portals. The API can be used for direct access with appropriate access control – see <a href="posts/fair-care-eresearch-2022">another eResearch presentation</a> which explains this in detail. These screenshots show code notebooks running in BinderHub on the Nectar cloud accessing language resources.</p>
<p>This work has also been <a href="https://digital.library.unt.edu/ark:/67531/metadc2114304/">written up</a> for the <em>2nd International Workshop on Digital Language Archives (LangArc 2023) virtual workshop on digital language archives</em> 2023-06-30.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide09.png' alt=' ' title='Slide: 9' border='1' width='85%%'/>
<p>This is the overall architecture for data storage and delivery — missing is how data gets into to the repository, but we’ll come to that later.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide10.png' alt=' ' title='Slide: 10' border='1' width='85%%'/>
<p>At this point I will introduce one of the themes of this talk. In March this year, <a href="https://bibwild.wordpress.com/2023/03/21/ocfl-and-source-of-truth-two-options/">this blog post was published</a> - looking at the pros and cons of using OCFL (the Oxford Common File Layout) as the “source of truth” for a system (say a repository).</p>
<p>We are very much taking the OCFL (that is file-in-storage-as-the-source-of-truth) approach in LDaCA. Which begs the question: “But doesn’t that mean that it’s very specific to language data?” No, because we’re using a very flexible, extensible, discipline-neutral format for data description – yes, we have ways to specialise metadata and interfaces to language and other cultural metadata, but NO, the systems are not locked-in to that mode of operation. This means we should be able to share development and maintenance more broadly than with a single archive.</p>
<p>Two main points we want to get across in this presentation:</p>
<ul>
<li>We are taking seriously the idea that data-in-storage should be “batteries included” – everything needed to preserve and use the data is stored together and systems can be reconstituted from this storage.</li>
<li>This approach IS generic – different vocabularies / schemas can be plugged-in by design.</li>
</ul>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide11.png' alt='📂 🔬 🔭 📹 💽 🖥️ ⚙️🎼🌡️🔮🎙️🔍🌏📡💉🏥💊🌪️ ' title='Slide: 11' border='1' width='85%%'/>
<p>So lets now start looking at the standards involved in the Arkisto approach. This is a slide from “What is RO-Crate” – The dataset may contain any kind of data resource about anything, in any format as a file or URL</p>
<p>Stian Soiland-Reyes, Peter Sefton, Mercè Crosas, Leyla Jael Castro, Frederik Coppens, José M. Fernández, Daniel Garijo, Björn Grüning, Marco La Rosa, Simone Leo, Eoghan Ó Carragáin, Marc Portier, Ana Trisovic, RO-Crate Community, Paul Groth, Carole Goble (2022):Packaging research artefacts with RO-Crate.Data Science 5(2)https://doi.org/10.3233/DS-210053</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide12.png' alt=' ' title='Slide: 12' border='1' width='85%%'/>
<p>The core standard for this work is RO-Crate (Research Object Crate) in which all data is input, stored and output. This a big step for eresearch systems – no longer is there a transformation step on data onboarding (we used the term ingest, but some project members and partners found the metaphor distasteful).</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide13.png' alt=' Here the mechanism is to use the ‘magic’ name METS.xml to store some extra metadata – with a fully linked-data system this kind of thing is not needed ' title='Slide: 13' border='1' width='85%%'/>
<p>This screenshot is a bit of (undated) DSpace documentation found following a tip from Kim Sheppard – we have included it here to illustrate that storing additional metadata (in this case METS) for an object was done by convention. Using a linked-data system means that we no longer have to do this kind of thing – there’s still one magic file name in RO-Crate but it’s only one for the metadata and one for the HTML preview – everything else is labelled and extensible.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide14.png' alt='Using this core layer gives you interoperability with generic tools and general purpose “Who What Where” metadata ' title='Slide: 14' border='1' width='85%%'/>
<p>In the early days of the “Open Repositories” movement repositories had Dublin Core metadata (a standard with a few different flavours).</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide15.png' alt=' Using a domain specific profile extends the core RO-Crate for a specific type of data – eg language data, computational workflows or “cultural collections” (You can use more than one profile) ' title='Slide: 15' border='1' width='85%%'/>
<p>These days using linked data it is no longer necessary to have a bevy of XML schemas with incompatible encodings to store data from different schemas, different voclabularies and ontologies can co-exist and be expressed in a common way.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide16.png' alt='' title='Slide: 16' border='1' width='85%%'/>
<p>In the PARADISEC system this is achieved by storing files on disk in a simple hierarchy - with metadata and other resources stored together in a directory - this scheme allows for hands-on management of data resources, independently of the software used to serve them.</p>
<p>This approach means that if the PARADISEC software-stack becomes un-maintainable for financial or technical reasons the important resources, the data, are stored safely on disk with their metadata and a new access portal could be constructed relatively easily.</p>
<p>Despite the valuable features of this solution, it is not generalisable. The metadata.xml is custom to PARADISEC, as is the software stack.</p>
<p>In 2019 PARADISEC and the eResearch team at UTS received small grants from the Australian National Data Service and began collaborating on an approach to managing archival repositories which built on this PARADISEC approach of storing metadata with data.</p>
<p>The UTS team presented on this at <a href="https://ptsefton.com/2019/11/05/FAIR%20Repo%20-%20eResearch%20Presentation/index.html">eResearch Australasia 2019</a></p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide17.png' alt=' ' title='Slide: 17' border='1' width='85%%'/>
<p>The <a href="https://www.researchobject.org/ro-crate/1.1/structure.html">structure of an RO-Crate</a> is very similar to the PARADISEC example above, but with a json file instead of XML, and an optional preview in HTML.</p>
<p>RO-Crate has a growing number of <a href="https://www.researchobject.org/ro-crate/tools/">tools and software libraries</a> which means that a team such as PARADISEC does not have to maintain their own bespoke software.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide18.png' alt=' ' title='Slide: 18' border='1' width='85%%'/>
<p>Here, for comparison is <a href="https://wiki.lyrasis.org/display/FEDORA6x/Fedora+OCFL+Object+Structure#FedoraOCFLObjectStructure-FedoraAtomicResource-Container">how Fedora 6 would store an object (an Atomic Resource in Fedora-speak) like this with multiple files</a>. Like RO-Cratee this uses linked-dataa but in this case split up into multiple files containing RDF triples. (This is similar to the pre-RO-Crate approach taken by the Research Object spec).</p>
<p>This also shows some of what an OCFL repository looks like – this is an OCFL object with a single version.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide19.png' alt=' This is an RO-Crate Object which is stored as an OCFL Object ' title='Slide: 19' border='1' width='85%%'/>
<p>This screenshot shows an example of an Arkisto-style use of OCFL (all of the metadata is stored in the ro-crate-metadata.json rather than spread out as in Fedora).</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide20.png' alt=' ' title='Slide: 20' border='1' width='85%%'/>
<p>Now we come to the second core standard in our stack the <a href="https://ocfl.io">Oxford Common File Layout</a> – which is something we found out about via OpenRepositories – I couldn’t make the presentation, but I got a corridor briefing on this from Neil Jeffries in Bozeman at OR 2018.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide21.png' alt=' ' title='Slide: 21' border='1' width='85%%'/>
<p>This slide shows the interface between our core standards – a compliant OCFL repository has Objects within it that conform to the RO-Crate specification.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide22.png' alt=' ' title='Slide: 22' border='1' width='85%%'/>
<p>This slide illustrates the flexibility of the approach we’re taking. As LDaCA is a national project, our archival repositories and those of our partners such as PARADISEC will be distributed with differences of governance, varying by organisation, language type and discipline, though there is still a desire to be able to aggregate data into services that make it findable.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide23.png' alt=' S3-Style Object store Plain Old File Store ' title='Slide: 23' border='1' width='85%%'/>
<p>The storage services may not all be the same in this model, some may be file systems, some may be object stores, and they may be hosted by and governed by a variety of organizations.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide24.png' alt=' ' title='Slide: 24' border='1' width='85%%'/>
<p>This slide shows how we have abstracted the “A” for Access in FAIR out of the repository and into a separate centralised or at least <em>concentrated</em> system. We have a <a href="https://www.ldaca.edu.au/posts/fair-care-eresearch-2022/">full write-up of this approach from the 2022 eResearch Australasia conference</a> and we don’t have time to go through it in detail here, but in summary:</p>
<ul>
<li>Every object in the repository has a Data Reuse License with some management metadata.</li>
<li>Each repository only needs an authoritative list of licenses and trusted license management systems to be able to serve the data.</li>
<li>License management is handled by a dedicated system that can deal with application and invitation workflows to grant licenses (including simple self-serve click-through license agreements)</li>
</ul>
<p>Note that our work is also informed by the <a href="https://www.gida-global.org/care">CARE principles for Indigenous data Governance (Collective benefit, Authority to control, Responsibility, Ethics)</a> which frame the way FAIR protocols are implemented. Again, see the <a href="https://digital.library.unt.edu/ark:/67531/metadc2114304/">LangArc workshop write-up</a>.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide25.png' alt=' Ok but how does the data get there? ' title='Slide: 25' border='1' width='85%%'/>
<p>Let’s revisit this diagram. What’s missing?</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide26.png' alt=' ' title='Slide: 26' border='1' width='85%%'/>
<p>In the first phase of the LDaCA project, work focused on batch import of data using tools to convert collections – this approach was used on contemporary collections as well as for “rescuing” collections from older repository system.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide27.png' alt=' { "@id": "https://github.com/Language-Research-Technology/corpus-tools-ro-crate", "@type": "SoftwareSourceCode", "name": "https://github.com/Language-Research-Technology/corpus-tools-ro-crate", "description": "Converts an RO-Crate to an LDaCA OCFL collection as long as the crate has repository Objects and Collections that are members of a RepositoryCollection in the root dataset", "programmingLanguage": { "@id": "https://en.wikipedia.org/wiki/Node.js" } }, { "@id": "#provenance", "name": "Created RO-Crate using corpus-tools-ro-crate", "@type": "CreateAction", "instrument": { "@id": "https://github.com/Language-Research-Technology/corpus-tools-ro-crate" }, "result": { "@id": "ro-crate-metadata.json" } } The act of creation of this metadata is documented ' title='Slide: 27' border='1' width='85%%'/>
<p>This slide shows some JSON-LD metadata that describes the way this RO-Crate metadata was created – illustrating how RO-Crate can be used to record provenance.</p>
<p>(UPDATE: I didn't explain <a href="https://json-ld.org/">JSON-LD</a> properly during the presentation. JSON-LD is a method of encoding linked-data (which can be quite esoteric and unapproachable) in JSON a method of describing data in simple text, which is widely used and understood by programmers.)</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide28.png' alt=' ' title='Slide: 28' border='1' width='85%%'/>
<p>This part of the architecture we are working on now…</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide29.png' alt='' title='Slide: 29' border='1' width='85%%'/>
<p>Here we see the Crate-O metadata tool (which is a zero-install web application that runs in Chrome and other browsers that support the new FilesystemAPI) being used to add an Organization as the Affiliation for a Person entity. Having imported this "Context Entity" (that's the RO-Crate term) it can then be re-used within the crate which we see here as the schema.org <code>publisher</code> property is linked to the same organization.</p>
<p>(At this stage Crate-O is still to be connected to the repository stack - that will happen in the second half of 2023)</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide30.png' alt=' ' title='Slide: 30' border='1' width='85%%'/>
<p>We hope to work with other editor projects (eg <a href="https://describo.github.io/#/">Describo</a>) to make editor profiles as compatible as possible.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide31.png' alt=' ' title='Slide: 31' border='1' width='85%%'/>
<p>The next series of slides show some examples of our approach implemented in a variety of contexts.</p>
<p>Here’s another repository that uses RO-Crate metadata (from the Language Data Commons of Australia / Australian Text Analytics Platform) – here users can launch a Jupyter notebook containing Python code (and explanatory text) that processes a dataset.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide32.png' alt=' SCREENSHOT OF NOTEBOOK ' title='Slide: 32' border='1' width='85%%'/>
<p>This is a screenshot of a Jupyter notebook that can process data from a repository via its API.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide33.png' alt=' ' title='Slide: 33' border='1' width='85%%'/>
<p>This slide shows the Arkisto stack powering the University of Technology Sydney’s Research Data Repository.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide34.png' alt=' ' title='Slide: 34' border='1' width='85%%'/>
<p>This page shows some screenshots of an internal-only application at UTS which gives academic staff access to successful research grant proposals – the data are stored in the same kind of Arkisto standards-based storage stack as we have presented here – with an interface that is tuned for this use case, with some custom access control to make sure that staff are <em>very</em> aware that these are sensitive and confidential documents.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide35.png' alt=' ' title='Slide: 35' border='1' width='85%%'/>
<p>This is a screenshot of data from a history project <a href="https://expertnation.org/">Expert Nation</a> exported to RO-Crate format and <a href="https://expertnation.research.uts.edu.au/">put online to support a book</a>.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/14/arkisto-stack-or-2023/Slide36.png' alt=' Want to join us? Your project here? Get in touch… ' title='Slide: 36' border='1' width='85%%'/>
<p>In conclusion, we have given a quick tour of a standards-based repository stack (loosely called Arkisto) and illustrated it with current work at the Language Data Commons of Australia and PARADISEC projects, but along the way have tried to emphasise that this is generic, re-usable architecture – and is based on standards. By using an extensible metadata standard with a growing community, and a storage-layer standard forged from an acquired aversion to systems migration we aim to reduce the risk to very important cultural data by working with as many communities as possible on software tools, so that we reduce cost and risk for all of us.</p>
</section>
Introducing the Oni Repository Stack2023-06-13T00:00:00+02:002023-06-13T00:00:00+02:00Peter Seftontag:ptsefton.com,2023-06-13:/2023/06/13/oni-dev-track-or-2023/index.html<p>By:</p>
<ul>
<li>Peter Sefton</li>
<li>Moises Sacal Bonequi</li>
<li>Alvin Sebastian</li>
<li>Mark Raadgever</li>
</ul>
<p>This presentation was delivered by Peter Sefton at Open Repositorites 2023.</p>
<h3>Abstract</h3>
<p>In this presentation we will show some of the general purpose repository tooling used to manage repository data for the Language Data Commons of Australia and the Australian …</p><p>By:</p>
<ul>
<li>Peter Sefton</li>
<li>Moises Sacal Bonequi</li>
<li>Alvin Sebastian</li>
<li>Mark Raadgever</li>
</ul>
<p>This presentation was delivered by Peter Sefton at Open Repositorites 2023.</p>
<h3>Abstract</h3>
<p>In this presentation we will show some of the general purpose repository tooling used to manage repository data for the Language Data Commons of Australia and the Australian Text Analytics Platform. We have a standards-based repository stack which is used to make research data available for human and machine-use. The main part of the stack is “Oni” https://github.com/Arkisto-Platform/oni which builds an access-controlled REST API from an Oxford Common File Layout (OCFL) data store (which consists of data objects which are saved as files-on-disk or in object storage), with data objects are described using the RO-Crate metadata standard. Data is indexed into a postgres-driven API for low-level access, and a full discovery index implemented in ElasticSearch, with the ability to create access portals in your web framework of choice. We will demonstrate rapid creation of large scale repositories using batch tooling, as well as using a metadata entry tool known as Describo to produce RO-Crate linked-data descriptions.</p>
<p>This slide shows the "small pieces, loosely joined style of the Oni repository" -- it is based on an OCFL data store for digital objects on disk or object storage - with RO-Crate metadata for each object.</p>
<h1>The architecture</h1>
<img src="https://ptsefton.com/2023/06/13/oni-dev-track-or-2023/oni_diagrams_oni-architecture-2.svg"/>
<p>This diagram shows the architecture of the Oni system. The name Oni started as an acronym -- Ocfl + NGINX (a web server) + Index (eg Solr or Elastic Search) but we no longer use NGNIX and Oni happens to be a kind of Ogre which has its own emoji 👹.</p>
<p>This demonstration shows and example of how to stand-up a repository for 300 documents, in this case plays in TEI XML format which we got from <a href="https://orcid.org/0000-0002-9336-1678">Professor Hugh Craig</a>. The first steps involve getting the data into an Oxford Common File Layout <a href="https://ocfl.io">OCFL</a> repository with each "object" (a play) in the repository described using Research Object Crate metadata: RO-Crate.</p>
<h1>Some data – ~300 plays from the 1500s</h1>
<img src="https://ptsefton.com/2023/06/13/oni-dev-track-or-2023/list-metadata.gif"/>
<p>This screen recording shows a command line session; listing the contents of a data directory full of XML and peeking into the CSV metadata supplied with the files by Professor Craig.</p>
<h1>Using RO-Crate-excel, execute a few maneuvers</h1>
<img src="https://ptsefton.com/2023/06/13/oni-dev-track-or-2023/rocxl.mov.gif"/>
<p>In this recording, we use the RO-Crate Excel tool to generate an Excel workbook listing all the files.</p>
<h1>Paste in the researcher's data</h1>
<img src="https://ptsefton.com/2023/06/13/oni-dev-track-or-2023/sheet-detail.png"/>
<p>Using Excel, we can manipulate data in a transparent way to get it ready for conversion into RO-Crate format -- the RO-Crate Excel tool uses some conventions that mean we can "show our working" in this process, and mark some of the more esoteric metadata as hidden (for now), though it is still available in the researcher's original ad-hoc CSV format.</p>
<h1>Fine tune using Crate-O ...</h1>
<img src="https://ptsefton.com/2023/06/13/oni-dev-track-or-2023/crate-o-org.mov.gif"/>
<p>Here we see the Crate-O metadata tool (which is a zero-install web application that runs in Chrome and other browsers that support the new FilesystemAPI) being used to add an Orgnization as the Affiliation for a Person entity. Having imported this "Context Entity" (that's the RO-Crate term for this type of contextual metadata) it can then be re-used within the crate which we see here as the schema.org <code>publisher</code> property is linked to the same orgnization.</p>
<h1>Here's where you get Crate-O</h1>
<img src="https://ptsefton.com/2023/06/13/oni-dev-track-or-2023/crate-o-site.png"/>
<p>You can get the Crate-O source or try it out <a href="https://github.com/Language-Research-Technology/crate-o">at this github repo</a>.</p>
<h1>… and you get an RO-Crate for the data</h1>
<img src="https://ptsefton.com/2023/06/13/oni-dev-track-or-2023/rochtml.mov.gif"/>
<p>This slide shows generating an HTML preview file that summarizes the data -- the RO-Crate is a JSON-LD file that was created from the spreadsheet shown above, and tweaked using Crate-O. JSON-LD is Linked Data in JSON format, this is what RO-Crate uses to make linked data approachable for a general programming audience.</p>
<h1>Then using corpus-tools-ro-crate, make an OCFL repo</h1>
<img src="https://ptsefton.com/2023/06/13/oni-dev-track-or-2023/make-plays.mov.gif"/>
<p>This slide shows another script (via a make file that supplies a set of commandline paramaters) which takes the RO-Crate and "explodes" it into a set of OCFL (Oxford Common File Layout) directories in a "Storage Root".</p>
<h1>This is the OCFL file layout</h1>
<img src="https://ptsefton.com/2023/06/13/oni-dev-track-or-2023/ocfl-screenshot.png"/>
<p>Here's what an OCFL repository might look like during development -- I built this on the 'plane to South Africa, somewhere over the Southen Ocean, and you can see that my tweaks to the code resulted in several versions of the OCFL/RO-Crate objects being created. In this recording I navigate to a file, open the RO-Crate Metadata Document, and inspect the metadata profile that it links to from the <code>conformsTo</code> property.</p>
<h1>Start up 👹 and index stuff</h1>
<p>Type, like:</p>
<pre><code>> docker compose up
... Screenfulls of stuff
> node structural-index.js
{ message: 'Started: database indexer' }
</code></pre>
<h1>Et Voila!</h1>
<img src="https://ptsefton.com/2023/06/13/oni-dev-track-or-2023/portal.mov.gif"/>
<p>This is a search portal for the plays with an Elastic search for full text for ~~facets~~ aggregations.</p>
<p>In conclusion, this repository stack is quite different from DSpace, ePrints and other repository systems where everything is built in to one application - the approach is more like the unix</p>
<h1>Tools used here</h1>
<h2>The excel-to-crate tooling:</h2>
<p>https://github.com/Language-Research-Technology/ro-crate-excel</p>
<h2>The plays example</h2>
<p>https://github.com/Language-Research-Technology/corpus-tools-example-plays</p>
<h1>More tools</h1>
<h2>The thing that turns RO-Crate into an OCFL repo:</h2>
<p>https://github.com/Language-Research-Technology/corpus-tools-ro-crate</p>
<h2>The Oni stack, OCFL library, API and Elastic Search:</h2>
<p>https://github.com/Language-Research-Technology/oni-ui</p>
Packaging data with detailed metadata using RO-Crate in FAIR open repositories2023-06-13T00:00:00+02:002023-06-13T00:00:00+02:00Peter Seftontag:ptsefton.com,2023-06-13:/2023/06/13/ro-crate-or-2023/index.html<p><a href="https://ptsefton.com/2023/06/13/ro-crate-or-2023/ro-crate-or-2023.pdf">PDF version</a></p>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/13/ro-crate-or-2023/Slide00.png' alt='Packaging data with detailed metadata using RO-Crate in FAIR open repositories Peter Sefton1, Stian Soiland-Reyes2 1: The University of Queensland, Australia; 2: The University of Manchester, UK ' title='Slide: 0' border='1' width='85%%'/>
<p>This presentation was delivered by Peter Sefton at Open Repositories 2023: it includes slides adapted from other RO-Crate presentations by Stian Soiland-Reyes and others - but here “I” means is Sefton.</p>
<h2>Abstract</h2>
<p>Research Object Crate (RO-Crate) is a community effort and specification to practically achieve FAIR packaging of research …</p></section><p><a href="https://ptsefton.com/2023/06/13/ro-crate-or-2023/ro-crate-or-2023.pdf">PDF version</a></p>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/13/ro-crate-or-2023/Slide00.png' alt='Packaging data with detailed metadata using RO-Crate in FAIR open repositories Peter Sefton1, Stian Soiland-Reyes2 1: The University of Queensland, Australia; 2: The University of Manchester, UK ' title='Slide: 0' border='1' width='85%%'/>
<p>This presentation was delivered by Peter Sefton at Open Repositories 2023: it includes slides adapted from other RO-Crate presentations by Stian Soiland-Reyes and others - but here “I” means is Sefton.</p>
<h2>Abstract</h2>
<p>Research Object Crate (RO-Crate) is a community effort and specification to practically achieve FAIR packaging of research objects (digital objects like data, methods, software) with structured metadata and context. RO-Crate uses well-established Web standards and FAIR principles. For common metadata representations, RO-Crate builds on schema.org, a mature and general mark-up vocabulary used by search engines, including Google Dataset Search. RO-Crate is adopted by many research projects as a pragmatic implementation of the FAIR principles that can be both general for interoperable exchange and extensible for domain-specific archiving.
RO-Crate development began in early 2019, when a workshop at Open Repositories 2019 in Hamburg generated a significant number of use-cases and expressions of interest from the OR community. This presentation will introduce RO-Crate, its continuing development and rapid adoption since 2019, report on how it is now being used in repository software, and the potential for further use in repository platforms that will be familiar to OR attendees.</p>
<h2>Outline</h2>
<p>In this presentation we’ll cover:</p>
<ul>
<li>A quick run thru what RO-Crate is and what it is for</li>
<li>New developments:
<ul>
<li>Version 1.2 is coming</li>
<li>Profiles are seeing a lot of activity</li>
<li>Tooling continues to improve</li>
</ul>
</li>
</ul>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/13/ro-crate-or-2023/Slide01.png' alt='Is it FAIR to use all these repositories? https://fairsharing.org/ https://faircookbook.elixir-europe.org/ https://www.re3data.org/ ' title='Slide: 1' border='1' width='85%%'/>
<p>Researchers are asked to make their research outputs – including publications, FAIR – where to publish?</p>
<p>They have to choose between Thousands of public, institutional and domain-specific repositories
Help from guidance and catalogues.</p>
<p>(FAIRsharing, re3data, FAIR Cookbook)</p>
<p>..but how to gather and reference outputs across multiple repositories?
What about contextual information?</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/13/ro-crate-or-2023/Slide02.png' alt='Describe and package data collections, datasets, software etc. with their metadata Platform-independent object exchange between repositories and services Support reproducibility and analysis: link data with codes and workflows Transfer of sensitive/large distributed datasets with persistent identifiers Aggregate citations and persistent identifiers Propagate provenance and existing metadata Publish and archive mixed objects and references Reuse existing standards, but hide their complexity Aims of FAIR Research Objects ' title='Slide: 2' border='1' width='85%%'/>
<p>These are our aims:</p>
<ul>
<li><em>Describe</em> and <em>package</em> data collections, datasets, software etc. with their <em>metadata</em> (And remember in the context of Open Repositories: publications are data too)</li>
<li><em>Platform-independent</em> object exchange between repositories and services</li>
<li>Support <em>reproducibility</em> and <em>analysis</em>: link data with codes and workflows</li>
<li>Transfer of <em>sensitive/large</em> distributed datasets with persistent identifiers</li>
<li>Aggregate <em>citations</em> and <em>persistent identifiers</em></li>
<li>Propagate <em>provenance</em> and <em>existing metadata</em></li>
<li>Publish and archive <em>mixed objects</em> and references</li>
<li>Reuse existing <em>standards</em>, but hide their complexity</li>
</ul>
<p>We're trying to be fairly platform-independent, and we're not too tied into a particular way of storing or identifying these components. We do want to have enough information for reproducibility ,and to support data that are coming in from different sources, that may not even be accessible directly because they require authorization.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/13/ro-crate-or-2023/Slide03.png' alt='' title='Slide: 3' border='1' width='85%%'/>
<p>The idea of the Research Object (RO) is to gather data in a kind of virtual package. This may include some actual files, and it may include outgoing references; these are related together and given brief descriptions. That way we know what the data are, and what role they play in <em>this</em> collection.</p>
<p>(I presented RO-Crate to a senior research technology leader recently who had not yet heard of RO-Crate – and they stopped me and asked “why is there <em>Research</em> in the name” – pointing out that RO-Crate is obviously applicable to more thanresearch use cases. The answer lies in the genealogy; RO-Crate is a merger between the Research Object line of work at the University of Manchester and DataCrate from the University of Technology Sydney - the technology is not inherently specific to research – but the motivations, particularly the FAIR principles do come from the the research world.)</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/13/ro-crate-or-2023/Slide04.png' alt='What's new ' title='Slide: 4' border='1' width='85%%'/>
<p>This slide shows a screenshot of the RO-Crate specification. The spec is designed to be an implementation guide that builds on other standards – we will continue to work on making this as simple as possible for tool developers (we admit parts of it have started to get a bit complex as we take on more use cases).</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/13/ro-crate-or-2023/Slide05.png' alt='Using common formats and vocabularies .. extending only when needed ' title='Slide: 5' border='1' width='85%%'/>
<p>We use the common vocabularies, but only extend where we need to.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/13/ro-crate-or-2023/Slide06.png' alt='TOOLS ' title='Slide: 6' border='1' width='85%%'/>
<p>RO-Crate is now a very healthy community - the spec is developed by an open process with fortnightly calls, and a github repository.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/13/ro-crate-or-2023/Slide07.png' alt='Python-lib ' title='Slide: 7' border='1' width='85%%'/>
<p>We have regular calls – a “main” monthly call and a Euro-focussed call. People call in from all over Europe, the US and Australia.</p>
<p>POST Conference note: This is obviously not what you'd call global coverage Claire Knowles pointed out to me the number of people at OR who were standing in Africa talking about their 'global' projects which often have low-to-zero represention outside of North America and Europe (we'll count Australasia as part of that as Australia's in Eurovision).</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/13/ro-crate-or-2023/Slide08.png' alt=' ' title='Slide: 8' border='1' width='85%%'/>
<p>There is a growing body of work on RO-Crate this Zenodo repository captures part of it - but it’s starting to show up in repositories and presentations in a lot of research contexts.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/13/ro-crate-or-2023/Slide09.png' alt=' ' title='Slide: 9' border='1' width='85%%'/>
<p>Workflow Hub is a example of a repository (though it calls itself a registry) – it contains scientific workflows.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/13/ro-crate-or-2023/Slide10.png' alt=' RO-Crate is built-in ' title='Slide: 10' border='1' width='85%%'/>
<p>Here's an example of a workflow in the WorkflowHub registry/repository – there’s a download button to get a workflow in RO-Crate format. Note the ‘sketch’ which illustrates the workflow.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/13/ro-crate-or-2023/Slide11.png' alt=' There’s an HTML page included in the RO-Crate Download that makes the crate human readable ' title='Slide: 11' border='1' width='85%%'/>
<p>If you download this workflow crate then you get a preview file like the one shown including the precis "sketch" of the workflow and links to the files – eg the "Main Workflow" link. This shows the benefits of RO-Crate – every download has a machine-readable metadata file, and there's a human-readable web page to go with it. If you find this on your computer in 10 year's time there is information there about what it is, and where it came from in a standardised format.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/13/ro-crate-or-2023/Slide12.png' alt=' RO-Crate Built-in here as well at RO-Hub ' title='Slide: 12' border='1' width='85%%'/>
<p>This is an <a href="https://reliance.rohub.org/fb3a8b1f-7132-4c0e-80c8-33ff294808da?activetab=overview">item from RO-Hub</a></p>
<p>The EOSC project RELIANCE use RO-Crate to package data cubes of earth observation data, along with documentation, images and workflows</p>
<p>Connects to related infrastructures for interactive execution/analysis.</p>
<p>Metadata includes temporal coverage, spatial coverage and vertical coverage.</p>
<p>ROHub publishes the archived RO-Crates to general-purpose repositories (Zenodo, B2Share) for longevity and PIDs.</p>
<p>(The RO-Crate preview file in this service could use work; it’s a raw representation of the JSON metadata but is still better than the old days)</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/13/ro-crate-or-2023/Slide13.png' alt=' ' title='Slide: 13' border='1' width='85%%'/>
<p>In the above examples we showed how resources can be downloaded from repositories in RO-Crate format – but there are still no widely accepted standard in place to joint the dots between, say a DOI for a dataset and an actual download of that data. DOIs resolve to web pages, not data streams – the RO-Crate community is actively engaged in joining these dots with work on FAIR signposting – establishing protocols for automated signalling of where data can be downloaded.</p>
<p>Please join the <a href="ttps://join.slack.com/t/fair-impact-support/shared_invite/zt-1s86x15a8-pJdpSns3tdZXgAoruHtuD">Slack conversation</a> if you’d like to talk to us about this.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/13/ro-crate-or-2023/Slide14.png' alt=' ' title='Slide: 14' border='1' width='85%%'/>
<p>We have just seen an example of an ATTACHED crate – you might call it RO-Crate “classic”, this is the starting point for RO-Crate – it’s first use case as a packaging format. In an Attached Crate data resources are included alongside the RO-Crate-metadata.json file. When we introduced RO-Crate at OR2019 in Hamburg this was the ONLY kind of crate.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/13/ro-crate-or-2023/Slide15.png' alt=' ' title='Slide: 15' border='1' width='85%%'/>
<p>Detached Crates, on the other hand, have resources that are NOT local. For example, an RO-Crate metadata document dowloaded from an API might reference resources available from the API.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/13/ro-crate-or-2023/Slide16.png' alt=' References file streams from the API ' title='Slide: 16' border='1' width='85%%'/>
<p>This is what a “Detached RO-Crate” looks like over an API – in this case one that is showing a collection of plays in English from the 1500s (this data features in another presentation given at Open Repositories, <a href="/2023/06/13/oni-dev-track-or-2023/">a demonstration</a> illustrating the technical details of an RO-Crate-based repository architecture.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/13/ro-crate-or-2023/Slide17.png' alt=' ' title='Slide: 17' border='1' width='85%%'/>
<p>This diagram sketches the architecture of the ]Australian Text Analytics](https://atap.edu.au), which is part of the <a href="https://ldaca.edu.au">Language Data Commons of Australia</a> , and shows the integration between data repositories (in green, on the right) and code execution environments (in red, on the left). The integration between these things is via documentation – and standards-based metadata (including, of course, RO-Crate).</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/13/ro-crate-or-2023/Slide18.png' alt=' https://www.researchobject.org/ro-crate/tools/ ' title='Slide: 18' border='1' width='85%%'/>
<p>We have been talking about RO-Crate tools – here’s the list from the website. Like any list of tools it can be hard to keep this up to date (like, for example I am talking about the Crate-O tool here but it is not yet on the list). Here’s the RO-Crate tools page: https://www.researchobject.org/ro-crate/tools/</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/13/ro-crate-or-2023/Slide19.png' alt=' ' title='Slide: 19' border='1' width='85%%'/>
<p>Here’s another repository that uses RO-Crate metadata (from the Lanaguage Data Commons of Australia / Australian Text Analytics Platform) – users can launch a Jupyter notebook in a binderhub execution environment. The Notebook fetches a Detached RO-Crate metadata document, processes it to filter further resources to fetch, and then fetches them from the API.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/13/ro-crate-or-2023/Slide20.png' alt=' SCREENSHOT OF NOTEBOOK ' title='Slide: 20' border='1' width='85%%'/>
<p>This is a screenshot of the Notebook, using he python RO-Crate library to consume data from the API.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/13/ro-crate-or-2023/Slide21.png' alt=' ' title='Slide: 21' border='1' width='85%%'/>
<p>The RO-Crate Python library has lots of functionality for doing actual data packaging – it has a file-system interface (we mention this as it is different from the approach taken in the Javascript library).</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/13/ro-crate-or-2023/Slide22.png' alt=' ' title='Slide: 22' border='1' width='85%%'/>
<p>And ro-crate-py has a commandline interface or making RO-Crates step by step.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/13/ro-crate-or-2023/Slide23.png' alt=' ' title='Slide: 23' border='1' width='85%%'/>
<p>RO-Crate-js (Javascript) takes a different approach – it is much more abstract, and has no direct connection to the file system.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/13/ro-crate-or-2023/Slide24.png' alt=' ' title='Slide: 24' border='1' width='85%%'/>
<p>RO-Crate excel creates a crate from a directory of files, and can allow existing ad hoc tabular metadata to be added to RO-Crates.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/13/ro-crate-or-2023/Slide25.png' alt='Crate-O – TODO when this is downloaded as a PPT ' title='Slide: 25' border='1' width='85%%'/>
<p>Here we see the Crate-O metadata tool (which is a zero-install web application that runs in Chrome and other browsers that support the new FilesystemAPI) being used to add an Organization as the Affiliation for a Person entity. Having imported this "Context Entity" (that's the RO-Crate term) it can then be re-used within the crate which we see here as the schema.org <code>publisher</code> property is linked to the same organization – with the ROR (Research Organization Registry) identifier https://ror.org/00eae9z71</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2023/06/13/ro-crate-or-2023/Slide26.png' alt=' Join us! ' title='Slide: 26' border='1' width='85%%'/>
<p>If you’d like to join in or contact us choose one of the options on the <a href="https://www.researchobject.org/ro-crate/community.html">RO-Crate Community page</a> – eg <a href="https://join.slack.com/t/seek4science/shared_invite/zt-csqh94qb-kf~kFbZxuHl1Hpxhbc8avw">join the Slack</a></p>
<p>And to cite RO-Crate:</p>
<p>Stian Soiland-Reyes, Peter Sefton, Mercè Crosas, Leyla Jael Castro, Frederik Coppens, José M. Fernández, Daniel Garijo, Björn Grüning, Marco La Rosa, Simone Leo, Eoghan Ó Carragáin, Marc Portier, Ana Trisovic, RO-Crate Community, Paul Groth, Carole Goble (2022):Packaging research artefacts with RO-Crate.Data Science 5(2)https://doi.org/10.3233/DS-210053</p>
</section>
Designing a metadata ecosystem for language research based on Research Object Crate (RO-Crate)2022-11-25T00:00:00+01:002022-11-25T00:00:00+01:00Peter Seftontag:ptsefton.com,2022-11-25:/2022/11/25/ldaca-metadata-ecosystem-eresearch-2022/index.html<p><a href="ldaca-metadata-ecosystem-eresearch-2022.pdf">PDF version</a></p>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/25/ldaca-metadata-ecosystem-eresearch-2022/Slide00.png' alt='Designing a metadata ecosystem for language research based on Research Object Crate (RO-Crate) Peter Sefton, Nick Thieberger, Marco La Rosa, Simon Musgrave, River Tae Smith, Moises Sacal Bonequi ' title='Slide: 0' border='1' width='85%%'/>
<p>By Peter Sefton, Nick Thieberger, Marco La Rosa, Simon Musgrave, River Tae Smith, Moises Sacal Bonequi – delivered by Peter Sefton at eResearch 2022 in Brisbane</p>
<p>This work is licensed under CC BY 4.0. To view a copy of this license, visit http://creativecommons.org/licenses/by/4 …</p></section><p><a href="ldaca-metadata-ecosystem-eresearch-2022.pdf">PDF version</a></p>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/25/ldaca-metadata-ecosystem-eresearch-2022/Slide00.png' alt='Designing a metadata ecosystem for language research based on Research Object Crate (RO-Crate) Peter Sefton, Nick Thieberger, Marco La Rosa, Simon Musgrave, River Tae Smith, Moises Sacal Bonequi ' title='Slide: 0' border='1' width='85%%'/>
<p>By Peter Sefton, Nick Thieberger, Marco La Rosa, Simon Musgrave, River Tae Smith, Moises Sacal Bonequi – delivered by Peter Sefton at eResearch 2022 in Brisbane</p>
<p>This work is licensed under CC BY 4.0. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/</p>
<p>This presentation will look at how a Metadata Standard - RO-Crate - with Metadata Profile (the Language Data Commons) is being developed and implemented. Two major collections are collaborating on the standard, PARADISEC and the Language Data Commons of Australia (LDaCA). This ongoing standardisation effort for language data is designed to improve interoperability, reduce costs for data migration and allow storage on disk, object storage or in archival repositories.</p>
<p><a href="https://www.researchobject.org/ro-crate/">RO-Crate</a> is a linked-data metadata system which allows discovery metadata (Who, what where) based on the widely adopted Schema.org vocabulary to be seamlessly integrated with more discipline specific metadata. RO-Crate uses metadata profiles to provide guidance for packaging resources for particular disciplines and purposes.</p>
<p>In this presentation we will introduce a RO-Crate metadata profile for language data which extends the core RO-Crate standard with new vocabulary terms adapted from pre-linked-data discipline specific metadata efforts, particularly the Open Language Archives Community (OLAC) standards. The profile has English-language guidance on how to structure collections of resources in a repository with links between them, such that they can be indexed and displayed via APIs and search/browse portals. The profile is also implemented as a series of machine-readable profiles for the Describo Online metadata description system.</p>
<p>We will demonstrate current ways of describing items in a variety of languages and modes (spoken, written and signed), from a large set of heterogeneous language resources held by PARADISEC and LDaCA. We will also show how to access them via API calls and a search portal, and how resources may be stored in simple storage systems using the Arkisto platform (a set of standards and principles).</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/25/ldaca-metadata-ecosystem-eresearch-2022/Slide01.png' alt='The Language Data Commons of Australia (LDaCA) and Australian Text Analytics Platform (ATAP) projects received investment (https://doi.org/10.47486/DP768 and https://doi.org/10.47486/PL074) from the Australian Research Data Commons (ARDC). The ARDC is funded by the National Collaborative Research Infrastructure Strategy (NCRIS). ARC LIEF LE210100013 (2021-2024) Nyingarn: a platform for primary sources in Australian Indigenous languages ' title='Slide: 1' border='1' width='85%%'/>
<p>This work is supported by the Australian Research Data Commons.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/25/ldaca-metadata-ecosystem-eresearch-2022/Slide02.png' alt=' With thanks for their contribution: Partner Institutions: ' title='Slide: 2' border='1' width='85%%'/>
<p>The Language Data Commons of Australia Data Partnerships (<a href="https://doi.org/10.47486/HIR001">LDaCA</a>) and the Australian Text Analytics Platform (<a href="https://doi.org/10.47486/PL074">ATAP</a>) are building towards a scalable and flexible language data and analytics commons. These projects will be part of the Humanities and Social Sciences Research Data Commons (HASS RDC).</p>
<p>The Data Commons will focus on preservation and discovery of distributed multi-modal language data collections under a variety of governance frameworks. This will include access control that reflects ethical constraints and intellectual property rights, including those of Aboriginal and Torres Strait Islander, migrant and Pacific communities.</p>
<p>The platform will provide workbench services to support computational research, starting with code-notebooks with no-code research tools provided in later phases. Research artefacts such as code and derived data will be made available as fully documented research objects that are re-runnable and rigorously described. Metrics to demonstrate the impact of the platform are projected to include usage statistics, data and article citations. These projects are led by Professor Michael Haugh of the School of Languages and Culture at the University of Queensland with several partner institutions.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/25/ldaca-metadata-ecosystem-eresearch-2022/Slide03.png' alt='Pacific and Regional Archive for Digital Sources in Endangered Cultures Running for 20 years 1,337 languages represented 675 collections 37,510 items 405,289 files 15,540 hours (audio) 2,465 hours (video) 193 TB October 2022 ' title='Slide: 3' border='1' width='85%%'/>
<p>This page shows a <a href="https://www.youtube.com/watch?v=CX-CODBwOVU&t=7s">YouTube demo of the PARADISEC web site</a>.</p>
<p>PARADISEC (the Pacific And Regional Archive for Digital Sources in Endangered Cultures) is a digital archive of records of some of the many small cultures and languages of the world and it has developed models to ensure that the archive can provide access to interested communities while also conforming with emerging international standards for digital archiving. Australian researchers have been making unique and irreplaceable audiovisual recordings in the region since portable field recorders became available in the mid-twentieth century, yet until the establishment of PARADISEC there was no Australian repository for these invaluable research recordings.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/25/ldaca-metadata-ecosystem-eresearch-2022/Slide04.png' alt=' ' title='Slide: 4' border='1' width='85%%'/>
<p>Goal: Be able to store data with an eye on preservation</p>
<p>In an archive like PARADISEC - it is important to be be able to maintain resources over the long term. For example, much material which falls within the scope of PARADISEC is stored on legacy media. PARADISEC archives tapes from a range of sources, such as the agencies in the Pacific shown in the images above. Such material needs to be digitised and returned to the source with meaningful metadata.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/25/ldaca-metadata-ecosystem-eresearch-2022/Slide05.png' alt='PARADISEC ACCESS https://language-archives.services/about/data-loader ' title='Slide: 5' border='1' width='85%%'/>
<p>PARADISEC has learned the importance of making the collection self-describing so it is not dependent on a database as the sole metadata source. It does use a database for administrative services, from which a text file with metadata for any item can be exported. This allows us to select an arbitrary set of items, put them on a hard disk, and use the dataloader application to generate an html catalog of just those items, drawing on the internal metadata file describing each item. This can be delivered on a hard disk to a local community or cultural organisation, or on a raspberry pi wifi local network to allow access on phones, as seen here in Erakor village in central Vanuatu.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/25/ldaca-metadata-ecosystem-eresearch-2022/Slide06.png' alt='ELAR: limited search capability, non-standard metadata schema, no ability to index annotation files, no bulk download LDaCA: rich metadata-first search, portable RO-Crate metadata, indexed annotations, bulk downloading of search results AUSLAN CORPUS ACCESS ' title='Slide: 6' border='1' width='85%%'/>
<p>Another example of how good metadata practice can improve community access is the Auslan (Australian Sign Language) corpus, for which community access is very important.</p>
<p>The Auslan Corpus has been stored with the Endangered Languages Archive (<a href="https://www.elararchive.org/">ELAR</a>) since 2008. However, ELAR does not currently suit the access needs of the Auslan corpus; it has low discoverability, and files must be downloaded individually. The corpus, along with the Auslan SignBank dictionary, is being included in LDaCA.</p>
<p>The Auslan Corpus holds great value as an educational tool for Auslan users and learners, both Deaf and hearing, and the move to LDaCA will allow further development of educational tools. One such tool is the ability, still under development, for Auslan Signbank dictionary to pull real-world examples of signs out of the corpus to show alongside dictionary entries.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/25/ldaca-metadata-ecosystem-eresearch-2022/Slide07.png' alt='' title='Slide: 7' border='1' width='85%%'/>
<p>For all of the collections we are working with data is discoverable via some kind of web portal which indexes and displays the archive (repository) of data. These screenshots are of work in progress at LDaCA.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/25/ldaca-metadata-ecosystem-eresearch-2022/Slide08.png' alt='' title='Slide: 8' border='1' width='85%%'/>
<p>The LDaCA services we are building use an API to drive the data portals. The API can be used for direct access with appropriate access control – see <a href="posts/fair-care-eresearch-2022">another eResearch presentation</a> which explains this in detail. These screenshots show code notebooks (running in BinderHub on the Nectar cloud) accessing language resources.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/25/ldaca-metadata-ecosystem-eresearch-2022/Slide09.png' alt='' title='Slide: 9' border='1' width='85%%'/>
<p>Having looked at the user-facing products, websites and APIs, we turn our attention to how data is managed on disk.</p>
<p>In the PARADISEC system this is achieved by storing files on disk in a simple hierarchy - with metadata and other resources stored together in a directory - this scheme allows for hands-on management of data resources, independently of the software used to serve them.</p>
<p>This approach means that if the PARADISEC software-stack becomes un-maintainable for financial or technical reasons the important resources, the data, are stored safely on disk with their metadata and a new access portal could be constructed relatively easily.</p>
<p>Despite the valuable features of this solution, it is not generalisable. The metadata.xml is custom to PARADISEC, as is the software stack.</p>
<p>In 2019 PARADISEC and the eResearch team at UTS received small grants from the Australian National Data Service and began collaborating on an approach to managing archival repositories which built on this PARADISEC approach of storing metadata with data.</p>
<p>The UTS team presented on this at <a href="https://ptsefton.com/2019/11/05/FAIR%20Repo%20-%20eResearch%20Presentation/index.html">eResearch Australasia 2019</a></p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/25/ldaca-metadata-ecosystem-eresearch-2022/Slide10.png' alt=' ' title='Slide: 10' border='1' width='85%%'/>
<p>For this Research Data Commons work we are using the Arkisto Platform (introduced <a href="http://ptsefton.com/2020/11/23/Arkisto/index.html">at eResearch 2020</a>).</p>
<p>Arkisto aims to ensure the long term preservation of data independently of code and services, recognizing the ephemeral nature of software and platforms. We know that sustaining software platforms can be hard and aim to make sure that important data assets are not locked up in databases or hard-coded logic of some hard-to-maintain application.</p>
<p>Inspired by PARADISEC’s approach the Arkisto platform is based on the idea of storing data in simple easy to manage file or object storage systems with metadata in an easily readable standard format.</p>
<p>The LDaCA repositories use the Oxford Common File Layout (<a href="https://ocfl.io/">OCFL</a>) standard which is backed and used by a number of universities and has multiple implementations while PARADISEC data will be migrated to a simpler data storage approach <a href="https://github.com/CoEDL/nocfl-js">NOCFL</a>, which is a single-library implementation, inspired by some of the same aims, but with different implementation choices to avoid data being obfuscated by OCFL’s layout, which is a product of its commitment to immutable, write-once file management.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/25/ldaca-metadata-ecosystem-eresearch-2022/Slide11.png' alt='{ "conformsTo": "http://purl.archive.org/language-data-commons/profile" } ' title='Slide: 11' border='1' width='85%%'/>
<p>Now to the main focus of this presentation - the metadata “Profile” we are jointly developing to ensure that language resources can be described in a way that is interoperable between software, and re-usable over time.</p>
<p>The Profile is an “RO-Crate Profile”, a kind of Cook Book for how to describe and package language data.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/25/ldaca-metadata-ecosystem-eresearch-2022/Slide12.png' alt='☁️ 📂 📄 ID? Title? Description? 👩🔬👨🏿🔬Who created this data? 📄What parts does it have? 📅 When? 🗒️ What is it about? ♻️ How can it be reused? 🏗️ As part of which project? 💰 Who funded it? ⚒️ How was it made? Addressable resources Local Data 👩🏿🔬 https://orcid.org/0000-0001-2345-6789 🔬 https://en.wikipedia.org/wiki/Scanning_electron_microscope ' title='Slide: 12' border='1' width='85%%'/>
<p>RO-Crate is method for describing a dataset as a digital object using a <strong>single linked-data metadata document</strong></p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/25/ldaca-metadata-ecosystem-eresearch-2022/Slide13.png' alt='📂 🔬 🔭 📹 💽 🖥️ ⚙️🎼🌡️🔮🎙️🔍🌏📡💉🏥💊🌪️ ' title='Slide: 13' border='1' width='85%%'/>
<p>The dataset may contain any kind of data resource about anything, in any format as a file or URL</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/25/ldaca-metadata-ecosystem-eresearch-2022/Slide14.png' alt=' ' title='Slide: 14' border='1' width='85%%'/>
<p>The RO-Crate standard also strongly recommends that JSON metadata is supplemented with an HTML preview - above we show what that looks like for a PARADISEC item. This is a screenshot of an HTML view of a PARADISEC Item generated using <a href="https://github.com/UTS-eResearch/ro-crate-html-js">an HTML rendering tool for RO-Crate</a>. The important point here is that this is a <em>generic</em> viewer that can understand any RO-Crate. It may not be glamorous but it could be included in an archive as a way to provide human-readable access in the absence of portals that are data specific (but cost money to build and maintain).</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/25/ldaca-metadata-ecosystem-eresearch-2022/Slide15.png' alt=' https://mod.paradisec.org.au ' title='Slide: 15' border='1' width='85%%'/>
<p>Here is the same page from the previous slide seen in a working model of an RO-Crate set exported from the current PARADISEC catalog, with a single page viewer using an elastic search. The two pages shown here are generated directly from metadata that was stored in an RO-Crate in a storage system using PARADISEC specific, rather than generic code.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/25/ldaca-metadata-ecosystem-eresearch-2022/Slide16.png' alt=' ' title='Slide: 16' border='1' width='85%%'/>
<p>The <a href="https://www.researchobject.org/ro-crate/1.1/structure.html">structure of an RO-Crate</a> is very similar to the PARADISEC example above, but with a json file instead of XML, and an optional preview in HTML.</p>
<p>RO-Crate has a growing number of <a href="https://www.researchobject.org/ro-crate/tools/">tools and software libraries</a> which means that a team such as PARADISEC do not have to maintain their own bespoke software.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/25/ldaca-metadata-ecosystem-eresearch-2022/Slide17.png' alt=' ' title='Slide: 17' border='1' width='85%%'/>
<p>The base vocabulary for the JSON-LD used in RO-Crate is schema.org - a widely used linked data standard. RO-Crate uses a handful of terms from other ontologies but importantly it allows for seamless extensibility with domain specific vocabularies, which is what we will talk about next.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/25/ldaca-metadata-ecosystem-eresearch-2022/Slide18.png' alt=' ' title='Slide: 18' border='1' width='85%%'/>
<p>The PARADISEC metadata model is based on the Open Language Archives (OLAC) metadata standard. This is an XML based standard, but has good online documentation, which is perfect for migrating to a Linked Data approach.</p>
<p>We used the OLAC terms, including <a href="http://www.language-archives.org/REC/type-20020628.html">some that were proposed but withdrawn</a> as the basis for a new vocabulary.</p>
<p>As part of a LIEF project (2022-23, led by author Thieberger), revisions to the OLAC scheme are planned, together with rebuilding the OLAC metadata harvester and aggregator.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/25/ldaca-metadata-ecosystem-eresearch-2022/Slide19.png' alt=' ' title='Slide: 19' border='1' width='85%%'/>
<p>The new Langauge Data Terms have been published at <a href="https://purl.archive.org/language-data-commons/terms">https://purl.archive.org/language-data-commons/terms</a></p>
<p>These terms have been modernised and mainstreamed from previous ways of describing resources, for example instead of describing the main item of interest as a PrimaryText (where text is any kind of communicative resource – not a bitstream of characters) we use the term PrimaryResource. And in the example in the image, the type of genre <em>Informational</em> has been added to the set proposed in the OLAC vocabulary.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/25/ldaca-metadata-ecosystem-eresearch-2022/Slide20.png' alt=' ' title='Slide: 20' border='1' width='85%%'/>
<p>(Image prompt DALL-E a hierarchical whale skeleton digital art)</p>
<p>Before we come back in detail to how RO-Crate works we will discuss the structure or skeleton of our language collections stored in a repository</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/25/ldaca-metadata-ecosystem-eresearch-2022/Slide21.png' alt=' ' title='Slide: 21' border='1' width='85%%'/>
<p>Broadly speaking there are two ways that an Arkisto-style repository can be structured and the profile sets out criteria for choosing one of the options.</p>
<p>For small, stable collections of data an entire collection (often referred to a ‘corpus’ by linguists) can be stored in a single directory or directory-like structure in an object store.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/25/ldaca-metadata-ecosystem-eresearch-2022/Slide22.png' alt=' ' title='Slide: 22' border='1' width='85%%'/>
<p>For larger collections the approach used by PARADISEC and most LDaCA collections is to store each Object or Item (typically a related set of recordings, or a single document) in a directory (or directory-like thing).</p>
<p>In this mode, each Object MUST link back to the Collection Object.</p>
<p>A Collection Object MAY have explicit listing of hasMember properties - which makes it possible to construct repository navigation (such as websites) more cheaply. This is the approach used in PARADISEC, while in LDaCA these links are constructed by an indexer servicer or summarizer application.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/25/ldaca-metadata-ecosystem-eresearch-2022/Slide23.png' alt='Describo Screenshot editing a collection record (PT) ' title='Slide: 23' border='1' width='85%%'/>
<p>This screenshot shows the Language Data Commons RO-Crate Profile in action. This is the <a href="https://github.com/Arkisto-Platform/describo-online">Describo Online</a> metadata editor, with configuration that reflects the profile being used to describe a language data collection using linked-data metadata.</p>
<p>In this case the description is of the collection object.</p>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/25/ldaca-metadata-ecosystem-eresearch-2022/Slide24.png' alt='LDaCA ' title='Slide: 24' border='1' width='85%%'/>
<p>Once the data is described, we ingest it into a repository, as a set of files on disk or object storage and index it in a portal, as you can see in these screenshots.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/25/ldaca-metadata-ecosystem-eresearch-2022/Slide25.png' alt='Demo ' title='Slide: 25' border='1' width='85%%'/>
<p><a href="https://www.youtube.com/watch?v=p-GZbe-Kzww&t=5s">Video of browsing a collection in an LDaCA repo</a> showing:</p>
<ul>
<li>Going to the portal</li>
<li>Selecting a collection</li>
<li>Searching for content</li>
<li>Selecting a notebook</li>
<li>Launching Binder</li>
</ul>
<p>This example notebook explores the collection via the rest API.</p>
<h1>Conclusion</h1>
<p>In this presentation we have shown the major components of an ecosystem for storing, discovering and analysing language data using common standards for describing objects in a repository. The <a href="https://www.researchobject.org/ro-crate/">RO-Crate</a> standard is used as the key metadata container, with a common vocabulary of language specific terms for describing data. This approach should reduce development costs and increase data reuse. The approach can also be adapted to other disciplines and domains with the development only of new profiles..</p>
</section>
A CARE and FAIR-ready distributed access control system for human-created data using the Australian Access Federation and beyond2022-11-16T00:00:00+01:002022-11-16T00:00:00+01:00Peter Seftontag:ptsefton.com,2022-11-16:/2022/11/16/fair-care-eresearch-2022/index.html<p>This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.</p>
<p><a href="fair-care-eresearch-2022.pdf">Download as PDF</a></p>
<p><img src="Slide00.png" alt="alt text" /></p>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/16/fair-care-eresearch-2022/Slide00.png' title='Slide: 0' border='1' width='85%%'/>
<p>This is write-up of a talk given at eResearch Australasia 2022, delivered by Peter Sefton, with some additional detail.</p>
<p>By: Peter Sefton, Jenny Fewster, Moises Sacal Bonequi, Cale Johnstone, Catherine Travis, River Tae Smith …</p></section><p>This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.</p>
<p><a href="fair-care-eresearch-2022.pdf">Download as PDF</a></p>
<p><img src="Slide00.png" alt="alt text" /></p>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/16/fair-care-eresearch-2022/Slide00.png' title='Slide: 0' border='1' width='85%%'/>
<p>This is write-up of a talk given at eResearch Australasia 2022, delivered by Peter Sefton, with some additional detail.</p>
<p>By: Peter Sefton, Jenny Fewster, Moises Sacal Bonequi, Cale Johnstone, Catherine Travis, River Tae Smith, Patrick Carnuccio</p>
<p>Edited by: Simon Musgrave</p>
</section>
<p><img src="Slide01.png" alt="alt text" title="Project Team(alphabetical order) Michael D’Silva Marco Fahmi Leah Gustafson Michael Haugh Cale Johnstone Kathrin Kaiser Sara King Marco La Rosa Mel Mistica Simon Musgrave Joel Nothman Moises Sacal Martin Schweinberger PT Sefton With thanks for their contribution: Partner Institutions:" /></p>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/16/fair-care-eresearch-2022/Slide01.png' alt='Project Team(alphabetical order) Michael D’Silva Marco Fahmi Leah Gustafson Michael Haugh Cale Johnstone Kathrin Kaiser Sara King Marco La Rosa Mel Mistica Simon Musgrave Joel Nothman Moises Sacal Martin Schweinberger PT Sefton With thanks for their contribution: Partner Institutions: ' title='Slide: 1' border='1' width='85%%'/>
<p>The Language Data Commons of Australia Data Partnerships (LDaCA) and the Australian Text Analytics Platform (ATAP) are building towards a scalable and flexible language data and analytics commons. These projects will be part of the Humanities and Social Sciences Research Data Commons (HASS RDC).</p>
<p>The Data Commons will focus on preservation and discovery of distributed multi-modal language data collections under a variety of governance frameworks. This will include access control that reflects ethical constraints and intellectual property rights, including those of Aboriginal and Torres Strait Islander, migrant and Pacific communities.</p>
<p>The platform will provide workbench services to support computational research, starting with code-notebooks with no-code research tools provided in later phases. Research artefacts such as code and derived data will be made available as fully documented research objects that are re-runnable and rigorously described. Metrics to demonstrate the impact of the platform are projected to include usage statistics, data and article citations. These projects are led by Professor Michael Haugh of the School of Languages and Culture at the University of Queensland with several partner institutions.</p>
</section>
<p><img src="Slide02.png" alt="alt text" title="The Language Data Commons of Australia (LDaCA) and Australian Text Analytics Platform (ATAP) projects received investment (https://doi.org/10.47486/DP768 and https://doi.org/10.47486/PL074) from the Australian Research Data Commons (ARDC). The ARDC is funded by the National Collaborative Research Infrastructure Strategy (NCRIS)." /></p>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/16/fair-care-eresearch-2022/Slide02.png' alt='The Language Data Commons of Australia (LDaCA) and Australian Text Analytics Platform (ATAP) projects received investment (https://doi.org/10.47486/DP768 and https://doi.org/10.47486/PL074) from the Australian Research Data Commons (ARDC). The ARDC is funded by the National Collaborative Research Infrastructure Strategy (NCRIS). ' title='Slide: 2' border='1' width='85%%'/>
<p>This work is supported by the Australian Research Data Commons.</p>
</section>
<p><img src="Slide03.png" alt="alt text" /></p>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/16/fair-care-eresearch-2022/Slide03.png' alt=' ' title='Slide: 3' border='1' width='85%%'/>
<p>Last year at eResearch Australasia, the Language Data Commons of Australia (LDaCA) team presented a design for a distributed access control system which could look after the A-is-for-accessible in FAIR data; in this presentation we describe and demonstrate a pilot system based on that design, showing how data licenses that allow access by identified groups of people to language data collections can be used with an AAF pilot system (CILogon) to give the right people access to data resources.</p>
<p>The ARDC have invested in a pilot of this work as part of the HASS Research Data Commons and Indigenous Research Capability Program integration activities.</p>
<p>The system has to be able to implement data access policies with real-world complexity and one of our challenges has been developing a data access policy that works across a range of different collections of language data. Here we present a pilot data access policy that we have developed, describing how this policy captures the decisions that must be made by a range of data providers to ensure data accessibility that complies with diverse legal, moral and ethical considerations.
We will discuss how the <a href="https://www.gida-global.org/care">CARE</a> and <a href="https://www.nature.com/articles/sdata201618">FAIR</a> principles underpin this work, and compare this work to other projects such as <a href="https://ardc.edu.au/project/cadre/">CADRE</a>, which promise to deliver more complex solutions in the future. Initial work is with collections curated in a research context but we will also address community access to these resources.</p>
<p>The idea is to separate safe storage of data from its delivery. Each item in a repository is stored with licensing information in natural language (English at the moment, but could be other languages) and the repository defers access decisions to an Authorization system, where data custodians can design whatever process they like for granting license access. This can range from simple click-through licenses where anyone can agree to license terms, to detailed multi-step workflows where applicants are vetted based on whatever criteria the rights holder wishes; qualifications, membership of a cultural group, have they paid a subscription fee, etc</p>
</section>
<p><img src="Slide04.png" alt="alt text" /></p>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/16/fair-care-eresearch-2022/Slide04.png' alt=' ' title='Slide: 4' border='1' width='85%%'/>
<p>Regarding rights, our project is informed by the <a href="https://www.gida-global.org/care">CARE</a> principles for Indigenous data which also describe the level of respect which should be given to any data collected from individuals or communities.</p>
<blockquote>
<p>The current movement toward open data and open science does not fully engage with Indigenous Peoples rights and interests. Existing principles within the open data movement (e.g. FAIR: findable, accessible, interoperable, reusable) primarily focus on characteristics of data that will facilitate increased data sharing among entities while ignoring power differentials and historical contexts. The emphasis on greater data sharing alone creates a tension for Indigenous Peoples who are also asserting greater control over the application and use of Indigenous data and Indigenous Knowledge for collective benefit</p>
</blockquote>
</section>
<p><img src="Slide05.png" alt="alt text" /></p>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/16/fair-care-eresearch-2022/Slide05.png' alt=' ' title='Slide: 5' border='1' width='85%%'/>
<p>We are designing the system so that it can work with diverse ways of expressing access rights, for example we are considering how the approach described here could be extended based on the likes of the <a href="https://localcontexts.org/labels/traditional-knowledge-labels/">Tribal Knowledge labels</a>, incorporating them into the data licensing framework we discuss below.</p>
</section>
<p><img src="Slide06.png" alt="alt text" title="Case Study - Sydney Speaks" /></p>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/16/fair-care-eresearch-2022/Slide06.png' alt='Case Study - Sydney Speaks ' title='Slide: 6' border='1' width='85%%'/>
<p>In this talk we look at a case-study with the <a href="https://slll.cass.anu.edu.au/sydney-speaks">Sydney Speaks project</a> via LDaCA steering committee member Professor <a href="https://orcid.org/0000-0002-1410-3268">Catherine Travis</a>.</p>
<blockquote>
<p>This project seeks to document and explore Australian English, as spoken in Australia’s largest and most ethnically and linguistically diverse city – Sydney.
The title “Sydney Speaks” captures a key defining feature of the project: the data come from recorded conversations between Sydney siders, as they tell stories about their lives and experiences, their opinions and attitudes. This allows us to measure how their lived experiences impact their speech patterns.
Working within the framework of variationist sociolinguistics, we examine variation in phonetics, grammar and discourse, in an effort to answer questions of fundamental interest both to Australian English, and language variation and change more broadly, including:</p>
<ul>
<li>How has Australian English as spoken in Sydney changed over the past 100 years?</li>
<li>Has the change in the ethnic diversity over that time period (and in particular, over the past 40 years) had any impact on the way Australian English is spoken?</li>
<li>What affects the way variation and change spread through society - Who are the initiators and who are the leaders in change? - How do social networks function in a modern metropolis? - What social factors are relevant to Sydney speech today, and over time (gender? class? region? ethnic identity?)
A better understanding of what kind of variation exists in Australian English, and of how and why Australian English has changed over time can help society be more accepting of speech variation and even help address prejudices based on ways of speaking.
Source: <a href="http://www.dynamicsoflanguage.edu.au/sydney-speaks/">http://www.dynamicsoflanguage.edu.au/sydney-speaks/</a></li>
</ul>
</blockquote>
<p>The collection contains recordings of people speaking, both contemporary and historic.</p>
<p>Because this involved human participants there are restrictions on the distribution of data - a situation we see with lots of studies involving people in a huge range of disciplines.</p>
</section>
<p><img src="Slide07.png" alt="alt text" title="Sydney Speaks Licenses" /></p>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/16/fair-care-eresearch-2022/Slide07.png' alt='Sydney Speaks Licenses ' title='Slide: 7' border='1' width='85%%'/>
<p>There are four tiers of data access we need to enforce and observe for this data based on the participant agreements and ethics arrangements under which the data were collected.</p>
<p>Concerns about rights and interests are important for any data involving people - and a large amount the data both indigenous and non-indigenous we are using will require access control that ensures that data is shared with the right users under the right conditions.</p>
</section>
<p><img src="Slide08.png" alt="alt text" /></p>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/16/fair-care-eresearch-2022/Slide08.png' alt='' title='Slide: 8' border='1' width='85%%'/>
<p>(Image generated by DALLE - prompt: A NSW Driver license for a wolfhound pup named Floki)</p>
<p>Lets go over some basics, starting with <em>licences</em>.</p>
<p>A licence in this context is <em>a natural language document</em> in which a copyright holder sets out the terms and conditions of use for data. Licences <em>may</em> have metadata that describes them, eg a property to say that this is an open licence (and does not require a check when serving data).</p>
<p>A license is not a computer program, or configuration, or an AI entity that can make decisions, it’s a legal document. You may also know this as a “data sharing agreement” or “terms of use”. Examples of licenses we see all the time are the GNU GPL or the various Creative Commons licenses which grant rights to others to redistribute a creative work, and specifies conditions on what changes are permitted.</p>
<p>That said, metadata <em>about</em> a license can be used to automate decision making - if it is labelled as being an open license, then a repository can serve data and include that data, if it is labeled as “closed” or more aptly, “authorization-required” then repository software can perform an authorization step, which we cover in detail later.</p>
<p>In the world of research data generated by or about human participants, licenses can’t always allow unauthenticated access and data redistribution, and they may permit distribution only to certain people, or classes or person. Some data, for example (particularly that which has not been or cannot be de-identified) can only be made available to the original research team.</p>
</section>
<p><img src="Slide09.png" alt="alt text" /></p>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/16/fair-care-eresearch-2022/Slide09.png' alt=' ' title='Slide: 9' border='1' width='85%%'/>
<p>(Dall-e prompt : A sad dog sitting on an iceberg, XKCD)</p>
<p>So, a license is a document that expresses conditions such as “Data can be used by other researchers”, but unfortunately we don’t have systems in the research-data ecosystem which can automatically identify a user as “a researcher” (this may be surprising to some, but the Australian Access Federation can, at this stage, only say that someone has an account with an institution - it can’t tell a professor from a student administration officer and there are certainly no lists of “certified linguists”).</p>
<p>Here are some cold hard facts:</p>
<p>We don’t have an authority that can identify someone as a researcher,</p>
<p>Or a “linguist”,</p>
<p>Or an “anthropologist”,</p>
<p>Or a member of an ARC (Australian Research Council) research project,</p>
<p>The <a href="https://ardc.edu.au/project/cadre/">CADRE</a> project is working on systems that will eventually support all these things, but they are not available as services yet, and their initial focus is on government data, so we have to work out ways for our data custodians to make decisions on who is considered an “other researcher” in the absence of attribute-based authentication.</p>
</section>
<p><img src="Slide10.png" alt="alt text" /></p>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/16/fair-care-eresearch-2022/Slide10.png' alt='' title='Slide: 10' border='1' width='85%%'/>
<p>The access control system we have been prototyping is based on licenses.</p>
<p>For any data object, which could be an entire collection, or one set of recordings of a speaker in a speech study, or a set of hand written linguistic field notes from the 1950s, or a novel etc we store a license with it. This means that future archivists / librarians and researchers can work out how to manage the data if the systems we build today for automated access are no longer operational and we give the license an ID which is a URL we can use to identify it uniquely.</p>
<p>This diagram shows how a license is explicitly linked to the data using a metadata description standard known as “Research Object Crate” <a href="http://ptsefton.com/2019/11/05/RO-Crate%20eResearch%20Australasia%202019/index.html">RO-Crate</a> . Each object in the repository is a crate, with a metadata file that describes the object and (optionally) its component files, including the data license.</p>
</section>
<p><img src="Slide11.png" alt="alt text" /></p>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/16/fair-care-eresearch-2022/Slide11.png' alt='' title='Slide: 11' border='1' width='85%%'/>
<p>(This diagram has been updated from the one presented at eResearch to show two portals instead of one)</p>
<p>Every item in a repository has a license, which may be an open one like CC Share Alike or a custom license derived from the ethics and participants agreements for a study in the context of local laws and institutional policy.</p>
<p>Using this license, distributed access portals in our architecture can check against an authorization system for each request for data. The portals may both host data with the same licensing but do not need to maintain access control lists.</p>
</section>
<p><img src="Slide12.png" alt="alt text" /></p>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/16/fair-care-eresearch-2022/Slide12.png' alt='' title='Slide: 12' border='1' width='85%%'/>
<p>(Images: Various baskets of puppies by DALL-E)</p>
<p>When we first developed access controls for LDaCA in 2021 it was a requirement that data licensing and access control decisions be decoupled from each other, and from particular repository software. The usual approach in repositories is to build in a local access-control system, but this is tied to a particular implementation and will not work in a distributed environment where there are multiple different repositories, and services such as computational resources that researchers need to access to process data.</p>
<p>We could not find an available open source system for managing license-based access to data, so our starting approach used groups as a proxy for granting licences on that basis that all common user-directory services such as LDAP include the concept of user groups.</p>
<p>Scope:</p>
<ul>
<li>
<p>simplest possible license based approach to access control</p>
</li>
<li>
<p>NOT attempting to be attribute based as that is not currently feasible within our project scope (see <a href="https://ardc.edu.au/project/cadre/">CADRE</a> for progress in that direction)</p>
</li>
</ul>
</section>
<p><img src="Slide13.png" alt="alt text" /></p>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/16/fair-care-eresearch-2022/Slide13.png' alt='' title='Slide: 13' border='1' width='85%%'/>
<p>The first prototype, which we presented at eResearch Australasia last year was a proof-of-concept Github based system. This demonstrated that authorization can be delegated from a repository to an external service. For each of the Sydney Speaks licenses there was a Github group (organization). The repository, when requested to serve data would get the user to login using the Github Authentication services, then check if the user was in the correct license group.</p>
<p>This worked, but there were issues with this approach:</p>
<ul>
<li>
<p>There are no workflow options (unless we build a workflow system), just adding people to a Github organisation to pre-authorize them</p>
</li>
<li>
<p>The system only supported a single logon service, which is not widely used in academia or by community groups</p>
</li>
</ul>
<p>So, we talked to the our colleagues at the Australian Access Federation (AAF), about a supported, research-sector-wide service.</p>
</section>
<p><img src="Slide14.png" alt="alt text" /></p>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/16/fair-care-eresearch-2022/Slide14.png' alt='' title='Slide: 14' border='1' width='85%%'/>
<p>The AAF, as it happened were already working with other research groups on a service called <a href="https://www.cilogon.org/">CILogon</a> (hosted in the USA initially, but soon to be hosted in Australia), like Github, this service has groups (which was our way of associating users with licenses in the absence of a specific license-granting service), but also allows users to log in with a variety of Authentication providers, including research institutions, via the Australian Access Federation as well as social logins such as Google and Microsoft (and our old friend Github).</p>
<p>Again this worked, but the current version of CILogon does not have particularly easy-to-use ways for a license-holder to create groups - there are a number of abstract constructs to deal with and there is currently no way to build an approval workflow using the web interface, so as with the Github trial we would have needed to build this part (all of this may change, as the software is under constant development).</p>
<p>There is a <a href="https://youtu.be/xEWXiM-jUfY">nine minute silent video</a> of what this looked like on YouTube for those who are really interested.</p>
<p>AAF is engaging with our project on the following:</p>
<ul>
<li>a cloud-based authentication and authorisation infrastructure (AAI) to support the needs of the project</li>
<li>understand and develop business process documentation for authorising access to data and services</li>
<li>configure the AAI to support these business processes and to develop extensions to facilitate new functionality that may be required</li>
<li>create a set of policies, standards and guidelines for managing researchers’ identity and access management</li>
<li>develop support documentation, train community representatives to operate the platform, and provide support to the community managers.
The AAF has recommended CILogon & REMS as potential solutions to investigate & prototype</li>
</ul>
<p>CILogon is a federated identity management platform that provides the following features:
support for institutional and community logins
cross-institutional and community collaboration
federated identity and group management
a community management dashboard
OIDC connectors for downstream services that support authorisation claims for services like
REMS
BinderHub
JupyterHub
LDaCA Dashboard</p>
<p>REMS (Resource Entitlement Management System) is a tool to help researchers browse resources such as datasets relevant to their research and to manage the application process for access to the resources.</p>
</section>
<p><img src="Slide15.png" alt="alt text" /></p>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/16/fair-care-eresearch-2022/Slide15.png' alt='REMS ' title='Slide: 15' border='1' width='85%%'/>
<p>Recently (after the abstract for this presentation was submitted) the AAF team made us aware of the Resource Entitlement Management System, <a href="https://github.com/CSCfi/rems">REMS</a>, which is an open source application out of Finland. This software is the missing link for LDaCA in that it allows a data custodian to grant licenses to users. And it works with CILogon as an Authentication layer so we can let users log in using a variety of services.</p>
<p>At the core of REMS is a set of Licenses which can then be associated with Resources - in our design this is (almost always) a one-to-one correspondence, for example we would have a licence “Sydney Speaks Data Researcher Access License” corresponding to resource that represents ALL data with that licence. These Resources can then be made available through a catalog, and workflows can be set up for pre-authorization processes ranging from single-click authorizations where a user just accepts a licence and a bot approves it, to complex forms where users upload credentials and one or more data custodians approve their request, and grant them the licence.</p>
<p>It also has features for revoking permissions, and has a full API so admin tasks can be automated (for us that’s in the future).</p>
<p>Once a user has been granted a license in a pre-authorization process then a repository can authorize access to a resource by checking with REMS to see if a given user is pre-authorized. That is, has been granted a license. Note that users do not have to find REMS on their own - they will be directed to it from data and computing services when they need to apply for pre-authorization.</p>
</section>
<p><img src="Slide16.png" alt="alt text" /></p>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/16/fair-care-eresearch-2022/Slide16.png' alt=' ' title='Slide: 16' border='1' width='85%%'/>
<p>This interaction diagram shows the flow involved in a user applying for a data license via REMS.</p>
<p>Not shown here are some design and preparation steps:</p>
<ul>
<li>
<p>The research team read their ethics approval and participant agreements and craft one or more access agreements (AKA licenses) for a data set (NOTE: If the data can be made available automatically with just a license attached, such as when all parties have agreed that data can be Creative commons licensed, or the data is in the public domain then the following steps are not required)</p>
</li>
<li>
<p>The research team and support staff add the license to REMS, creating a “resource” a virtual offering that corresponds to any dataset that has the above license</p>
</li>
<li>
<p>The research team add a workflow to REMS - this could range from an auto-approved click through where users can agree to license terms, through to detailed (manual) checking of their credentials.</p>
</li>
</ul>
<p>The next slide shows the interactions involved in accessing data once a user has been granted the license license.</p>
</section>
<p><img src="Slide17.png" alt="alt text" /></p>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/16/fair-care-eresearch-2022/Slide17.png' alt=' ' title='Slide: 17' border='1' width='85%%'/>
<p>This diagram shows the “access-control dance” for a user who has been granted a license in REMS obtaining access to a dataset at a data portal which gives access to data in a repository or archive.</p>
</section>
<p><img src="Slide18.png" alt="alt text" title="Demo" /></p>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/16/fair-care-eresearch-2022/Slide18.png' alt='Demo ' title='Slide: 18' border='1' width='85%%'/>
<p>In this video we demonstrate how to use REMS and how does a user request access to an LDaCA resource.</p>
</section>
<p><img src="Slide19.png" alt="alt text" title="FAQ" /></p>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/16/fair-care-eresearch-2022/Slide19.png' alt='FAQ ' title='Slide: 19' border='1' width='85%%'/>
<p>(This section was added after the conference, to try to summarize the discussion and clarify requirements by starting an FAQ on this approach)</p>
<h2>Q Why not "just" implement an access control list (ACL) in the repository?</h2>
<p>There are a few reasons for the distributed approach we have taken in LDaCA:</p>
<ol>
<li>
<p>ACLs need maintenance over time - people's identities change, they retire and die, so storing a list of identifiers such as email addresses alongside content is not a viable long-term preservation strategy. Rather, we will encourage data custodians to describe in words what are permitted uses for the data, and by whom, in a license, then allow whomever is the current data custodian to manage that access in a separate administrative system. We expect these administrative systems to be ephemeral, and change over time but also to generate less friction over time as standards are developed. Expected future benefits of concentrating these processes will include that people do not have to prove the same claims they make about themselves multiple times and that it is easier for data custodians to authorize access.</p>
</li>
<li>
<p>LDaCA data will be stored in a variety of places with separate portal applications serving data for specific purposes; if these systems all have in-built authorization schemes, even if they are the same, then we have the problem of synchronizing access control lists around a network of services.</p>
</li>
<li>
<p>Accessing data that requires some sort of authorization process is not language or humanities specific, so working with an existing application that can handle pre-authorization workflows and access-control authorization decisions is an attractive choice and should allow LDaCA to take advantage of centrally managed services with functionality that improves over time rather than having to develop and maintain our own systems.</p>
</li>
<li>
<p>If complex access controls are implemented inside a system then there is a risk that data becomes stranded inside that system and cannot be reused without completely re-implementing the access control. For example, imagine an archive of cultural material with complex access controls encoded into the business logic such as “this item is accessible only to male initiates”. Applications like this need to store user accounts with attributes on both data and user records that can be used to authorize access. There is a high risk of data being stranded in a system such as this if it is no longer supported. This will be mitigated somewhat if the rules are also expressed as licenses, perhaps a composition of Traditional Knowledge (TK) Labels - but the access system is baked-in to the application and not portable.</p>
</li>
</ol>
<h2>Q: Yes but why does data need to have a license if we already have access controls?</h2>
<p>The point of Research Data Commons projects like LDaCA is to create an ecosystem where data can be re-used. For language data, this means that users, including researchers and community members, will be able to download data for certain authorised purposes and activities. The license is the way that data custodians communicate to data users (and future administrators) what those purposes activities are.</p>
<p>A license, which is always packaged with data will allow:</p>
<ul>
<li>
<p>A user to inspect a five-year-old dataset in their downloads folder and work out what they are allowed to do with it.</p>
</li>
<li>
<p>An IT professional to clean up laptop that has been handed in by (or seized from – it happens) a departing faculty member.</p>
</li>
<li>
<p>A developer to re-create an access control replacing a decommissioned system.</p>
</li>
</ul>
<h2>Q So many licenses! Sounds like a lot of work!</h2>
<p>We expect that the overhead of writing licenses will diminish greatly over time and standard clauses and complete licenses will be established. A data depositor will be able to choose from a set of standard license terms (such as a standard “restricted to CIs and participants license” for a given repository, using that as a template to mint their own license for a given data set with its own name and ID. The user can choose a standard way of adding pre-authorized licensees (such as email invitations). This ID can then be used by an authorization system.</p>
<h2>Q So you have centralized authorization into a system that grants licenses doesn't that mean you are locked-in to that system?</h2>
<p>No, and Yes</p>
<p><strong>No</strong>, there is no lock in regarding the list of Licenses and pre-authorized users; licenses and access control lists can be exported via an API so it is possible to import them into another system or save them for audit purposes.</p>
<p><strong>Yes</strong>, there is lockin, in that at this stage the workflow used to give access to users is specific to the system (such as REMS)</p>
<p><strong>But</strong>, because our process requires a governance step <em>first</em> in writing a license, then there is a statement of intent for re-building those processes later if needed - a step which is very likely to be missing in a system with built-in access control.</p>
<p>Also, over time, we expect the administrative burden of constructing workflows will become less as standards are developed for a couple of things:</p>
<ol>
<li>
<p>Licenses can be made less complex (particularly in the context of academic studies) if they specify re-use by particular known cohorts in advance - this comes down to improving the design of studies to encourage data reuse. This may also help to simplify academic ethics processes in the medium to long term.</p>
</li>
<li>
<p>The CADRE project is looking to improve pre-authorization workflows that automatically source relevant information about potential users - fetching their publication record, and potentially remembering what certifications they have, so these attributes can be used and reused for decision making. It is conceivable that this approach might be useful in cultural contexts as well to allow data custodians to manage data sharing - this is a discussion we have yet to have in the broader HASS RDC.</p>
</li>
</ol>
<h2>Q What if I have a really simple requirement like giving access to just a couple of people - doesn’t this license approach just add complexity?</h2>
<p>If a data item needs to be locked down to a small group of people, say the chief investigator and the participants in a recorded dialogue then an obvious implementation is to maintain a small access control list (ACL) for the item. But all of the issues identified above with application-specific ACLs are the same, no matter the size of the cohort: the data set can’t be access controlled outside of its home system. If the system is no longer running then the data may be completely inaccessible, and if there is no license document stored with the data setting out terms of re-use in general terms then there is no indication to future administrators about who, if anyone, should have access of the data.</p>
<h2>Q: We don’t need a license, we have a “terms of use”</h2>
<p>Same thing. Terms of use for data are what a license does. We are designing our systems so that all the relevant terms and conditions go in one place to minimize confusion.</p>
</section>
<p>The final three slides have been contributed by co-author Patrick.</p>
<p>These slides briefly outline the AAF process for the next phase that will provide the foundations for the development of the service and the creation of those policies to support the community and the service.</p>
<p><img src="Slide20.png" alt="alt text" title="Next Steps in the AAF Engagement ….. Revisit and consolidate the project’s vision through interviews and engagement with stakeholders, collaborators and community participants This next phase will provide the foundations for the development of policies to support the community. These will assist in creating a trusted community for access to sensitive data that supports the good practice and protocols. These activities are outlined in the following slides …" /></p>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/16/fair-care-eresearch-2022/Slide20.png' alt='Next Steps in the AAF Engagement ….. Revisit and consolidate the project’s vision through interviews and engagement with stakeholders, collaborators and community participants This next phase will provide the foundations for the development of policies to support the community. These will assist in creating a trusted community for access to sensitive data that supports the good practice and protocols. These activities are outlined in the following slides … ' title='Slide: 20' border='1' width='85%%'/>
<p>This process will support the project to deliver a viable service that meets researchers’ needs and is trusted by the community and the participants to safely distribute data to authorised persons.</p>
</section>
<p><img src="Slide21.png" alt="alt text" title="The AAF's business analyst conducts interviews with stakeholders and community members to discover and formalise the community's processes and requirements" /></p>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/16/fair-care-eresearch-2022/Slide21.png' alt='The AAF's business analyst conducts interviews with stakeholders and community members to discover and formalise the community's processes and requirements ' title='Slide: 21' border='1' width='85%%'/>
<p>The AAF’s business analyst is conducting interviews with the key stakeholders.
This discovery process will collect information on the current and the “to-be” state of the service.</p>
<p>Together these will establish goals and expectations and provide the basis for further prototyping a service that meets stakeholder needs.</p>
<p>The process will facilitate the building of a service that empowers the data custodians, the communities and participants to manage access.</p>
</section>
<p><img src="Slide22.png" alt="alt text" title="These feeds into the prototyping & implementing phase" /></p>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/11/16/fair-care-eresearch-2022/Slide22.png' alt='These feeds into the prototyping & implementing phase ' title='Slide: 22' border='1' width='85%%'/>
<p>The basis for prototyping is iterative:
Identify
Prioritise
Pilot
Review
Update requirements</p>
<p>This leads to a production service that meets participant, community and researcher requirements and unifies the services, policies and trust framework for the community.</p>
</section>
HASS RDC Technical Advisory Group Meeting LDaCA & ATAP Intro2022-02-18T00:00:00+01:002022-02-18T00:00:00+01:00Peter Seftontag:ptsefton.com,2022-02-18:/2022/02/18/hass_rdc_tech_advisory/index.html<p>This is a presentation I gave to the <a href="https://ardc.edu.au/collaborations/strategic-activities/hass-and-indigenous-research-data-commons/">Humanities, Arts and Social Sciences Research Data Commons and Indigenous Research Capability Program</a> Technical Advisory Group on Friday 11th February 2022. Thanks to Simon Musgrave for reviewing this and adding a little detail here and there.</p>
<p>We will post this to the …</p><p>This is a presentation I gave to the <a href="https://ardc.edu.au/collaborations/strategic-activities/hass-and-indigenous-research-data-commons/">Humanities, Arts and Social Sciences Research Data Commons and Indigenous Research Capability Program</a> Technical Advisory Group on Friday 11th February 2022. Thanks to Simon Musgrave for reviewing this and adding a little detail here and there.</p>
<p>We will post this to the Language Data Commons of Australia (LDaCA) website soon and I'll link it here.</p>
<p><a href="https://ptsefton.com/2022/02/18/hass_rdc_tech_advisory/HASS RDC Technical Advisory Group Meeting LDaCA & ATAP Intro.pdf">PDF version</a></p>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/02/18/hass_rdc_tech_advisory/Slide00.png' alt='HASS RDC Technical Advisory Group Meeting
<p>LDaCA & ATAP
Intro
Peter Sefton - p.sefton@uq.edu.au
' title='Slide: 0' border='1' width='85%%'/></p>
<p>The Language Data Commons of Australia Data Partnerships (LDaCA) and the Australian Text Analytics Platform (ATAP) are establishing a scalable and flexible language data and analytics commons. These projects will be part of the Humanities and Social Sciences Research Data Commons (HASS RDC).
The Data Commons will focus on preservation and discovery of distributed multi-modal language data collections under a variety of governance frameworks. This will include access control that reflects ethical constraints and intellectual property rights, including those of Aboriginal and Torres Strait Islander, migrant and Pacific communities.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/02/18/hass_rdc_tech_advisory/Slide01.png' alt='
<p>' title='Slide: 1' border='1' width='85%%'/></p>
<p>For this Research Data Commons work we are using the Arkisto Platform (introduced <a href="http://ptsefton.com/2020/11/23/Arkisto/index.html">at eResearch 2020</a>).</p>
<p>Arkisto aims to secure the long term preservation of data independently of code and services - recognizing the ephemeral nature of software and platforms. We know that sustaining software platforms can be hard and aim to make sure that important data assets are not locked up in database or hard-coded logic of some hard-to-maintain application.</p>
<p>We are using three key standards on this project …</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/02/18/hass_rdc_tech_advisory/Slide02.png' alt='
<p>' title='Slide: 2' border='1' width='85%%'/></p>
<p>The first standard is the <a href="https://ocfl.io/1.0/spec/">Oxford Common File Layout</a> - this is a way of keeping version controlled digital objects on a plain old filesystem or object store.</p>
<p>Here’s the introduction to the spec:</p>
<blockquote>
<h2>Introduction</h2>
<p>This section is non-normative.</p>
<p>This Oxford Common File Layout (OCFL) specification describes an application-independent approach to the storage of digital objects in a structured, transparent, and predictable manner. It is designed to promote long-term access and management of digital objects within digital repositories.</p>
<h2>Need</h2>
<p>The OCFL initiative began as a discussion amongst digital repository practitioners to identify well-defined, common, and application-independent file management for a digital repository's persisted objects and represents a specification of the community’s collective recommendations addressing five primary requirements: completeness, parsability, versioning, robustness, and storage diversity.</p>
<h2>Completeness</h2>
<p>The OCFL recommends storing metadata and the content it describes together so the OCFL object can be fully understood in the absence of original software. The OCFL does not make recommendations about what constitutes an object, nor does it assume what type of metadata is needed to fully understand the object, recognizing those decisions may differ from one repository to another. However, it is recommended that when making this decision, implementers consider what is necessary to rebuild the objects from the files stored.</p>
<h2>Parsability</h2>
<p>One goal of the OCFL is to ensure objects remain fixed over time. This can be difficult as software and infrastructure change, and content is migrated. To combat this challenge, the OCFL ensures that both humans and machines can understand the layout and corresponding inventory regardless of the software or infrastructure used. This allows for humans to read the layout and corresponding inventory, and understand it without the use of machines. Additionally, if existing software were to become obsolete, the OCFL could easily be understood by a light weight application, even without the full feature repository that might have been used in the past.</p>
<h2>Versioning</h2>
<p>Another need expressed by the community was the need to update and change objects, either the content itself or the metadata associated with the object. The OCFL relies heavily on the prior art in the [Moab] Design for Digital Object Versioning which utilizes forward deltas to track the history of the object. Utilizing this schema allows implementers of the OCFL to easily recreate past versions of an OCFL object. Like with objects, the OCFL remains silent on when versioning should occur recognizing this may differ from implementation to implementation.</p>
<h2>Robustness</h2>
<p>The OCFL also fills the need for robustness against errors, corruption, and migration. The versioning schema ensures an OCFL object is robust enough to allow for the discovery of human errors. The fixity checking built into the OCFL via content addressable storage allows implementers to identify file corruption that might happen outside of normal human interactions. The OCFL eases content migrations by providing a technology agnostic method for verifying OCFL objects have remained fixed.</p>
<p>Storage diversity
Finally, the community expressed a need to store content on a wide variety of storage technologies. With that in mind, the OCFL was written with an eye toward various storage infrastructures including cloud object stores.</p>
</blockquote>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/02/18/hass_rdc_tech_advisory/Slide03.png' alt='☁️
📂
<p>📄
ID? Title? Description?</p>
<p>👩🔬👨🏿🔬Who created this data?
📄What parts does it have?
📅 When?
🗒️ What is it about?
♻️ How can it be reused?
🏗️ As part of which project?
💰 Who funded it?
⚒️ How was it made?
Addressable resources
Local Data
👩🏿🔬 https://orcid.org/0000-0001-2345-6789
🔬 https://en.wikipedia.org/wiki/Scanning_electron_microscope
' title='Slide: 3' border='1' width='85%%'/></p>
<p>The second standard is Research Object Crate. (RO-Crate) a method for describing any dataset of local or remote resources as a digital object using a <strong>single linked-data metadata document</strong>.</p>
<p>RO-Crate is used in our platform both for describing data objects in the OCFL repository, and for delivering metadata over the API (which we’ll show in architecture diagrams and screenshots below).</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/02/18/hass_rdc_tech_advisory/Slide04.png' alt='📂
<p>🔬 🔭 📹 💽 🖥️ ⚙️🎼🌡️🔮🎙️🔍🌏📡💉🏥💊🌪️
' title='Slide: 4' border='1' width='85%%'/></p>
<p>RO-Crates may contain any kind of data resource about anything, in any format as a file or URL - it’s not just for language data; there are also many projects in the sciences starting to <a href="https://www.researchobject.org/ro-crate/in-use/">use RO-Crate</a>.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/02/18/hass_rdc_tech_advisory/Slide05.png' alt='
<p>' title='Slide: 5' border='1' width='85%%'/></p>
<p>This image is taken from a <a href="https://slideplayer.com/slide/3919920/">presentation on digital preservation</a> .</p>
<p>https://pcdm.org/2016/04/18/models</p>
<p>The third key standard for Arkisto is the Portland Common Data Model. Like OCFL, this was developed by members of the digital library/repository community. It was devised as a way to do interchange between repository systems, most of which, it turned out had evolved very similar ways of having nested collections, digital objects that aggregate related files. Using this very simple ontology allows us to store data in the OCFL layer in a very flexible way - depending on factors like data size, licensing and whether data is likely to change or need to be withdrawn, we can store entire collections as OCFL objects or across many OCFL objects with PCDM used to show the structure of the data collections regardless of how they happen to be stored.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/02/18/hass_rdc_tech_advisory/Slide06.png' alt='
<p>RO-Crates MUST have licence information that sets out conditions for use/reuse of the data</p>
<p>This RO-Crate contains an entire PCDM collection
' title='Slide: 6' border='1' width='85%%'/></p>
<p>Back to RO-Crates.</p>
<p>RO-Crates are self-documenting and can ship with a HTML file that allows a consumer of the crated data to see whatever documentation the crate authors have added.</p>
<p>This crate contains an entire collection (RepositoryCollection is the RO-Crate term that corresponds to pcdm:Collection).</p>
<p>Crates must have license information that set out how data may be used and if it may be redistributed. As we are dealing with language data which is (almost) always created by people, it is important that their intellectual property rights and privacy are respected. More on this later.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/02/18/hass_rdc_tech_advisory/Slide07.png' alt='
<p>' title='Slide: 7' border='1' width='85%%'/></p>
<p>This shows a page for what we’re calling an Object (RepositoryObject). A RepositoryObject is a single “thing” such as a document, a conversation, a session in a speech study. (this was called an item in Alveo but given that both the Portland Common Data model and Oxford Common File Layout use “Object” we are using that term at least for now).</p>
<p>This shows that the system is capable of dealing with unicode characters - which is good, as you would expect as it’s 2022 and this is a Language Data Commons, but there are still challenges, like dealing with mixtures of left to right and right to left text, and we need to find or define metadata terms to keep track of “language”, “writing system”, and the difference between things that started as orthographic (written) text, vs spoken or signed etc. There’s a group of us working on that, currently led by Nick Thieberger and Peter Sefton.</p>
<p>Simon Musgrave and Peter Sefton <a href="https://ptsefton.com/2022/01/27/DAMTA_Slides_v1/">presented our progress with multilingual text</a> at a virtual workshop run by ANU in January.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/02/18/hass_rdc_tech_advisory/Slide08.png' alt='
<p>Link back to the container which has type RepositoryObject
' title='Slide: 8' border='1' width='85%%'/></p>
<p>Here’s another screenshot showing one of the government documents in PDF format - with a link back to the abstract RepositoryObject that houses all of the manifestations of the document in various languages.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/02/18/hass_rdc_tech_advisory/Slide09.png' alt='
<p>Repositories: institutional, domain or both</p>
<p>Find / Access services
Research Data Management Plan
Workspaces:</p>
<p>working storage
domain specific tools
domain specific services
collect
describe
analyse
Reusable, Interoperable
data objects
deposit early
deposit often
Findable, Accessible, Reusable data objects
reuse data objects
V1.1 © Marco La Rosa, Peter Sefton 2021 https://creativecommons.org/licenses/by-sa/4.0/</p>
<p>🗑️
Active cleanup processes workspaces considered ephemeral
🗑️
Policy based data management
' title='Slide: 9' border='1' width='85%%'/></p>
<p>The above diagram takes a big-picture view of research data management in the context of <em>doing</em> research. It makes a distinction between managed repository storage and the places where work is done - “workspaces”. Workspaces are where researchers collect, analyse and describe data. Examples include the most basic of research IT services, file storage as well as analytical tools such as Jupyter notebooks (the backbone of ATAP - the text analytics platform). Other examples of workspaces include code repositories such as GitHub or GitLab (a slightly different sense of the word repository), survey tools, electronic (lab) notebooks and bespoke code written for particular research programmes - these workspaces are essential research systems but usually are not set up for long term management of data.
The cycle in the centre of this diagram shows an idealised research practice where data are collected and described and deposited into a repository frequently. Data are made findable and accessible as soon as possible and can be “re-collected” for use and re-use.</p>
<p>For data to be re-usable by humans and machines (such as ATAP notebook code that consumes datasets in a predictable way) it must be well described. The ATAP and LDaCA approach to this is to use the Research Object Crate (RO-Crate) specification. RO-Crate is essentially a guide to using a number of standards and standard approaches to describe both data and re-runnable software such as workflows or notebooks.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/02/18/hass_rdc_tech_advisory/Slide10.png' alt='Compute
<p>HPC
Cloud
Desktop</p>
<p>collect
describe
analyse
🗑️
Active cleanup processes workspaces considered ephemeral
… etc
ATAP Notebooks
Apps, Code, Workflows</p>
<p>Deposit /Publish
PARADISEC
Analytics Portal
Code discovery
Launch / Rerun
Data Discovery
Authenticated API</p>
<p>Workbench
Notebooks
Data import by URL
Export fully described pkg
Stretch goals:
Code gen / simple interfaces eg Discursis</p>
<p>BYOData 🥂
⚙️
STORAGE (including Cloudstor)
.
Data Curation
& description
Reuse
Licence Server
Identity Management
AAF / social media accounts</p>
<p>Data Cleaning
OCR / transcription format migration
Archive & Preservation Repositoriesinstitutional, domain or both
AU Nat. Corpus
AusLan (sign)
Sydney Speaks
ATAP Corpus
Reference,Training & BYO
Workspaces:
working storage
domain specific tools
domain specific services
Harvested
external
Lang. portal(s)
Corpus discovery
Item discovery
Authenticated API
Create virtual corpora</p>
<p>' title='Slide: 10' border='1' width='85%%'/></p>
<p>This rather messy slide captures the overall high-level architecture for the LDaCA Research Data Commons - there will be an analytical workbench (left of the diagram) which is the basis of the Australian Text Analytics (ATAP) project - this will focus on notebook-style programming using one of the emerging Jupyter notebook platforms in that space. (This is not 100% decided yet, but that has not stopped the team from starting to collect and develop notebooks that open up text analytics to new coders from the linguistics community.) Our engagement lead, Dr Simon Musgrave sees the ATAP work as primarily an educational enterprise encouraging researchers to adopt new research practices - which will be underpinned by services built on the Arkisto standards that allow for rigorous, re-runnable research.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/02/18/hass_rdc_tech_advisory/Slide11.png' alt='Compute
<p>HPC
Cloud
Desktop</p>
<p>collect
describe
analyse
🗑️
Active cleanup processes workspaces considered ephemeral
… etc
ATAP Notebooks
Apps, Code, Workflows
Deposit /Publish
PARADISEC
Analytics Portal
Code discovery
Launch / Rerun
Data Discovery
Authenticated API</p>
<p>Workbench
Notebooks
Data import by URL
Export fully described pkg
Stretch goals:
Code gen / simple interfaces eg Discursis</p>
<p>BYOData 🥂
⚙️
STORAGE (including Cloudstor)
.
Data Curation
& description
Reuse
Licence Server
Identity Management
AAF / social media accounts</p>
<p>Data Cleaning
OCR / transcription format migration
Archive & Preservation Repositoriesinstitutional, domain or both
AU Nat. Corpus
AusLan (sign)
Sydney Speaks
ATAP Corpus
Reference,Training & BYO
Workspaces:
working storage
domain specific tools
domain specific services
Harvested
external
Lang. portal(s)
Corpus discovery
Item discovery
Authenticated API
Create virtual corpora</p>
<p>Talking mainly about this bit today
' title='Slide: 11' border='1' width='85%%'/></p>
<p>In this presentation we are going to focus on the portal/repository architecture more than on the ATAP notebook side of things. We know that we will be using (at least) the SWAN Jupyter notebook service perceived by AARNet but we are still scoping how notebooks will be made portable between systems and where they will be stored at various stages of their development. We will be supporting and encouraging researchers to archive notebooks wrapped in RO-Crates with re-use information OUTSIDE of the SWAN platform though - it’s a workspace, not a repository; it does not have governance in place for long term preservation.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/02/18/hass_rdc_tech_advisory/Slide12.png' alt='
<p>' title='Slide: 12' border='1' width='85%%'/></p>
<p>This is a much simpler view zooming in on the core infrastructure components that we have built so far. We are starting with bulk ingest of existing collections and will add one-by-one deposit of individual items after that.</p>
<p>This show the OCFL repository at the bottom - with a Data & Access API that mediates access. This API understands the RO-Crate format and in particular its use of the Portland Common Data Model to structure data. The API also enforces access control to objects; every repository object has a license setting out the terms of use and re-use for its data, which will reflect the way the data were collected - whether participants signed agreements, ethics approvals and privacy law are all relevant here. Each license will correspond to a group of people who have agreed to and/or been selected by a data custodian. We are in negotiations with the <a href="https://aaf.edu.au/">Australian Access Federation (AAF)</a> to use their <a href="https://www.cilogon.org/">CILogon</a> service for this authorization step and for authentication of users across a wide variety of services including the AAF itself and Google, Microsoft, GitHub etc.</p>
<p>There’s also an access portal which will be based on a full-text index (at this stage we’re using ElasticSearch) which is designed to help people find data they might be interested in using. This follows current conventions for browse/search interfaces which we’re familiar with from shopping sites - you can search for text and/or drill down using <em>facets</em> (which are called aggregations in Elastic-land). eg which language am in interested in or do I want [ ] Spoken or [ ] Written material?</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/02/18/hass_rdc_tech_advisory/Slide13.png' alt='There may be several distributed file-based repositories feeding the same index
There may be several portals using the same index - eg to give collection specific advanced search
There may be other kinds of index such as triple stores or relational databases that index tabular data
' title='Slide: 13' border='1' width='85%%'/>
<p>This architecture is very modular and designed to operate in a distributed fashion, potentially with distributed file and/or object based repositories all being indexed by a centralised service. There may also be other ‘flavours’ of index such as triple or graph stores, relational databases that ingest tabular data or domain specific discovery tools such as corpus analysis software. And, there may be collection specific portals that show a slice of a bigger repository with features or branding specific to a subset of data.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/02/18/hass_rdc_tech_advisory/Slide14.png' alt='
<p>👹
' title='Slide: 14' border='1' width='85%%'/></p>
<p>This implementation of the Arkisto standards-stack is known as Oni. That’s not really an acronym any more though it once stood for OCFL, Ngnix (a web server) or Node (a Javascript framework) and an Index. An Oni is a kind of Japanese demon. 👹</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/02/18/hass_rdc_tech_advisory/Slide15.png' alt='
<p>' title='Slide: 15' border='1' width='85%%'/></p>
<p>But how will data get into the OCFL repository? At the moment we’re loading data using a series of scripts which are being developed at our github organization.</p>
<p>This diagram and the next come from the <a href="https://arkisto-platform.github.io/use-cases/">Arkisto Use cases page</a> it show how we will be converting data from existing collections into a form where they can be preserved in an OCFL repository and be part of a bigger collection, ALWAYS with access control based on licenses.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/02/18/hass_rdc_tech_advisory/Slide16.png' alt='
<p>' title='Slide: 16' border='1' width='85%%'/></p>
<p>This is a screenshot our github repository showing the corpus migration tools we’ve started developing (there are six, and one general purpose text-cleaning tool). These repositories have not all been made public yet, but they will be - they contain tools to build Arkisto-ready file repositories that can be made available in one or more portals</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/02/18/hass_rdc_tech_advisory/Slide17.png' alt='PIC OF ALVEO
<p>' title='Slide: 17' border='1' width='85%%'/></p>
<p>Here’s our portal which give a browse interface to allow drill-down data discovery.</p>
<p>But wait! That’s not the LDaCA portal - that’s Alveo!</p>
<p>Oh yes, so it is.</p>
<p>Alveo was built ten years ago - and has not seen a much uptake.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/02/18/hass_rdc_tech_advisory/Slide18.png' alt='
<p>' title='Slide: 18' border='1' width='85%%'/></p>
<p>This screenshot shows some of the browse facets for the COOEE corpus, which contains early Australian <strong>written</strong> English materials. But facets like <code>Written Mode</code> and <code>Communication Medium</code> both of which are known for COOEE are not populated.</p>
<p>There are a quite few things that were wrong with Alveo - like we obviously didn’t get all the metadata populated to the level that it would make these browse facets actually useful for filtering. But more importantly, there was not enough work done to check which browse facets <em>are</em> useful and not enough of the budget was able to be spent on user engagement and training rather than software development.</p>
<p>One of my current LDaCA senior colleagues said to me a couple of years ago that Alveo was useless: “I just wanted to get all the data” they siad. Me, I was thinking “but it has an API so you CAN get all the data - what’s the problem?”. We have tried not to repeat this mistake by making sure that the API delivers entire collections and we have some demonstrations of doing this for real work.</p>
<p>Another colleague who was actually on the Alveo team said that this interface was "equally useless for everyone", and they later built a custom interface for one of the collections.</p>
<p>We’re taking these lessons to heart in designing the LDaCA infrastructure - making sure that as we go we have people using the software - it helps that we have an in house (though distributed) development team rather than an external contractor so feedback is very fast - we can jump onto a call and demo stuff at any time.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/02/18/hass_rdc_tech_advisory/oni-api-edit-2.gif' alt='API Demo' title='Slide: 19' border='1' width='85%%'/>
<p>We decided to build from the data API first.</p>
<p>In this demo developer Moises Sacal Bonequi is looking at the API via the Postman tool. This demonstration shows how the API can be used to find collections (that conform to our metadata profile)</p>
<ol>
<li>First he lists the collections, then chooses one</li>
<li>He then gets a collection with the <code>&resolve</code> parameter, meaning that the API will internally traverse the PCDM collection hierarchy and return ALL the metadata for the collection - down to the file level</li>
<li>He then downloads a file (for which he has a license that most of you reading this don’t have - hence the obfuscation of the dialogue)</li>
</ol>
<p>This API has been used and road tested at ANU to develop techniques for topic modelling on the Sydney Speaks corpus (more about which corpus below) - by a student Marcel Reverter-Rambaldi under the supervision of Prof Catherine Travis at ANU - we are hoping to publish this work as a re-usable notebook that can be adapted for other projects, and to allow the techniques the ANU team have been developing to be applied to other similar data in LDaCA.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/02/18/hass_rdc_tech_advisory/Slide20.png' alt='
<p>' title='Slide: 20' border='1' width='85%%'/></p>
<p>And one of the data scientists who was working with us at UQ, Mel Mistica, developed a <a href="https://github.com/Australian-Text-Analytics-Platform/ro-crate-metadata/blob/main/ro-crate-metadata.ipynb">demonstration notebook</a> with our tech team that used the API to access another full collection (which is also suitable for the ANU topic modelling approach) - this notebook gets all the metadata for a small social history collection which contains transcribed interviews with women in Western Sydney and shows how a data scientist might explore what’s in it and start asking questions about the data, like the age distribution of the participants and start digging in to what they were talking about.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/02/18/hass_rdc_tech_advisory/oni-v3.1.gif' alt='Demo' title='Slide: 21' border='1' width='85%%'/>
<p>This screencast shows a work-in-progress snapshot of the Oni portal we talked about above in action, showing how search and browse might be used to find repository objects from the index - in this case searching for Arabic words in a small set of Australian Government documents.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/02/18/hass_rdc_tech_advisory/Slide22.png' alt='
🎄🎁
' title='Slide: 22' border='1' width='85%%'/>
<p>Hang on!</p>
<p>You keep talking about “repositories” - don’t you always say stuff like <a href="http://ptsefton.com/2012/02/14/an-australian-research-data-repository/">A repository is not just a software application. It’s a lifestyle. It’s not just for Christmas</a>?</p>
<p>That’s right - we’ve been talking about repository software architectures here but it is important to remember that a repository needs to be considered an institution rather than a software stack or collection of files, more “University Library” than “My Database”.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/02/18/hass_rdc_tech_advisory/Slide23.png' alt='Compute
<p>HPC
Cloud
Desktop</p>
<p>collect
describe
analyse
🗑️
Active cleanup processes workspaces considered ephemeral
… etc
ATAP Notebooks
Apps, Code, Workflows</p>
<p>Deposit /Publish
PARADISEC
Lang. portal(s)
Corpus discovery
Item discovery
Authenticated API
Create virtual corpora</p>
<p>Analytics Portal
Code discovery
Launch / Rerun
Data Discovery
Authenticated API</p>
<p>Workbench
Notebooks
Data import by URL
Export fully described pkg
Stretch goals:
Code gen / simple interfaces eg Discursis</p>
<p>BYOData 🥂
⚙️
STORAGE (including Cloudstor)
.
Data Curation
& description
Reuse
Licence Server
Identity Management
AAF / social media accounts</p>
<p>Data Cleaning
OCR / transcription format migration
Archive & Preservation Repositoriesinstitutional, domain or both
AU Nat. Corpus
AusLan (sign)
Sydney Speaks
ATAP Corpus
Reference,Training & BYO
Workspaces:
working storage
domain specific tools
domain specific services
Harvested
external
Our demo today looks at this part …
' title='Slide: 23' border='1' width='85%%'/></p>
<p>The next half a dozen slides are based on <a href="https://ptsefton.com/2021/10/12/ldaca2021/index.html">a presentation I gave at eResearch Australasia 2021 with Moises Sacal Bonequi</a></p>
<p>Today we will look in detail at one important part of this architecture - access control. How can we make sure that in a distributed system, with multiple data repositories and registries residing with different data custodians, the right people have access to the right data?</p>
<p>I didn’t spell this out in the recorded conference presentation, but for data that resides in the repositories at the right of the diagram we want to encourage research processes that clearly separate data from code. Notebooks and other code workflows that use data will fetch a version-controlled reference copy from a repository - using an access key if needed, process the data and produce results that are then deposited into an appropriate repository alongside the code itself. Given that a lot of the data in the language world is NOT available under open licenses such as Creative Commons it is important to establish this practice - each user of the data must negotiate or be granted access individually. Research can still be reproducible using this model, but without a culture of sharing datasets without regard for the rights of those who were involved in the creation of the data.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/02/18/hass_rdc_tech_advisory/Slide24.png' alt='
<p>' title='Slide: 24' border='1' width='85%%'/></p>
<p>Regarding rights, our project is informed by the <a href="https://www.gida-global.org/care">CARE principles</a> for Indigenous data.</p>
<blockquote>
<p>The current movement toward open data and open science does not fully engage with Indigenous Peoples rights and interests. Existing principles within the open data movement (e.g. FAIR: findable, accessible, interoperable, reusable) primarily focus on characteristics of data that will facilitate increased data sharing among entities while ignoring power differentials and historical contexts. The emphasis on greater data sharing alone creates a tension for Indigenous Peoples who are also asserting greater control over the application and use of Indigenous data and Indigenous Knowledge for collective benefit</p>
</blockquote>
<p>But we do not see the CARE principles as only applying to Indigenous data and knowledge. Most language data is a record of the behaviour of people who have moral rights in the material (even if they do not have legal rights) and taking the CARE principles as relevant in such cases ensures serious thinking about the protection of tose moral rights.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/02/18/hass_rdc_tech_advisory/Slide25.png' alt='
<p>' title='Slide: 25' border='1' width='85%%'/></p>
<p>https://localcontexts.org/labels/traditional-knowledge-labels/</p>
<p>We are designing the system so that it can work with diverse ways of expressing access rights, for example licensing like the Tribal Knowledge labels.The idea is to separate safe storage of data with a license on each item, which may reference the TK labels from a system that is administered by the data custodians who can make decisions about who is allowed to access data.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/02/18/hass_rdc_tech_advisory/Slide26.png' alt='Case Study - Sydney Speaks
<p>' title='Slide: 26' border='1' width='85%%'/></p>
<p>We are working on a case-study with the <a href="http://www.dynamicsoflanguage.edu.au/sydney-speaks/">Sydney Speaks project</a> via steering committee member Catherine Travis.</p>
<blockquote>
<p>This project seeks to document and explore Australian English, as spoken in Australia’s largest and most ethnically and linguistically diverse city – Sydney.
The title “Sydney Speaks” captures a key defining feature of the project: the data come from recorded conversations between Sydney siders, as they tell stories about their lives and experiences, their opinions and attitudes. This allows us to measure how their lived experiences impact their speech patterns.
Working within the framework of variationist sociolinguistics, we examine variation in phonetics, grammar and discourse, in an effort to answer questions of fundamental interest both to Australian English, and language variation and change more broadly, including:</p>
<ul>
<li>How has Australian English as spoken in Sydney changed over the past 100 years?</li>
<li>Has the change in the ethnic diversity over that time period (and in particular, over the past 40 years) had any impact on the way Australian English is spoken?</li>
<li>What affects the way variation and change spread through society
<ul>
<li>Who are the initiators and who are the leaders in change?</li>
<li>How do social networks function in a modern metropolis?</li>
<li>What social factors are relevant to Sydney speech today, and over time (gender? class? region? ethnic identity?)
A better understanding of what kind of variation exists in Australian English, and of how and why Australian English has changed over time can help society be more accepting of speech variation and even help address prejudices based on ways of speaking.
Source: <a href="http://www.dynamicsoflanguage.edu.au/sydney-speaks/">http://www.dynamicsoflanguage.edu.au/sydney-speaks/</a></li>
</ul>
</li>
</ul>
</blockquote>
<p>The collection contains recordings of people speaking both contemporary and historic.</p>
<p>Because this involved human participants there are restrictions on the distribution of data - a situation we see with lots of studies involving people in a huge range of disciplines.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/02/18/hass_rdc_tech_advisory/Slide27.png' alt='Sydney Speaks Licenses
' title='Slide: 27' border='1' width='85%%'/>
<p>There are four tiers of data access we need to enforce and observe for this data based on the participant agreements and ethics arrangements under which the data were collected.</p>
<p>Concerns about rights and interests are important for any data involving people - and a large amount the data both indigenous and non-indigenous we are using will require access control that ensures that data sharing is appropriate.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/02/18/hass_rdc_tech_advisory/security.gif' alt='Demonstration screencast
' title='Slide: 28' border='1' width='85%%'/>
<p>In this example demo we uploaded various collections and are authorising with Github organisations</p>
<p>In a our production release we will use AAF to authorise different groups</p>
<p>Let's find a dataset: The Sydney Speaks Corpus</p>
<p>As you can see we cannot see any data</p>
<p>Lets login… We authorise Github…</p>
<p>Now you can see we have access sub corpus data and I am just opening a couple of items</p>
<p>—</p>
<p>Now in Github we can see the group management example.</p>
<p>I have given access to all the licences to myself, as you can see here and given access to licence A to others.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/02/18/hass_rdc_tech_advisory/Slide29.png' alt='
<p>' title='Slide: 29' border='1' width='85%%'/></p>
<p>This diagram is a sketch of the interaction that took place in the demo - it shows how a repository can delegate authorization to an external system - in this case Github rather than CILogon. But we are working with the ARDC to set up a trial with the Australian Access Federation to allow CILogon access for the HASS Research Data Commons so we can pilot group-based access control.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/02/18/hass_rdc_tech_advisory/Slide30.png' alt='TODO
Scope the infrastructure we need to support this (need more clarity on what data we will have and where it will be housed)
Improve our testing for scale and implement Continuous Integration so we don’t break things with every new Corpus that comes on board
Pick our metadata terms we will probably build on the OLAC (Open Language Archives) vocabularies - but there are other options such as the CLARIN (Eu) vocabs
Integrate better with the Australian Text Analytics Platform ATAP - eg fire up a notebook from the search portal to operate on a collection of interest
' title='Slide: 30' border='1' width='85%%'/>
<p>There’s a lot still to do.</p>
</section>
Infrastructure for Multilingual Text Analysis2022-01-27T00:00:00+01:002022-01-27T00:00:00+01:00Simon Musgrave, Peter Seftontag:ptsefton.com,2022-01-27:/2022/01/27/DAMTA_Slides_v1/index.html<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/01/27/DAMTA_Slides_v1/Slide0.png' alt='Infrastructure for Multilingual Text Analysis
Simon Musgrave
Peter Sefton
Language Data Commons of Australia (LDaCA)
University of Queensland
' title='0' border='1' width='85%'/>
<p>This presentation is from an online event on January 27th 2022. Digital Approaches to Multilingual Text Analysis delivered by Simon Musgrave and Peter Sefton.</p>
<blockquote>
<h3>About this event</h3>
<p>The use of DH tools and methods have been applied across a variety of corpora but text-analysis of English language sources has dominated …</p></blockquote></section><section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/01/27/DAMTA_Slides_v1/Slide0.png' alt='Infrastructure for Multilingual Text Analysis
Simon Musgrave
Peter Sefton
Language Data Commons of Australia (LDaCA)
University of Queensland
' title='0' border='1' width='85%'/>
<p>This presentation is from an online event on January 27th 2022. Digital Approaches to Multilingual Text Analysis delivered by Simon Musgrave and Peter Sefton.</p>
<blockquote>
<h3>About this event</h3>
<p>The use of DH tools and methods have been applied across a variety of corpora but text-analysis of English language sources has dominated this field. These approaches are increasingly being used in languages and linguistics research for non-English corpora. At the same time, the integration of these tools has seen new research questions and possibilities emerge, including questions such as “Is there a non-Anglo digital humanities (DH), and if so, what are its characteristics” (Fiormonte 2016: 438). Recent studies have begun to examine aspects such as OCR for historical text analysis and data mining (Hill & Hengchen 2019; Goodman et al. 2018), multilingual computation analysis (Dombrowski 2020), semantic and sentiment analysis (Daems et al. 2019) and historical linguistics (Evans 2016), among others. The papers in this conference present a diverse range of projects and critiques of digital methods across different languages.</p>
<p>January 27th 1:45pm – 7:30pm AEDT</p>
<p>Convener: Joshua Brown Senior Lecturer and Convenor, Italian Studies, Australian National University and Katrina Grant Senior Lecturer, Centre for Digital Humanities Research, Australian National University</p>
</blockquote>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/01/27/DAMTA_Slides_v1/Slide1.png' alt='Why we need research infrastructure
Collecting data is time-consuming (expensive)
Making data reusable while respecting rights is very desirable
FAIR and CARE principles should guide us
Managing this at the level of individual projects is onerous
Even for small datasets
Separate infrastructure encourages best practices
FAIRer data
Wider availability of data management expertise
Better alignment with technology change
' title='1' border='1' width='85%'/>
<p>If we accept that making sharing and reuse of data (consistent with ethical considerations) should be the default, managing even small amounts of data can be onerous. Having infrastructures which can take on the task relieves researchers of some of this burden and brings advantages: more reliable <a href="https://www.nature.com/articles/sdata201618">FAIR</a> compliance, access to data management experts, more responsive to changing technology (at least for the life of the infrastructure)</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/01/27/DAMTA_Slides_v1/Slide2.png' alt='Introducing the Language Data Commons of Australia (LDaCA)
LDaCA will make nationally significant language data available for academic and non-academic use and provides a model for ensuring continued access with appropriate community control
LDaCA aims to provide access to materials which record language use in Australia
In some cases, LDaCA will provide federated access to existing collections
In other cases, LDaCA will be a repository
' title='2' border='1' width='85%'/>
<p>Regardless of where data is housed, access will be through one portal (external data may also be accessible by other routes). Access control will follow the <a href="https://www.gida-global.org/care">CARE Principles for Indigenous Data Governance</a> where original providers of data have moral rights which must be considered, data owners/custodians will control lists of authorised users.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/01/27/DAMTA_Slides_v1/Slide3.png' alt='Multilingual material
Multilingual Australia (ABS data):
In 2016, there were over 300 separately identified languages spoken in Australian homes
More than one-fifth (21 per cent) of Australians spoke a language other than English at home
An infrastructure with the stated aims of LDaCA has to be able to handle data:
From multiple languages
With multiple writing systems
With multiple annotations (translations, phonetics, syntax etc)
In principle, Unicode encoding and suitable fonts should be capable of doing this
How does it work in practice?
' title='3' border='1' width='85%'/>
<p>'record language use in Australia' covers a huge range of possibilities. Current figures on slide, plus at least 250 Australian languages pre-European arrival, at least half no longer spoken but records remain (not for all). Unicode is not always without problems – how are we doing in meeting these goals so far? First, the architecture....</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/01/27/DAMTA_Slides_v1/Slide4.png' alt='LDaCA Architecture
' title='4' border='1' width='85%'/>
<p>The LDaCA technical architecture is based on the <a href="https://arkisto-platform.github.io/">Arkisto platform</a>, storing data in the <a href="https://arkisto-platform.github.io/standards/ocfl/">Oxford Common File Layout</a> (OCFL), with data objects such as linguistic items and collections described in detail using <a href="https://arkisto-platform.github.io/standards/ro-crate/">Research Object Crate</a> (RO-Crate). RO-Crate is a linked-data approach to describing data which is based on widely used standards for structural and descriptive properties such as dates and contributors, with extensions for language data being built on work in the <a href="http://www.language-archives.org/">Open Language Archives</a> (OLAC). RO-Crate is an international collaboration with diverse contributors, the specification is in English and most RO-Crates at this point have English metadata and contents, but there is demand for content in other languages and future versions of the spec will cover multilingual use cases.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/01/27/DAMTA_Slides_v1/Slide5.png' alt='The demonstration material
All levels of government in Australia make documents available in multiple languages
Demonstration corpus uses documents from:
Services Australia
Department of Health (Victoria)
Languages:
Arabic
Farsi (Persian)
Turkish
Vietnamese
Chinese (simplified characters)
' title='5' border='1' width='85%'/>
<p>Simon pointed out here that the languages all use different writing systems, 3 completely distinct systems (Turkish and Vietnamese use extended Roman scripts, Farsi uses an Arabic based script)</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/01/27/DAMTA_Slides_v1/demo.gif' alt='' title='6' border='1' width='85%'/>
<p>This quick demonstration screencast shows a work-in-progress prototype of the LDaCA portal which will give controlled access to language resources to those who are licensed to see them – in this demonstrator we have openly available multilingual Australian Government documents in PDF and text format and a small history dataset containing interviews with women from Western Sydney, <a href="http://omeka.uws.edu.au/farmstofreeways/">Farms to Freeways</a>. Eventually the LDaCA repository will contain a wide variety of data including speech, video, sign, images and digitized text with a browse and search interface to allow researchers to find data they are interested in – provided, of course that they have been granted an appropriate licence to view and use the data. In this demonstration our colleague Moises Sacal Bonequi peforms searches in different languages to find repository objects of interest. Each object has multiple translations stored in separate files, in both PDF and text format.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/01/27/DAMTA_Slides_v1/Slide7.png' alt='Why bother?
What are the relevant data sources?
Lots more government documents
Serial publications:
Tim Sherratt lists 52 non-English sources in Trove
31 commenced publication before 1945
German prominent in C19
Substantial resources >1945 in Italian and Greek
Chinese publications always present
Use of LOTEs in Australia is under-researched, huge opportunities to collect data
' title='7' border='1' width='85%'/>
<p>Is there data in Australia which makes it worth worrying about this? Yes – at least two important sources of written material, plus this is an under-researched field with lots of questions to be answered and therefore lots of data to be collected. For example, there is research on differing usage in Vietnamese depending on speakers' time of arrival in Australia (1970s v. later), yet to be replicated with other similarly time-layered communities.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2022/01/27/DAMTA_Slides_v1/Slide8.png' alt='Acknowledgments
The Language Data Commons of Australia project received investment from the NCRIS-enabled Australian Research Data Commons (ARDC) through two of its programs:
Data Partnerships Program: Developing policy and technology foundations of a nationally integrated research infrastructure for language data collections of high strategic importance for the Australian research community.
HASS Research Data Commons and Indigenous Research Capability Program: Capitalising on existing infrastructure, securing vulnerable and dispersed collections and linking with improved analysis environments for new research outcomes.
Software developer: Moises Sacal Bonequi
' title='8' border='1' width='85%'/>
</section>
Towards a (technical architecture for a) HASS Research Data Commons for language and text analysis2021-10-12T00:00:00+02:002021-10-12T00:00:00+02:00Peter Seftontag:ptsefton.com,2021-10-12:/2021/10/12/ldaca2021/index.html<p>This is a presentation by Peter (Petie) Sefton and Moises Sacal, delivered at the online <a href="https://conference.eresearch.edu.au/2021-program/">eResearch Australasia Conference</a> on October 12th 2021.</p>
<p>The presentation was by recorded video - this is a written version. Mosies and I are both employed by the University of Queensland School of Languages and Culture.</p>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2021/10/12/ldaca2021/Slide0.png' alt='Towards a (technical architecture for a) HASS Research Data Commons for language and text analysis
Peter Sefton & Moises Sacal
technical architecture for a
' title='0' border='1' width='85%'/>
<p>Here …</p></section><p>This is a presentation by Peter (Petie) Sefton and Moises Sacal, delivered at the online <a href="https://conference.eresearch.edu.au/2021-program/">eResearch Australasia Conference</a> on October 12th 2021.</p>
<p>The presentation was by recorded video - this is a written version. Mosies and I are both employed by the University of Queensland School of Languages and Culture.</p>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2021/10/12/ldaca2021/Slide0.png' alt='Towards a (technical architecture for a) HASS Research Data Commons for language and text analysis
Peter Sefton & Moises Sacal
technical architecture for a
' title='0' border='1' width='85%'/>
<p>Here is the abstract as submitted:</p>
<p>The Language Data Commons of Australia Data Partnerships (LDaCA) and the Australian Text Analytics Platform (ATAP) are building towards a scalable and flexible language data and analytics commons. These projects will be part of the Humanities and Social Sciences Research Data Commons (HASS RDC).</p>
<p>The Data Commons will focus on preservation and discovery of distributed multi-modal language data collections under a variety of governance frameworks. This will include access control that reflects ethical constraints and intellectual property rights, including those of Aboriginal and Torres Strait Islander, migrant and Pacific communities.</p>
<p>The platform will provide workbench services to support computational research, starting with code-notebooks with no-code research tools provided in later phases. Research artefacts such as code and derived data will be made available as fully documented research objects that are re-runnable and rigorously described. Metrics to demonstrate the impact of the platform are projected to include usage statistics, data and article citations.</p>
<p>In this presentation we will present the proposed architecture of the system, the principles that informed it and demonstrate the first version. Features of the solution include the use of the Arkisto Platform (presented at eResearch 2020), which leverages the Oxford Common File Layout. This enables storing complete version-controlled digital objects described using linked data with rich context via the Research Object Crate (RO-Crate) format. The solution features a distributed authorization model where the agency archiving data may be separate from that authorising access.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2021/10/12/ldaca2021/Slide1.png' alt='Project Team(alphabetical order)
Michael D’Silva
Marco Fahmi
Leah Gustafson
Michael Haugh
Cale Johnstone
Kathrin Kaiser
Sara King
Marco La Rosa
Mel Mistica
Simon Musgrave
Joel Nothman
Moises Sacal
Martin Schweinberger
PT Sefton
<p>With thanks for their contribution:
Partner Institutions:
' title='1' border='1' width='85%'/></p>
<p>This cluster of projects is led by Professor Michael Haugh of the School of Languages and Culture at the University of Queensland with several partner institutions.</p>
<p>I work on Gundungurra and Darug land in the Blue Mountains, Moises is on the land of the Gadigal peoples of the Eora Nation. We would like to show acknowledge the traditional custodians of the lands on which we live and work and the importance of indigenous knowledge, culture and language to the these projects.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2021/10/12/ldaca2021/Slide2.png' alt='The Language Data Commons of Australia (LDaCA) and Australian Text Analytics Platform (ATAP) projects received investment (https://doi.org/10.47486/DP768 and https://doi.org/10.47486/PL074) from the Australian Research Data Commons (ARDC). The ARDC is funded by the National Collaborative Research Infrastructure Strategy (NCRIS).
' title='2' border='1' width='85%'/>
<p>This work is supported by the Australian Research Data Commons.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2021/10/12/ldaca2021/Slide3.png' alt='⛰️ 🔏
' title='3' border='1' width='85%'/>
<p>We are going to talk about the emerging architecture and focus in on one very important part of it: Access control. 🔏</p>
<p>But first, some background.⛰️</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2021/10/12/ldaca2021/Slide4.png' alt='The platform will:
Be sustainable, with a focus on data preservation as an overriding concern - data will not be ‘trapped’ in a particular platform and all data and code developed on the platform will be in a “migration free” layout ready for reuse
preserve interoperable and re-usable data via the use of common standards for describing and structuring data with useful detailed context and provenance
make data from ATAP and LDaCA and collections discoverable - with the caveat that harvesting harmonised metadata from existing corpora may be difficult
Provide workbench services for computational research - starting with code-notebooks but with the aim of building towards no-code environments and automatically re-runnable workflows
include clear licensing on all data and code on how data may be reused, informed by a legally sound policy framework, with an access-control framework to allow automated data access where possible (there are some external dependencies here)
be distributed - with data held by a number of different organizations under a variety of governance models and technologies (potentially including copies for redundancy or to put data close to compute and analytical services)
enable best-practice in research, with research products such as code and derived data available as “fully documented research objects” that as as re-runnable and rigorously described as possible
provide and be able to show value in enabling and measuring the impact of research
<p>' title='4' border='1' width='85%'/></p>
<p>The architecture for the Data Commons project is informed by as set of goals and principles starting with ensuring that important data assets have the best chance of persisting into the future.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2021/10/12/ldaca2021/Slide5.png' alt='
<p>Repositories: institutional, domain or both</p>
<p>Find / Access services
Research Data Management Plan
Workspaces:</p>
<p>working storage
domain specific tools
domain specific services
collect
describe
analyse
Reusable, Interoperable
data objects
deposit early
deposit often
Findable, Accessible, Reusable data objects
reuse data objects
V1.1 © Marco La Rosa, Peter Sefton 2021 https://creativecommons.org/licenses/by-sa/4.0/</p>
<p>🗑️
Active cleanup processes workspaces considered ephemeral
🗑️
Policy based data management
' title='5' border='1' width='85%'/></p>
<p>The diagram which we developed with Marco La Rosa makes a distinction between managed repository storage and the places where work is done - “workspaces”. Workspaces are where researchers collect, analyse and describe data. Examples include the most basic of research IT services, file storage as well as analytical tools such as Jupyter notebooks (the backbone of ATAP - the text analytics platform). Other examples of workspaces include code repositories such as GitHub or GitLab (a slightly different sense of the word repository), survey tools, electronic (lab) notebooks and bespoke code written for particular research programmes - these workspaces are essential research systems but usually are not set up for long term management of data.
The cycle in the centre of this diagram shows an idealised research practice where data are collected and described and deposited into a repository frequently. Data are made findable and accessible as soon as possible and can be “re-collected” for use and re-use.</p>
<p>For data to be re-usable by humans and machines (such as ATAP notebook code that consumes datasets in a predictable way) it must be well described. The ATAP and LDaCA approach to this is to use the Research Object Crate (RO-Crate) specification. RO-Crate is essentially a guide to using a number of standards and standard approaches to describe both data and re-runnable software such as workflows or notebooks.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2021/10/12/ldaca2021/Slide6.png' alt='
<p>' title='6' border='1' width='85%'/></p>
<p>In the context of the previous high-level map distinguishing workspaces and repository services, we are using the Arkisto Platform (introduced <a href="http://ptsefton.com/2020/11/23/Arkisto/index.html">at eResearch 2020</a>).</p>
<p>Arkisto is an approach to eResearch service that places the emphasis on ensuring the long term preservation of data independently of code and services - recognizing the ephemeral nature of software.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2021/10/12/ldaca2021/Slide7.png' alt='' title='7' border='1' width='85%'/>
<p>An example of a corpus is the PARADISEC collection - Pacific and Regional Archive for Digital Sources in Endangered Cultures</p>
<p>PARADISEC has viewers for various content types: video and audio with time aligned transcriptions, image set viewers and document viewers (xml, pdf and microsoft formats). We are working on making these viewers available across Arkisto sites by having a standard set of hooks for adding viewer plugins to a site as needed.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2021/10/12/ldaca2021/Slide8.png' alt='Compute
<p>HPC
Cloud
Desktop</p>
<p>collect
describe
analyse
🗑️
Active cleanup processes workspaces considered ephemeral
… etc
ATAP Notebooks
Apps, Code, Workflows</p>
<p>Deposit /Publish
PARADISEC
Analytics Portal
Code discovery
Launch / Rerun
Data Discovery
Authenticated API</p>
<p>Workbench
Notebooks
Data import by URL
Export fully described pkg
Stretch goals:
Code gen / simple interfaces eg Discursis</p>
<p>BYOData 🥂
⚙️
STORAGE (including Cloudstor)
.
Data Curation
& description
Reuse
Licence Server
Identity Management
AAF / social media accounts</p>
<p>Data Cleaning
OCR / transcription format migration
Archive & Preservation Repositoriesinstitutional, domain or both
AU Nat. Corpus
AusLan (sign)
Sydney Speaks
ATAP Corpus
Reference,Training & BYO
Workspaces:
working storage
domain specific tools
domain specific services
Harvested
external
Lang. portal(s)
Corpus discovery
Item discovery
Authenticated API
Create virtual corpora</p>
<p>' title='8' border='1' width='85%'/></p>
<p>This slide captures the overall high-level architecture - there will be an analytical workbench (left of the diagram) which is the basis of the Australian Text Analytics (ATAP) project - this will focus on notebook-style programming using one of the emerging Jupyter notebook platforms in that space. The exact platform is not 100% decided yet, but that has not stopped the team from starting to collect and develop notebooks that open up text analytics to new coders from the linguistics community. Our engagement lead, Dr Simon Musgrave sees the ATAP work as primarily an educational enterprise - which will be underpinned by services built on the Arkisto standards that allow for rigorous, re-runnable research.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2021/10/12/ldaca2021/Slide9.png' alt='Compute
<p>HPC
Cloud
Desktop</p>
<p>collect
describe
analyse
🗑️
Active cleanup processes workspaces considered ephemeral
… etc
ATAP Notebooks
Apps, Code, Workflows</p>
<p>Deposit /Publish
PARADISEC
Lang. portal(s)
Corpus discovery
Item discovery
Authenticated API
Create virtual corpora</p>
<p>Analytics Portal
Code discovery
Launch / Rerun
Data Discovery
Authenticated API</p>
<p>Workbench
Notebooks
Data import by URL
Export fully described pkg
Stretch goals:
Code gen / simple interfaces eg Discursis</p>
<p>BYOData 🥂
⚙️
STORAGE (including Cloudstor)
.
Data Curation
& description
Reuse
Licence Server
Identity Management
AAF / social media accounts</p>
<p>Data Cleaning
OCR / transcription format migration
Archive & Preservation Repositoriesinstitutional, domain or both
AU Nat. Corpus
AusLan (sign)
Sydney Speaks
ATAP Corpus
Reference,Training & BYO
Workspaces:
working storage
domain specific tools
domain specific services
Harvested
external
Our demo today looks at this part …
' title='9' border='1' width='85%'/></p>
<p>Today we will look in detail at one important part of this architecture - access control. How can we make sure that in a distributed system, with multiple data repositories and registries residing with different data custodians, the right people have access to the right data?</p>
<p>I didn’t spell this out in the recorded conference presentation, but for data that resides in the repositories at the right of the diagram we want to encourage research processes that clearly separate data from code. Notebooks and other code workflows that use data will fetch a version-controlled reference copy from a repository - using an access key if needed, process the data and produce results that are then deposited into an appropriate repository alongside the code itself. Given that a lot of the data in the language world is NOT available under open licenses such as Creative Commons it is important to establish this practice - each user of the data must negotiate or be granted access individually. Research can still be reproducible using this model, but without a culture of sharing datasets without regard for the rights of those who were involved in the creations of the data.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2021/10/12/ldaca2021/Slide10.png' alt='
<p>' title='10' border='1' width='85%'/></p>
<p>Regarding rights, our project is informed by the <a href="https://www.gida-global.org/care">CARE principles</a> for indegenous data.</p>
<blockquote>
<p>The current movement toward open data and open science does not fully engage with Indigenous Peoples rights and interests. Existing principles within the open data movement (e.g. FAIR: findable, accessible, interoperable, reusable) primarily focus on characteristics of data that will facilitate increased data sharing among entities while ignoring power differentials and historical contexts. The emphasis on greater data sharing alone creates a tension for Indigenous Peoples who are also asserting greater control over the application and use of Indigenous data and Indigenous Knowledge for collective benefit</p>
</blockquote>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2021/10/12/ldaca2021/Slide11.png' alt='
<p>' title='11' border='1' width='85%'/></p>
<p>https://localcontexts.org/labels/traditional-knowledge-labels/</p>
<p>We are designing the system so that it can work with diverse ways of expressing access rights, for example licensing like the Tribal Knowledge labels.The idea is to separate safe storage of data with a license on each item, which may reference the TK labels from a system that is administered by the data custodians who can make decisions about who is allowed to access data.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2021/10/12/ldaca2021/Slide12.png' alt='Case Study - Sydney Speaks
<p>' title='12' border='1' width='85%'/></p>
<p>We are working on a case-study with the <a href="http://www.dynamicsoflanguage.edu.au/sydney-speaks/">Sydney Speaks project</a> via steering committee member Professor Catherine Travis.</p>
<blockquote>
<p>This project seeks to document and explore Australian English, as spoken in Australia’s largest and most ethnically and linguistically diverse city – Sydney.
The title “Sydney Speaks” captures a key defining feature of the project: the data come from recorded conversations between Sydney siders, as they tell stories about their lives and experiences, their opinions and attitudes. This allows us to measure how their lived experiences impact their speech patterns.
Working within the framework of variationist sociolinguistics, we examine variation in phonetics, grammar and discourse, in an effort to answer questions of fundamental interest both to Australian English, and language variation and change more broadly, including:</p>
<ul>
<li>How has Australian English as spoken in Sydney changed over the past 100 years?</li>
<li>Has the change in the ethnic diversity over that time period (and in particular, over the past 40 years) had any impact on the way Australian English is spoken?</li>
<li>What affects the way variation and change spread through society
<ul>
<li>Who are the initiators and who are the leaders in change?</li>
<li>How do social networks function in a modern metropolis?</li>
<li>What social factors are relevant to Sydney speech today, and over time (gender? class? region? ethnic identity?)
A better understanding of what kind of variation exists in Australian English, and of how and why Australian English has changed over time can help society be more accepting of speech variation and even help address prejudices based on ways of speaking.
Source: <a href="http://www.dynamicsoflanguage.edu.au/sydney-speaks/">http://www.dynamicsoflanguage.edu.au/sydney-speaks/</a></li>
</ul>
</li>
</ul>
</blockquote>
<p>The collection contains recordings of people speaking both contemporary and historic.</p>
<p>Because this involved human participants there are restrictions on the distribution of data - a situation we see with lots of studies involving people in a huge range of disciplines.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2021/10/12/ldaca2021/Slide13.png' alt='Sydney Speaks Licenses
' title='13' border='1' width='85%'/>
<p>There are four tiers of data access we need to enforce and observe for this data based on the participant agreements and ethics arrangements under which the data were collected.</p>
<p>Concerns about rights and interests are important for any data involving people - and a large amount the data both indigenous and non-indigenous we are using will require access control that ensures that data sharing is appropriate.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2021/10/12/ldaca2021/Slide14.gif' alt='Demo
' title='14' border='1' width='85%'/>
<p>In this example demo we uploaded various collections and are authorising with Github organisations</p>
<p>In a our production release we will use AAF to authorise different groups</p>
<p>Let's find a dataset: The Sydney Speaks Corpus</p>
<p>As you can see we cannot see any data</p>
<p>Lets login… We authorise Github…</p>
<p>Now you can see we have access sub corpus data and I am just opening a couple of items</p>
<p>—</p>
<p>Now in Github we can see the group management example.</p>
<p>I have given access to all the licences to myself, as you can see here and given access to licence A to others.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2021/10/12/ldaca2021/Slide15.png' alt='
<p>' title='15' border='1' width='85%'/></p>
<p>This diagram is a sketch of the interaction that took place in the demo - it shows how a repository can delegate authorization to an external system - in this case Github rather than CILogon. But we are working with the ARDC to set up a trial with the Australian Access Federation to allow CILogon access for the HASS Research Data Commons so we can pilot group-based access control.</p>
<p>NOTE: This diagram has been updated slightly from the version presented at the conference to make it clear that the lookup to find the licence for the data set is <em>internal</em> to the repository - the id is a DOI but it is not being resolved over the web.</p>
</section>
<section typeof='http://purl.org/ontology/bibo/Slide'>
<img src='https://ptsefton.com/2021/10/12/ldaca2021/Slide16.png' alt='
🚧
' title='16' border='1' width='85%'/>
<p>In this presentation, about work which is still very much under construction, we have:</p>
<ul>
<li>Shown an overview of a complete Data Commons Architecture</li>
<li>Previewed a distributed access-control mechanism which will separates out the the job of storing and delivering data from that of authorising access</li>
<li>We'll be back next year with more about how analytics and data repositories connect using structure and linked data.</li>
</ul>
</section>
FIIR Data Management; Findable Inaccessible Interoperable and Reusable?2021-06-11T00:00:00+02:002021-06-11T00:00:00+02:00ptseftontag:ptsefton.com,2021-06-11:/2021/06/11/faira/index.html<p>This is a work in progress post I'm looking for feedback on the substance - there's a comment box below, email me, or see me on twitter: @ptsefton.</p>
<p>[Update 2021-06-16: had some comments from Michael D'Silva at AARNet - have added a couple of things below.]</p>
<p>I am posting this now because …</p><p>This is a work in progress post I'm looking for feedback on the substance - there's a comment box below, email me, or see me on twitter: @ptsefton.</p>
<p>[Update 2021-06-16: had some comments from Michael D'Silva at AARNet - have added a couple of things below.]</p>
<p>I am posting this now because I have joined a pair of related projects as a senior technical advisor and an we will have to look at access-authorization to data on both - licenses will vary from open, to click-through agreements, to complex cultural restrictions such as <a href="https://publish.illinois.edu/commonsknowledge/2017/09/07/an-introduction-to-traditional-knowledge-labels-and-licenses/#:~:text=Traditional%20knowledge%2C%20or%20TK%2C%20labels,data%20management%20and%20presentation%20strategies.">TK Licenses</a>;</p>
<ol>
<li><a href="https://doi.org/10.47486/PL074">Australian Text Analytics Platform (ATAP)</a></li>
<li><a href="https://ardc.edu.au/news/a-national-language-data-commons-for-australia/">Language Data Commons for Australia (LDaCA)</a></li>
</ol>
<p><strong>Summary:</strong> Not all research data can be made openly available (for ethical, safety, privacy, commercial or other reasons) but a lot <em>can reasonably be sent over the web to trusted parties</em>. If we want to make it accessible (as per the "A" in the <a href="https://en.wikipedia.org/wiki/FAIR_data">FAIR data</a> principles) then at present each data repository/service has to handle its own access controls. In this post I argue that if we had a <em>Group Service</em> or <em>Licence Service</em> that allowed research teams to build their own groups and/or licences then the service could issue a Group Access Licence URLs. Other services such as repositories in a trusted relationship with the <em>Group/Licence Service</em> holding content with digital licences which had such URLs could do a redirect dance (like with oAuth and other authentication protocols), sending the users who request access to digital objects to the <em>Group/Licence Service</em> which could authenticate them and check if they have access rights then let the repository know whether or not to give them access.</p>
<hr />
<p>In this post I will look at some missing infrastructure for doing <a href="https://en.wikipedia.org/wiki/FAIR_data">FAIR data</a> (Reminder: FAIR is Findable, Accessible, Interoperable, Reusable data) - and will cite the FAIR principles.</p>
<p>If a dataset can be released under an open licence then that's no problem but if data is only available for reuse under some special circumstances to certain users for certain purposes then the research sector lacks general-purpose infrastructure to support this. Tech infrastrucure aside, we do have a way of handling this legally. You specify these special conditions using a <em>licence</em> as per the FAIR principles.</p>
<blockquote>
<p>R1.1. (Meta)data are released with a clear and accessible data usage licence</p>
</blockquote>
<p>The licence might say (in your local natural language) "Members of international research project XYX can access this dataset". Or "contact us for a specific licence (and we'll add you to a license-holder group if approved)". [Update: 2021-06-16 Or This content is licensed to an individual ID]</p>
<p>Now the dataset can be deposited in a repository, which will take care of <em>some of</em> the FAIR principles for you including the F-word stuff.</p>
<blockquote>
<h2>Findable</h2>
<p>The first step in (re)using data is to find them. Metadata and data should be easy to find for both humans and computers. Machine-readable metadata are essential for automatic discovery of datasets and services, so this is an essential component of the FAIRification process.</p>
<p>F1. (Meta)data are assigned a globally unique and persistent identifier</p>
<p>F2. Data are described with rich metadata (defined by R1 below)</p>
<p>F3. Metadata clearly and explicitly include the identifier of the data they describe</p>
<p>F4. (Meta)data are registered or indexed in a searchable resource</p>
</blockquote>
<p>Yes, you could somehow deal with all that with some bespoke software service, but the simplest solution is to use a repository, or if there isn't one, work with infrastructure people to set one up - there are a number of software solutions that can help provide all the needed services. The repository will typically issue persistent identifiers for digital objects and serve up data using a standarised communication protocol (usually) HTTP(S).</p>
<blockquote>
<h2>Accessible</h2>
<p>Once the user finds the required data, they need to know how they can be accessed, possibly including authentication and authorisation.</p>
<p>A1. (Meta)data are retrievable by their identifier using a standardised communications protocol</p>
<p>A1.1 The protocol is open, free, and universally implementable</p>
<p>A1.2 The protocol allows for an authentication and authorization procedure, where necessary</p>
<p>A2. Metadata are accessible, even when the data are no longer available</p>
</blockquote>
<p>But repository software can not be trusted to understand license text and thus cannot work out who to make non-open data available <em>to</em> - so what will (usually) happen is it will either just make the non-open data available only to the depositor and administrators. The default is to make it <em>Inaccessible</em> via what repository people call "mediated access" - ie you have to contact someone to ask for access and then they have to figure out how to get the data to you.</p>
<p>At the Australian Data Archive they have the "request access" part automated:</p>
<blockquote>
<h2>4 - DOWNLOADING DATA</h2>
<p>To download open access data and documentation, click on the “Download” button next to the file you are interested in. Much of the data in the ADA collection has controlled access, denoted by a red lock icon next to the file. Files with controlled access require you to request access to the data, by clicking on the “Request Access” button.
<a href="https://ada.edu.au/accessing-data/">https://ada.edu.au/accessing-data/</a></p>
</blockquote>
<p>In some cases the repository itself will have some kind of built in access control using groups, or licences or some-such. For example, the <a href="https://alveo.edu.au/">Alveo</a> virtual lab funded by NeCTAR in Australia, on which I worked, has a local licence checker, as each collection has a licence. Some licences just require a click-through agreement, others are associated with lists of users who have paid money, or are blessed by a group-owner.</p>
<p>I'm not citing Alveo as a much-used or successful service, it was not, overall, a great success in terms of uptake, but I think it has a good data-licence architecture; there is a licence-component that was separate from the rest of the system. The licence checking sits in front of the "Findability" part of the data and the API - not much of that data is available without at least some kind of licence that users have to agree to.</p>
<img src="https://ptsefton.com/2021/06/11/faira/plantuml_Alveo.png" alt="Diagram">
<p>This pattern makes a clear separation between the licence as an abstract, identifiable thing, and a service to keep track of who holds the licence.</p>
<p>Question is, could we do something like this at national or global scale?</p>
<p>We are part of the way there - we can authenticate users in a number of ways, eg by the Australian Access Federation (<a href="https://aaf.edu.au/">AAF</a>) and equivalents around the world, and there are protocols that allow a service to authenticate using Google, Facebook, Github et al. These all rely on variants of a pattern where a user of service <code>A</code> is redirected to an authentication service <code>B</code> where they put in their password or a one-time key, and whatever other mechanism the IT department deem necessary, and then are redirected back to service <code>A</code> with an assurance from <code>B</code> that this person is who they say they are.</p>
<p>What we don't have (as far as I'm aware) is a general purpose protocol for checking whether someone holds a licence. A repository could redirect a web user to a Group Licence Server and the user could transact with the licence service, authenticate themselves (in whatever way that licence service supports) and then the licence service could check it's internal lists of who has what licence and then return it. If the license is just a click through then the user could to the clicking - or request access,or pay money or whatever is required.</p>
<p>[Update: 2021-06-16 This class of service would also be useful for provisioning access to things other than data - such as compute or other workspace resources. Making this a standard protocol means that these services could be offered by different organizations, yes we want national ones but for some kinds of sensitive data a community might want to run and control their own.]</p>
<p>(We are aware of the work on FAIR Digital Objects and the <a href="https://fairdo.org/">FDO Forum</a> - it does say there that:</p>
<blockquote>
<p>FAIR Digital Objects (FDO) provide a conceptual and implementation framework to develop scalable cross-disciplinary capabilities, deal with the increasing data volumes and their inherent complexity, build tools that help to increase trust in data, create mechanisms to efficiently operate in the domain of scientific assertions, and promote data interoperability.</p>
</blockquote>
<p>Colleagues and I have started discussions with the folks there.)</p>
<p>Those of us who were around higher-ed-tech in the '00s in Australia will remember <a href="https://slideplayer.com/slide/1514611/">MAMS</a> - the Meta Access Management system - the leader James Dalziel was at all the eResearch-ish conferences talking about this shared federation (that would allow you to log in to other people's systems (we got that - it's the aforementioned AAF), with fantastic user stories about being able to log into a data repository and then by virtue of the fact that you're a female anthropologist, gain access to some cultural resources (we didn't get that bit). I remember <a href="https://theconversation.com/profiles/kent-fitch-137926">Kent Fitch</a> then from the National Library, one of the team that build the national treasure <a href="https://trove.nla.gov.au/">Trove 😀</a> bursting that particular bubble over beers after one such talk. He asked: How do you identify an anthropologist? Answer - a university authentication system certainly can't.</p>
<p>I realised a long long time later that while you can't identify the anthropologists, or tell 'em apart from the ethnographers or ethnomusicologists etc that they <em>can</em> and make their own groups, via research projects, collaborations and scholarly societies. You <em>could</em> have a group that listed the members of a scholarly society and use that for certain kinds of access control, and you could, of course let the researchers self select people they want to share with - let <em>them</em> set up <em>their</em> groups.</p>
<p>What if we had a class of stand-alone service where anyone could set up a group and add users to it? A project lead could decide on what is an acceptable way to authenticate, via Academic Federations like AAF or ORCID, public services like Github or Facebook etc and then add a list of users via email addresses or other IDs. And what if there was a way to auto-populate that group by linking through to <a href="https://osf.io/">OSF</a> groups, or Github organisations, Slack etc (all of which use different APIs and none of which know about licences in this sense as far as I know). This would be useful for groups of researchers who need access to storage, compute, and yes, datasets with particular licence provisions. There could be free-to-use group access for individuals and paid services for orgs like learned societies who can use the list to make deals with infrastructure providers for example. And there need not only be one of these services, they'd work well at a National level I think but could be more granular or discipline based.</p>
<p>(Does such a thing already exist? Did I miss it? Let me know in the comments below or on twitter - I'm @ptsefton)</p>
<p>We could do this something like the way modern authentication services work with a simple hand-off of a user to an authentication service, but with the addition of licence URL, to a service that says - yep I vouch for this person they have a licence to see the data.</p>
<img src="https://ptsefton.com/2021/06/11/faira/plantuml_FAIRA.png" alt="Diagram">
<p>The above interaction diagram is purely a fantasy of mine. I'm not an Internet Engineer - so I have probably made some horrible errors, please let me know.</p>
<p>Obviously this requires a trust-framework; repositories would have to trust the licence servers and vice-versa and these relationships would have to be time-limited and renewable. You wouldn't want to trust a service for longer than their domain registration for example in case someone else you don't trust buys the domain, that kind of thing. And you'd want some public key stuff happening so that transactions are signed (a further mitigation against domain squatters - they would presumably not have your private key).</p>
<p>And this is not an overly complicated application we're talking about - all access-controlled APIs already have to do some of this locally. It's the the governance - the trust federations that will take time and significant resources (so lets start now :-).</p>
<p>And while we're on the subject of trust - this scheme would work in the same way most research does - with trust in the people working on the projects - typically they have access to data and are trusted to keep it safe. Being a member of a project team in health research, for example often involves joining a host organization as an honorary staff member, and being subject to all its policies and procedures. Some research groups have high levels of governance; people are identified using things like nursing registrations, and other certifications; some are ad-hoc collections of people identified by a PI using any old-email address.</p>
<p>NOTE: for data that needs to be kept really really secure? Data that can never even be put on a hard drive and left on a bus - then this proposal is <em>not</em> the scheme for that data - that's where you'd be looking at a Secure eResearch Platform (SeRP) where the data lives in a walled garden and can be inspected only via a locked-down terminal application, or even stricter, you might only have secure on-site access to data that's air-gapped from any network.</p>
<p>Here's a sketch of some infrastructure. Essentially this is what happened inside Alveo - the question is can it be distributed so repositories can be de-coupled from authorization services?</p>
<img src="https://ptsefton.com/2021/06/11/faira/plantuml_ARCH.png" alt="Diagram">