Infrastructure and what do we really want for DMPs

2021-05-20

[Updated 2021-05-20 after Gail McGlinn and I got home from the pub and she read this through; fixed several terrible typos and the odd incomplete sentence - I knew I should not have let the dog proofread the version we put out this afternoon (he just wanted to go for a W A L K I am beginning to think he didn't read it at all)]

[Updated 2021-05-23 further minor punctuation changes]

I was invited to speak at an RDA (Research Data Alliance) 'Infrastructure and Data Management Plans (DMPs)' webinar on May 18th as part of a panel talking about Data Management Plans and infrastructure.

Here are my slides for my five minutes of RDA-fame - with some notes about what I said, some things I forgot to say and some things I learned.

Liz Stokes from ARDC asked me:

"Could I suggest that you focus on what people should expect from DMP infrastructure and talk about how infrastructure can encourage good data management practices?”

Me being me I took the "what people should expect" as an opportunity to reflect not just on current infrastructure but also on what's needed for the future; where the gaps are.

Preamble

I found this session very useful. Recently I have been working on the A-is-for-accessible in FAIR, looking at how do we provide access to non-open research data that requires a specific license agreement for access and use, usually for privacy, ethical or commercial reasons. The simplest approach to this that I can think of is to create data licences, which would have human readable licence terms, that apply to a certain group of people, and to let the data custodian or a particular non-open dataset decide how to manage these groups of people.

Does the group consist of:

A named list of project collaborators?
A click-through licence where we identify a person and then keep track of the fact they clicked it?
A commercial arrangement like a subscription where downloads are available for a certain time?

A repository would not need to worry about all of that if we had a mechanism to pass a licence URL to a licence service and let it sort it out - authenticate the user, then authorize data access or not based on their membership of a group.

Before I spoke John Chodacki from the Califormnia Digital Library, standing in for Maria Praetzellis, talked about some work they're doing with field research stations and data management plans, using a Data Management Planning tool called DMPTool. One of Maria's slides showed a graph of how a data management plan links to publications, organisations, funders, datasets and you guessed it, people, identified with standard researcher IDs - ORCIDs. Given that applications like DMPTool, and our home-grown ReDBox Data Management Platform are keeping track of both researcher (and potentially research participant) cohorts and data deposits - these applications might ALSO be able to do the authentication and authorization steps for other services which provision things; repositories that provision data and workspaces that provision services.

My final slide I had prepared was about the need for cohort or group management services so we can give groups of people a license to access certain data. I have been sketching out potential software architectures for a couple of client projects that deal with this at internet-scale via standard ways of treating licences, it is worth exploring how that group management could be developed in Data Management (Planning) applications.

The talk

I would like to acknowledge and pay my respects to the Gundungurra and Darug people who are the traditional custodians of the land on which I live and work.

RDMPs need to be useful. There are a couple of dimensions to this - without using the dreaded “Carrot and stick” metaphor they can be useful in that they help you comply with organizational and funder requirements, but also, they could be useful as a focal point for getting access to services.

provisioning
researchers
projects
grants db
PAP portal
metadata db
ingest
share
<p>dropbox-like
process</p>
<p>inc NCI, Pawsey, local HPC, etc
1
3
4
5
portal 1
portal 2
portal 3
repo 1
repo 2
repo 3
6
1
3
4
5
6
2
Researchers access grants database which indicates which grants they ‘own’ or have access to. This “surrounding” metadata is registered with the metadata db.
Space marked with this metadata is provisioned on dropbox-like storage which is visible to the NeCTAR cloud – this space should belong to a project, not a person.
Automated and manual ingest processes feed data to this store, harvesting additional metadata where possible and relevant
Provisioned space should be as dropbox like as possible.
Storage is immediately visible to NeCTAR cloud and processes developed to ship data to local HPC or peak facilities using existing high-speed networks and tools
Once project is complete, the data is packaged and shipped to and indexed by the relevant domain repository as well as registered with the RDA index.
Research Data Australia
Data Lifecycle Project –
All components
process</p>
<p>5
National Storage Resources
2
Possible Project Components

This is a slide sent to me by Ian Duncan from the ARDC “Data Lifecycle” project, circa 2016. Note the box top left labelled “Provisioning” - this was an exciting vision which we picked up and implemented at UTS. We added provisioning services to the open source RedBox research data management platform that heard about from Andrew Brazzati who talked just before me, which are now available to the whole community.

The Data Lifecycle project ”morphed into” (Ian’s Words) the Research Activity Identifier Service offered by ARDC - this addresses part, but not all of the Data Lifecycle vision - being able to identify activities is OK but what we really need is a way for cohorts of researchers, their support staff and other participants to be able to access resources. I’ll come back to this point.

I didn’t include this in my presentation - but two of the other presenters did! It was drawn by Gerrad Barthelot at UTS following an architecture workshop with the eResearch team - building on the idea of Provisioning as the core useful service - which we learned about partly from the Data Lifecycle project.

<p>Repositories: institutional, domain or both</p>
<p>Find / Access services
Research Data Management Plan
Workspaces:</p>
<p>working storage
domain specific tools
domain specific services
collect
describe
analyse
Reusable, Interoperable
data objects
deposit early
deposit often
Findable, Accessible, Reusable data objects
reuse data objects
V1.1 © Marco La Rosa, Peter Sefton 2021 https://creativecommons.org/licenses/by-sa/4.0/</p>
<p>🗑️
Active cleanup processes workspaces considered ephemeral
🗑️
Policy based data management

Speaking of Research Data Lifecycles - recently I blogged a bit of a rant about the misuse of the term "lifecycle" and called out some issues with some of the ways research data cycles are often depicted as project-based cycles.

This is a diagram that Marco La Rosa and I have been developing that tries to capture the infrastructure needed for Research Data Management.

One of the key points of this diagram is that the cyclical part of doing FAIR data management should be as rapid as possible, not something to be thought about only at the end of the project. We need to start thinking of "workspaces" as ephemeral places to collect/create, analyse and describe data and push that data into managed repository storage ASAP.

If you accept this premise, then I think the thing you have to do as a data manager or custodian is work out where all these pieces are - do you have a DMP system yet? Is there a data repository at your institution? There's no point in having data policies or running training unless these bits of infrastructure are in place. And they often aren't.

I am an independent agent now - since the position of eResearch Support Manager was considered to be redundant by UTS. At least for now I am in a position to work on what I want to work on; for me that means looking for gaps - things that need doing in the research sector. This next couple of slides is not so much - what can you expect from infrastructure as in expect to be able to get right now, but what infrastructure we need to demand, to support, to ask institutions and projects like the Australian Research Data Commons to support.

☁️
📂
<p>📄
ID? Title? Description?</p>
<p>👩‍🔬👨🏿‍🔬Who created this data?
📄What parts does it have?
📅 When?
🗒️ What is it about?
♻️ How can it be reused?
🏗️ As part of which project?
💰 Who funded it?
⚒️ How was it made?
Addressable resources
Local Data
👩🏿‍🔬 https://orcid.org/0000-0001-2345-6789
🔬 https://en.wikipedia.org/wiki/Scanning_electron_microscope
RO-Crate

One area I'm putting time into is standardization for data interchange, particularly standardizing ways for turning piles of amorphous data into well described, packaged Digital Objects that can be Reused and moved between Interoperable systems (the I and R in FAIR) - via the RO-Crate spec

RO-Crate is a method for describing a dataset as a (FAIR) Digital Object using a single linked-data metadata document that can be put in a zip file or download. This is a key enabler for infrastructure that allows for Interop; the I in FAIR.

RO-Crate is seeing strong uptake in some communities and is being built in to systems such as Workflow Hub, PARADISEC and a new project I'm going to be involved in for Humanities and Social Science Data in Australia - more on that soon. And via the European Open Science Cloud (see the next slide).

As an example of something Marco and I have been working on (via connections with AARNet) - the CS3MESH4EOSC - is “Democratising FAIR” by helping to link the pink “workspaces” box with the green “repositories” box in our RDM diagram.

A number of smiley faces representing people grouped into three overlapping cohorts.
<p>

We don’t have a good way to define groups of people across institutional boundaries which is an essential part of provisioning services to research projects and provisioning non-open data.

As noted above - Research Data Management (Planning) Tools could be one part of the infrastructure needed to identity data licences, who holds that licence - meaning that repositories that know the license for data can hand-off authorization and authentication to the DMP system. John Chodacki encouraged me to feed these ideas and requirements into DMPTool process, lets consider this post the first step in that.

Steven McEachern from the Australian Data Archive and I also promised to follow up on the issue of data access/licencing.

[ptsefton.com] | [CV & Bio]

Infrastructure and what do we really want for DMPs

2021-05-20

Preamble

The talk