Trip Report (with bonus opinions) - Open Repositories 2018, Bozeman Montana, USA

2018-07-10

I (Peter Sefton) recently attended OR2018, the Open Repositories conference from June 4-7, 2018 in Bozeman Montana.

This post is being posted on the UTS eResearch site and on my site.

My trip was funded by the University of Technology Sydney (UTS).

Mission

Gavin Kennedy from QCIF was also in attendance, and we were on something of a mission - to promote and get feedback on the recent work we've been doing on the ReDBox research data management platform. We ran a ReDBOX intro workshop on the Monday of the conference. Gavin and I also presented a general introduction to ReDBOX and the provisioner, and I went into more detail about the DataCrate standard for shipping and showing-off research data that I have been leading, with help from a growing community of supporters. I also did a presentation in the technical session which included a live demo of using ReDBox to ingest DataCrates - showing how it could 'sniff out' metadata from DataCrate packages.

Research Data Management

Open Repositories now has enough going on about Research Data Management that I was able to spend most in those sessions when I wasn't meeting with people directly.

I heard people talk about a few things that helped confirm some of our design choices at UTS, and a few things to challenge my world view as well.

Vladimir Bubalo from Macquarie, Gavin and I chased up some more details about the Dataverse repository software.There is some interest from Macquarie in what's available in the way of open source research data repositories.

As attested by the session Reaching out with Data: Dataverse Creating a Global Community Dataverse looks like it's a thriving product and would be a good integration target for ReDBOX, it powers the Australian Data Archive for one thing and can be used as an institutional data repository. One problem they're still grappling with is large file support, which is the same issue with any repository software when you try to put large volumes of, or large numbers of data streams through the API or web interface.

There was a really interesting talk from the UK Jisc Research Data Shared Services (RDSS) project (on which I have done some very part time consulting via Artefactual in Canada) about how they failed to get a Samvera repository working as part of their offering.

The have published their report.

It seems like the project to adapt Samvera failed not because of large files or large volumes of files, although that problem may have come up later, as we reported at OR2014, Hydra, which was rebranded Samvera had severe performance problems on the Alveo virtual lab related to processing number of files in a single transaction. Two issues Jisc called out in their report are one, a failure to implement their complex domain model in Samvera. They say:

In order to capture and store the range of metadata required by the RDSS CDM the internal storage model of the work type within RDSS Samvera needed to map closely to the CDM’s conceptual data model. However, it did not prove to be straightforward to translate the conceptual data model of the RDSS CDM, which leverages entity-relationships, to a programming/storage model that is largely intended to be flat.

And two, they had problems implementing a message passing model.

In order to avoid the kind of problems reported by Jisc and to deal with the reality that many researchers just want to gulp-down files, or object-store blobs, or database tables and most emphatically do not want to sip data through a tiny API-straw we are building research data management systems that are distributed:

Avoiding complex customisation of applications wherever possible and instead linking Research Data Management Plans (RDMPs) to workspaces in research software such as lab notebooks or git repositories, and archival data stores using standard linked data techniques.
Not using message-queues that might block progress on workloads (lesson learned from ReDBOX 1 development and its ill-conceived "curation" workflows). We plan to run "bots", stand alone agents instead which generate reports or make links or auto-provision spaces.
Keeping archival data, and data that needs rapid access in large scalable file storage systems, described and linked to RDMPs, data descriptions, and data publication records. For us, 'the repository' is a collection of services which includes the central index and a variety of stores, rather than a single application: It's a lifestyle; it's not just for Christmas; and most importantly it's a governance thing.

Given the above one of the things I was keen to do at OR2018 was to find out more about the Oxford Common Filesystem Layout (OCLF) which is being driven by people from Oxford, Emory, Stanford, Cornell and Duraspace - it's about how you organise digital assets in the kind of deconstructed architecture I described above for UTS. The files are kept on a file system (yes cloud-first people, that might actually be backed by an object-store) so that you can run services against them: index them for discovery, check their integrity, generate dissemination versions for distribution, report on items due for disposal and so on.

I couldn't get to the OCFL presentation but lead author Neil Jeffries from Oxford talked me through the emerging standard and the process being used to develop it. Neil assured me we can go ahead and implement against the draft spec (and challenged me with the statement that at Oxford they don't like to talk about data vs metadata, it's all data. I'm still thinking that over Neil, but I think I still believe in metadata).

This is from the (still rather sparse) draft dated 2018-06-22, a couple of weeks after the conference:

A general observation is that the contents of a digital repository -- that is, the digital files and metadata that an institution might wish to manage -- are largely stable. Once content has been accessioned, it is unlikely to change significantly over its lifetime. This is in contrast to the software applications that manage these contents, which are ephemeral, requiring constant updating and replacement. Thus, transitions between application-specific methods of file management to support software upgrades and replacement cycles can be seen as unnecessary and risky change, changing the long-term stable objects to support the short-term, ephemeral software.

By providing a specification for the file-and-folder layout on disk, the OCFL is an attempt at reducing, or even eliminating, the need for these transitions. As an application-independent specification, conforming applications will natively 'understand' the underlying file structure without needing to first transition these contents to their own format.

https://ocfl.io/

From my notes of the conversation with Neil:

Multiple things accessing the file system writing changes to LOGS. Eg fixity check, or create new.

There is no state. File system is the state. Digital pres services are workers / crawler or message queue driven.

The incomplete spec is online - it's not enough to do a complete implementation yet, but we will track it.

The host university, Montana State had a nice take on this too. Sara Mannheimer, Jason A. Clark, James Espeland presented A Prototype for the Institutional Research Data Index

Most out-of-the-box institutional repository systems don’t provide the workflows and metadata features required for research data. Consequently, many libraries now support two institutional repository systems—one for publications, and one for research data—even when there are nearly a thousand data repositories in the United States, many of which provide services and policies that ensure their trustworthiness and suitability for institutional research data. Libraries are either increasing spending by purchasing data repository solutions from vendors, or replicating work by building, customizing, and managing individual instances of data repository software. This presentation suggests a potential solution to this issue: a prototype for an open source Institutional Research Data Index (IRDI) that promotes discovery and reuse of institutional datasets through automatic metadata harvesting and search engine optimization. IRDI could lead to a single, unified index for academic institutional research data. A unified data index would lead to three key impacts: increasing discovery, reuse, and citation of open research data; reinforcing the idea that research data is a legitimate scholarly product; and promoting community-wide systems that require less resource expenditure.

They noted that getting a research data repository up and running is hard and expensive.

Their solution:

Deposit data elsewhere in discipline repositories
Have a local IRDI - Institutional Research Data Index

This is similar to the ReDBox approach in that it is highly distributed and it contains an IRDI. The Montana team have soon-to-be-released code that helps find data that's residing out there on the web, eg in Figshare, which we need to look in to.

Another thing we should explore ReDBOX is more about how data is stored and secured. We are working with a product, the DELL/EMC Isilon which has a lot of features in this area, but I plan to look at the Edinburgh & Manchester DataVault project as well Sustaining the momentum, moving the DataVault project to a service Claire Knowles presented with with Mary McDerby, Robin Rice, Thomas Higgins

DataVault does encrypted multi-site storage. 3 copies. One on site, one outer Edinburgh on UK cloud. They use chunking to reduce risk around encryption - lose less of a file if there's a problem.

Some highlights

An insight from Esme Cowles in Valkyrie: Reimagining the Samvera Community which is looking at adding swappable back ends to the Samvera platform (so you can use something other than Fedora to store your stuff).

Lesson from Islandora - don't fight your host platform.

Speaking of Islandora, in Relational Databases as Repository Objects Alexander Garnett showed off a plugin that gives live access to a SQL database in your Islandora repository; it spins up a Docker container on demand. I asked Alex on The Twitter if he'd seen Datasette but he said that SQLite does not scale well in his experience.

Thomas Morrell from Caltech mentioned a few interesting things in Positioning a repository as campus research infrastructure:

Live visualization from a data reposioty
Demoed fetching data from a live repository using a Jupyter notebook.
Harvesting metadata out of git repositories using the codemeta standard. Codemeta metadata is very close in design to our DataCrate. this could be straight out of DataCrate:

    {"author":
      {
      "@id":"http://orcid.org/0000-0003-0077-4738",
      "@type":"Person",
      "email":"slaughter@nceas.ucsb.edu",
      "givenName":"Peter",
          "familyName: "Slaughter"
      }
    }

Joshua A. Westgard from the University of Maryland Libraries presented a Python Library for the Fedora API, which would have been of interest to us at UTS if we'd gone ahead with our planned Fedora 4 design for data management instead of joining forces with QCIF in the new MongoDB based ReDBox.

Chris Diaz from Northwestern University talked about making Static websites over the top of a repo in Jekyll and Institutional Repositories, looks like a great, sustainable alternative to Omeka exhibitions worth considering in some situations.

And the winners are...

I thought I'd pick out a few of the best quotes and one-liners from my notes.

Best quote: Robin Dean

Fractional agile: how to do iterative development when it’s only part of your job

Abstract

Agile development methods such as scrum strongly recommend a dedicated team for software development, meaning that all team members work solely on one project at a time. This is not feasible in many of the academic and open-source community environments where open repositories are developed. This presentation will encourage developers, managers, and open-source contributors to find sustainable ways to participate in agile development even if they can’t dedicate themselves full time to a project. Drawing from my experience as a scrum master on a digital repository team, I will share practical advice on setting expectations, communicating about availability, and maintaining a sustainable pace on a scrum team composed entirely of part-time members.

And the quote:

THERE IS NO SUCH THING AS AN EXPECTATION THAT IS TOO CLEAR

Ta Robin, I've been using this at work.
Best catchphrase: Robert S. Doiel, Thomas Morrell

Building software at the edges of heterogeneous repositories Caltech Library, United States of America

Abstract

Caltech Library has a heterogeneous mix of repository systems (e.g. EPrints hosts CaltechAUTHORS and CaltechTHESIS, while CaltechDATA is based on Invenio). Caltech Library has changed its focus from developing in the specific repository system to one of development at the edges leveraging web APIs. This has allowed us to not only repurpose content but start working at collection level curation by integrating external data sources like ORCID, CrossRef, FundRef and DataCite. The philosophy we have evolved is to work from copies of the data in JSON form using an Open Source tool Caltech Library created called dataset as well as additional Open Source tools in a project called datatools. These command line tools are written in Go but can be easily used from more popular languages like Python. This talk will introducing these tools and demonstrate their usage via Python.

And the phrase:

CONTINUOUS MIGRATION

I liked this phrase because we're migrating data at the moment between RedBox 1 and ReDBox 2 and it's causing my colleague Michael Lynch visible physical and psychic pain. Robert's articulation of the idea that data migration should be normal, and done constantly so it's easy and considered business as usual resonated. But now that I think about it - what we're really aiming for is to get the core data to stay put and not to move - as per the aims of the Oxford Common Filesystem Layout I quoted above. The systems that index, access and analyse the data, and trade metadata are the ones involved in continuous migration.

I am please to report that Robert has recently been in contact with me to talk about aligning their dataset project with DataCrate, this is a step towards DataCrate being able to look inside data files and do stuff like describe column-headers in tabular data, rather than just describing the files.
Best "Lesson Learned": Lars Holm Nielsen, Alexandros Ioannidis, Krzysztof Nowak, Jose Benito Gonzalez Lopez from CERN

File loss: hits and near misses

Abstract

Repositories increasingly depend on external cloud storage or other complex distributed systems in order to satisfy ever-growing needs for storing larger data volumes. The cloud system helps repository manages store terabytes and petabytes of data, and often simplifies the file management in the underlying repository software. We trust these systems to store our files, yet, often we lack understanding of the operation and internals of these systems and how they can fail. This talk will present two file loss incidents on Zenodo, uncovering some ways these distributed systems can fail. One incident was caused by a coincidence of two software bugs in independent systems (the hit), and a second incident was caused by a human operational mistake in the cloud storage system (the near miss).

Great talk about when things go wrong. We don't hear enough of these stories.

The lesson?

LOG EVERYTHING

The venue etc

Often the OR conference is at a conference centre when in North America but this one was at the uni, in a lovely part of Bozeman, walkable from Downtown, through a leafy suburb of early twentieth century houses. The venue was comfy, the food was fine, and there was a cash bar at the dinner, though with a distinct lack of options for people who didn't want alcohol, and a decent band, Little Jane and the pistol whips. Gotta love that gun culture. Downtown Bozeman is great, plenty of good food options including Bison from Ted Turner's ranch. Ted not only revolutionised TV news by inventing CNN, he is a largest private owner of Bison in the world according to our server. He's even involved in research...

Q. What did Ted Turner discover when he let 200 Bison loose in the top paddock for 1 year?

A. The bisontennial.

[ptsefton.com] | [CV & Bio]

Trip Report (with bonus opinions) - Open Repositories 2018, Bozeman Montana, USA

2018-07-10

Mission

Research Data Management

Some highlights

And the winners are...

Abstract

THERE IS NO SUCH THING AS AN EXPECTATION THAT IS TOO CLEAR

Abstract

CONTINUOUS MIGRATION

Abstract

LOG EVERYTHING

The venue etc