[ptsefton.com] | [CV & Bio]

Trip Report (with bonus opinions) - Open Repositories 2018, Bozeman Montana, USA

2018-07-10

I (Peter Sefton) recently attended OR2018, the Open Repositories conference from June 4-7, 2018 in Bozeman Montana.

This post is being posted on the UTS eResearch site and on my site.

My trip was funded by the University of Technology Sydney (UTS).

The sign at the Lewis and Clark Motel

Mission

Gavin Kennedy from QCIF was also in attendance, and we were on something of a mission - to promote and get feedback on the recent work we've been doing on the ReDBox research data management platform. We ran a ReDBOX intro workshop on the Monday of the conference. Gavin and I also presented a general introduction to ReDBOX and the provisioner, and I went into more detail about the DataCrate standard for shipping and showing-off research data that I have been leading, with help from a growing community of supporters. I also did a presentation in the technical session which included a live demo of using ReDBox to ingest DataCrates - showing how it could 'sniff out' metadata from DataCrate packages.

Research Data Management

Open Repositories now has enough going on about Research Data Management that I was able to spend most in those sessions when I wasn't meeting with people directly.

I heard people talk about a few things that helped confirm some of our design choices at UTS, and a few things to challenge my world view as well.

Vladimir Bubalo from Macquarie, Gavin and I chased up some more details about the Dataverse repository software.There is some interest from Macquarie in what's available in the way of open source research data repositories.

As attested by the session Reaching out with Data: Dataverse Creating a Global Community Dataverse looks like it's a thriving product and would be a good integration target for ReDBOX, it powers the Australian Data Archive for one thing and can be used as an institutional data repository. One problem they're still grappling with is large file support, which is the same issue with any repository software when you try to put large volumes of, or large numbers of data streams through the API or web interface.

There was a really interesting talk from the UK Jisc Research Data Shared Services (RDSS) project (on which I have done some very part time consulting via Artefactual in Canada) about how they failed to get a Samvera repository working as part of their offering.

The have published their report.

It seems like the project to adapt Samvera failed not because of large files or large volumes of files, although that problem may have come up later, as we reported at OR2014, Hydra, which was rebranded Samvera had severe performance problems on the Alveo virtual lab related to processing number of files in a single transaction. Two issues Jisc called out in their report are one, a failure to implement their complex domain model in Samvera. They say:

In order to capture and store the range of metadata required by the RDSS CDM the internal storage model of the work type within RDSS Samvera needed to map closely to the CDM’s conceptual data model. However, it did not prove to be straightforward to translate the conceptual data model of the RDSS CDM, which leverages entity-relationships, to a programming/storage model that is largely intended to be flat.

And two, they had problems implementing a message passing model.

In order to avoid the kind of problems reported by Jisc and to deal with the reality that many researchers just want to gulp-down files, or object-store blobs, or database tables and most emphatically do not want to sip data through a tiny API-straw we are building research data management systems that are distributed:

Given the above one of the things I was keen to do at OR2018 was to find out more about the Oxford Common Filesystem Layout (OCLF) which is being driven by people from Oxford, Emory, Stanford, Cornell and Duraspace - it's about how you organise digital assets in the kind of deconstructed architecture I described above for UTS. The files are kept on a file system (yes cloud-first people, that might actually be backed by an object-store) so that you can run services against them: index them for discovery, check their integrity, generate dissemination versions for distribution, report on items due for disposal and so on.

I couldn't get to the OCFL presentation but lead author Neil Jeffries from Oxford talked me through the emerging standard and the process being used to develop it. Neil assured me we can go ahead and implement against the draft spec (and challenged me with the statement that at Oxford they don't like to talk about data vs metadata, it's all data. I'm still thinking that over Neil, but I think I still believe in metadata).

This is from the (still rather sparse) draft dated 2018-06-22, a couple of weeks after the conference:

A general observation is that the contents of a digital repository -- that is, the digital files and metadata that an institution might wish to manage -- are largely stable. Once content has been accessioned, it is unlikely to change significantly over its lifetime. This is in contrast to the software applications that manage these contents, which are ephemeral, requiring constant updating and replacement. Thus, transitions between application-specific methods of file management to support software upgrades and replacement cycles can be seen as unnecessary and risky change, changing the long-term stable objects to support the short-term, ephemeral software.

By providing a specification for the file-and-folder layout on disk, the OCFL is an attempt at reducing, or even eliminating, the need for these transitions. As an application-independent specification, conforming applications will natively 'understand' the underlying file structure without needing to first transition these contents to their own format.

https://ocfl.io/

From my notes of the conversation with Neil:

Multiple things accessing the file system writing changes to LOGS. Eg fixity check, or create new.

There is no state. File system is the state. Digital pres services are workers / crawler or message queue driven.

The incomplete spec is online - it's not enough to do a complete implementation yet, but we will track it.

The host university, Montana State had a nice take on this too. Sara Mannheimer, Jason A. Clark, James Espeland presented A Prototype for the Institutional Research Data Index

Most out-of-the-box institutional repository systems don’t provide the workflows and metadata features required for research data. Consequently, many libraries now support two institutional repository systems—one for publications, and one for research data—even when there are nearly a thousand data repositories in the United States, many of which provide services and policies that ensure their trustworthiness and suitability for institutional research data. Libraries are either increasing spending by purchasing data repository solutions from vendors, or replicating work by building, customizing, and managing individual instances of data repository software. This presentation suggests a potential solution to this issue: a prototype for an open source Institutional Research Data Index (IRDI) that promotes discovery and reuse of institutional datasets through automatic metadata harvesting and search engine optimization. IRDI could lead to a single, unified index for academic institutional research data. A unified data index would lead to three key impacts: increasing discovery, reuse, and citation of open research data; reinforcing the idea that research data is a legitimate scholarly product; and promoting community-wide systems that require less resource expenditure.

They noted that getting a research data repository up and running is hard and expensive.

Their solution:

This is similar to the ReDBox approach in that it is highly distributed and it contains an IRDI. The Montana team have soon-to-be-released code that helps find data that's residing out there on the web, eg in Figshare, which we need to look in to.

Another thing we should explore ReDBOX is more about how data is stored and secured. We are working with a product, the DELL/EMC Isilon which has a lot of features in this area, but I plan to look at the Edinburgh & Manchester DataVault project as well Sustaining the momentum, moving the DataVault project to a service Claire Knowles presented with with Mary McDerby, Robin Rice, Thomas Higgins

DataVault does encrypted multi-site storage. 3 copies. One on site, one outer Edinburgh on UK cloud. They use chunking to reduce risk around encryption - lose less of a file if there's a problem.

Some highlights

An insight from Esme Cowles in Valkyrie: Reimagining the Samvera Community which is looking at adding swappable back ends to the Samvera platform (so you can use something other than Fedora to store your stuff).

Lesson from Islandora - don't fight your host platform.

Speaking of Islandora, in Relational Databases as Repository Objects Alexander Garnett showed off a plugin that gives live access to a SQL database in your Islandora repository; it spins up a Docker container on demand. I asked Alex on The Twitter if he'd seen Datasette but he said that SQLite does not scale well in his experience.

Thomas Morrell from Caltech mentioned a few interesting things in Positioning a repository as campus research infrastructure:

    {"author":
      {
      "@id":"http://orcid.org/0000-0003-0077-4738",
      "@type":"Person",
      "email":"slaughter@nceas.ucsb.edu",
      "givenName":"Peter",
          "familyName: "Slaughter"
      }
    }

Joshua A. Westgard from the University of Maryland Libraries presented a Python Library for the Fedora API, which would have been of interest to us at UTS if we'd gone ahead with our planned Fedora 4 design for data management instead of joining forces with QCIF in the new MongoDB based ReDBox.

Chris Diaz from Northwestern University talked about making Static websites over the top of a repo in Jekyll and Institutional Repositories, looks like a great, sustainable alternative to Omeka exhibitions worth considering in some situations.

And the winners are...

I thought I'd pick out a few of the best quotes and one-liners from my notes.

The venue etc

Often the OR conference is at a conference centre when in North America but this one was at the uni, in a lovely part of Bozeman, walkable from Downtown, through a leafy suburb of early twentieth century houses. The venue was comfy, the food was fine, and there was a cash bar at the dinner, though with a distinct lack of options for people who didn't want alcohol, and a decent band, Little Jane and the pistol whips. Gotta love that gun culture. Downtown Bozeman is great, plenty of good food options including Bison from Ted Turner's ranch. Ted not only revolutionised TV news by inventing CNN, he is a largest private owner of Bison in the world according to our server. He's even involved in research...

Q. What did Ted Turner discover when he let 200 Bison loose in the top paddock for 1 year?

The neon sign at Ted's

A. The bisontennial.