DataCrate Formalising ways of packaging research data for re-use and dissemination

2017-10-19

[Update: 2017-10-20 Fixed a few typos and some formatting.]

This is a presentation I gave at eResearch Australasia 2017-10-18 about the new Draft (v0.1) Data Crate Specification for data packaging I've just completed, with lots of help from others (credits at the end).

BACKGROUND

In 2013 Peter Sefton and Peter Bugeia presented at eResearch Australasia on a format for packaging research data(1), using standards based metadata, with one innovative feature – instead of including metadata in a machine readable format only, each data package came with an HTML file that contained both human and machine readable metadata, via RDFa, which allows semantic assertions to be embedded in a web page.

Variations of this technique have been included in various software products over the last few years, but the there was no agreed standard on which vocabularies to use for metadata, or specification of how the files fitted together.

THE PRESENTATION

This presentation will describe work in progress on the DataCrate specification(2), illustrated with examples, including a tool to create DataCrate. We will also discuss other work in this area, including Research Object Bundles (3) and DataConservency(4) packaging.

We will be seeking feedback from the community on this work should it continue? Is it useful? Who can help out? The DataCrate spec:

Has both human and machine readable metadata at a package (data set/collection) level as well as at a file level
Allows for and encourages inclusion of contextual metadata such as descriptions of organisations, facilities, experiments and people linked to files with meaningful relationships (eg to say a file was created by a particular machine, as part of a particular experiment, at an organisation).
Is a BagIt profile(5). BagIt(6) is a simple packaging standard for file-based data.
Has a README.html tag file at the root with bagit-style metadata about the distribution (contact details etc) with a link to;
a CATALOG.html file in RDFa, using schema.org metadata inside the payload (data) dir with detailed information about the files in the package, and a redundant CATALOG.json in JSON-LD format
Is extensible easily as it is based on RDF.

REFERENCES

Sefton P, Bugeia P. Introducing next year’s model, the data-crate; applied standards for data-set packaging. In: eResearch Australasia 2013 [Internet]. Brisbane, Australia; 2013. Available from: http://eresearchau.files.wordpress.com/2013/08/eresau2013_submission_57.pdf

datacrate: Bagit-based data packaging specification for dissemination of research data with useful human and machine readable metadata: “Make Data Crate Again!” [Internet]. UTS-eResearch; 2017 [cited 2017 Jun 29]. Available from: https://github.com/UTS-eResearch/datacrate

Research Object Bundle [Internet]. [cited 2017 Jun 16]. Available from: https://researchobject.github.io/specifications/bundle/

Data Conservancy Packaging Specification Home [Internet]. [cited 2017 Jun 29]. Available from: http://dataconservancy.github.io/dc-packaging-spec/dc-packaging-spec-1.0.html

Ruest N. BagIt Profiles Specification [Internet]. 2017 Jun. Available from: https://github.com/ruebot/bagit-profiles

Kunze J, Boyko A, Vargas B, Madden L, Littman J. The BagIt File Packaging Format (V0.97) [Internet]. [cited 2013 Mar 1]. Available from: http://tools.ietf.org/html/draft-kunze-bagit-06

DataCrate: Formalising || ways of packaging research || data for re-use and || dissemination || Peter Sefton, University of Technology Sydney

Slide notes

This is a presentation I gave at eResearch Australasia 2017-10-18.

Slide notes

Peter Bugeia and I talked about this 4 years ago. This year I got around to leading the effort to standardising what we did back then.

Slide notes

This presentation is structured as a story.

Back in June Cameron Neylon was [annoyed](http://cameronneylon.net/blog/as-a- researcher-im-a-bit-bloody-fed-up-with-data-management/)

"More concretely I specifically have data from a set of interviews. I have audio and || I have notes/transcripts. I have the interview prompt. I have decided this set of || around 40 files is a good package to combine into one dataset on Zenodo. So my || next step is to search for some guidance on how to organise and document || that data. Interviews, notes, must be a common form of data package right? || So a quick search for a tutorial, or guidance or best practice? || || Nope. Give it a go. You either get a deep dive into metadata schema (and || remember I'm one of the 2% who even know what those words mean) or you get || very high level generic advice about data management in general. Maybe you get || a few pages giving (inconsistent) advice on what audio file formats to use."

Slide notes

When I saw this cry for help I contacted Cameron and offered to work with him.

"As a researcher trying to do a good job || of data deposition, I want an example of || my kind of data being done well, so I can || copy it and get on with my research"

Slide notes

More from Cameron.

Slide notes

But actually, there are no simple examples of how to organise "long-tail" data sets for publication. Research data management books will tell you about various metadata standards, but how do you enter the metadata and associate it with your data?

Cameron Professor Neylon has || published his dataset

Slide notes

The dataset is [available from Zenodo](https://doi.org/10.13039/501100000193 ), an open data repository hosted by CERN.

Slide notes

This is a human-readable catalog that lists all the files in the data set.

With information about people, places, || licenses and their relationships to the || files || in the DataCrate

Slide notes

And has information about their context and the relationships between them.

Slide notes

For example it shows that Cameron is the creator of the dataset. Note that Cameron is idetified by his ORCID ID: [http://orcid.org/0000-0002-0068-716X](http://orcid.org/0000-0002-0068-716X). Using URLs to identify things such as people is one of the key principles of [Linked Data](https://en.wikipedia.org/wiki/Linked_data).

With lots of useful info about || relationships between the files

Like this one is || a translation of || this other one

Slide notes

Here's an example of a relationship between two of the files - one is a translation of another.

<div || resource="./data/.../WorkshopBookletParticipants.docx" || property="http://schema.org/translationOf"> || ... || </div>

Slide notes

The HTML contains RDFa embedded metadata. [RDFa](https://en.wikipedia.org/wiki/RDFa) is a standard way of embedding sematics in a web page.

That's standard semantic web metadata || as used by search engines

Slide notes

RDFa, using the [schema.org](http://schema.org) metadata vocabulary is widely used by search engines.

Slide notes

Movie times, opening times, recipes - these are all some of the things that search engines understand.

There's programmer-friendly JSON || metadata: easy to look up Contact

Slide notes

This package also has JSON metadata.

"@graph": [ || { || "@id": "data", || "@type": "Dataset", || "Contact": { || "@id": "http://orcid.org/0000-0002-0068-716X", || "@type": "Person", || "Email": "cn@cameronneylon.net", || "ID": "http://orcid.org/0000-0002-0068-716X", || "Name": "Cameron Neylon" || },

Slide notes

The JSON is easily usable by programmers - getting the contact for this dataset for example is a simple operation.

And use the context to expand that to a || full unambiguous URI

Slide notes

But if needed, the simple "Contact" can be turned into a URI, as per LInked Data principles.

"@context": { || ... || "Description": "schema:description", || "License": "schema:license", || "Title": "schema:name", || "Name": "schema:name", || "Creator": "schema:creator", || ... || "TranslationOf": "schema:translationOf", || "Funder": "schema:Funder", || "Person": "schema:Person", || "Contact": "schema:accountablePerson", || ... || "schema": "http://schema.org/",

Slide notes

You can look up Contact in the DataCrate JSON-LD context and see that it maps to schema:accountablePerson

Contact -> schema:accountablePerson || schema:accountablePerson -> || http://schema.org/accountablePerson

Slide notes

Then you can map schema:Accountable person to http://schema.org/accountablePerson

And machine-readable BagIt checksums || to check integrity

Slide notes

There are also checksums for all the data files.

Slide notes

There's a Bagit manifest file.

Slide notes

Which lists all the files and their checksums, so the validity of the bag can be checked.

Slide notes

This package is like a gift from Cameron, to his collaborators, to other researchers and to his future self.

Slide notes

.. to do this work ...

Slide notes

We used an experimental tool called Calcyte

Slide notes

... I ran Calcyte on Cameron's Google Drive share to create CATALOG.xlsx files ...

Slide notes

[Calcyte](https://codeine.research.uts.edu.au/eresearch/calcyte) is experimental early- stage open source software written by my group (mainly me) at UTS.

Slide notes

Calcyte created spreadsheets which functioned as metadata forms that Cameron could fill out.

Slide notes

The spreadsheets are multi-sheet workbooks, giving us scope to describe not only data entities like files, but metadata entities such as people, licenses and organisations.

I ran Calcyte to create the human and || machine readable metadata

Slide notes

We spent a couple of months working on this intermittently, it will be quicker next time, but this level of data description will always involve a fair bit of care and work, at least a few hours for this scale of project. It's also important to proofread the result, just as with publishing articles.

So what's special about this packaging || approach?

Human AND machine readable web- || native || linked-data || metadata, || not just string-values in XML

Slide notes

The advantages of this approach are that the package has: Human AND machine readable web-native linked-data metadata, not just string-values in XML

Slide notes

This slide is a reminder of what the CATALOG.html file looks like, complete with its DataCite citation, which, when people start citing this, will add to Cameron's academic capital.

Slide notes

This work is based on previous efforts

Cr8it - now being looked after by Newcastle.edu.au (via Western Sydney and Intersect) https://github.com/digitalbridge/crateit/tree/develop
HIEv https://github.com/IntersectAustralia/dc21
Mike Lake's CAVE repository. https://suss.caves.org.au/cave/

Cr8it and HIEv are covered in our 2013 presentation at eResearch Australasia

It builds on other standards:

BagIt: https://tools.ietf.org/html/draft-kunze-bagit-14
Schema.org http://schema.org

Slide notes

The format used in this demo is described in a [draft specification](https://github.com/UTS-eResearch/datacrate/tree/master/spec/0.1).

Slide notes

- Use at UTS for our data repository, and for export from various services

Lobby to get support integrated into Zenodo, Figshare et al
Improve capture/packaging tools (Cra8it, Cloudstor Collections
Work with others on aligning this work with other standards, [here's a list someone else put together](https://docs.google.com/document/d/155lA2BcixTl- zwJHGfLkxsmg7WmQbBK00QWyP8QggkE/edit).
Work with RDA on their repository interchange format. https://www.rd-alliance.org/groups/research-data-repository-interoperability-wg.html

"Make data crate again" || Liz Stokes 2017

Slide notes

I'll leave it with this slogan from our UTS data librarian and friend of eResearch, Liz Stokes.

Thanks to:

Cameron Neylon for being customer zero
Liz Stokes for working on metadata crosswalking/mapping
Mike Lake for coding and ideas
Conal Tuohy and Duncan Loxton for commenting on the draft spec
Amir Aryani for discussions about metadata

And the mainly Sydney-based metadata group who met in the leadup to this work Piyachat Ratana, Sharyn Wise, Michael Lynch, Craig Hamilton, Vicki Picasso, Gerry Devine, Katrin Trewin, Ingrid Mason, Peter Bugeia

[ptsefton.com] | [CV & Bio]

DataCrate Formalising ways of packaging research data for re-use and dissemination

2017-10-19

BACKGROUND

THE PRESENTATION

REFERENCES