[ptsefton.com] | [CV & Bio]

DataCrate Formalising ways of packaging research data for re-use and dissemination

2017-10-19

[Update: 2017-10-20 Fixed a few typos and some formatting.]

This is a presentation I gave at eResearch Australasia 2017-10-18 about the new Draft (v0.1) Data Crate Specification for data packaging I've just completed, with lots of help from others (credits at the end).

BACKGROUND

In 2013 Peter Sefton and Peter Bugeia presented at eResearch Australasia on a format for packaging research data(1), using standards based metadata, with one innovative feature – instead of including metadata in a machine readable format only, each data package came with an HTML file that contained both human and machine readable metadata, via RDFa, which allows semantic assertions to be embedded in a web page.

Variations of this technique have been included in various software products over the last few years, but the there was no agreed standard on which vocabularies to use for metadata, or specification of how the files fitted together.

THE PRESENTATION

This presentation will describe work in progress on the DataCrate specification(2), illustrated with examples, including a tool to create DataCrate. We will also discuss other work in this area, including Research Object Bundles (3) and DataConservency(4) packaging.

We will be seeking feedback from the community on this work should it continue? Is it useful? Who can help out? The DataCrate spec:

REFERENCES

Sefton P, Bugeia P. Introducing next year’s model, the data-crate; applied standards for data-set packaging. In: eResearch Australasia 2013 [Internet]. Brisbane, Australia; 2013. Available from: http://eresearchau.files.wordpress.com/2013/08/eresau2013_submission_57.pdf

datacrate: Bagit-based data packaging specification for dissemination of research data with useful human and machine readable metadata: “Make Data Crate Again!” [Internet]. UTS-eResearch; 2017 [cited 2017 Jun 29]. Available from: https://github.com/UTS-eResearch/datacrate

Research Object Bundle [Internet]. [cited 2017 Jun 16]. Available from: https://researchobject.github.io/specifications/bundle/

Data Conservancy Packaging Specification Home [Internet]. [cited 2017 Jun 29]. Available from: http://dataconservancy.github.io/dc-packaging-spec/dc-packaging-spec-1.0.html

Ruest N. BagIt Profiles Specification [Internet]. 2017 Jun. Available from: https://github.com/ruebot/bagit-profiles

Kunze J, Boyko A, Vargas B, Madden L, Littman J. The BagIt File Packaging Format (V0.97) [Internet]. [cited 2013 Mar 1]. Available from: http://tools.ietf.org/html/draft-kunze-bagit-06



DataCrate: Formalising || ways of packaging research ||    data for re-use and ||       dissemination ||  Peter Sefton, University of Technology Sydney
Slide notes
This is a presentation I gave at eResearch Australasia 2017-10-18.


Slide notes
Peter Bugeia and I talked about this 4 years ago. This year I got around to leading the effort to standardising what we did back then.


Slide notes
This presentation is structured as a story.

Back in June Cameron Neylon was [annoyed](http://cameronneylon.net/blog/as-a- researcher-im-a-bit-bloody-fed-up-with-data-management/)



"More concretely I specifically have data from a set of interviews. I have audio and || I have notes/transcripts. I have the interview prompt. I have decided this set of || around 40 files is a good package to combine into one dataset on Zenodo. So my || next step is to search for some guidance on how to organise and document || that data. Interviews, notes, must be a common form of data package right? || So a quick search for a tutorial, or guidance or best practice? ||  || Nope. Give it a go. You either get a deep dive into metadata schema (and || remember I'm one of the 2% who even know what those words mean) or you get || very high level generic advice about data management in general. Maybe you get || a few pages giving (inconsistent) advice on what audio file formats to use."
Slide notes
When I saw this cry for help I contacted Cameron and offered to work with him.


"As a researcher trying to do a good job || of data deposition, I want an example of || my kind of data being done well, so I can ||   copy it and get on with my research"
Slide notes
More from Cameron.


There were no examples
Slide notes
But actually, there are no simple examples of how to organise "long-tail" data sets for publication. Research data management books will tell you about various metadata standards, but how do you enter the metadata and associate it with your data?


So we made one


Fast forward to this week ...


Cameron Professor Neylon has ||    published his dataset


https://doi.org/10.13039/501100000193


Slide notes
The dataset is [available from Zenodo](https://doi.org/10.13039/501100000193 ), an open data repository hosted by CERN.


It's a zipped-up BagIt bag




There's a catalog inside


Slide notes
This is a human-readable catalog that lists all the files in the data set.


With information about people, places, || licenses and their relationships to the ||                 files ||            in the DataCrate
Slide notes
And has information about their context and the relationships between them.


Slide notes
For example it shows that Cameron is the creator of the dataset. Note that Cameron is idetified by his ORCID ID: [http://orcid.org/0000-0002-0068-716X](http://orcid.org/0000-0002-0068-716X). Using URLs to identify things such as people is one of the key principles of [Linked Data](https://en.wikipedia.org/wiki/Linked_data).


With lots of useful info about || relationships between the files


Like this one is || a translation of ||  this other one


Slide notes
Here's an example of a relationship between two of the files - one is a translation of another.


And it's not just nice tables either


<div ||   resource="./data/.../WorkshopBookletParticipants.docx" ||   property="http://schema.org/translationOf"> ||   ... || </div>
Slide notes
The HTML contains RDFa embedded metadata. [RDFa](https://en.wikipedia.org/wiki/RDFa) is a standard way of embedding sematics in a web page.


That's standard semantic web metadata ||        as used by search engines
Slide notes
RDFa, using the [schema.org](http://schema.org) metadata vocabulary is widely used by search engines.


Slide notes
Movie times, opening times, recipes - these are all some of the things that search engines understand.


But that's not all.


There's programmer-friendly JSON || metadata: easy to look up Contact
Slide notes
This package also has JSON metadata.




"@graph": [ ||   { ||     "@id": "data", ||     "@type": "Dataset", ||     "Contact": { ||       "@id": "http://orcid.org/0000-0002-0068-716X", ||       "@type": "Person", ||       "Email": "cn@cameronneylon.net", ||       "ID": "http://orcid.org/0000-0002-0068-716X", ||       "Name": "Cameron Neylon" ||     },
Slide notes
The JSON is easily usable by programmers - getting the contact for this dataset for example is a simple operation.


And use the context to expand that to a ||         full unambiguous URI
Slide notes
But if needed, the simple "Contact" can be turned into a URI, as per LInked Data principles.


"@context": { ||  ... ||   "Description": "schema:description", ||   "License": "schema:license", ||   "Title": "schema:name", ||   "Name": "schema:name", ||   "Creator": "schema:creator", ||   ... ||   "TranslationOf": "schema:translationOf", ||   "Funder": "schema:Funder", ||   "Person": "schema:Person", ||   "Contact": "schema:accountablePerson", ||    ... ||    "schema": "http://schema.org/",
Slide notes
You can look up Contact in the DataCrate JSON-LD context and see that it maps to schema:accountablePerson


Contact -> schema:accountablePerson || schema:accountablePerson -> || http://schema.org/accountablePerson
Slide notes
Then you can map schema:Accountable person to http://schema.org/accountablePerson


And machine-readable BagIt checksums ||           to check integrity
Slide notes
There are also checksums for all the data files.


Slide notes
There's a Bagit manifest file.


Slide notes
Which lists all the files and their checksums, so the validity of the bag can be checked.


It's not so much a package as a


Slide notes
This package is like a gift from Cameron, to his collaborators, to other researchers and to his future self.


How did you do it?
Slide notes
.. to do this work ...


We used an experimental tool called ||            Calcyte
Slide notes
We used an experimental tool called Calcyte


I ran Calcyte on Cameron's Google Drive ||     share to create CATALOG.xlsx files
Slide notes
... I ran Calcyte on Cameron's Google Drive share to create CATALOG.xlsx files ...


Slide notes
[Calcyte](https://codeine.research.uts.edu.au/eresearch/calcyte) is experimental early- stage open source software written by my group (mainly me) at UTS.


Slide notes
Calcyte created spreadsheets which functioned as metadata forms that Cameron could fill out.


Slide notes
The spreadsheets are multi-sheet workbooks, giving us scope to describe not only data entities like files, but metadata entities such as people, licenses and organisations.


Cameron filled out the metadata


I ran Calcyte to create the human and ||       machine readable metadata


Rinse, repeat || (took a few goes)
Slide notes
We spent a couple of months working on this intermittently, it will be quicker next time, but this level of data description will always involve a fair bit of care and work, at least a few hours for this scale of project. It's also important to proofread the result, just as with publishing articles.


So what's special about this packaging ||              approach?


Human AND machine readable web- ||                 native ||             linked-data ||              metadata, ||    not just string-values in XML
Slide notes
The advantages of this approach are that the package has: Human AND machine readable web-native linked-data metadata, not just string-values in XML


Slide notes
This slide is a reminder of what the CATALOG.html file looks like, complete with its DataCite citation, which, when people start citing this, will add to Cameron's academic capital.


This work is based on previous efforts || l Cr8it - now being looked after by Newcastle.edu.au (via Western Sydney and ||    Intersect) https://github.com/digitalbridge/crateit/tree/develop || l HIEv https://github.com/IntersectAustralia/dc21 || l Mike Lake's CAVE repository. https://suss.caves.org.au/cave/ || Both of these are covered in our 2013 presentation at eResearch Australasia || It builds on other standards: || BagIt: https://tools.ietf.org/html/draft-kunze-bagit-14 || Schema.org http://schema.org
Slide notes
This work is based on previous efforts

Cr8it and HIEv are covered in our 2013 presentation at eResearch Australasia

It builds on other standards:



Slide notes
The format used in this demo is described in a [draft specification](https://github.com/UTS-eResearch/datacrate/tree/master/spec/0.1).


TODO || (assuming people see the value in DateCrate) || 1. Use at UTS for our data repository, and for export from various services || 2. Lobby to get support integrated into Zenodo, Figshare et al || 3. Improve capture/packaging tools (Cra8it, Cloudstor Collections <your-system- ||    here> || 4. Work with others on aligning this work with other standards, [here's a list ||    someone else put together https://docs.google.com/document/d/155lA2BcixTl- ||    zwJHGfLkxsmg7WmQbBK00QWyP8QggkE || 5. Work with RDA on their repository interchange format. || 6. https://www.rd-alliance.org/groups/research-data-repository-interoperability- ||    wg.html
Slide notes
- Use at UTS for our data repository, and for export from various services


"Make data crate again" ||    Liz Stokes 2017
Slide notes
I'll leave it with this slogan from our UTS data librarian and friend of eResearch, Liz Stokes.

Thanks to:

And the mainly Sydney-based metadata group who met in the leadup to this work Piyachat Ratana, Sharyn Wise, Michael Lynch, Craig Hamilton, Vicki Picasso, Gerry Devine, Katrin Trewin, Ingrid Mason, Peter Bugeia