Introducing next year's model, the data-crate; applied standards for data-set packaging

2013-11-01

This is also up at the UWS eResearch blog

[Update 2013-11-04:

If you're reading this in Feedly and possibly other feed readers the images in this post won't show - click through to the site to see the presentation>

Added some more stuff from the proposal, including the reference list - clarified some quoted text]

Introducing next year’s model, the data-crate; applied standards for data-set packaging by Peter Seftton and Peter Bugeia is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License .

This presentation was delivered by Peter Sefton at eResearch Australasia 2013 in Brisbane, based on this proposal .

ABSTRACT

In this paper we look at current options available for storing research data to maximize potential reuse and discoverability, both at the level of data files, and sets of data files, and describe some original work bringing together existing standards and metadata schemas to make well-described, reusable data sets that can be distributed as single files, dubbed “crates” with as much context and provenance as possible. We look at some of the issues in choosing file formats in which to archive and disseminate data, and discuss techniques for adding contextual information which is both human-readable and machine-readable in the context of both institutional and discipline data management practice.

When the eResearch team at UWS and Intersect were working on the ANDS DC21 “HIEv” (5) application to allow researchers to create data-sets from collections of files, we looked in vain for a simple-to-implement solution for making CSV-type data available with as much provenance and re-use metadata as possible. In this presentation we will discuss some of the many file-packaging options which were considered and rejected including METS (6), and plain-old zip files with no metadata.

The Eucalyptus woodland free-air CO2 enrichment (EucFACE) facility is the only one of its kind in the southern hemisphere.

It is unique in that it provides full-height access to the mature trees within remnant Cumberland Plain Forest, the only FACE system in native forest anywhere in the world. It is sited on naturally low-nutrient soils in what is close to original bushland, and offers researchers an amazing site at which to study the effects of elevated CO2 on water use, plant growth, soil processes and native biodiversity in a mature, established woodland within the Sydney Basin.

http://www.uws.edu.au/hie/research/research_projects/eucface

This is in the context of the Hawkesbury Institute For the Environment,(HIE) experimental facility, pictured is the Free-Air-Co2 exchange experiment ( EucFACE) under construction.

This is the context in which we did this data-packaging work, but it is designed to be more broadly applicable.

What if provide a zip download of a whole lot of environment-data files and someone writes and important article, but then they can’t work out which zip file and which data files they actually used?

What if there’s some really important data that I know I have on my hard-disk but I can’t tell which file it’s in ‘cos they’re all called stuff like 34534534-er2.csv?

We have reached the time when there is a genuine need to be able to match-up data from different sources; infrastructure projects funded by the Australian National Data Service (ANDS) (4) are now feeding human-readable metadata descriptions to the Research Data Australia (RDA) website. But which standards to use? As Tanenbaum said, “The nice thing about standards is that you have so many to choose from. Furthermore, if you do not like any of them, you can just wait for next year's model” (1). However, when it comes to choosing file format standards for research data, we have found that while there might be many standards there is no single standard for general-purpose research data packaging. It is, however possible to stitch-together a number of different standards to do a reasonable job of packaging and describing research data for archiving and reuse.

There are several issues with standards at the file level. For example consider one of the most commonly supported formats: CSV – or Comma Separated Values. CSV file is actually a non-standard, ie there is no agreed CSV specification, only a set of unreliable conventions used by different software, RFC 4180 (2) notwithstanding. While a CSV file has column headers, there is no way to standardise their meaning. Moving up the complexity chain, the Microsoft Excel based .xslx format is a standard, as is the Open Document Format for spreadsheets but again, even though you can point to a header-row in a spreadsheet and say “that’s the header” there is no standard way to label variables in a way that will match with the labels used by other researchers, or to allow discovery of the same kind of data points in hetrogenous data sets. There is a well established standard which does allow for “self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data”, NetCDF (3) – we will consider how this might be more broadly adopted in eResearch contexts.

Can you guess which two standards are the basis for the crate?

When the eResearch team at UWS and Intersect NSW were working on the ANDS DC21 “HIEv” (5) application to allow researchers to create data-sets from collections of files, we looked in vain for a simple-to-implement solution for making CSV-type data available with as much provenance and re-use metadata as possible, as per the principles outlined above. In this presentation we will discuss some of the many file-packaging options which were considered and rejected including METS (6), and plain-old zip files with no metadata. The project devised a new proof-of-concept specification, known as a ‘crate’, based on a number of standards,. This format:

Uses the California Digital Libraries Bagit specification(7) for bundling files together into a bag.

Creates a single-file for the bag using zip (other contenders would include TAR or disk image formats but zip is widely supported across operating systems and software libraries).

Uses a human-readable HTML README file to make apparent as much metadata as is available from (a) within files and (b) about the context of the research data.

Uses RDF with the W3C’s DCAT ontology (8) and others to add machine readable metadata about the package including relationships between files, technical metadata such as types and sizes and research context

The following few slides from the DC21/HIEv ssystem show how a user can select some files...

... look at file metadata ...

... add files to a cart ...

... download the files in a zip package ...

... inside the zip the files are structured using the bagit format ...

... with a standalone README.html file containing all the metadata we know about the files and associated research context (experiments, facilites)

This is something you can unzip on your laptop, put on a web server, or a repository could show to users as a ‘peek’ inside the data set

... with detail about every file as per the HIEv application itself

... and embedded machine readable metadata using RDFa

... the RDFa metadata describes the data-set as a graph.

Completed packages flow-through to the Research Data Catalogue via an OAI-PMH feed, and there they are given a DOI so they can be cited. The hand-off between systems is important, once a DOI is issued the data set has to be kept indefinitely and must not be changed.

The README file not only contains human readable descriptions of the files and their context there is embedded machine readable metadata. The relationships such as “CreatedBy” use URIs from mainstream ontologies where possible.

We have not done this yet, but using platorms like R-Studio + Knitr it would be possiblet to include runnable-code in data packages, which would provide a ‘literate programming’ readme. This is an example of some data we got from Craig Barton and Remko Duursma.

So the README could include plots, etc, and a copy of the article

Cr8it is designed to plug in to the ownCloud share-sync service so users can compile data sets from working data file for deposit in a repository.

The HIE project is (in part) a simple semantic CMS system that will describe the research context at HIE.

Try this in more places

Integrate research context

Continue quest for decent ontologies and vocabs

Get feedback

**REFERENCES**

1. Tanenbaum AS. Computer networks. Prentice H all PTR (ECS Professional). 1988;1(99):6.

2. <ietf@shaftek.org> YS. Common Format and MIME Type for Comma-Separated Values (CSV) Files [Internet]. [cited 2013 Jun 8]. Available from: http://tools.ietf.org/html/rfc4180

3. Rew R, Davis G. NetCDF: an interface for scientific data access. Computer Graphics and Applications, IEEE. 1990;10(4):76–82.

4. Sandland R. Introduction to ANDS [Internet]. ANDS; 2009. Available from: http://ands.org.au/newsletters/newsletter-2009-07.pdf

5. Intersect. Data Capture for Climate Change and Energy Research: HIEv (AKA DC21) [Internet]. Sydney, Australia; 2013. Available from: http://eresearch.uws.edu.au/blog/projects/data-capture-for-climate-change-and-energy-research/

6. Pearce J, Pearson D, Williams M, Yeadon S. The Australian METS Profile–A Journey about Metadata. D-Lib Magazine. 2008;14(3/4):1082–9873.

7. Kunze J, Boyko A, Vargas B, Madden L, Littman J. The BagIt File Packaging Format (V0.97) [Internet]. [cited 2013 Mar 1]. Available from: http://tools.ietf.org/html/draft-kunze-bagit-06

8. Maali F, Erickson J, Archer P. Data Catalog Vocabulary (DCAT) [Internet]. World Wide Web Consortium; Available from: http://www.w3.org/TR/vocab-dcat/

[ptsefton.com] | [CV & Bio]

Introducing next year's model, the data-crate; applied standards for data-set packaging

2013-11-01

Peter Sefton* p.sefton@uws.edu.au

Peter Bugeia** peter.bugeia@intersect.org.au

*University of Western Sydney

**Intersect Australia Ltd

\

\

\

\