[ptsefton.com] | [CV & Bio]

Pick, Pack, Publish: Cr8it and Of The Web

2016-10-26

Here's a presentation that was given at eResearch Australasia in October 2014, but which was never put online. I've rescued this from Google Docs, and cleaned it up slightly.

Pick, Pack, Publish
cr8it and Of The Web
Peter Sefton, Peter Bugeia, Vicki Picasso

This presentation is about two complementary open source software products that have been produced by a consortium of partners including the University of Western Sydney, the University of Newcastle and Intersect Australia. These products are designed to bridge the gap between easily ­accessible dropbox.com­ style working-­data file­sharing and synchronization, and the publishing and archiving of mature research data sets.

Cr8it (crate­it) is a file packaging and publishing application that lets users of just about any research discipline package data together with metadata in order to maximise its future potential for reuse using Data Crates. ­ The crate format is designed to include as much metadata as possible to maximise data usefulness, plus the actual data payload. Cr8it is a plugin for the ownCloud open source file ­synchronization and sharing platform. Owncloud is becoming familiar to Australian researchers from the AARNet Cloudstor+ service.

Of The Web (OTW) is a toolkit for extracting metadata from generic and domain­ specific file formats, and creating web­ previews. For example: the toolkit can be used for extracting time series data from proprietary formats. It can also be used to automatically generate summary web­pages for data.

Cr8it and OTW are built on a wide range of open source componentry.

Researchers now need to have end-­to­-end data management plans in place for all data, and all data must be archived appropriately to comply with the various codes and funding arrangements under which researchers are operating. National funding initiatives and projects such as the Australian National Data Services’ Metadata Stores program have now provided both the infrastructure and the opportunity for development of capability within institutions. As eResearch and library professionals we have noted that just about every research group we deal with uses dropbox.com or similar file share and synchronisation services; this class of service is clearly the “killer app” for distributed teams working with file­based data. We have also observed that there is a gap in eResearch infrastructure between working­ data on file­shares and desktops, and the “proper” Research Data Repositories, eResearch tools, data capture tools and virtual laboratories now being established at universities.

Issues:

  • gap/ gulf between files and repositories
  • increasing need for researchers to cite data and the lack of easy routes to do this
  • The lack of decent low-end data packaging standards that are usable and accessible
  • The lack of plug-compatible, single-function solutions to build easy routes
  • Extreme range of data management use cases
  • The inability of big initiatives such as ANDS, NeCTAR, Cloustor+ to solve the micro use cases researchers face everyday
  • The lack of definitive and effective institutional name authority solutions for people, projects and data.

This gap is a huge barrier for researchers, as most tools require either laborious uploading of data, or forces researchers re­organise their data. Cr8it is designed to bridge that gap and eliminate the need to re­organise or move data, allowing it to be harvested, or ‘picked’ in situ. Our aim is to enable a seamless process, without disrupting the researcher’s workflow, to package research datasets, add metadata, and connect with data curation processes to publish a data description to the Research Data Commons and to appropriate discipline and other repositories. There are two main triggers for this: (a) publishing a research article, with supporting data and (b) archiving data at the end of a project.

Researchers now need to have end­-to­-end data management plans in place for all data, and all data must be archived appropriately to comply with the various codes and funding arrangements under which researchers are operating. National funding initiatives and projects such as the Australian National Data Services’ Metadata Stores program have now provided both the infrastructure and the opportunity for development of capability within institutions. As eResearch and library professionals we have noted that just about every research group we deal with uses dropbox.com or similar file share and synchronisation services; this class of service is clearly the “killer app” for distributed teams working with file­based data. We have also observed that there is a gap in eResearch infrastructure between working­data on file­shares and desktops and the “proper” Research Data Repositories, eResearch tools, data capture tools and virtual laboratories now being established at universities.

This gap is a huge barrier for researchers, as most tools require either laborious uploading of data, or forces researchers re­organise their data. Cr8it is designed to bridge that gap and eliminate the need to re­organise or move data, allowing it to be harvested, or ‘picked’ in situ. Our aim is to enable a seamless process, without disrupting the researcher’s workflow, to package research datasets, add metadata, and connect with data curation processes to publish a data description to the Research Data Commons and to appropriate discipline and other repositories. There are two main triggers for this: (a) publishing a research article, with supporting data and (b) archiving data at the end of a project.

Why are we here?

One of the main drivers for this work is that researchers need persistent identifiers such as Digital Object Identifiers (DOIs - see www.doi.org) for publicly available datasets.

We need automated software to orchestrate all the IDs and URLs for data sets and keep them up to date. Cr8it is the precursor to all of this; it is designed to be a way for a researcher to pick their data, package it, and then publish it.

OTW

First up we’ll talk about Of The Web (OTW). This is not to be confused with Off The Web.

Of The Web (OTW) is a plug-in framework for developing web-­based previews for all kinds of files. In essence, it converts files that are stored on-the-web but which can’t be easily processed by-the-web (ie: they need to be downloaded before they can be opened), into HTML file previews which are native to-the-web. At the moment OTW can generate previews for office formats (word processing documents, and presentations) and create web­ previews of spreadsheets, as well as CSV export[2]. OTW can also extract embedded technical metadata from common image formats, and convert Markdown format to HTML. OTW has a plug-in architectrure, which means the sky is the limit on the kinds of files which can have previews created.

Cr8it is OTW-aware. If OTW previews and metadata are present when Cr8it is building a data package, Cr8it will integrate the extracted metadata and previews into the README file of the data package.

This slide shows an example. This is a preview for a “Patch Clamp” abf file used in neurophysiology: http://www.moleculardevices.com/systems/conventional-patch-clamp/pclamp-10-software

Other people have the same idea...

The goals of the American National Centre for Supercomputing Applications (NCSA) Brown Dog project are similar to that of Cr8it and OTW with regard to the accessibility of long-tail research data. Brown Dog focuses on data file discovery, the extraction of metadata from files, and data file conversion into compatible local formats. Commencing in October 2013, Brown dog has a budget of $US10.5M over 5 years. Refer to http://browndog.ncsa.illinois.edu/.

Another similar project is Fondz by Ed Summers at the US Library of Congress. Frondz has a preservation focus. See https://github.com/edsu/fondz.

The Islandora project is also looking at this kind of thing, using the Taverna workflow system to orchestrate file format conversions as well as scientific workflows.

OTW is available on Github at https://github.com/uws-eresearch/otw

This slide shows an overview of the Pick/Pack/Publish process which we’ll go through over the next few slides.

Cr8it is for long tail research

Out on the long tail, researchers often work with feral files.

These files get used across multiple projects, and often the names of the files are not particularly informative.

Cr8it runs on ownCloud

ownCloud is a share/sync/see service which is similar to Dropbox. It can synchronise files from a user’s desktop to the cloud and show them in a web environment. In this screenshot, you can see a view of the same file as on the previous slide.

Owcloud is open source. Universities can run their own instance of the software and effectively create their own cloud-based file share.

We have modified the Owncloud Files application to allow users to add their shared data files and folders to a named crate.

Here’s the main cr8it editing screen for the crate named “Long-tail Dog Study”. This screenshot shows the files and folders that have been added to the crate. It is a shopping-cart-like table-of-contents. The actual Owncloud files and folders are not copied to the crate, only item references are added to the crate.

There’s a lot going on here.

Within a crate, the user can drag, drop and rename folder and file items without affecting the original Owncloud folder or file names. “Backlit Dogs” is much better than “05” and anything is better than “PA050109.JPG”. This name is used in the HTML README file as metadata. But it does not change the file name of the exported file, as this may affect analysis code.

Cr8it collects high-quality linked metadata. For things like names of people and grants, the system looks-up a name-authority. This means that we maximise opportunities to create links between data-sets, publications and people.

We will be adding more linked-lookups including linking data to research context such as facilities and instruments.

This site allows an institutional user to set up public definitions of existing equipment and facilities, which can then be looked up. These definitions only need to set up and described once.

Downloading & Publishing

Researchers can download their data in zip format, which we’ll look at in a minute, or ‘push’/publish the data to a repository.

The obvious question is: what if there’s a lot of data? Answer; We’re working on ways of publishing the data in-situ so it does not have to be moved.

A brief note about standards: last year a subset of the Authors of this paper (the ones called Peter) presented at eResearch 2013 on how we use the Bagit and Zip standards to make ‘data crates’ in the HIEv application. See http://ptsefton.com/2013/11/01/1944.htm/. [Update 2016-10-27 - pointed this at my own blog. Western Sydney site seems to have died.]

When the zip file is downloaded from cr8it, the crate data comes in the bagit format. The data is laid-out on disk in the same way as the original Owncloud files and folders, but with an HTML README file which travels with the data. For presentation purposes, the README is organised according to the structure of crate, with better titles for files and directories if the researcher has taken the time to rename and re-organise them. THe README also contains embedded linked (not necessarily open) data in both human and machine readable formats.

Together with Of The Web, cr8it can also produce EPUB ebooks, which is something we’ll talk about in a separate presentation.

Collaborate with us
User group
groups.google.com/forum/#!forum/cr8it-users
Dev group
groups.google.com/forum/#!forum/cr8it-dev
</presentation>
Thanks to Veronica Luke,  Ilya Anisimoff, Ibrahim Taoube, Stanley Hon,  Rod Harrison, Karen El-Azzi - Lloyd Harischandra, Kim Heckenberg, Carmi Cronje, Andrew Leahy, David Clarke, Ton Dijkgraaf, Kai Chen