All posts by ptsefton

Notes on ownCloud robustness

I’m on my way to a meeting at Intersect about the next phase of the Cr8it data packaging and publishing project. Cr8it is an ownCloud plugin, and ownCloud is widely regarded as THE open source dropbox-like service, but it is not without its problems.

Dropbox has been a huge hit, a killer app with what I call powers to "Share, Sync & See". Syncing between devices, including mobile (where it’s not really syncing) is what made Dropbox so pervasive, giving us a distributed file-system with almost frictionless sharing via emailed requests, with easy signup for new users. The see part refers to the fact that you can look at your stuff via the web too. And there is a growing ecosystem of apps that can use Dropbox as an underlying distributed filesystem.

ownCloud is (amongst other things) an open source alternative to Dropbox.com’s file-sync service. A number of institutions and service providers in the academic world are now looking at it because it promises some of the killer-app qualities of dropbox in an open source form, meaning that, if all goes well it can be used to manage research data, on local or cloud infrastructure, at scale, with the ease of use and virality of dropbox. If all goes well.

There are a few reasons dropbox and other commercial services are not great for a university:

  • We need to be able control where data are stored and have the flexibility to bring data close to large facilities. This is why CERN have the largest ownCloud test lab in the world, so I’ve heard.

  • It is important to be able to write applications such as Cr8it without being beholden to a company like Dropbox.com, Apple, Google or Microsoft who can approve or deny access to their APIs at their pleasure, and can change or drop the underlying product. (Google seem to pose a particular risk in this department, they play fast and loose with products like Google Docs, dumping features when it suits them)

But ownCloud has some problems. The ownCloud forum is full of people saying, "tried this out for my company/workgroup/school. Showed promise but there’s too many bugs. Bye." At UWS eResearch we have been using it more or less successfully for several months, and have experienced some fairly major issues to do with case-sensitivty and other incompatibilities between various file systems on Windows, OS X and Linux.

From my point of view as an eResearch manager, I’d like to see the emphasis at ownCloud be on getting the core share-sync-see stuff working, and then on getting a framework in place to support plugins in a robust way.

What I don’t want to see is more of this:

Last week, the first version of OwnCloud Documents was released as a part of OwnCloud 6. This incorporates a subset of editing features from the upstream WebODF project that is considered stable and well-tested enough for collaborative editing.

We tried this editor at eResearch UWS as a shared scratchpad in a strategy session and it was a complete disaster, our browsers kept losing contact with the document, and when we tried to copy-paste the text to safety it turned out that copying text is not supported. In the end we had to rescue our content by copying HTML out of the browser and stripping out the tags.

In my opinion, ownCloud is not going to reach its potential when the focus remains on getting shiny new stuff out all the time, far from making ownCloud shine, every broken app like this editor tarnishes its reputation substantially. By all means release these things for people to play with but the ownCloud team needs to have a very hard think about what they mean by "stable and well tested".

Along with others I’ve talked to in eResearch, I’d like to see work at owncloud.com focus on:

  • Define Sync behaviour in detail, complete with automated tests and have a community-wide push to get the ongoing sync problems sorted. For example, fix this bug reported by a former member of my team along with several others to do with differences between file systems.

  • Create a standard way to generate and store file derivaties such as image thumbnails, or HTML document previews, as well as additional file metadata. At the moment plugins are left to their own devices, so there is no way for apps to reliably access each others data. I have put together a simple Alpha-quality framework for generating web-views of things via the file system, Of the Web, but I’d really like to be able to hook it in to ownCloud properly.

  • Get the search onto a single index rather than the current approach of having an index per user, something like Elastic Search, Solr or Lucene could easily handle a single metadata-and-text index with information about sharing, with changes to files on the server fed to the indexer as they happen.

  • [Update 2014-04-11] Get the sync client to handle connecting to multiple ownCloud servers, in Academia we will definitely have researchers wanting to use more than one service, eg AARNet’s Cloudstor+ and an institutional ownCloud. (Not to mention proper dropbox integration)

Creative Commons License
Notes on ownCloud robustness by Peter Sefton is licensed under a Creative Commons Attribution 4.0 International License

Introducing next year’s model, the data-crate; applied standards for data-set packaging

This is also up at the UWS eResearch blog

[Update 2013-11-04:

If you're reading this in Feedly and possibly other feed readers the images in this post won't show - click through to the site to see the presentation

Added some more stuff from the proposal, including the reference list - clarified some quoted text]

Creative Commons Licence
Introducing next year’s model, the data-crate; applied standards for data-set packaging by Peter Seftton and Peter Bugeia is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License .

This presentation was delivered by Peter Sefton at eResearch Australasia 2013 in Brisbane, based on this proposal .

Slide 1

Peter Sefton* p.sefton@uws.edu.au

Peter Bugeia** peter.bugeia@intersect.org.au

*University of Western Sydney

**Intersect Australia Ltd

ABSTRACT

In this paper we look at current options available for storing research data to maximize potential reuse and discoverability, both at the level of data files, and sets of data files, and describe some original work bringing together existing standards and metadata schemas to make well-described, reusable data sets that can be distributed as single files, dubbed “crates” with as much context and provenance as possible. We look at some of the issues in choosing file formats in which to archive and disseminate data, and discuss techniques for adding contextual information which is both human-readable and machine-readable in the context of both institutional and discipline data management practice.


Slide 2

When the eResearch team at UWS and Intersect were working on the ANDS DC21 “HIEv” (5) application to allow researchers to create data-sets from collections of files, we looked in vain for a simple-to-implement solution for making CSV-type data available with as much provenance and re-use metadata as possible. In this presentation we will discuss some of the many file-packaging options which were considered and rejected including METS (6), and plain-old zip files with no metadata.

The Eucalyptus woodland free-air CO2 enrichment (EucFACE) facility is the only one of its kind in the southern hemisphere.

It is unique in that it provides full-height access to the mature trees within remnant Cumberland Plain Forest, the only FACE system in native forest anywhere in the world. It is sited on naturally low-nutrient soils in what is close to original bushland, and offers researchers an amazing site at which to study the effects of elevated CO2 on water use, plant growth, soil processes and native biodiversity in a mature, established woodland within the Sydney Basin.

http://www.uws.edu.au/hie/research/research_projects/eucface

This is in the context of the Hawkesbury Institute For the Environment,(HIE) experimental facility, pictured is the Free-Air-Co2 exchange experiment ( EucFACE) under construction.


Slide 3

This is the context in which we did this data-packaging work, but it is designed to be more broadly applicable.


What keeps us awake at night?

What if provide a zip download of a whole lot of environment-data files and someone writes and important article, but then they can’t work out which zip file and which data files they actually used?

What if there’s some really important data that I know I have on my hard-disk but I can’t tell which file it’s in ‘cos they’re all called stuff like 34534534-er2.csv?


Some standards are not actually standards......

We have reached the time when there is a genuine need to be able to match-up data from different sources; infrastructure projects funded by the Australian National Data Service (ANDS) (4) are now feeding human-readable metadata descriptions to the Research Data Australia (RDA) website. But which standards to use? As Tanenbaum said, “The nice thing about standards is that you have so many to choose from. Furthermore, if you do not like any of them, you can just wait for next year’s model” (1). However, when it comes to choosing file format standards for research data, we have found that while there might be many standards there is no single standard for general-purpose research data packaging. It is, however possible to stitch-together a number of different standards to do a reasonable job of packaging and describing research data for archiving and reuse.

There are several issues with standards at the file level. For example consider one of the most commonly supported formats: CSV – or Comma Separated Values. CSV file is actually a non-standard, ie there is no agreed CSV specification, only a set of unreliable conventions used by different software, RFC 4180 (2) notwithstanding. While a CSV file has column headers, there is no way to standardise their meaning. Moving up the complexity chain, the Microsoft Excel based .xslx format is a standard, as is the Open Document Format for spreadsheets but again, even though you can point to a header-row in a spreadsheet and say “that’s the header” there is no standard way to label variables in a way that will match with the labels used by other researchers, or to allow discovery of the same kind of data points in hetrogenous data sets. There is a well established standard which does allow for “self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data”, NetCDF (3) – we will consider how this might be more broadly adopted in eResearch contexts.


Slide 6

Data Packaging Principles for this environment…

Slide 8

2. The packaging format should deal with any kind of data file

3. The packaging format should work for any domain

4. The packaging format should be platform neutral

Slide 12

6. Metadata should be both human and machine-readable

Slide 14

8. 	The package format should cater for datasets of any size*

The Crate Specification…

Slide 17

Can you guess which two standards are the basis for the crate?


Slide 18

When the eResearch team at UWS and Intersect NSW were working on the ANDS DC21 “HIEv” (5) application to allow researchers to create data-sets from collections of files, we looked in vain for a simple-to-implement solution for making CSV-type data available with as much provenance and re-use metadata as possible, as per the principles outlined above. In this presentation we will discuss some of the many file-packaging options which were considered and rejected including METS (6), and plain-old zip files with no metadata. The project devised a new proof-of-concept specification, known as a ‘crate’, based on a number of standards,. This format:

Uses the California Digital Libraries Bagit specification(7) for bundling files together into a bag.

Creates a single-file for the bag using zip (other contenders would include TAR or disk image formats but zip is widely supported across operating systems and software libraries).

Uses a human-readable HTML README file to make apparent as much metadata as is available from (a) within files and (b) about the context of the research data.

Uses RDF with the W3C’s DCAT ontology (8) and others to add machine readable metadata about the package including relationships between files, technical metadata such as types and sizes and research context


Slide 19

The following few slides from the DC21/HIEv ssystem show how a user can select some files…


Slide 20

\

… look at file metadata …


Slide 21

… add files to a cart …


Slide 22

\

… download the files in a zip package …


Slide 23

\

… inside the zip the files are structured using the bagit format …


Slide 24

… with a standalone README.html file containing all the metadata we know about the files and associated research context (experiments, facilites)

This is something you can unzip on your laptop, put on a web server, or a repository could show to users as a ‘peek’ inside the data set


Slide 25

\

… with detail about every file as per the HIEv application itself


Slide 26

… and embedded machine readable metadata using RDFa


Slide 27

… the RDFa metadata describes the data-set as a graph.

Completed packages flow-through to the Research Data Catalogue via an OAI-PMH feed, and there they are given a DOI so they can be cited. The hand-off between systems is important, once a DOI is issued the data set has to be kept indefinitely and must not be changed.


Slide 28

The README file not only contains human readable descriptions of the files and their context there is embedded machine readable metadata. The relationships such as “CreatedBy” use URIs from mainstream ontologies where possible.


Slide 29

We have not done this yet, but using platorms like R-Studio + Knitr it would be possiblet to include runnable-code in data packages, which would provide a ‘literate programming’ readme. This is an example of some data we got from Craig Barton and Remko Duursma.


Slide 30

So the README could include plots, etc, and a copy of the article


Slide 31

Cr8it is designed to plug in to the ownCloud share-sync service so users can compile data sets from working data file for deposit in a repository.

The HIE project is (in part) a simple semantic CMS system that will describe the research context at HIE.


What’s next?

Try this in more places

Integrate research context

Continue quest for decent ontologies and vocabs

Get feedback

REFERENCES

1. Tanenbaum AS. Computer networks. Prentice H all PTR (ECS Professional). 1988;1(99):6.

2. <ietf@shaftek.org> YS. Common Format and MIME Type for Comma-Separated Values (CSV) Files [Internet]. [cited 2013 Jun 8]. Available from: http://tools.ietf.org/html/rfc4180

3. Rew R, Davis G. NetCDF: an interface for scientific data access. Computer Graphics and Applications, IEEE. 1990;10(4):76–82.

4. Sandland R. Introduction to ANDS [Internet]. ANDS; 2009. Available from: http://ands.org.au/newsletters/newsletter-2009-07.pdf

5. Intersect. Data Capture for Climate Change and Energy Research: HIEv (AKA DC21) [Internet]. Sydney, Australia; 2013. Available from: http://eresearch.uws.edu.au/blog/projects/data-capture-for-climate-change-and-energy-research/

6. Pearce J, Pearson D, Williams M, Yeadon S. The Australian METS Profile–A Journey about Metadata. D-Lib Magazine. 2008;14(3/4):1082–9873.

7. Kunze J, Boyko A, Vargas B, Madden L, Littman J. The BagIt File Packaging Format (V0.97) [Internet]. [cited 2013 Mar 1]. Available from: http://tools.ietf.org/html/draft-kunze-bagit-06

8. Maali F, Erickson J, Archer P. Data Catalog Vocabulary (DCAT) [Internet]. World Wide Web Consortium; Available from: http://www.w3.org/TR/vocab-dcat/

Questions for the Australian (library) Repository Community

I am at the CAUL Repository Community Days meeting, and along with a few other Antipodeans will he giving an update on the Open Repositories 2013 conference. I gave a report in a previous post, but I thought I’d do another summary with the benefit of a couple more months of hindsight, and ask a couple of questions of the Australian Repo community. Comments are open!

Reproducibility, reuse and friends

As Rick Jelliffe commented on my post we need to get stuff happening, rather than making excuses.

For controversial research, I think we have passed the point where “Use, Reuse, Reproduce is worthy, but I think maybe we’re not there yet” is politically viable, however true it may be.

So what are we (the library community) going to do to help?

What’s our ‘Developer Challenge’?

The developer challenge event needs looking after. In the years that we could rely on JISC to show up to Open Repositories with facilitators like David ‘Flanders’ Flanders or Mahendra Mahey and journalists to document stuff things went pretty well, but this year JISC were not in a position to put in so many resources, many developers developers apparently didn’t make the trip and things were a little more low-key. So, the conference committee has decided to get serious about, there is now officially a “Developer Track” with a “Developer Track Chair”. Guess who that is.

And where where the Aussie and Kiwi developers? In 2008 in Southampton there were enough of us to make four-persong developer challenge entry with some left over, this year I had to beg the only other developer from our part of the world, Andrea Schweer (NZ) to enter the dev competition, which she did, and managed to produce something useful for the community.

But never mind Canada, where are the developers back home? In many cases they are no longer in the library, helping with the repository. And there is no more CAIRSS repository support service and no more CAIRSS technical officer. So how is the library, as custodian of the Publication Repository and (maybe) the Data Repository for the institute going to do what needs to be done as we reinvent the scholarly process?

Is our “Developer Challenge” getting access to developers in the first place?

Creative Commons Licence
Questions for the Australian (library) Repository Community by Peter Sefton is licensed under a Creative Commons Attribution 3.0 Unported License.

Round table on vocabularies for describing research data: where’s my semantic web?

[UPDATE: Fixed some formatting]

Creative Commons Licence
Round table on vocabularies for describing research data: where’s my semantic web? by Peter Sefton is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Summary: in this post I talk about an experimental semantic website for describing what I’m calling ‘research context’, wondering if such as site can be used as a ‘source of truth’ for metadata entry, for example when someone is uploading a file into a research data repository. The post assumes some knowledge of linked data and RDF and/or an interest in eResearch software architecture.

Thanks to twitter correspondents Jodi Schneider, Kristi Holmes and Christopher Gutteridge.

On Friday 7th September I attended a meeting at Intersect about metadata vocabularies for managing research data, in the context of projects sponsored by the Australian National Data Service (ANDS). Ingrid Mason asked me to talk about my experiences describing research data. I approached this by starting with a run-through of the talk Peter Bugeia and I put together for Open Repositories with an emphasis on our attempts to use Linked Data principles for metadata. In this work we encountered two big problems, which I brought to the round-table session as questions.

  1. It’s really hard to work out which ontology, or set of vocabulary terms to use to describe research context. Take ‘experiment’ what is a good linked data term for that?

    Q. What to use as a URI for an experiment?

  2. In trying to build linked-data systems I have not found any easy to use tools. (I got lots of useful leads from Kristy Holmes and Jodi Schneider on Twitter, more on that below).

    Q. Where’s my semantic web!

Answers at the end of the post, but you have to read the whole thing anyway.

The problem I’m working on at the moment with colleagues at the University of Western Sydney is how we can provide a framework for metadata about research data. We’re after efficient interfaces for researchers to contextualise research data sets, across lots of different research domains where the research context looks quite different.

For example, take the HIEv system at the Hawkesbury Intitute for the Environment (HIE). HIEv is basically a file-repository for research data files. It has information about each file (size, type, date range etc) and contextual metadata about the research context, in this case using a two-part hierarchy: Facility / Experiment where facilities are associated with multiple experiments and files are associated with experiments. Associating a data file with research context is easy in HIEv because it’s built in to the system. A human or machine uploading a data file associates it with an experiment using a form, or a JSON data structure respectively. The framework for describing research context is built-in to the application, and the data lives in its internal database.

This approach works well, until:

  1. We try to re-use the software behind HIEv in another context, maybe one where the research domain does not centre on facilities, or experiment is not quite the right concept, or the model needs to be further elaborated.

    Example: In the MyTardis project, a development team added an extra element to that package’s research hierarchy – porting the application to new domains means substantial rework. See this message on their mailing list.

  2. We want to re-use the same contextual descriptions to describe research data in another system where we are faced with either programming a whole new framework for the same context, or adding a new interface for our new system to talk to the research context framework in the old one.

    Example: At HIE, with the help of some computing students, Gerry Devine and I are exploring the use of OwnCloud (the dropbox-like Share/Sync/See application) to manage working files, with a simple forms interface to add them to HIEV. As it stands the students have to replicate the Facility/Experiment data in their system, meaning they are hard-coding facility / Experiment hierarchies into HTML forms.


Gerry Devine and I have been sketching an architecture designed to help out in both of these situations. The idea is to break-out the description of the research-context into a well-structured application. This temporary site of Gerry’s, shows what it might look like in one aspect, a web site which describes stuff at HIE; facilities, and their location, experiments taking place at those facilities, and projects. The question we’re exploring is: can we maintain a description of the HIE research context in one place, such as an institute research site or wiki, and have our various data-management applications use that context, rather than having to build the same research-context framework into each app and populate with lists of values? Using a human-readable website as the core home for research context information is appealing because it solves another problem, getting some much needed documentation on the research happening at our organisation online.

Here’s an interaction diagram showing what might transpire when a researcher wants to use a file management application, such as ownCloud (app) to upload some data to HIEv, the working data repository at the institute:


We don’t have much of this implemented, but last week I had a play with the research context website part of the picture (the system labelled ‘web’, in the above diagram). I wanted to see if I could create a web site like the one Gerry made, but with added semantics, so that when an application, like an ownCloud plugin asked ‘gimme research context’ it could return a list of facilities, experiments and projects in machine readable form.

For a real institute or organisation-wide research context management app, you’d want to have an easy to use point and click interface, but for the purposes of this experiment I decided to go with one of the many markdown-to-html tools. See this page which summarises why you’d want to use one and lists an A-Z of alternatives.This is the way many of the cool kids make their sites theses days – they maintain pages as markdown text files, kept under version control and run a script to spit out a static website. Probably the best-known of these is Jekyl, which is built in to GitHub. I chose Poole because it’s Python, a language in which I can get by, and it is super-simple, and this is after-all just an experiment.

So, here’s what a page looks like in Markdown. The top part of the file, up to ‘—–’ is metadata which can be used to lay out the page in a consistent way. Below the line, is structured markup. # Means “Heading level 1” (h1), ## is ‘h2′ and so on.

title: Glasshouse S30
long: 150.7465
lat:  -33.6112
typeOf: @facility
full_name: Glasshouse facility at UWS Hawkesbury building S30
code: GHS30
description: Glasshouse in the S-precinct of the University of Western Sydney, Hawkesbury Campus, containing eight naturally lit and temperature-controlled compartments (3 x 5 x 3.5m, width x length x height). This glasshouse is widely used for short-term projects, often with a duration of 2-3 months. Air temperature is measured and controlled by an automated system at user-defined targets (+/- 4 degrees C) within each compartment. The concentration of atmospheric carbon dioxide is controlled within each compartment using a system of infrared gas analyzers and carbon dioxide injectors. Supplementary lighting will be installed in 2013.
---

Contact: Renee Smith (technician, R.Smith@uws.edu.au), John Drake (Post-doc, je.drake@uws.edu.au), Mike Aspinwall (Post-doc, m.aspinwall@uws.edu.au).

# References: 

Smith, R. A., J. D. Lewis, O. Ghannoum, and D. T. Tissue. 2012. Leaf structural responses to pre-industrial, current and elevated atmospheric CO2 and temperature affect leaf function in Eucalyptus sideroxylon. Functional Plant Biology 39:285-296.

Ghannoum, O., N. G. Phillips, J. P. Conroy, R. A. Smith, R. D. Attard, R. Woodfield, B. A. Logan, J. D. Lewis, and D. T. Tissue. 2010. Exposure to preindustrial, current and future atmospheric CO2 and temperature differentially affects growth and photosynthesis in Eucalyptus. Global Change Biology 16:303-319.



# Data organisation overview

There have been a large number of relatively short-duration experiments in the Glasshouse S30 facility, often with multiple nested projects within each experiment.  The file naming convention captures this hierarchy.



# File Naming Convention

Convention: GHS30_<EXPERIMENT>_<PROJECT>_<VARIABLE COLLECTION CODE>_<DATA PROCESSING>_<DATE or DATERANGE>[_<VERSION>].<filetype>

The resulting HTML looks like this:

But wait, there’s more! Inside the human-readable HTML page is some machine-readable code to say what this page is about using linked-data principles. The best way I have been able to work out how to describe a facility is using the Eagle-I ontology, where I think the appropriate term for what HIE calls a facility is ‘core-laboratory’. You can browse the ontology and tell me if I’m right. This says that the glasshouse facilty is a type of core-laboratory.

<section
resource="http://uws.edu.au/facilities/glasshouse-s30.html"
typeof="http://vivoweb.org/ontology/core#CoreLaboratory">

<h1 property="dc:title">Glasshouse facility at UWS Hawkesbury building S30</h1>

(I’m not an RDF expert so if I have this wrong somebody please tell me! And yes, I know there are issues to consider here What URIs should we use for naming facilities and other contextual things? Should we use Handles? PURLS? Plain old URLs like the one above?)


The code that produced this snippet is really simple, but I did have to code it:


def hook_postconvert_semantics():
    for p in pages:
    	if p.typeOf <> None:
       		p.html = "\n\n<section resource='http://hie.uws.edu.au/research-context/%s' \
                typeof='%s'>\n\n%s\n\n<section>\n\n" % (p.url, types[p.typeOf], p.html)

Now, the part that I’m quite excited about is that if you point an RDFa distiller at this you get the following. This is JSON-LD format which is (sort of) RDF wrapped up in JSON. Part time programmers like me often find RDF difficult to deal with, but everyone loves JSON, you can slurp it up into a variable in your language of choice and access the data using native idioms.

{
    "@context": {
        "dcterms": "http://purl.org/dc/terms/"
    }, 
    "@graph": [
      
        {
            "@id": "facilities/glasshouse-s30.html", 
            "@type": "http://vivoweb.org/ontology/core#CoreLaboratory", 
            "http://www.w3.org/2003/01/geo/wgs84_pos#long": {
                "@value": "150.7465", 
                "@language": "en"
            }, 
            "dcterms:title": {
                "@value": "Glasshouse facility at UWS Hawkesbury building S30", 
                "@language": "en"
            }, 
            "http://www.w3.org/2003/01/geo/wgs84_pos#lat": {
                "@value": "-33.6112", 
                "@language": "en"
            }
        }
    ]

That might look horrible to some, but should be easy for our third-year comp-sci students to deal with. Iterate over the items in the @graph array, find those where @type is equal to “http://vivoweb.org/ontology/core#CoreLaboratory“, get the title, and build a drop-down list for the user, to associate their data file with this facility (using the ID). This potentially lets us de-couple our file management app from our HIEv repository, from our Research Data repository, and let them all share the same ‘source of truth’ about research context. In library terms, my hacked-up version of Gerry’s website is acting as a name-authority for entities in the HIE space.

There is a lot more to cover here, including how experiments are associated with facilities, and how, when a user publishes a data set from HIEv a file can be linked to a facility/experiment combination using a relation “wasGeneratedBy” from the World Wide Web Consortium’s PROV (provenance) ontology.

As I noted above, the markdown based approach is not going to work for some user communities. What is needed to support this general design pattern, assuming that one would want to, is some kind of combination of a research-context database application and a web content management system (CMS). A few people, including Jodi Schneider suggested I look at Drupal, the open source CMS. Drupal does ‘do’ RDF, but not without some serious configuration.

Jodi also pointed me to VIVO, which is used for describing research networks, usually focussing on people more than on infrastructure or context. I remember from a few years ago a presentation from one of the VIVO people that said very explicitly that VIVO was not designed to be a source of primary data so I wondered if it was appropriate to even consider it as a place to enter, rather than index and display data. The VIVO wiki says it is possible, but building a site with the same kind of content as Gerry’s would be a lot of work just as it would be in Drupal.

Oh, and those answers? Well thanks to Arif Shaonn from the University of New South Wales, I know that http://www.w3.org/ns/prov#Activity is probably a good general type for experiments (no, I’m going to define an ontology of my own, I already have enough pets).

And where’s my semantic web? Well, I think we may need to build a little more proof-of-concept infrastructure to see if the idea of a research-context CMS acting as a source of truth for metadata makes sense, and if so, make the case for building it as part of future eResearch data-management apps.

My dodgy code including the input and output files for a small part of Gerry’s website is on github, to run it you’ll need to install Poole first.

Trip report: Open Repositories 2013

Creative Commons License
Trip report: Open Repositories 2013 by Peter Sefton is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

From July 8th to July 12th I was on Prince Edward Island in Canada for the Open Repositories conference. I try to participate as a member of the OR committee, particularly in matters relating to the developer challenge which is a key part of the conference, when I can manage the early-morning conference calls with Europe and the North America. My trip was funded by my employer, the University of Western Sydney. In this report I’ll talk about the conference overall, the developer challenge and the paper I presented, written with my colleague Peter Bugeia.

This was my first trip to Canada. I liked it.

wpid-trip-report-for-ptsefton.com-final.html_trip-report-for-ptsefton.com-final_html_1715e81a.jpg Prince Edward Island’s largest land mammal is the lighthouse.

Summary

The main-track conference started and ended with talks which, to me at least, were above all about Research Integrity. We started with Vitoria Stodden’s Re-use and Reproducibility: Opportunities and Challenges, which I’ll cover in more detail below. One of Stodden’s main themes was the gap in our scholarly practice and infrastructure where code and data repositories should be. It is no longer enough to publish research articles that make claims if those claims are impossible to evaluate or substantiate in the absence of the data and the code that support them. The closing talk touched on some of the same issues, looking at the current flawed and corruptible publishing system, claiming for example that the journal based rewards system encourages cheating. Both of these relate to repositories, in overlapping ways.

But OR is not just about the main track, which was well put together by Sarah Shreeves and Jon Dunn, it remains a practical, somewhat technical conference where software user and developer groups are also important strands and the Developer’s Challenge is a key part of the event.

The conference: the “Two Rs and a U”

First up, the main conference. The theme this year was “Use, Reuse, Reproduce”. The call for proposals said:

Some specific areas of interest for OR2013 are:

  • Effective re-use of content–particularly research data–enabled by embedded repository tools and services

  • Effective re-use of software, services, and infrastructure to support repository development

  • Facilitation of reproducible research through access to data, workflows, and code

  • Services making use of repository metadata

  • Focused, disciplinary or community-based software, services, and infrastructure for use and reuse of content

  • Integration of data, including linked data, and external services with repositories to provide solutions to specific domains

  • Added-value services for repositories

  • Long-term preservation of repositories and their contents

  • Role and impact of repositories in the research ecosystem

These are all great things to talk about, and show how repositories, at least in universities as expanding from publications to data. The catch-phrase “Use, Reuse, Reproduce” is worthy, but I think maybe we’re not there yet. What I saw and heard, which was of course just a sample, was more along the lines of “Here’s what we’re doing with research data” rather than stories about re-use of repository content or reproducible research. I hope that some of the work that’s happening the Australian eResearch scene on Virtual Labs and eResearch tools finds its way to OR2014, as I think that these projects are starting to really join-up some of the basic data management infrastructure we’ve been building courtesy of the Australian National Data Service (ANDS) with research practices and workflows. It’s the labs that will start to show the Use and Reuse and maybe some Reproduction.

Keynote:

Victoria Stodden’s opening keynote was a coherent statement of some of the challenges facing scholarship, which is currently evaluated on the basis publications, citations and journals. But publications are most often not supported by data and/or code that can be used to check them, Stodden talked mainly about computationally-based research, but the problem affects many disciplines. For a keynote I found it little dry – there was only one picture, and I would have preferred a few stories or metaphors to make it more engaging. I was also hoping she’d talk about the difference between repeatability and reproducibility, which she did in another talk. Our community needs to get on top of this, so here’s an ‘aside-slide’ from another of her talks:

Aside: Terminology

  • Replicability (King 1995)*: Now: regenerate results from existing code, data.

  • Reproducibility (Claerbout 1992)*: Now: independent recreation of results

  • without existing code or data,

  • Repeatability: re-run experiments to determine the sensitivity of results when

  • underlying measurements are retaken,

  • Verification: the accuracy with which a computational model delivers the

  • results of the underlying mathematical model,

  • Validation: the accuracy of a computational model with respect to the

  • underlying data (model error, measurement error).

See: V. Stodden, “Trust Your Science? Open Your Data and Code!” Amstat News, 1 July 2011. http:// magazine.amstat.org/blog/2011/07/01/trust-your-science/

*These citations are not in the reference list in the slide-deck.

Stodden made some references to repositories, summarized thus on The Twitter:

Simon Hodson ‏@simonhodson9910 Jul

@sparrowbarley: #OR2013 keynote, V Stodden called for sharing of data & code to “perfect the scholarly record” & “root out error”” #jiscmrd

Peter Ruijgrok ‏@pruijgrok9 Jul

#or2013 Victoria Stodden: A publication is actually an advertisement. Data and software code is what it is about as proof / reproducing

This was a useful contribution to Open Repositories – Reuse, Replicability, Reproducibility et al have to be amongst our raisons de etre. Just as the Open Access movement drove the initial wave of institutional publications repositories, the R words will drive the development of data and code repositories, both institutional and disciplinary. OR is a very grounded conference, for practitioners more than theorists, so I would expect that over the next decade we’ll be talking about how to build the infrastructure to support researchers like Victoria, and the researchers we’ve been working with at UWS, which brings us to our talk.

Our paper: 4a Data Management

The presentation I gave, written with Peter Bugeia talks about how UWS collaborated with Intersect, using money from the Australian National Data Service to work on a suite of projects that cover the four As. It’s up on the UWS eResearch blog and with slightly cleaned-up formatting on my own site.

At the University of Western Sydney we’ve been working on end-to-end data management. The goal of being able to support the R words for researchers is certainly behind what we’re doing but before we get results in that area we have to build some basic infrastructure. For the purposes of this paper, then, we settled on the idea of talking about a set of ‘A’ words that we have tried to address with useful services:

  1. Acquiring data

  2. Acting on data

  3. Archiving data

(we could maybe have made more of the importance of including as much context about research as possible, including code, but we certainly did mention it).

  1. Advertising data

(note the accidental alignment with Victoria Stodden’s comment that an article is an ad.)

Note that the A words have been retrofitted to the project as a rhetorical device, this is not the framework used to develop the services

Everyone in the repository world knows that “if we build it they will come” is not going to work, which is why this is not just about Archiving and Advertising, two core traditional repository roles, it’s about providing services for Acquiring and Acting data for researchers. Reproducibility et al are going to be more and more important to the conduct of research, and as awareness of this spreads the most successful researchers will be the ones who are already making sure their data and code are well looked after and well described.

The closing plenary

Jean-Claude Guédon’s wrap up complemented the opening well, drawing together a lot of familiar threads around open access, and looking at the way the scholarly rewards system has created a crisis in research integrity, he rehearsed the familiar argument that journal-level metrics are not useful, and can be counter-productive, calling for an article-level measuring system which can operate independently of the publishing companies who control so much of our current environment. He warned against letting corporations take ‘our’ data and sell that back to us as well. There was nothing really ground breaking here, but it was a timely reminder to think about why we’re even at a conference called Open Repositories.

Like Stodden, Guédon didn’t offer much of a roadmap for the repository movement, which after all is our job, although he did try talk in context maybe a little more than Stodden’s opening which, while it did reference repositories had the air of a well-practiced stump-speech.


The developer challenge

This year the develop challenge judging panel was decisively chaired by Sarah Shreeves who was also on the program committee. We struggled to get entrants this time – this still needs some analysis, but at this stage it looks like the relatively remote location meant that many developers didn’t get funding to attend, and we had a little confusion around a new experiment for this year which was a non-compulsory hackfest a few kilometers from the main venue which left a couple people thinking they’d missed out on a chance to join in. And the big one was that there was no dedicated on-the-ground dev wrangler on hand; for the last several years JISC have been able to send staff, notably Mahendra Mahey. I did try to encourage teams to enter, with modest success, but Mahendra was definitely missed.

So who won?

William J Nixon ‏@williamjnixon11 Jul

#OR2013 Developers challenge winners – Team Raven’s PDF/A and Team ORCID. Congratulations. More details on these at http://or2013.net/content/or-2013-dev-challenge-event …

This year we based the judging on the criteria I put together last year in the form of a manifesto about the values of the conference. I think that helped focus the judging process and feedback was generally good from the panel but we’ll see how people feel after some reflection. I talked about this in the wrap-up. Torsten Reimer got the heart of it:

Torsten Reimer ‏@torstenreimer11 Jul

#OR2013 Developer Challenge co-judge @ptsefton summarises the dev manifesto http://bit.ly/1dmQ8UQ  creating new networks is key

Robin Rice (front left) had organized a photo of the smuggler’s den in which we convened:

Robin Rice ‏@sparrowbarley10 Jul

The Dev Challenge judges are convening in an appropriate venue. #OR2013 pic.twitter.com/ToIkzQueeb

The committee is putting together a manual for future organizers, and I will be suggesting something along these lines:

  • Dev facilities should be as close as possible to the main conference rooms, even remote rooms in the same facility cause problems as people need to be able to be in and out of presentations.

  • There needs to be a dedicated mentor for the dev challenge to help teams coalesce and do stuff like make sure that winners are announced formally.

The future of Fedora

I was really interested in the new Fedora Futures / Fedora 4 project (FF). Fedora is the back-end repository component behind projects like Islandora, Hydra (parts of which power the HCS virtual laboratory) and optionally for ReDBOX.

The new Fedora project is not ready for real use yet, but it shows lots of promise as a simple-to-use linked-data-ready data storage layer for eResearch projects, where you want to keep data and metadata about that data. New built-in self-repairing clustering and a simple REST interface make it appealing.

I was particularly excited about the (promised) ability for FF to be able to run on top of an existing file tree and provide repository-like services over the files. This is exactly what we have been looking for in the Cr8it project, where the idea is to bridge the yawning chasm from work-in-progress research files to well described data sets. Small detail: doesn’t actually do what I fondly hoped it would yet I wanted to be able to point it at a set of files, have it extract metadata and generate derived views and previews, allow extra metadata to be linked to files and watch for changes. Working on that.

The venue and all that

Prince Edward Island took some getting to for some of us, but it was worth it. Mark Leggot’s team at UPEI and the related spin-off company Discovery Garden were consummate hosts and Charlottetown was a great size for a few hundred conference attendees; it was impossible not to network unless you stayed in your hotel room. I didn’t. I particularly appreciated the local CD in the conference bag. Mine is Clocks and Hearts Keep Going by Tanya Davis, laid-back pop-folk, which kind of reminded me of The Be Good Tanyas, in a good way, and no not because of the Tanya thing, but it may be a Canadian thing. And there were decent local bands at the conference dinner, which was held in three adjacent restaurants and involved oysters and lobsters, like just about every other meal on PEI.

wpid-trip-report-for-ptsefton.com-final.html_trip-report-for-ptsefton.com-final_html_m463f855a.jpg

Just about everyone I went to dinner with was obsessed with these things, and I do think they might be better than the ones we have here

There was pretty good fish and chips too.

Another student project – crossing the curation boundary

Creative Commons Licence
Another student project – crossing the curation boundary by Peter Sefton is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Another student project – crossing the curation boundary

I wrote last week about a student project on HTML slide-viewing for which I’m the client. This week I met with another group to talk about a project which has more direct application to my job as eResearch manager at the University of Western Sydney.

The next cohort are going to be looking at a system for getting working data into an eResearch application. Specifically they are going to have a go at building an uploader add-in to ownCloud, the Open Source Dropbox.com-like system so it can feed data to the HIEv data management application used by the Hawkesbury Institute for the Environment. This project was inspired by two things:

  • The fact that we’re working with ownCloud in a trial at UWS, and our cr8it data packaging and archiving application is based on ownCloud, so getting some students working in this area will help us better understand oC and build expertise around UWS.

  • A meeting with Gerry Devine, the data manager at HIE – where he was explaining how the institute is trying to improve the quality of data in HIEv; at this stage a least they don’t want everything uploaded, and files need to conform to naming conventions1.

These two things go very nicely together; OwnCloud has a sync-service like Dropbox that can replicate folders full of files across machines, via a central server, and a web-view of the files, it has a plugin architecture so it is easy to add actions to files, and HIEv has an API that can accept uploads. The application is simple:

  • For certain file types, those that might have data like .csv files, show an ‘Upload to HIEv’ button in the web interface.

  • Present the user with aform, to collect metadata about the file; what date range does it represent, which experimental facility is it from, via a drop-down list etc (and yes automated metadata extraction would be nice to have, if the students have time).

  • Use the metadata to generate file-names based on the metadata.

  • Upload to HIEv.

I think that should be a reasonable scope for a third year assignment, with plenty of room to add nice add-on features if there’s time. A couple of obvious ones:

  • Extracting metadata from files (eg working out the data-range).

  • Making the metadata form configurable eg with a JSON file.

Beyond that, there is a potentially much more ground-breaking extension possible. Instead of having to set up the metadata form for every context of research, what if information about the research context could be harvested from the web and the user could pick their context from that?

I have been talking this idea through with various eResearch and repository people. I submitted it as an idea to the Open Repositories Dev challenge (late, as usual). Nobody bit, but I think it’s important:

If you are building a repository for research data, then you need to be able to record a lot of contextual metadata about the data being collected. For example, you might have some way to attach data to instruments . We typically see designs with hierarchies something like Facility / Experiment / Dataset / File. Problem is, if you design this into the application, for example via database table then that makes it much harder to adapt to a new domain or changing circumstances, where you might have more or fewer levels, or hierarchies of experiment or instrument might become important etc.

So, what I’d like to see would be a semantic wiki or CMS for describing research context with some built-in concepts such as “Institute”, “Instrument”, “Experiment”, “Study”, “Clinical Trial” (but extensible) which could be used by researchers, data librarians and repository managers to describe research context as a series of pages or nodes, and thus create a series of URIs to which data in any repository anywhere can point: the research data repository could then concentrate on managing the data, and link the units of data (files, sets, databases, collections) to the context via RDF assertions such as ‘<file> generatedBy <instrument>’. Describing new data sets would involve look-up and auto-completes to the research-context-semantic-wiki – a really interesting user interface challenge.

It would be great to see someone demonstrate this architecture, building on a wiki or CMS framework such as Drupal or maybe one of the NoSQL databases, or maybe as a Fedora 4 app, showing how describing research context in a flexible way can be de-coupled from one or more data-repositories. In fact the same principle would apply to lots of repository metadata – instead of configuring input forms with things like institutional hierarchies, why not set up semantic web sites that document research infrastructure and processes and link the forms to them?

Back to UWS and my work with Gerry Devine. Turns out Gerry has been working describing the research context for his domain, the Hawkesbury Institute for the Environment. Gerry has a draft web site which describes the research context in some detail – all the background you’d like to have to make sense of a data file full of sensor data about life in whole tree chamber number four. It would be great if we could get the metadata in systems in HIEv pointing to this kind of online resource with statements like this:

<this-file> generatedBy https://sites.google.com/site/hievuws/facilities/eucface

To support this we’d meed to add some machine readable metadata to supplement Gerry’s draft human-readable web site. Ideally such a site would be able to support versioned descriptions of context so you could link data to the particular configurations of the research context, in the interests of maximising research integrity as per the Singapore Statement:

4. Research Records: Researchers should keep clear, accurate records of all research in ways that will allow verification and replication of their work by others.

5. Research Findings: Researchers should share data and findings openly and promptly, as soon as they have had an opportunity to establish priority and ownership claims.

1I know there are some strong arguments that IDs should be semantically empty – ie that that should not contain metadata but there are good practical reasons why data files with good names are necessary, and anyway the ID for a data set is not the same as its filename when it happens to be on your laptop.

HTML Slide presentations, students to the rescue

Creative Commons Licence
HTML Slide presentations, students to the rescue by Peter Sefton is licensed under a Creative Commons Attribution 3.0 Unported License.

Thanks to Andrew Leahy’s organising skills I am now the client for a group of third year computing students from the School of Computing, Engineering and Mathematics at the University of Western Sydney who have chosen to work on an HTML slide viewer project for their major project. I’m not going to name them here or point to their work without their permission, but who knows, they might start up an open source project as part of this assignment.

You might have noticed that on this blog I have been experimenting with embedding slide presentations in posts, like this conference presentation on research data management which embeds slide images originally created in the Google Drive presentation app along with speaker notes, or this one on the same topic where the slides are HTML sections rather than images. These posts mix slides with text, so you can choose to read the story or watch the show using an in browser HTML viewer. I think this idea has potential to be a much better way of preserving and presenting slides than throwing slide-decks online, but at the moment the whole experience on this blog is more than a bit clunky and leaves lots to be desired, which is where the students come in.

Hang on, there are dozens of HTML slide-viewer applications out there – so why do I think we need a new one?

There are a few main requirements I had which are not met by any of the existing frameworks, that I know of. These are:

  • It should be possible to mix slide and discursive content.

    That is, slides should be able to sprinkled through an otherwise ‘normal’ HTML document which should display as plain-old-html without any tricks.

  • Slide markup should be declarative and use modern semantic-web conventions.

    That is, the slides and their parts should be identified by markup using URIs instead of the framework assuming, for example that <section> or <div> means ‘this is a slide’. Potentially, different viewing engines could consume the same content. You could have a dedicated viewer for use in-venue with speaker notes on one screen and presentation on another and another to show a slide presentation embedded in a WordPress post.

  • Following from (2), the slide show behaviour should be completely independent of the format for the slides.

    That is adding the behaviour should be a one or two liner added to the top, or even better dropping the HTML into a ‘slide-ready’ content management system like, um, my blog.

There are plenty of frameworks with some kind of open license that students should be able to adapt for this project. That’s what I did with my attempt, I wrote a few lines of code to take slides embedded in blog posts, get rid of other HTML and marshal the result into the venerable W3C Slidy format. The format is declarative, and the documents don’t ‘do’ anything at all until a wordpress plugin sniffs-out slide markup hiding in them.

I’m going to be working with the team to negotiate what seems like a reasonable set of goals for this project, but my current thinking is something like the following:

  • In consultation with me, define a declarative format for embedding slides in HTML that can cover (at least):

    • Identifying slides using a URI.

    • Identify parts of slides (the slide vs notes etc).

  • Allow slides to consist of one or both of an image of the slide or a text/HTML version of the same thing. Eg a nicely rendered image of some bullet points from PowerPoint with equivalent HTML formatting also available to search engines and screen-readers.

  • Improve on the current slide-viewing experience in WordPress with:

    • Some kind of player that works in-post (ie without going fullscreen). A simple solution that came up in our meeting would be to automatically add navigation that just skips between slides, with some kind of care taken to show the slide at the top of the screen with context below it.

    • An improved full-screen player that can (at least) recognise when a full-screen image version of the slide is available and display that scaled to fit rather than the sub-optimal thing I have going on now with Slidy putting a heading at the and the image below.

There are lots more things that could be done with this, given time, which might make good material for future projects:

  1. Adding support for this format to Pandoc or similar.

  2. Creating a converter or exporter for slide presentations in common formats (.pptx, odp) targeting the new format.

  3. Extending the support I have already built into WordDown and the ICE content converter to allow authors to embed slides in word processing documents.

  4. Adding support for syncronised audio and video.

  5. Allowing more hyper-presentations like prezi.

  6. Dealing with progressive slide builds.

  7. Slide transitions.

  8. Different language versions of the same content.

  9. Synchronising display on multiple machines, eg student’s ipads or a second computer.

  10. Master slides and branding – point to a slide template somewhere? Include a suggested slide layout somehow?

  11. Adding a presenter mode with slides on one screen and notes on another.

  12. For use with mult-screen rigs like Andrew Leahy’s Wonderama maybe the extra screens could be used to show more context, slides on one screen video of presenter on another – photos, maps on other screens. Eg a Wonderama presentation rig could look for geo-coded stuff in the presentation and throw up maps or Google Earth viz on spare screens, or other contextual material.

Of course depending on which framework, if any, the students decide to adopt and/or adapt some of the above may come for free.

4A Data Management: Acquiring, Acting-on, Archiving and Advertising data at the University of Western Sydney

This is a repost of a presentation I wrote with Peter Bugeia and delivered at Open Repositories in Canada, originally published on the UWS eResearch team blog, and presented here with minor updates to the notes, mainly formatting but with one extra quip.

Creative Commons Licence
4A Data Management: Acquiring, Acting-on, Archiving and Advertising data at the University of Western Sydney by Peter Sefton and Peter Bugeia is licensed under a Creative Commons Attribution 3.0 Unported License.

Slide 1

Notes

Abstract

There has been significant Government investment in Australia in repository and eResearch infrastructure over the last several years, to provide all universities with an institutional repository for publications, and via the Australian National Data Service to encourage the creation of institution-wide Research Data Catalogues, and research Data Capture applications. Further rounds of funding have added physical data storage and cloud computing services. This presentation looks at an example of how these streams of money have been channeled together at the University of Western Sydney to create a joined-up vision for research data management across the institution and beyond, creating an environment where data may be used by research teams within and outside of the institution. Alongside of the technical services, we report on early work with researchers to create a culture of replicable use of data, towards the vision of truly reproducible research.

This presentation will show a proven end-to-end design for research data flows, starting from a research group, The Hawkesbury Institute for the Environment, where a large sensor network gathers data for use by institute researchers, in-situ, with data flowing-through to an institutional data repository and catalogue, and thence to Research Data Australia – a national data search engine. We also discuss a parallel workflow with a more generic focus – available to any researcher. We also report on work we have done to improve metadata capture at source, and to create infrastructure that will support the entire research data lifecycle. We include demonstrations of two innovations which have emerged from the associated project work: the first is of a new tool for researchers to find, organize, package and publish datasets; the second is of a new packaging format which has both human-readable and machine-readable components.

Slide 2

Notes

Some of the work we discuss here was funded by the Australian National Data Service. See:

Seeding the commons project to describe data sets at UWS and the Data catalogue project.

HIEv Data Capture at the Hawkesbury Institute for the Environment

The talk

Notes

We’ll use the four A’s to talk about some issues in data management.

  • We need a simple framework which covers it all, to capture how we work with research data from cradle to grave:

  • We need to Acquire the raw data and make it secure and available to be worked on.

  • We need to Act on the data to cleanse it while keeping track of how it was cleansed, analyse it using tools to support our research, while maintaining the data’s provenence.

  • We need to Archive the data from working storage to an archival store, making it citable

  • We need to Advertise that the data exists so that others can discover it and use it confidently with simple access mechanisms and simple tools.

  • 4A must work for

  • high-intensity research data such as that from gene sequences, sensor networks, astronomy, medical diagnostic equipment, etc.

  • the long tail of unstructured research data.

For example

Notes

In the presentation, I used a short video on how to catch a kangaroo. (Late the night before I was searching for this video and I forgot how spell kagaroo and tried starting it like this “How to catch a C … A … N …” – at which point the Google suggestion popped up with this, which I decided not to show at the conference I’d blame the jet-lag, but you wouldn’t believe me.)

If only data capture were as simple as catching a kangaroo in a shopping bag!

Australian Government Initiatives in Research Data Management

Notes

There have been several rounds of investment in (e)research infrastructure in Australia over the last decade, including substantial investments to get institutional publications repositories established.

  • Australian National Data Service (ANDS) $50M (link)

  • National eResearch Collaboration Tools and Resources (NeCTAR) project (link) $50M

  • Research Data Storage Infrastructure (RDSI) $50M (link)

  • Implemented to date:

  • National Research Data Catalogue – Research Data Australia

  • Standard approach to updating the Catalogue (OAI-PMH and rif-cs)

  • 10+ Institutional Metadata Repositories implemented

  • 120+ data capture applications implemented across 30+ research organisations

  • Upgrade of High Performance Computing infrastructure

  • Colocation of data storage and computing

Slide 6

Notes

UWS is a young ( ~20years) university performing well above most of its contemporaries in research.

Slide 7

Notes

This slide by Prof Andrew Cheetham – the Deputy Vice Chancellor for Research shows that UWS performs very well at attracting competitive grant income from the Australian Research Council.

Slide 8

Notes

UWS is concentrating its research into flagship institutes – we will be talking in more detail about HIE, here, our environmental institute which does research from cutting across different disciplines spanning from the leaf level to the ecosystem level.

Slide 9

Notes

Slide 11

Notes

These are Intersect’s members. Intersect also collaborates with other eResearch organisations throughout Australia.

The slide is a photo of at the recent Hackfest event. This is an annual fun competition for software developers to use open government data in innovative ways. Intersect hosted the NSW chapter of the event.

eResearch @ UWS

Notes

The eResearch unit at UWS is a small team, currently reporting to the Deputy Vice Chancellor, Research. See our FAQ.

Slide 13

Notes

At UWS, we haven’t tried to drive change with top-down policy. Instead, we’ve taken a practical, project-based approach which has allowed a data architecture to evolve. The eResearch Roadmap calls for a series of data capture applications to be developed for data-intensive research, along with a generic application to cover the long tail of research data.

The 4A Vision

For the purposes of this presentation we will talk about the ‘4A’ approach to research data management – Acquire, Act, Archive and Advertise. The choice of different terms from the 2Rs Reuse and Reproduce of the conference theme is intended to throw a slightly different light on the same set of issues. The presentation will examine each of these ‘A’s in turn and explain how they have helped us to organize our thinking in developing a target technical data architecture and integrated data-related end-to-end business processes and services involving research technicians and support staff, researchers and their collaborators, library staff, information technology staff, office of research services, and external service providers such as the Australian National Data Service and the National Library of Australia. The presentation will also discuss how all of this relates to the research project life cycle and grant funding approval.

Acquiring the data

We are attacking data acquisition (known as Data Capture by the Australian National Data Service, ANDS 1) in two ways:

With discipline specific applications for key research groups. A number of these have been developed in Australia recently (for example MyTARDIS 2), we will talk about one developed at UWS. With ANDS funding, UWS is building an open source automated research data capture system (the HIEv) for the Hawkesbury Institute for the Environment to automatically gather time-series sensor data and other data from a number of field facilities and experiments, providing researchers and their authorised collaborators with easy self-service discovery and access to that data.

Generic services for Data storage via simple file shares, Integration with cloud storage including Dropbox.com and other distributed file systems. And Source-code repositories such as public and private github and bitbucket stores for working code and textual data.

Acting on data

The data Acquisition services described above are there in the first instance to allow researchers to use data. With our environmental researchers, we are developing techniques for developing reusable data sets which include raw data, commented scripts to clean the data (eg a comment “filter out known bad-days when the facility was not operating”) then re-organize it via resampling or other operations into useful ‘clean’ data that can be fed to models, plotted etc and used as the basis of publications. Demo: the presentation will include a live demonstration of using HIEv to work on data and create a data archive.

From action to archive

Having created both re-usable base data sets and publication-specific operations on data to create plots etc there are several workflows where various parties trigger deposit of finished, fixed, citable data into a repository. Our project team mapped out several scenarios where data are deposited with different actors and drivers including motivations that are both carrot (my data set will be cited) and stick (the funder/journal says I have to deposit). Services are being crafted to fit in with these identified workflows rather than build new things and assume “they will come”.

Archiving the data

The University of Western Sydney has established a Research Data Repositoryi (RDR), the central component of which is a Research Data Catalogue, running on the ReDBOX open source repository platform. While individual data acquisition applications such as HIEv are considered to have a finite lifespan, the RDR will provide on-going curation of important research datasets. This service is set up to harvest data sets from the working-data applications, including the HIEv data-acquisition application and the CrateIt data packaging service using the Open Archives Initiative – Protocol for Metadata Harvesting (OAI-PMH).

Advertising the data

As with Institutional Publications Repositories, one of the key functions of the Research Data Repository is to disseminate metadata about holdings to aggregation services and give data a web presence. Many Australian institutions are connected to the Research Data Australia discovery service 6, which harvests metadata via an ANDS-defined standard over the OAI-PMH harvesting protocol. There is so far no Google-Scholar-like service which is harvesting data about data sets via direct web crawling (that we know about), so there are no firm standards for how to embed data in a page, but we are tracking the developments of the Schema.org vocabulary, which is driven largely by Google’s group of companies which are Google’s peers, and the work described above on data packaging with RDFa metadata is intended to be consumed by direct crawlers. It is possible to unzip a CrateIt package and expose it to the web thus creating a machine-readable entry-point to the data within the Zip/BagIt archive.

Looking to the future, the University is also considering plans for an over-arching discovery hub, which would bring together all metadata data about research including information on publications, people, and organisation.

Technical architecture

The following diagram shows the first end-to-end data capture to archiving pathways to be turned on at the University of Western Sydney, covering Acquisition and Action on data (use) and Archiving and Advertising of data for reuse. Note the inclusion of a name-authority service which is used to ensure that all metadata flowing through the system is unambiguous and inked-data-ready 7. The name Authority is populated with data about people, grants and subject codes from databases within the research services section of the university and from community-maintained ontologies. A notable omission from the architecture is integration with the Institutional Publications Repository – we hope to be able to report on progress joining up that piece of the infrastructure via a Research Hub at Open Repositories 2014.

i Project materials refer to the repository as a project which includes both working and archival storage as well as some computing resources, drawing a line around ‘the repository’ that is larger than would be usual for a presentation at Open Repositories.

Slide 14

Notes

There are a number of major research facilities at HIE, here are two whole-tree chambers which allow control over temperature, moisture and atmospheric CO2.

Slide 15

Notes

This diagram shows the end to end data and application architecture which Intersect and UWS eResearch built to capture data from HIE sensors and other sources. Each of the columns roughly equates to the four A model. Once data is packaged in the HIev, it is stored in the Research Data Store and there is a corresponding record for it in the Research Data Catalog. The data packaging format produced by the HIEv, along with the delivery protocol are key to the architecture: the data packaging format (based on bagit) is stand-alone from the HIEv and self-describing, the delivery protocol (OAI-PMH) is well-defined and standards based. THese are discussed in more detail in later slides. When other data capature applications are developed at UWS, to integrate into and extend the architecture they will simply need to package data in the same format and produce and deliver the same meta-data via the same delivery protocol as the HIEv.

Slide 16

Notes

This diagram shows how the four ‘A’s fit together for HIE. Acquisition and action are closely related – it is important to provide services which researchers actually want to use and to build in data publishing and packaging services rather than setting up an archive, and hoping they come to it with data.

Slide 17

Notes

The HIEv/DC21 application is available as open source:

  • Funded by ANDS

  • Developed by Intersect

  • Automated data capture

  • Ruby on Rails application

  • Agile development methodology

  • Went live in Jan 2013.

  • 1200 files, 15 GB of RAW data, 25 users.

  • 120 files auto-uploaded nightly, +1GB per week

  • Expected to reach 50,000 files in next couple of years

  • Now extended to include Eucface data

  • Possibly to be extended to include Genomic data (20TB per year)

  • Integrated with UWS data architecture

  • Supports the full 4 As – links Acquire to Act to Archive

Slide 18

Notes

Acting on data: our researchers are not staring to do work with the HIEv system: here’s an API developed by Dr Remko Duursma to consume data from R-stats.

Slide 19

Notes

Acting on data: researchers can pull data either manually of via API calls and do work, such as this R-plot.

From acting to archiving…

Notes

The following few slides show how a user can select some files…

Slide 21

Notes

… look at file metadata …

Slide 22

Notes

… add files to a cart …

Slide 23

Notes

… download the files in a zip package …

Slide 24

Notes

… inside the zip the files are structured using the bagit format …

Slide 25

Notes

… with a standalone README.html file containing all the metadata we know about the files and associated research context (experiments, facilites) …

Slide 26

Notes

… with detail about every file as per the HIEv application itself

Slide 27

Notes

… and embedded machine readable metadata using RDFa lite attributes

Slide 28

Notes

… the RDFa metadata describes the data-set as a graph.

Completed packages flow-through to the Research Data Catalogue via an OAI-PMH feed, and there they are given a DOI so they can be cited. The hand-off between systems is important, once a DOI is issued the data set has to be kept indefinitely and must not be changed.

Slide 29

Notes

Advertising – data. This is a record about an experiment on Research Data Australia.

Slide 30

Notes

I said I’d talk about the long tail. He are two.

We looked in some detail at how the HIEv data capture application works for environmental data – but what about researchers who are on the long tail, and who don’t have specific software applications for their group?

We are working on a similar Acquire and Act service that will operate with files and trying to make it as useful and attractive as possible. Most research teams we talk to at UWS are using Dropbox or one of the other ‘Share, Sync, See’ services. Dropbox has limitation on what we can do with its APIs and does not play nicely with authentication schemes other than its own, so we are looking at building ‘Acquire and Act’ services using an open source alternative; ownCloud.

Our application is known as Cr8it (Crate-it).

Slide 31

Notes

A number of techniques employed at UWS:

  • the “R” drive

  • research-project-oriented data shares

  • synchronisation with dropbox and owncloud

  • synchronisation with github and svn

References

1. Burton, A. & Treloar, A. Designing for Discovery and Re-Use: the ‘ANDS Data Sharing Verbs’ Approach to Service Decomposition. International Journal of Digital Curation 4, 44–56 (2009).

2. Androulakis, S. MyTARDIS and TARDIS: Managing the Lifecycle of Data from Generation to Publication. in eResearch Australasia 2010 (2010).at <http://ccaeducause1.caudit.edu.au/index.php/eraust/2010/paper/view/62>

3. Sefton, P. M. The Fascinator – Desktop eResearch and Flexible Portals. (2009).at <https://smartech.gatech.edu/handle/1853/28483>

4. Kunze, J., Boyko, A., Vargas, B., Madden, L. & Littman, J. The BagIt File Packaging Format (V0.97). at <http://tools.ietf.org/html/draft-kunze-bagit-06>

5. Group, W. W. & others RDFa Core 1.1 Recommendation. (2012).at <http://www.w3.org/TR/rdfa-syntax/>

6. Wolski, M., Richardson, J. & Rebollo, R. Shared benefits from exposing research data. in 32 nd Annual IATUL Conference (2011).at <http://iatul2011.bg.pw.edu.pl/proceedings/ft/Wolski_M.pdf>

7. Berners-Lee, T. Linked data, 2006. at <http://www.w3.org/DesignIssues/LinkedData.html>



Research Data @ the University of Western Sydney (Introducing a data deposit management plan to the research community at UWS)

I was invited to speak at the National Higher Education Faculty Research Summit in Sydney on May 22 about our Research Data Repository project. The conference promises to provide a forum for exploration.

Explore

  • Sourcing extra grant funding and increasing revenue streams

  • Fostering collaboration and building successful relationships

  • Emerging tools and efficient practices for maintaining research efficacy and integrity

  • Improving your University’s research performance, skills and culture to enable academic excellence

My topic is “Introducing a data deposit management plan to the research community at UWS”. This relates directly to the conference theme I have highlighted, on emerging tools and practice. My strategy for this presentation, given that we’re at a summit, is to stay above 8000m, use a few metaphors, and discuss the strategy we’re taking at UWS rather than dive too deeply into the sordid details of projects. As usual, these are my notes; I hope these few paragraphs will be more useful than just a slide deck, but this is not a fully developed essay.

There are two kinds of data: Working and Archival/Published

In very general terms, we have divided our data storage into two parts: the working Research Data Storage service where people get things done, collect data and work with it and the archival Research Data Repository part where stable, citable published data sets are looked after (by the library) for the long term.

This talk is not going to be all about architecture diagrams but here’s one more, from a recent project update showing two examples of applications that will assist researchers in working with data. One very important application is HIEv, the central data capture/management platform for the Hawkesbury Institute for the Environment. This is where research teams capture sensor data, research support staff work to clean and package the data, researchers develop models and produce derived data and visualisations. We’re still working out exactly how this will work as publications using the data start to flow, but right now data moves from the working space to the archival space, and thence to the national data discovery service, see this example of weather data – (unfortunately the data set is not yet openly available for this one, I think it should be, and I’ll be doing what I can to make it so).

Data wrangling services

The other service shown on this diagram is Dropbox.com. We’d be hard pressed to stop researchers from using this service – it comes up in just about every consultation meeting. Researchers themselves must take responsibility for making sure that services like this are appropriate given their data management obligations under funder agreements and codes of practice. For those projects where Dropbox.com is appropriate we plan to let researchers invite the Research Data Store to share their stuff, thus creating a managed, backed-up copy at the university, and opening the way for us to provide useful services over the data (coming soon).

Data management

Yes, we have a web page about research data management, with some basic advice and links to more resources, but putting up web pages does not effect the kind of culture change needed to establish research data management, data re-use and data citation. As our Research Office head, Gar Jones, says this will be a change similar to the introduction of Human and Animal ethics management which will take several years to roll out.

Some key points for this presentation

I want to talk about:

  • Governance, open access, metadata, identifiers

  • The importance of the (administrative) research lifecycle

  • Policy supported by services rather than aspirations

eResearch = goat tracks

This is a concrete path on the Werrington South (Penrith) campus of the University of Western Sydney. The path is there because people kept walking through the garden bed, which was in between where the shuttle bus stops and where they wanted to be, at the library. As I said at a similar conference for IT-types last year:

Groups like mine work in the gap between the concrete and the goat track, my job is to encourage the goats.

And once we’ve encouraged the goats to make new paths, we need to get the university infrastructure people to come and pave the paths.

What’s over the horizon?

What do research administrators and IT directors need to be thinking about?

  • Changes in the research landscape – more emphasis on data reuse and citation, increasing emphasis on defensible research mean data will become as important as citations

  • Providing access to publications and data so it can be reused.

  • (e)Research infrastructure in general, where collaboration must not be constrained by the boundaries of individual institutional networks and firewalls.

Any others?

Research data, Next Big Thing?

The Australian National Data Service runs a data-discovery service designed to advertise data for reuse.

Governments are joining in

As research organisations, we want to have infrastructure for data management, and a culture of data management that involves forward planning, and data re-use. So the next section of the talk is about how we need to:

  • Stop the fat multinational-publisher tail from wagging the starving research dog. Ensure research funded by us is accessible and usable by us.

  • Understand our researchers and their habits, so we can help them take on this new data management responsibility (actually it’s not a new responsibility, but many have simply been paying no attention to it, in the absence of any obvious reason to do so).

  • Sort out the metadata mess most universities are swimming in.

Now for the big picture stuff.

Open Free scholarship is coming? (Just beyond that ridge)

OA is a Good Thing,

Which will:

  • Reduce extortionate journal pricing.

  • Provide equitable access to research outputs to the whole world.

  • Open Access to publications and Coming Soon: Open Access to data.

  • Promote Open Science and Open Research.

  • Drive huge demand for data management, cataloguing, archiving, publishing services

http://aoasg.org.au/

There are competing models for open access. Bizarrely the discussion is often framed as a contest between ‘Green’ and ‘Gold’. It’s a lot like the State of Origin Rugby League, a contrived but popular-in-obscure-corners of the world contest where the ‘Blues’ and ‘Maroons’ run repeatedly into each other. In both State of Origin and Open Access, the current winners are large media companies. Add least being an Open Access advocate doesn’t give you head injuries.

Green OA refers to author-deposited pre-publication versions of research articles. Gold means that the published version itself is ‘Open’ for some ill-defined definition of open, often at a cost of thousands of dollars, out of the researcher’s budget. Green or Gold, a lot of so-called Open Access publishing operates with no formal legal underpinnings, that is, without copyright-based licenses. For example when I deposited a Green version of a paper I had written here, and wrote to the publisher asking them to clarify copyright and licensing issues I got no reply.


We have a brief window now to try to build services for research data management that do have a solid legal basis and avoid following some of the OA movements missteps but this is not trivial (1).

Identity management is crucial

I have used a variant of the above dog picture before to talk about identity management. This dog has a name but it’s a terrible way to find out about him as he has a much more famous namesake.

Like the rest of us, this dog has all sorts of identifying names and numbers – a microchip number linked to a database, an ID assigned by the RSPCA, patient numbers at veterinary practices, which may be linked to more than one human, phone numbers on his tag etc. Point is, it’s much worse for researchers than for dogs – identities are maintained all over the place. Foley and Kochalko put it like this:

While much has changed since the days of David Livingstone, we continue to struggle with associating individuals with their works accurately and unambiguously. Author name ambiguity plagues science and scholarship: when researchers are not properly identified and credited for their work, dead-ends and information gaps emerge. The impact ripples throughout the ecosystem, compromising collaboration networks, impact metrics, “smarter” research allocations, and the overall discovery process. Name ambiguity also weighs on the system by creating significant hidden costs for all stakeholders. (2)


To do metadata management well we need to make sure that we sort out all sorts of naming and identifying issues, dealing correctly with potential causes of confusion, multiple people with the same name, people with multiple names over time, and simultaneously, name variants. Even where there are agreed subject codes like the Field of Research codes that are heavily used in research measurement exercises they can get mixed us as different databases use different variants.

We try to work out how to fit new processes into existing workflows

At Rochester university, when they installed an institutional repository the team conducted ethnographic research on their research community (3). We have not gone that far, but our Research Data Repository project does try to pay attention to what researchers do as part of their current work, and to fit new processes into existing ones.


For example, the above scenario tries to capture the interactions that would happen when a researcher is required by a journal to deposit data before publication. We spend a lot of time talking to the Office of Research Services (ORS) and research librarian team about how we can fit in with their existing processes, and how to minimise negative impacts on research groups. Research Offices are used to responding to changing regulatory environments so adding new fields to forms etc is straightforward. Changing IT services is much harder; the ITS is much bigger than ORS, new services need to be acquired, provisioned and documented, and the service desk team has to be taught new processes.

Challenge: how to stop the corporate publishing tail from wagging the scholarly dog

This is a rather a substantial issue to try to talk about in a discussion about research data management and repositories, but it’s essential to keep an eye on the big picture. We know that scholarship has to change, publishing has to change, but we don’t know how. We need to develop strategies for how we want it to change. Some examples of where this is important:

  • Policy on ‘ownership’ of intellectual property rights over data needs to be established. This is not as simple as it is for publications, as data are not always subject to copyright (1).

  • Data citation is going to be an important metric.

New models are needed. People like Alex Holcombe from Sydney uni are developing them:

Science is broken; let’s fix it. This has been my mantra for some years now, and today we are launching an initiative aimed squarely at one of science’s biggest problems. The problem is called publication bias or the file-drawer problem and it’s resulted in what some have called a replicability crisis.

When researchers do a study and get negative or inconclusive results, those results usually end up in file drawers rather than published. When this is true for studies attempting to replicate already-published findings, we end up with a replicability crisis where people don’t know which published findings can be trusted.

To address the problem, Dan Simons and I are introducing a new article format at the journal Perspectives on Psychological Science (PoPS). The new article format is called Registered Replication Reports (RRR).  The process will begin with a psychological scientist interested in replicating an already-published finding. They will explain to we editors why they think replicating the study would be worthwhile (perhaps it has been widely influential but had few or no published replications). If we agree with them, they will be invited to submit a methods section and analysis plan and submit it to we editors. The submission will be sent to reviewers, preferably the authors of the original article that was proposed to be replicated. These reviewers will be asked to help the replicating authors ensure their method is nearly identical to the original study.  The submission will at that point be accepted or rejected, and the authors will be told to report back when the data comes in.  The methods will also be made public and other laboratories will be invited to join the replication attempt.  All the results will be posted in the end, with a meta-analytic estimate of the effect size combining all the data sets (including the original study’s data if it is available). The Open Science Framework website will be used to post some of this. The press release is here, and the details can be found at the PoPS website.

http://alexholcombe.wordpress.com/2013/03/03/registered-replication-reports-are-open-for-submissions/

This seems like a positive note on which to end. Hundreds of researchers are trying to fix scholarship, they’re the ones we need to talk to about what a data repository or a data management plan should be.

Science is broken let’s fix it

1. Stodden V. The Legal Framework for Reproducible Scientific Research: Licensing and Copyright. Computing in Science Engineering. 2009;11(1):35–40.

2. Foley MJ, Kochalko DL. Open Researcher and Contributor Identification (ORCID). 2012 [cited 2013 May 21]; Available from: http://docs.lib.purdue.edu/cgi/viewcontent.cgi?article=1133&context=charleston

3. Lindahl D, Bell S, Gibbons S, Foster NF. Institutional Repositories, Policies, and Disruption. 2007 [cited 2013 May 21]; Available from: http://open.bu.edu/xmlui/handle/2144/919

Creative Commons License
Research Data @ the University of Western Sydney (Introducing a data deposit management plan to the research community at UWS) by Peter Sefton is licensed under a Creative Commons Attribution 3.0 Unported License.

Running an Open Source project from a university dev team

Steven Hayes from Arts eResearch at the University of Sydney invited me to visit their group and talk about running open source software projects, as they are making their Heurist (semantic database-of-everything) software open source. This was more of a conversation than a presentation, but I prepared a few ‘slides’ to remind me of which points to hit. Here are my notes. The focus here was not on why go open source, or open source in general, it was about doing it in a small university-based team. Comments about how various uni open source projects run would be appreciated.

I have been involved in creating two sizeable code-bases both released by the University of Southern Queensland as open source. They had very different histories. I’ll talk about both and how they run, although actually one of them doesn’t run any more in any meaningful way.

Two projects I started…

… on which other people* did most of the work

  • ICE – the Integrated Content Environment. Used at USQ for creating course materials for delivery online and in print. Almost no activity on this outside of USQ these days. Inside USQ? I don’t know for certain, but I think it is still in use, and finding a replacement has proven difficult, which doesn’t surprise me as that was the reason we built it in the first place).

  • ReDBOX – the Research Data Box (and The Fascinator, the underlying toolkit).

*Thanks to Ron Ward, Oliver Lucido, Linda Octalina, Duncan Dickinson, Greg Pendlebury, Daniel de Byl, Bron Chandler, Tim McCallum, Cynthia Wong, Jason Zejfert, Sally MacFarlane, Caroline Drury, Pamela Glossop, Warwick Milne, Sue Craig, Vicki Picasso, Dave Huthnance, Shirley Reushle and the late Alan Smith who made, tested, championed and supported these projects. Thanks also to funding from the Australian government via ANDS, ARROW and other streams. Sorry if I forgot anyone.

(At this point I wanted to check that everyone knows what Open Source means, making sure that we all understand how Richard Stallman made software free using copyright law. Whoever holds the copyright in a bit of software, which is likely to be whoever wrote it, or their employer can control distribution by using a licence, a legal instrument. Stallman’s insight was that a licence could be used to enforce sharing, openness and freedom: you can use this stuff I created provided you promise to share it with other people (that’s not a quote). Oh, and people working in this space should also understand the difference between Free and Open Source [1].

But I forgot.)

RTFM

Above, I linked to a free book on producing Open Source software [1] by Karl Fogel which seems to cover most of what you’d need to know. I haven’t read it all, looks useful.

But I don’t like this

The book begins:

Most free software projects fail.

I think that’s silly, talking about failure without first defining success.

Me, I’m not sure that all these scenarios Fogel lists are failures at all, there are lots of reasons to release code and they are not all necessarily about building a substantial community:

We tend not to hear very much about the failures. Only successful projects attract attention, and there are so many free software projects in total[2] that even though only a small percentage succeed, the result is still a lot of visible projects. We also don’t hear about the failures because failure is not an event. There is no single moment when a project ceases to be viable; people just sort of drift away and stop working on it. There may be a moment when a final change is made to the project, but those who made it usually didn’t know at the time that it was the last one. There is not even a clear definition of when a project is expired. Is it when it hasn’t been actively worked on for six months? When its user base stops growing, without having exceeded the developer base? What if the developers of one project abandon it because they realized they were duplicating the work of another—and what if they join that other project, then expand it to include much of their earlier effort? Did the first project end, or just change homes?

What’s the first thing that comes to mind when you think of Open Source?

Linux? Apache? WordPress?  Firefox?

The hits. The stadium-filling rock-star projects?

Your band has 99.9% probability of staying in the garage

Figure 1 Me (the good looking one) and cousin Tim at the Springwood Sports club, about to perform with a community uke-group. No plans for world-domination, playing for family, who are obliged to attend and even some people who , for some reason, choose to come. #Notfailure.

It’s important to work out why you are going to release software as Open Source – think about the audience. One very important audience is you, yourself. If you work on code as part of your job, then your employment contract may well mean that your employer owns the copyright. Do you want to be able to continue using it in your next job? Show potential employers? Making it open source helps your future self.

I know this first hand.

Universities are not as stable as they seem, or you may hope. At the Australian Digital Futures Institute at USQ we began by hosting code repositories and websites internally. I reasoned that the university would be a good bet for maintaining persistence of these resources.

But then one Gilly Salmon came to our institute to be the new professor, decided, along with the rest of the senior leadership team that there was altogether too much making the digital future going on in the Australian Digital Futures Institute, too much technology. They let just about all the technical staff go, no matter how useful they were to the organisation, or how pregnant they happened to be (we’re a relationship brand, the director of marketing told me, so we shouldn’t be continuing to develop software to deliver award-winning distance-ed services).

Web sites that would still have value are just gone from public view, including, ironically the PILIN project site, which was about persistent identifiers. Even the ICE website which is full of useful stuff for USQ itself now appears to be only accessible via the Wayback machine. They’re still using it but they turned off the website anyway, the code, however, is sitting on Google code so we all still have access to it.

This sort of thing happens all the time. For a couple of us, the NextEd refugees, this was the second redundancy associated with USQ. Kids, it is prudent to make sure that any code you might want to re-use later in your career is released under an open licence, and documentation, web sites etc likewise under creative commons. Think of it as a professional escape pod.

The ReDBOX project survived this ADFI shut down, because it had been open source from the beginning but further funding had to be redirected to another university which was willing to host the building of a digital future.

Lessons

  • Open Source can be worth doing even if the audience is your future self

  • Don’t trust someone else to keep your website up

  • If you want a community you’ll (likely) have to build it

  • Every project is different, so you need to structure yours around your users

Oh, and the answer to most questions is on Stack Exchange. I decided that this list was worth using as a starting point for discussion.

http://programmers.stackexchange.com/questions/51553/checklist-for-starting-an-open-source-project

Havoc P said: [with additions by me post the discussion at USYD]

Things I’d put in the early priorities are:

  • have a simple “what is it?” web site with links to some discussion forum (whether email or chat) and to the source code repository

    [Mailing lists are usually best IMO – forums can be empty, echoing and make you project look unloved. A tech list is a must, always, but other communications should be built around the reality of your project. No user community yet? Build one. Others over at Stack Exchange added that once you have a tech-list is best to hold or log all your discussions there so architectural decisions are transparent and the community can engage.

    On the ReDBOX project there are two main mailing lists, one for the techies and one for the users (mostly library staff), and lots of virtual and face-to-face get togethers. There is a committers group who are in charge of what gets into the trunk and various ad-hoc arrangements to sponsor sub-projects at the dozen or so sites using the software. The groups and how they interact were all created to serve that community, not from some manual of best practice, although it is all informed by collective experience of open source projects.]

  • be sure the code compiles and usually works, don’t commit work-in-progress or half-ass patches on the main branch that break things, because then other people’s work would be disrupted

    [Well, OK, but if you’re releasing an existing code base then don’t get too hung up on making things perfect (a) it will be a huge waste if there is no demand for your code and (b) don’t be unnecessarily shy, most open source projects are like busking, not stadium rock, nobody is watching you waiting to pounce on your errors.]

  • put a license file in the code repository with a well-known license, and mark the copyright owner (probably you, or your company). don’t omit the license, make up a license, or use an obscure license.

  • have instructions for how to contribute, say in a HACKING file or include in your README. This should include where to send patches, how to format patches, code indentation rules, any other important conventions of the project

  • have instructions on how to report a bug

  • be helpful on the mailing list or whatever your forums are

More from Havoc P

After those priorities I’d say:

  • documentation (this saves you work on the mailing list… make a FAQ from your list posts is a simple start)

  • try to do things in a “normal” way (don’t invent your own build system or use some weird one, don’t use 1-space indentation, don’t be annoyingly quirky in general because it adds learning curve)

  • promote your project. marketing marketing marketing. You need some blogs and news sites and stuff like that to cover you, and then when people show up interested, you need to talk to them and be sure they get it working and look at their patches. Maybe mention your project in the forums for related projects.

    [Yes, this is a huge one. One of the big differences between ReDBOX, which is no hit, but has a solid user base and ICE which never made it out of USQ is that Vicki Picasso from Newcastle Uni and I marketed the hell out of ReDBOX early to a very specific community of user-organisations. We needed a community so the software would have a sustainable base, so we designed the software for the community and sought input on the design as broadly as we could.

    With ICE, I talked about it to lots of the wrong people and didn’t sell it to the right ones, other distance ed unis, but that was partly because it conferred a competitive advantage on USQ. This comes back to the point above about success vs failure – there’s more than one way to succeed.]

  • always review and accept patches as quickly as humanly possible. Immediately is perfect. More than a couple days and you are losing lots of people.

  • always reply to email about the project as quickly as humanly possible.

  • create a welcoming/positive/fun atmosphere. don’t be a jerk. say please and thank you and hand out praise. chase off any jackasses that turn up and start to poison the community. try to meet people in person when you can and form bonds.

[1] K. Fogel, Producing open source software: How to run a successful free software project. O’Reilly Media, Inc., 2005.

Creative Commons License
Running an Open Source project from a university dev team by Peter (pt) Sefton is licensed under a Creative Commons Attribution 3.0 Unported License.