Category Archives: Uncategorized

Introducing next year’s model, the data-crate; applied standards for data-set packaging

This is also up at the UWS eResearch blog

[Update 2013-11-04:

If you're reading this in Feedly and possibly other feed readers the images in this post won't show - click through to the site to see the presentation

Added some more stuff from the proposal, including the reference list - clarified some quoted text]

Creative Commons Licence
Introducing next year’s model, the data-crate; applied standards for data-set packaging by Peter Seftton and Peter Bugeia is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License .

This presentation was delivered by Peter Sefton at eResearch Australasia 2013 in Brisbane, based on this proposal .

Slide 1

Peter Sefton* p.sefton@uws.edu.au

Peter Bugeia** peter.bugeia@intersect.org.au

*University of Western Sydney

**Intersect Australia Ltd

ABSTRACT

In this paper we look at current options available for storing research data to maximize potential reuse and discoverability, both at the level of data files, and sets of data files, and describe some original work bringing together existing standards and metadata schemas to make well-described, reusable data sets that can be distributed as single files, dubbed “crates” with as much context and provenance as possible. We look at some of the issues in choosing file formats in which to archive and disseminate data, and discuss techniques for adding contextual information which is both human-readable and machine-readable in the context of both institutional and discipline data management practice.


Slide 2

When the eResearch team at UWS and Intersect were working on the ANDS DC21 “HIEv” (5) application to allow researchers to create data-sets from collections of files, we looked in vain for a simple-to-implement solution for making CSV-type data available with as much provenance and re-use metadata as possible. In this presentation we will discuss some of the many file-packaging options which were considered and rejected including METS (6), and plain-old zip files with no metadata.

The Eucalyptus woodland free-air CO2 enrichment (EucFACE) facility is the only one of its kind in the southern hemisphere.

It is unique in that it provides full-height access to the mature trees within remnant Cumberland Plain Forest, the only FACE system in native forest anywhere in the world. It is sited on naturally low-nutrient soils in what is close to original bushland, and offers researchers an amazing site at which to study the effects of elevated CO2 on water use, plant growth, soil processes and native biodiversity in a mature, established woodland within the Sydney Basin.

http://www.uws.edu.au/hie/research/research_projects/eucface

This is in the context of the Hawkesbury Institute For the Environment,(HIE) experimental facility, pictured is the Free-Air-Co2 exchange experiment ( EucFACE) under construction.


Slide 3

This is the context in which we did this data-packaging work, but it is designed to be more broadly applicable.


What keeps us awake at night?

What if provide a zip download of a whole lot of environment-data files and someone writes and important article, but then they can’t work out which zip file and which data files they actually used?

What if there’s some really important data that I know I have on my hard-disk but I can’t tell which file it’s in ‘cos they’re all called stuff like 34534534-er2.csv?


Some standards are not actually standards......

We have reached the time when there is a genuine need to be able to match-up data from different sources; infrastructure projects funded by the Australian National Data Service (ANDS) (4) are now feeding human-readable metadata descriptions to the Research Data Australia (RDA) website. But which standards to use? As Tanenbaum said, “The nice thing about standards is that you have so many to choose from. Furthermore, if you do not like any of them, you can just wait for next year’s model” (1). However, when it comes to choosing file format standards for research data, we have found that while there might be many standards there is no single standard for general-purpose research data packaging. It is, however possible to stitch-together a number of different standards to do a reasonable job of packaging and describing research data for archiving and reuse.

There are several issues with standards at the file level. For example consider one of the most commonly supported formats: CSV – or Comma Separated Values. CSV file is actually a non-standard, ie there is no agreed CSV specification, only a set of unreliable conventions used by different software, RFC 4180 (2) notwithstanding. While a CSV file has column headers, there is no way to standardise their meaning. Moving up the complexity chain, the Microsoft Excel based .xslx format is a standard, as is the Open Document Format for spreadsheets but again, even though you can point to a header-row in a spreadsheet and say “that’s the header” there is no standard way to label variables in a way that will match with the labels used by other researchers, or to allow discovery of the same kind of data points in hetrogenous data sets. There is a well established standard which does allow for “self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data”, NetCDF (3) – we will consider how this might be more broadly adopted in eResearch contexts.


Slide 6

Data Packaging Principles for this environment…

Slide 8

2. The packaging format should deal with any kind of data file

3. The packaging format should work for any domain

4. The packaging format should be platform neutral

Slide 12

6. Metadata should be both human and machine-readable

Slide 14

8. 	The package format should cater for datasets of any size*

The Crate Specification…

Slide 17

Can you guess which two standards are the basis for the crate?


Slide 18

When the eResearch team at UWS and Intersect NSW were working on the ANDS DC21 “HIEv” (5) application to allow researchers to create data-sets from collections of files, we looked in vain for a simple-to-implement solution for making CSV-type data available with as much provenance and re-use metadata as possible, as per the principles outlined above. In this presentation we will discuss some of the many file-packaging options which were considered and rejected including METS (6), and plain-old zip files with no metadata. The project devised a new proof-of-concept specification, known as a ‘crate’, based on a number of standards,. This format:

Uses the California Digital Libraries Bagit specification(7) for bundling files together into a bag.

Creates a single-file for the bag using zip (other contenders would include TAR or disk image formats but zip is widely supported across operating systems and software libraries).

Uses a human-readable HTML README file to make apparent as much metadata as is available from (a) within files and (b) about the context of the research data.

Uses RDF with the W3C’s DCAT ontology (8) and others to add machine readable metadata about the package including relationships between files, technical metadata such as types and sizes and research context


Slide 19

The following few slides from the DC21/HIEv ssystem show how a user can select some files…


Slide 20

\

… look at file metadata …


Slide 21

… add files to a cart …


Slide 22

\

… download the files in a zip package …


Slide 23

\

… inside the zip the files are structured using the bagit format …


Slide 24

… with a standalone README.html file containing all the metadata we know about the files and associated research context (experiments, facilites)

This is something you can unzip on your laptop, put on a web server, or a repository could show to users as a ‘peek’ inside the data set


Slide 25

\

… with detail about every file as per the HIEv application itself


Slide 26

… and embedded machine readable metadata using RDFa


Slide 27

… the RDFa metadata describes the data-set as a graph.

Completed packages flow-through to the Research Data Catalogue via an OAI-PMH feed, and there they are given a DOI so they can be cited. The hand-off between systems is important, once a DOI is issued the data set has to be kept indefinitely and must not be changed.


Slide 28

The README file not only contains human readable descriptions of the files and their context there is embedded machine readable metadata. The relationships such as “CreatedBy” use URIs from mainstream ontologies where possible.


Slide 29

We have not done this yet, but using platorms like R-Studio + Knitr it would be possiblet to include runnable-code in data packages, which would provide a ‘literate programming’ readme. This is an example of some data we got from Craig Barton and Remko Duursma.


Slide 30

So the README could include plots, etc, and a copy of the article


Slide 31

Cr8it is designed to plug in to the ownCloud share-sync service so users can compile data sets from working data file for deposit in a repository.

The HIE project is (in part) a simple semantic CMS system that will describe the research context at HIE.


What’s next?

Try this in more places

Integrate research context

Continue quest for decent ontologies and vocabs

Get feedback

REFERENCES

1. Tanenbaum AS. Computer networks. Prentice H all PTR (ECS Professional). 1988;1(99):6.

2. <ietf@shaftek.org> YS. Common Format and MIME Type for Comma-Separated Values (CSV) Files [Internet]. [cited 2013 Jun 8]. Available from: http://tools.ietf.org/html/rfc4180

3. Rew R, Davis G. NetCDF: an interface for scientific data access. Computer Graphics and Applications, IEEE. 1990;10(4):76–82.

4. Sandland R. Introduction to ANDS [Internet]. ANDS; 2009. Available from: http://ands.org.au/newsletters/newsletter-2009-07.pdf

5. Intersect. Data Capture for Climate Change and Energy Research: HIEv (AKA DC21) [Internet]. Sydney, Australia; 2013. Available from: http://eresearch.uws.edu.au/blog/projects/data-capture-for-climate-change-and-energy-research/

6. Pearce J, Pearson D, Williams M, Yeadon S. The Australian METS Profile–A Journey about Metadata. D-Lib Magazine. 2008;14(3/4):1082–9873.

7. Kunze J, Boyko A, Vargas B, Madden L, Littman J. The BagIt File Packaging Format (V0.97) [Internet]. [cited 2013 Mar 1]. Available from: http://tools.ietf.org/html/draft-kunze-bagit-06

8. Maali F, Erickson J, Archer P. Data Catalog Vocabulary (DCAT) [Internet]. World Wide Web Consortium; Available from: http://www.w3.org/TR/vocab-dcat/

Another student project – crossing the curation boundary

Creative Commons Licence
Another student project – crossing the curation boundary by Peter Sefton is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Another student project – crossing the curation boundary

I wrote last week about a student project on HTML slide-viewing for which I’m the client. This week I met with another group to talk about a project which has more direct application to my job as eResearch manager at the University of Western Sydney.

The next cohort are going to be looking at a system for getting working data into an eResearch application. Specifically they are going to have a go at building an uploader add-in to ownCloud, the Open Source Dropbox.com-like system so it can feed data to the HIEv data management application used by the Hawkesbury Institute for the Environment. This project was inspired by two things:

  • The fact that we’re working with ownCloud in a trial at UWS, and our cr8it data packaging and archiving application is based on ownCloud, so getting some students working in this area will help us better understand oC and build expertise around UWS.

  • A meeting with Gerry Devine, the data manager at HIE – where he was explaining how the institute is trying to improve the quality of data in HIEv; at this stage a least they don’t want everything uploaded, and files need to conform to naming conventions1.

These two things go very nicely together; OwnCloud has a sync-service like Dropbox that can replicate folders full of files across machines, via a central server, and a web-view of the files, it has a plugin architecture so it is easy to add actions to files, and HIEv has an API that can accept uploads. The application is simple:

  • For certain file types, those that might have data like .csv files, show an ‘Upload to HIEv’ button in the web interface.

  • Present the user with aform, to collect metadata about the file; what date range does it represent, which experimental facility is it from, via a drop-down list etc (and yes automated metadata extraction would be nice to have, if the students have time).

  • Use the metadata to generate file-names based on the metadata.

  • Upload to HIEv.

I think that should be a reasonable scope for a third year assignment, with plenty of room to add nice add-on features if there’s time. A couple of obvious ones:

  • Extracting metadata from files (eg working out the data-range).

  • Making the metadata form configurable eg with a JSON file.

Beyond that, there is a potentially much more ground-breaking extension possible. Instead of having to set up the metadata form for every context of research, what if information about the research context could be harvested from the web and the user could pick their context from that?

I have been talking this idea through with various eResearch and repository people. I submitted it as an idea to the Open Repositories Dev challenge (late, as usual). Nobody bit, but I think it’s important:

If you are building a repository for research data, then you need to be able to record a lot of contextual metadata about the data being collected. For example, you might have some way to attach data to instruments . We typically see designs with hierarchies something like Facility / Experiment / Dataset / File. Problem is, if you design this into the application, for example via database table then that makes it much harder to adapt to a new domain or changing circumstances, where you might have more or fewer levels, or hierarchies of experiment or instrument might become important etc.

So, what I’d like to see would be a semantic wiki or CMS for describing research context with some built-in concepts such as “Institute”, “Instrument”, “Experiment”, “Study”, “Clinical Trial” (but extensible) which could be used by researchers, data librarians and repository managers to describe research context as a series of pages or nodes, and thus create a series of URIs to which data in any repository anywhere can point: the research data repository could then concentrate on managing the data, and link the units of data (files, sets, databases, collections) to the context via RDF assertions such as ‘<file> generatedBy <instrument>’. Describing new data sets would involve look-up and auto-completes to the research-context-semantic-wiki – a really interesting user interface challenge.

It would be great to see someone demonstrate this architecture, building on a wiki or CMS framework such as Drupal or maybe one of the NoSQL databases, or maybe as a Fedora 4 app, showing how describing research context in a flexible way can be de-coupled from one or more data-repositories. In fact the same principle would apply to lots of repository metadata – instead of configuring input forms with things like institutional hierarchies, why not set up semantic web sites that document research infrastructure and processes and link the forms to them?

Back to UWS and my work with Gerry Devine. Turns out Gerry has been working describing the research context for his domain, the Hawkesbury Institute for the Environment. Gerry has a draft web site which describes the research context in some detail – all the background you’d like to have to make sense of a data file full of sensor data about life in whole tree chamber number four. It would be great if we could get the metadata in systems in HIEv pointing to this kind of online resource with statements like this:

<this-file> generatedBy https://sites.google.com/site/hievuws/facilities/eucface

To support this we’d meed to add some machine readable metadata to supplement Gerry’s draft human-readable web site. Ideally such a site would be able to support versioned descriptions of context so you could link data to the particular configurations of the research context, in the interests of maximising research integrity as per the Singapore Statement:

4. Research Records: Researchers should keep clear, accurate records of all research in ways that will allow verification and replication of their work by others.

5. Research Findings: Researchers should share data and findings openly and promptly, as soon as they have had an opportunity to establish priority and ownership claims.

1I know there are some strong arguments that IDs should be semantically empty – ie that that should not contain metadata but there are good practical reasons why data files with good names are necessary, and anyway the ID for a data set is not the same as its filename when it happens to be on your laptop.

4A Data Management: Acquiring, Acting-on, Archiving and Advertising data at the University of Western Sydney

This is a repost of a presentation I wrote with Peter Bugeia and delivered at Open Repositories in Canada, originally published on the UWS eResearch team blog, and presented here with minor updates to the notes, mainly formatting but with one extra quip.

Creative Commons Licence
4A Data Management: Acquiring, Acting-on, Archiving and Advertising data at the University of Western Sydney by Peter Sefton and Peter Bugeia is licensed under a Creative Commons Attribution 3.0 Unported License.

Slide 1

Notes

Abstract

There has been significant Government investment in Australia in repository and eResearch infrastructure over the last several years, to provide all universities with an institutional repository for publications, and via the Australian National Data Service to encourage the creation of institution-wide Research Data Catalogues, and research Data Capture applications. Further rounds of funding have added physical data storage and cloud computing services. This presentation looks at an example of how these streams of money have been channeled together at the University of Western Sydney to create a joined-up vision for research data management across the institution and beyond, creating an environment where data may be used by research teams within and outside of the institution. Alongside of the technical services, we report on early work with researchers to create a culture of replicable use of data, towards the vision of truly reproducible research.

This presentation will show a proven end-to-end design for research data flows, starting from a research group, The Hawkesbury Institute for the Environment, where a large sensor network gathers data for use by institute researchers, in-situ, with data flowing-through to an institutional data repository and catalogue, and thence to Research Data Australia – a national data search engine. We also discuss a parallel workflow with a more generic focus – available to any researcher. We also report on work we have done to improve metadata capture at source, and to create infrastructure that will support the entire research data lifecycle. We include demonstrations of two innovations which have emerged from the associated project work: the first is of a new tool for researchers to find, organize, package and publish datasets; the second is of a new packaging format which has both human-readable and machine-readable components.

Slide 2

Notes

Some of the work we discuss here was funded by the Australian National Data Service. See:

Seeding the commons project to describe data sets at UWS and the Data catalogue project.

HIEv Data Capture at the Hawkesbury Institute for the Environment

The talk

Notes

We’ll use the four A’s to talk about some issues in data management.

  • We need a simple framework which covers it all, to capture how we work with research data from cradle to grave:

  • We need to Acquire the raw data and make it secure and available to be worked on.

  • We need to Act on the data to cleanse it while keeping track of how it was cleansed, analyse it using tools to support our research, while maintaining the data’s provenence.

  • We need to Archive the data from working storage to an archival store, making it citable

  • We need to Advertise that the data exists so that others can discover it and use it confidently with simple access mechanisms and simple tools.

  • 4A must work for

  • high-intensity research data such as that from gene sequences, sensor networks, astronomy, medical diagnostic equipment, etc.

  • the long tail of unstructured research data.

For example

Notes

In the presentation, I used a short video on how to catch a kangaroo. (Late the night before I was searching for this video and I forgot how spell kagaroo and tried starting it like this “How to catch a C … A … N …” – at which point the Google suggestion popped up with this, which I decided not to show at the conference I’d blame the jet-lag, but you wouldn’t believe me.)

If only data capture were as simple as catching a kangaroo in a shopping bag!

Australian Government Initiatives in Research Data Management

Notes

There have been several rounds of investment in (e)research infrastructure in Australia over the last decade, including substantial investments to get institutional publications repositories established.

  • Australian National Data Service (ANDS) $50M (link)

  • National eResearch Collaboration Tools and Resources (NeCTAR) project (link) $50M

  • Research Data Storage Infrastructure (RDSI) $50M (link)

  • Implemented to date:

  • National Research Data Catalogue – Research Data Australia

  • Standard approach to updating the Catalogue (OAI-PMH and rif-cs)

  • 10+ Institutional Metadata Repositories implemented

  • 120+ data capture applications implemented across 30+ research organisations

  • Upgrade of High Performance Computing infrastructure

  • Colocation of data storage and computing

Slide 6

Notes

UWS is a young ( ~20years) university performing well above most of its contemporaries in research.

Slide 7

Notes

This slide by Prof Andrew Cheetham – the Deputy Vice Chancellor for Research shows that UWS performs very well at attracting competitive grant income from the Australian Research Council.

Slide 8

Notes

UWS is concentrating its research into flagship institutes – we will be talking in more detail about HIE, here, our environmental institute which does research from cutting across different disciplines spanning from the leaf level to the ecosystem level.

Slide 9

Notes

Slide 11

Notes

These are Intersect’s members. Intersect also collaborates with other eResearch organisations throughout Australia.

The slide is a photo of at the recent Hackfest event. This is an annual fun competition for software developers to use open government data in innovative ways. Intersect hosted the NSW chapter of the event.

eResearch @ UWS

Notes

The eResearch unit at UWS is a small team, currently reporting to the Deputy Vice Chancellor, Research. See our FAQ.

Slide 13

Notes

At UWS, we haven’t tried to drive change with top-down policy. Instead, we’ve taken a practical, project-based approach which has allowed a data architecture to evolve. The eResearch Roadmap calls for a series of data capture applications to be developed for data-intensive research, along with a generic application to cover the long tail of research data.

The 4A Vision

For the purposes of this presentation we will talk about the ‘4A’ approach to research data management – Acquire, Act, Archive and Advertise. The choice of different terms from the 2Rs Reuse and Reproduce of the conference theme is intended to throw a slightly different light on the same set of issues. The presentation will examine each of these ‘A’s in turn and explain how they have helped us to organize our thinking in developing a target technical data architecture and integrated data-related end-to-end business processes and services involving research technicians and support staff, researchers and their collaborators, library staff, information technology staff, office of research services, and external service providers such as the Australian National Data Service and the National Library of Australia. The presentation will also discuss how all of this relates to the research project life cycle and grant funding approval.

Acquiring the data

We are attacking data acquisition (known as Data Capture by the Australian National Data Service, ANDS 1) in two ways:

With discipline specific applications for key research groups. A number of these have been developed in Australia recently (for example MyTARDIS 2), we will talk about one developed at UWS. With ANDS funding, UWS is building an open source automated research data capture system (the HIEv) for the Hawkesbury Institute for the Environment to automatically gather time-series sensor data and other data from a number of field facilities and experiments, providing researchers and their authorised collaborators with easy self-service discovery and access to that data.

Generic services for Data storage via simple file shares, Integration with cloud storage including Dropbox.com and other distributed file systems. And Source-code repositories such as public and private github and bitbucket stores for working code and textual data.

Acting on data

The data Acquisition services described above are there in the first instance to allow researchers to use data. With our environmental researchers, we are developing techniques for developing reusable data sets which include raw data, commented scripts to clean the data (eg a comment “filter out known bad-days when the facility was not operating”) then re-organize it via resampling or other operations into useful ‘clean’ data that can be fed to models, plotted etc and used as the basis of publications. Demo: the presentation will include a live demonstration of using HIEv to work on data and create a data archive.

From action to archive

Having created both re-usable base data sets and publication-specific operations on data to create plots etc there are several workflows where various parties trigger deposit of finished, fixed, citable data into a repository. Our project team mapped out several scenarios where data are deposited with different actors and drivers including motivations that are both carrot (my data set will be cited) and stick (the funder/journal says I have to deposit). Services are being crafted to fit in with these identified workflows rather than build new things and assume “they will come”.

Archiving the data

The University of Western Sydney has established a Research Data Repositoryi (RDR), the central component of which is a Research Data Catalogue, running on the ReDBOX open source repository platform. While individual data acquisition applications such as HIEv are considered to have a finite lifespan, the RDR will provide on-going curation of important research datasets. This service is set up to harvest data sets from the working-data applications, including the HIEv data-acquisition application and the CrateIt data packaging service using the Open Archives Initiative – Protocol for Metadata Harvesting (OAI-PMH).

Advertising the data

As with Institutional Publications Repositories, one of the key functions of the Research Data Repository is to disseminate metadata about holdings to aggregation services and give data a web presence. Many Australian institutions are connected to the Research Data Australia discovery service 6, which harvests metadata via an ANDS-defined standard over the OAI-PMH harvesting protocol. There is so far no Google-Scholar-like service which is harvesting data about data sets via direct web crawling (that we know about), so there are no firm standards for how to embed data in a page, but we are tracking the developments of the Schema.org vocabulary, which is driven largely by Google’s group of companies which are Google’s peers, and the work described above on data packaging with RDFa metadata is intended to be consumed by direct crawlers. It is possible to unzip a CrateIt package and expose it to the web thus creating a machine-readable entry-point to the data within the Zip/BagIt archive.

Looking to the future, the University is also considering plans for an over-arching discovery hub, which would bring together all metadata data about research including information on publications, people, and organisation.

Technical architecture

The following diagram shows the first end-to-end data capture to archiving pathways to be turned on at the University of Western Sydney, covering Acquisition and Action on data (use) and Archiving and Advertising of data for reuse. Note the inclusion of a name-authority service which is used to ensure that all metadata flowing through the system is unambiguous and inked-data-ready 7. The name Authority is populated with data about people, grants and subject codes from databases within the research services section of the university and from community-maintained ontologies. A notable omission from the architecture is integration with the Institutional Publications Repository – we hope to be able to report on progress joining up that piece of the infrastructure via a Research Hub at Open Repositories 2014.

i Project materials refer to the repository as a project which includes both working and archival storage as well as some computing resources, drawing a line around ‘the repository’ that is larger than would be usual for a presentation at Open Repositories.

Slide 14

Notes

There are a number of major research facilities at HIE, here are two whole-tree chambers which allow control over temperature, moisture and atmospheric CO2.

Slide 15

Notes

This diagram shows the end to end data and application architecture which Intersect and UWS eResearch built to capture data from HIE sensors and other sources. Each of the columns roughly equates to the four A model. Once data is packaged in the HIev, it is stored in the Research Data Store and there is a corresponding record for it in the Research Data Catalog. The data packaging format produced by the HIEv, along with the delivery protocol are key to the architecture: the data packaging format (based on bagit) is stand-alone from the HIEv and self-describing, the delivery protocol (OAI-PMH) is well-defined and standards based. THese are discussed in more detail in later slides. When other data capature applications are developed at UWS, to integrate into and extend the architecture they will simply need to package data in the same format and produce and deliver the same meta-data via the same delivery protocol as the HIEv.

Slide 16

Notes

This diagram shows how the four ‘A’s fit together for HIE. Acquisition and action are closely related – it is important to provide services which researchers actually want to use and to build in data publishing and packaging services rather than setting up an archive, and hoping they come to it with data.

Slide 17

Notes

The HIEv/DC21 application is available as open source:

  • Funded by ANDS

  • Developed by Intersect

  • Automated data capture

  • Ruby on Rails application

  • Agile development methodology

  • Went live in Jan 2013.

  • 1200 files, 15 GB of RAW data, 25 users.

  • 120 files auto-uploaded nightly, +1GB per week

  • Expected to reach 50,000 files in next couple of years

  • Now extended to include Eucface data

  • Possibly to be extended to include Genomic data (20TB per year)

  • Integrated with UWS data architecture

  • Supports the full 4 As – links Acquire to Act to Archive

Slide 18

Notes

Acting on data: our researchers are not staring to do work with the HIEv system: here’s an API developed by Dr Remko Duursma to consume data from R-stats.

Slide 19

Notes

Acting on data: researchers can pull data either manually of via API calls and do work, such as this R-plot.

From acting to archiving…

Notes

The following few slides show how a user can select some files…

Slide 21

Notes

… look at file metadata …

Slide 22

Notes

… add files to a cart …

Slide 23

Notes

… download the files in a zip package …

Slide 24

Notes

… inside the zip the files are structured using the bagit format …

Slide 25

Notes

… with a standalone README.html file containing all the metadata we know about the files and associated research context (experiments, facilites) …

Slide 26

Notes

… with detail about every file as per the HIEv application itself

Slide 27

Notes

… and embedded machine readable metadata using RDFa lite attributes

Slide 28

Notes

… the RDFa metadata describes the data-set as a graph.

Completed packages flow-through to the Research Data Catalogue via an OAI-PMH feed, and there they are given a DOI so they can be cited. The hand-off between systems is important, once a DOI is issued the data set has to be kept indefinitely and must not be changed.

Slide 29

Notes

Advertising – data. This is a record about an experiment on Research Data Australia.

Slide 30

Notes

I said I’d talk about the long tail. He are two.

We looked in some detail at how the HIEv data capture application works for environmental data – but what about researchers who are on the long tail, and who don’t have specific software applications for their group?

We are working on a similar Acquire and Act service that will operate with files and trying to make it as useful and attractive as possible. Most research teams we talk to at UWS are using Dropbox or one of the other ‘Share, Sync, See’ services. Dropbox has limitation on what we can do with its APIs and does not play nicely with authentication schemes other than its own, so we are looking at building ‘Acquire and Act’ services using an open source alternative; ownCloud.

Our application is known as Cr8it (Crate-it).

Slide 31

Notes

A number of techniques employed at UWS:

  • the “R” drive

  • research-project-oriented data shares

  • synchronisation with dropbox and owncloud

  • synchronisation with github and svn

References

1. Burton, A. & Treloar, A. Designing for Discovery and Re-Use: the ‘ANDS Data Sharing Verbs’ Approach to Service Decomposition. International Journal of Digital Curation 4, 44–56 (2009).

2. Androulakis, S. MyTARDIS and TARDIS: Managing the Lifecycle of Data from Generation to Publication. in eResearch Australasia 2010 (2010).at <http://ccaeducause1.caudit.edu.au/index.php/eraust/2010/paper/view/62>

3. Sefton, P. M. The Fascinator – Desktop eResearch and Flexible Portals. (2009).at <https://smartech.gatech.edu/handle/1853/28483>

4. Kunze, J., Boyko, A., Vargas, B., Madden, L. & Littman, J. The BagIt File Packaging Format (V0.97). at <http://tools.ietf.org/html/draft-kunze-bagit-06>

5. Group, W. W. & others RDFa Core 1.1 Recommendation. (2012).at <http://www.w3.org/TR/rdfa-syntax/>

6. Wolski, M., Richardson, J. & Rebollo, R. Shared benefits from exposing research data. in 32 nd Annual IATUL Conference (2011).at <http://iatul2011.bg.pw.edu.pl/proceedings/ft/Wolski_M.pdf>

7. Berners-Lee, T. Linked data, 2006. at <http://www.w3.org/DesignIssues/LinkedData.html>



Research Data @ the University of Western Sydney (Introducing a data deposit management plan to the research community at UWS)

I was invited to speak at the National Higher Education Faculty Research Summit in Sydney on May 22 about our Research Data Repository project. The conference promises to provide a forum for exploration.

Explore

  • Sourcing extra grant funding and increasing revenue streams

  • Fostering collaboration and building successful relationships

  • Emerging tools and efficient practices for maintaining research efficacy and integrity

  • Improving your University’s research performance, skills and culture to enable academic excellence

My topic is “Introducing a data deposit management plan to the research community at UWS”. This relates directly to the conference theme I have highlighted, on emerging tools and practice. My strategy for this presentation, given that we’re at a summit, is to stay above 8000m, use a few metaphors, and discuss the strategy we’re taking at UWS rather than dive too deeply into the sordid details of projects. As usual, these are my notes; I hope these few paragraphs will be more useful than just a slide deck, but this is not a fully developed essay.

There are two kinds of data: Working and Archival/Published

In very general terms, we have divided our data storage into two parts: the working Research Data Storage service where people get things done, collect data and work with it and the archival Research Data Repository part where stable, citable published data sets are looked after (by the library) for the long term.

This talk is not going to be all about architecture diagrams but here’s one more, from a recent project update showing two examples of applications that will assist researchers in working with data. One very important application is HIEv, the central data capture/management platform for the Hawkesbury Institute for the Environment. This is where research teams capture sensor data, research support staff work to clean and package the data, researchers develop models and produce derived data and visualisations. We’re still working out exactly how this will work as publications using the data start to flow, but right now data moves from the working space to the archival space, and thence to the national data discovery service, see this example of weather data – (unfortunately the data set is not yet openly available for this one, I think it should be, and I’ll be doing what I can to make it so).

Data wrangling services

The other service shown on this diagram is Dropbox.com. We’d be hard pressed to stop researchers from using this service – it comes up in just about every consultation meeting. Researchers themselves must take responsibility for making sure that services like this are appropriate given their data management obligations under funder agreements and codes of practice. For those projects where Dropbox.com is appropriate we plan to let researchers invite the Research Data Store to share their stuff, thus creating a managed, backed-up copy at the university, and opening the way for us to provide useful services over the data (coming soon).

Data management

Yes, we have a web page about research data management, with some basic advice and links to more resources, but putting up web pages does not effect the kind of culture change needed to establish research data management, data re-use and data citation. As our Research Office head, Gar Jones, says this will be a change similar to the introduction of Human and Animal ethics management which will take several years to roll out.

Some key points for this presentation

I want to talk about:

  • Governance, open access, metadata, identifiers

  • The importance of the (administrative) research lifecycle

  • Policy supported by services rather than aspirations

eResearch = goat tracks

This is a concrete path on the Werrington South (Penrith) campus of the University of Western Sydney. The path is there because people kept walking through the garden bed, which was in between where the shuttle bus stops and where they wanted to be, at the library. As I said at a similar conference for IT-types last year:

Groups like mine work in the gap between the concrete and the goat track, my job is to encourage the goats.

And once we’ve encouraged the goats to make new paths, we need to get the university infrastructure people to come and pave the paths.

What’s over the horizon?

What do research administrators and IT directors need to be thinking about?

  • Changes in the research landscape – more emphasis on data reuse and citation, increasing emphasis on defensible research mean data will become as important as citations

  • Providing access to publications and data so it can be reused.

  • (e)Research infrastructure in general, where collaboration must not be constrained by the boundaries of individual institutional networks and firewalls.

Any others?

Research data, Next Big Thing?

The Australian National Data Service runs a data-discovery service designed to advertise data for reuse.

Governments are joining in

As research organisations, we want to have infrastructure for data management, and a culture of data management that involves forward planning, and data re-use. So the next section of the talk is about how we need to:

  • Stop the fat multinational-publisher tail from wagging the starving research dog. Ensure research funded by us is accessible and usable by us.

  • Understand our researchers and their habits, so we can help them take on this new data management responsibility (actually it’s not a new responsibility, but many have simply been paying no attention to it, in the absence of any obvious reason to do so).

  • Sort out the metadata mess most universities are swimming in.

Now for the big picture stuff.

Open Free scholarship is coming? (Just beyond that ridge)

OA is a Good Thing,

Which will:

  • Reduce extortionate journal pricing.

  • Provide equitable access to research outputs to the whole world.

  • Open Access to publications and Coming Soon: Open Access to data.

  • Promote Open Science and Open Research.

  • Drive huge demand for data management, cataloguing, archiving, publishing services

http://aoasg.org.au/

There are competing models for open access. Bizarrely the discussion is often framed as a contest between ‘Green’ and ‘Gold’. It’s a lot like the State of Origin Rugby League, a contrived but popular-in-obscure-corners of the world contest where the ‘Blues’ and ‘Maroons’ run repeatedly into each other. In both State of Origin and Open Access, the current winners are large media companies. Add least being an Open Access advocate doesn’t give you head injuries.

Green OA refers to author-deposited pre-publication versions of research articles. Gold means that the published version itself is ‘Open’ for some ill-defined definition of open, often at a cost of thousands of dollars, out of the researcher’s budget. Green or Gold, a lot of so-called Open Access publishing operates with no formal legal underpinnings, that is, without copyright-based licenses. For example when I deposited a Green version of a paper I had written here, and wrote to the publisher asking them to clarify copyright and licensing issues I got no reply.


We have a brief window now to try to build services for research data management that do have a solid legal basis and avoid following some of the OA movements missteps but this is not trivial (1).

Identity management is crucial

I have used a variant of the above dog picture before to talk about identity management. This dog has a name but it’s a terrible way to find out about him as he has a much more famous namesake.

Like the rest of us, this dog has all sorts of identifying names and numbers – a microchip number linked to a database, an ID assigned by the RSPCA, patient numbers at veterinary practices, which may be linked to more than one human, phone numbers on his tag etc. Point is, it’s much worse for researchers than for dogs – identities are maintained all over the place. Foley and Kochalko put it like this:

While much has changed since the days of David Livingstone, we continue to struggle with associating individuals with their works accurately and unambiguously. Author name ambiguity plagues science and scholarship: when researchers are not properly identified and credited for their work, dead-ends and information gaps emerge. The impact ripples throughout the ecosystem, compromising collaboration networks, impact metrics, “smarter” research allocations, and the overall discovery process. Name ambiguity also weighs on the system by creating significant hidden costs for all stakeholders. (2)


To do metadata management well we need to make sure that we sort out all sorts of naming and identifying issues, dealing correctly with potential causes of confusion, multiple people with the same name, people with multiple names over time, and simultaneously, name variants. Even where there are agreed subject codes like the Field of Research codes that are heavily used in research measurement exercises they can get mixed us as different databases use different variants.

We try to work out how to fit new processes into existing workflows

At Rochester university, when they installed an institutional repository the team conducted ethnographic research on their research community (3). We have not gone that far, but our Research Data Repository project does try to pay attention to what researchers do as part of their current work, and to fit new processes into existing ones.


For example, the above scenario tries to capture the interactions that would happen when a researcher is required by a journal to deposit data before publication. We spend a lot of time talking to the Office of Research Services (ORS) and research librarian team about how we can fit in with their existing processes, and how to minimise negative impacts on research groups. Research Offices are used to responding to changing regulatory environments so adding new fields to forms etc is straightforward. Changing IT services is much harder; the ITS is much bigger than ORS, new services need to be acquired, provisioned and documented, and the service desk team has to be taught new processes.

Challenge: how to stop the corporate publishing tail from wagging the scholarly dog

This is a rather a substantial issue to try to talk about in a discussion about research data management and repositories, but it’s essential to keep an eye on the big picture. We know that scholarship has to change, publishing has to change, but we don’t know how. We need to develop strategies for how we want it to change. Some examples of where this is important:

  • Policy on ‘ownership’ of intellectual property rights over data needs to be established. This is not as simple as it is for publications, as data are not always subject to copyright (1).

  • Data citation is going to be an important metric.

New models are needed. People like Alex Holcombe from Sydney uni are developing them:

Science is broken; let’s fix it. This has been my mantra for some years now, and today we are launching an initiative aimed squarely at one of science’s biggest problems. The problem is called publication bias or the file-drawer problem and it’s resulted in what some have called a replicability crisis.

When researchers do a study and get negative or inconclusive results, those results usually end up in file drawers rather than published. When this is true for studies attempting to replicate already-published findings, we end up with a replicability crisis where people don’t know which published findings can be trusted.

To address the problem, Dan Simons and I are introducing a new article format at the journal Perspectives on Psychological Science (PoPS). The new article format is called Registered Replication Reports (RRR).  The process will begin with a psychological scientist interested in replicating an already-published finding. They will explain to we editors why they think replicating the study would be worthwhile (perhaps it has been widely influential but had few or no published replications). If we agree with them, they will be invited to submit a methods section and analysis plan and submit it to we editors. The submission will be sent to reviewers, preferably the authors of the original article that was proposed to be replicated. These reviewers will be asked to help the replicating authors ensure their method is nearly identical to the original study.  The submission will at that point be accepted or rejected, and the authors will be told to report back when the data comes in.  The methods will also be made public and other laboratories will be invited to join the replication attempt.  All the results will be posted in the end, with a meta-analytic estimate of the effect size combining all the data sets (including the original study’s data if it is available). The Open Science Framework website will be used to post some of this. The press release is here, and the details can be found at the PoPS website.

http://alexholcombe.wordpress.com/2013/03/03/registered-replication-reports-are-open-for-submissions/

This seems like a positive note on which to end. Hundreds of researchers are trying to fix scholarship, they’re the ones we need to talk to about what a data repository or a data management plan should be.

Science is broken let’s fix it

1. Stodden V. The Legal Framework for Reproducible Scientific Research: Licensing and Copyright. Computing in Science Engineering. 2009;11(1):35–40.

2. Foley MJ, Kochalko DL. Open Researcher and Contributor Identification (ORCID). 2012 [cited 2013 May 21]; Available from: http://docs.lib.purdue.edu/cgi/viewcontent.cgi?article=1133&context=charleston

3. Lindahl D, Bell S, Gibbons S, Foster NF. Institutional Repositories, Policies, and Disruption. 2007 [cited 2013 May 21]; Available from: http://open.bu.edu/xmlui/handle/2144/919

Creative Commons License
Research Data @ the University of Western Sydney (Introducing a data deposit management plan to the research community at UWS) by Peter Sefton is licensed under a Creative Commons Attribution 3.0 Unported License.

Running an Open Source project from a university dev team

Steven Hayes from Arts eResearch at the University of Sydney invited me to visit their group and talk about running open source software projects, as they are making their Heurist (semantic database-of-everything) software open source. This was more of a conversation than a presentation, but I prepared a few ‘slides’ to remind me of which points to hit. Here are my notes. The focus here was not on why go open source, or open source in general, it was about doing it in a small university-based team. Comments about how various uni open source projects run would be appreciated.

I have been involved in creating two sizeable code-bases both released by the University of Southern Queensland as open source. They had very different histories. I’ll talk about both and how they run, although actually one of them doesn’t run any more in any meaningful way.

Two projects I started…

… on which other people* did most of the work

  • ICE – the Integrated Content Environment. Used at USQ for creating course materials for delivery online and in print. Almost no activity on this outside of USQ these days. Inside USQ? I don’t know for certain, but I think it is still in use, and finding a replacement has proven difficult, which doesn’t surprise me as that was the reason we built it in the first place).

  • ReDBOX – the Research Data Box (and The Fascinator, the underlying toolkit).

*Thanks to Ron Ward, Oliver Lucido, Linda Octalina, Duncan Dickinson, Greg Pendlebury, Daniel de Byl, Bron Chandler, Tim McCallum, Cynthia Wong, Jason Zejfert, Sally MacFarlane, Caroline Drury, Pamela Glossop, Warwick Milne, Sue Craig, Vicki Picasso, Dave Huthnance, Shirley Reushle and the late Alan Smith who made, tested, championed and supported these projects. Thanks also to funding from the Australian government via ANDS, ARROW and other streams. Sorry if I forgot anyone.

(At this point I wanted to check that everyone knows what Open Source means, making sure that we all understand how Richard Stallman made software free using copyright law. Whoever holds the copyright in a bit of software, which is likely to be whoever wrote it, or their employer can control distribution by using a licence, a legal instrument. Stallman’s insight was that a licence could be used to enforce sharing, openness and freedom: you can use this stuff I created provided you promise to share it with other people (that’s not a quote). Oh, and people working in this space should also understand the difference between Free and Open Source [1].

But I forgot.)

RTFM

Above, I linked to a free book on producing Open Source software [1] by Karl Fogel which seems to cover most of what you’d need to know. I haven’t read it all, looks useful.

But I don’t like this

The book begins:

Most free software projects fail.

I think that’s silly, talking about failure without first defining success.

Me, I’m not sure that all these scenarios Fogel lists are failures at all, there are lots of reasons to release code and they are not all necessarily about building a substantial community:

We tend not to hear very much about the failures. Only successful projects attract attention, and there are so many free software projects in total[2] that even though only a small percentage succeed, the result is still a lot of visible projects. We also don’t hear about the failures because failure is not an event. There is no single moment when a project ceases to be viable; people just sort of drift away and stop working on it. There may be a moment when a final change is made to the project, but those who made it usually didn’t know at the time that it was the last one. There is not even a clear definition of when a project is expired. Is it when it hasn’t been actively worked on for six months? When its user base stops growing, without having exceeded the developer base? What if the developers of one project abandon it because they realized they were duplicating the work of another—and what if they join that other project, then expand it to include much of their earlier effort? Did the first project end, or just change homes?

What’s the first thing that comes to mind when you think of Open Source?

Linux? Apache? WordPress?  Firefox?

The hits. The stadium-filling rock-star projects?

Your band has 99.9% probability of staying in the garage

Figure 1 Me (the good looking one) and cousin Tim at the Springwood Sports club, about to perform with a community uke-group. No plans for world-domination, playing for family, who are obliged to attend and even some people who , for some reason, choose to come. #Notfailure.

It’s important to work out why you are going to release software as Open Source – think about the audience. One very important audience is you, yourself. If you work on code as part of your job, then your employment contract may well mean that your employer owns the copyright. Do you want to be able to continue using it in your next job? Show potential employers? Making it open source helps your future self.

I know this first hand.

Universities are not as stable as they seem, or you may hope. At the Australian Digital Futures Institute at USQ we began by hosting code repositories and websites internally. I reasoned that the university would be a good bet for maintaining persistence of these resources.

But then one Gilly Salmon came to our institute to be the new professor, decided, along with the rest of the senior leadership team that there was altogether too much making the digital future going on in the Australian Digital Futures Institute, too much technology. They let just about all the technical staff go, no matter how useful they were to the organisation, or how pregnant they happened to be (we’re a relationship brand, the director of marketing told me, so we shouldn’t be continuing to develop software to deliver award-winning distance-ed services).

Web sites that would still have value are just gone from public view, including, ironically the PILIN project site, which was about persistent identifiers. Even the ICE website which is full of useful stuff for USQ itself now appears to be only accessible via the Wayback machine. They’re still using it but they turned off the website anyway, the code, however, is sitting on Google code so we all still have access to it.

This sort of thing happens all the time. For a couple of us, the NextEd refugees, this was the second redundancy associated with USQ. Kids, it is prudent to make sure that any code you might want to re-use later in your career is released under an open licence, and documentation, web sites etc likewise under creative commons. Think of it as a professional escape pod.

The ReDBOX project survived this ADFI shut down, because it had been open source from the beginning but further funding had to be redirected to another university which was willing to host the building of a digital future.

Lessons

  • Open Source can be worth doing even if the audience is your future self

  • Don’t trust someone else to keep your website up

  • If you want a community you’ll (likely) have to build it

  • Every project is different, so you need to structure yours around your users

Oh, and the answer to most questions is on Stack Exchange. I decided that this list was worth using as a starting point for discussion.

http://programmers.stackexchange.com/questions/51553/checklist-for-starting-an-open-source-project

Havoc P said: [with additions by me post the discussion at USYD]

Things I’d put in the early priorities are:

  • have a simple “what is it?” web site with links to some discussion forum (whether email or chat) and to the source code repository

    [Mailing lists are usually best IMO – forums can be empty, echoing and make you project look unloved. A tech list is a must, always, but other communications should be built around the reality of your project. No user community yet? Build one. Others over at Stack Exchange added that once you have a tech-list is best to hold or log all your discussions there so architectural decisions are transparent and the community can engage.

    On the ReDBOX project there are two main mailing lists, one for the techies and one for the users (mostly library staff), and lots of virtual and face-to-face get togethers. There is a committers group who are in charge of what gets into the trunk and various ad-hoc arrangements to sponsor sub-projects at the dozen or so sites using the software. The groups and how they interact were all created to serve that community, not from some manual of best practice, although it is all informed by collective experience of open source projects.]

  • be sure the code compiles and usually works, don’t commit work-in-progress or half-ass patches on the main branch that break things, because then other people’s work would be disrupted

    [Well, OK, but if you’re releasing an existing code base then don’t get too hung up on making things perfect (a) it will be a huge waste if there is no demand for your code and (b) don’t be unnecessarily shy, most open source projects are like busking, not stadium rock, nobody is watching you waiting to pounce on your errors.]

  • put a license file in the code repository with a well-known license, and mark the copyright owner (probably you, or your company). don’t omit the license, make up a license, or use an obscure license.

  • have instructions for how to contribute, say in a HACKING file or include in your README. This should include where to send patches, how to format patches, code indentation rules, any other important conventions of the project

  • have instructions on how to report a bug

  • be helpful on the mailing list or whatever your forums are

More from Havoc P

After those priorities I’d say:

  • documentation (this saves you work on the mailing list… make a FAQ from your list posts is a simple start)

  • try to do things in a “normal” way (don’t invent your own build system or use some weird one, don’t use 1-space indentation, don’t be annoyingly quirky in general because it adds learning curve)

  • promote your project. marketing marketing marketing. You need some blogs and news sites and stuff like that to cover you, and then when people show up interested, you need to talk to them and be sure they get it working and look at their patches. Maybe mention your project in the forums for related projects.

    [Yes, this is a huge one. One of the big differences between ReDBOX, which is no hit, but has a solid user base and ICE which never made it out of USQ is that Vicki Picasso from Newcastle Uni and I marketed the hell out of ReDBOX early to a very specific community of user-organisations. We needed a community so the software would have a sustainable base, so we designed the software for the community and sought input on the design as broadly as we could.

    With ICE, I talked about it to lots of the wrong people and didn’t sell it to the right ones, other distance ed unis, but that was partly because it conferred a competitive advantage on USQ. This comes back to the point above about success vs failure – there’s more than one way to succeed.]

  • always review and accept patches as quickly as humanly possible. Immediately is perfect. More than a couple days and you are losing lots of people.

  • always reply to email about the project as quickly as humanly possible.

  • create a welcoming/positive/fun atmosphere. don’t be a jerk. say please and thank you and hand out praise. chase off any jackasses that turn up and start to poison the community. try to meet people in person when you can and form bonds.

[1] K. Fogel, Producing open source software: How to run a successful free software project. O’Reilly Media, Inc., 2005.

Creative Commons License
Running an Open Source project from a university dev team by Peter (pt) Sefton is licensed under a Creative Commons Attribution 3.0 Unported License.

Repositories! (What are they good for?)

Creative Commons License
Repositories! (What are they good for?) by Peter Sefton is licensed under a Creative Commons Attribution 3.0 Unported License.

Repositories! (What are they good for?)

Georgina Edwards has invited me to Intersect NSW to give a talk to the software engineering team about repositories in eResearch. There were also quite a few eResearch analysts in attendance, not to mention a couple of members of the senior management team. (Just in case you’re wondering, the answer to the question in the title is not “absolutely nothing”).


Here are my notes, with embedded slides, which I put together on the train to and from the CBD (ie quick and dirty).

The summary: repository means a lot of different things, but the main sense I talked about with the Intersect team was ‘data-store component’. I tried to cover why using a repository in an eResearch project might be important because repositories can provide a lot of ready-made functionality, particularly in the area of digital preservation, but also access to indexing services and content-transcoding to generate new formats from things ingested. I talked about one aid for thinking about repository services which I think is useful – the Repository Micro-services framework from the California Digital Libraries, and ran through some of the repository frameworks that people in the eResearch.au world might encounter.

The liveliest discussion was around RDF, the Resource Description Framework, and what it’s good for. I made the assertion that RDF was the best practice approach to storing metadata, allowing for built-in extensibility. RDF uses URIs as names for both things and relations, which reduces ambiguity and aids interoperability. But I think it’s important to draw the line between RDF as a good way to do metadata, and annotation and the assumption that an RDF query language (via an RDF triple-store) is always going to be needed or even work. I’m sceptical about the promise of RDF as some kind of super semantic world-wide web of knowledge you can query for the answer to anything, but it’s clearly a good way to do metadata – there’s no excuse for inventing a new metadata schema that is not RDF based these days.  (Use the comments if you want to discuss).

The talk

I thought I’d start from something that the developers would be familiar with. Source-code repositories.

To a bunch of software developers…

… a repository is a place to put code

But it’s not just a place to put things. On a development project, the repository offers a number of services, like integration with task management systems, versioning, search and collaboration. I’m sure everyone in the professional eResearch world would be horrified to find a development project that wasn’t using source-code management via a code-repository: Git, Mercurial, or at the very least something ancient like Subversion.

What’s a repository to me?

The first time many of us heard the term repository in Higher Education was in connection with the Open Access movement, when a few forward thinking universities in Australia QUT, UQ, USQ and even some others outside of Queensland began to set up Institutional Repositories, using software like Eprints or Dspace. These were essentially online databases of PDF files for academic works, with bibliographic metadata. They were also seen as sites for preservation of materials, and had services to advertise their contents to the world, via the OAI-PMH metadata harvesting standard, and via metadata embedded in the web pages that described the academic works.

A group of us put together a presentation for Open Repositories last year on the growth of Institutional Research Data Repositories, alongside the ‘traditional’ Institutional Publications repository.

There are a few senses of the word:

  • Repository-as-database

  • Repository-as-application

    Institutional Repository or Data Repository

  • Repository-as-lifestyle (ie analogous to a ‘library’)

People tend not to be very careful about these senses of the word repository and indeed the boundaries are actually quite blurry. If you have chosen to call your application a repository, then that term brings a certain gravitas, you’d expect the repository-as-application to be something that’s not just for Christmas, but something you’ve made a commitment to feeding and walking at least for some time.

With that  in mind, the point of this discussion: is what might a repository-as-data store be good for in an eResearch project?

Services in a typical repository-as-datastore underneath an application:

  • If the app goes away the data is/are safe independently of the application services,

    • with all digital objects stored in standard formats

    • with standardised metadata

    • so they can be preserved*.

  • You get OAI-PMH (pull/out) and SWORD (push/in) built in

  • Built in security/access control

    (but beware of actual real-world performance)

  • Content transcoding

    (thumbnails / image viewers / video versions)

Nobody put up their hand and said “Hey that’s just a CMS” (Content Management System), but the answer would have been, yes, of course. A repository-as-application is just a serious CMS, one designed for maintaining important stuff in a well-managed way. Indeed, the University of Queensland is moving its Institutional Repository to a Drupal-based system, and leaving behind the repository-as-data-store that used to sit underneath it.

The Repository Micro-services framework from the University of California captures all these services really nicely.

Repository Micro-services

http://journals.tdl.org/jodi/index.php/jodi/article/view/1605/1766

This is implemented in http://merritt.cdlib.org/, which does not seem to have an obvious application to download.

Repository micro-services

Some repository software you may hear about

  • Eprints (Perl)

    Good for publications repositories, has been used for cultural collections, learning – has every imaginable interface to repository content

  • DSpace good for a range of digital object collections

    eg Andrea Schweer’s talk on a data capture app Building a repository for freshwater quality data

  • Fedora Commons (back end)

  • CKAN – a Research data Hub app (Python)

  • Micro-service components like BagIt for packaging and PairTree for efficient file-storage.

NOTE: All of the above apart form Eprints include built-in search using Apache Solr.

In conclusion, I asked: why use one of the above, particularly when on first acquaintance, something like Fedora can look like an anchor, impeding forward progress?

The basic answer is that if in the long run your project is going to require some large percentage of the repository micro-services discussed above, then you’re going to end up writing your own Fedora-like thing. Also, I think it’s better to be part of a community looking at these things together. For example Fedora is not a magic solution to being able to re-use repository content between applications, but it is reassuring to know that the Hydra and Islandora communities are talking about interop via their Hylandora project and there is a significant amount of preservation-work happening in the Fedora world.

To some of us, the idea of doing certain kinds of eResearch project without a back-end repository (as in something that has managed services around preservation under some kind of serious governance) would be like doing software development without a code repository. The question, of course is which kinds of project? And of course, if you do need one, where do you put the repository part in the architecture.

CAIRSS – CAUL Australasian Institutional Repository Support Service

By Dr Peter Sefton (University of Western Sydney) with Ms Caroline Drury (University of Southern Queensland).

On Wednesday 5th Dec I (Peter) visited the Japanese Digital Repository Federation at their invitation and expense, to talk about how our respective repository communities are organised.  I’d like to thank the DRF for this opportunity to make the brief trip to Tokyo. Caroline was invited but was unable to make it. The DRF folks have put up a summary of the meeting, in Japanese. Note that while my comments on that page are listed as “CAIRSS” I was not representing CAIRSS (the CAUL (Council of Australian University Librarians) Institutional Repository Support Service), I attended as a member of the Australian/Australasian repository community. I also attended the DRF international conference in 2009 on a similar basis, when I did happen to be associated with CAIRSS, so the organisers knew me. I did talk a fair bit about CAIRSS, in the context of other projects in Australia.

Before I went I polled the CAIRSS-list to find out if there were any questions people would like answered – more on that below.

First, a bit about me and repositories:

  • I was the technical lead for the Regional Universities Building Research Infrastructure Collaboratively (RUBRIC) project which was hosted by the University of Southern Queensland (and the de-facto project manager for several months during the project establishment phase).

  • I led a small team at USQ subcontracting to the ARROW project during 2008, providing technical support to ARROW, and repository services to small Higher Education institutions in Australia.

  • I worked on USQ’s successful bid to host the first CAIRSS repository support service in Australia and acted as a senior strategist for the service, for example working on guides such as the one on how to get into Google Scholar et al, and negotiating major changes to repository infrastructure such as the closure of the Australian Digital Theses search service and its subsumption into the National Library of Australia’s Trove service.

  • I was not involved in running the second version of CAIRSS from 2011-2012 but I have remained part of the repository community in Australia and attended the 2012 community day where I spoke about trends in repository software in the context of organisational governance.

  • I am on the conference committee for the Open Repositories series of conferences (from 2011) – the call for papers for the 2013 conference is just out.

The DRF

Shigeki SUGITA started off proceedings with a presentation about the activities of the Digital Repository Federation.

Perhaps the most striking thing from an Australasian point of view is a staffing issue; talented repository managers are required by management to rotate through a variety of library jobs meaning that there is constant turnover and a lack of opportunity to specialise. There are similar pressures at play in our libraries I guess, with a need to train new repository staff, and significant turnover but not to the same extent.

Japanese repositories are very much driven by an Open Access agenda, which is quite different from the situation in Australia where two different government measurement schemes collecting information about publications and push repositories in another direction, more on that below.

Another interesting dimension to the Japanese scene is that they have a number of consortial-repositories where a number of institutions share a repository. This is an idea that came up in Australia in the mid-to-late 00’s several times, but never got off the ground. It might be worth revisiting some time both for institutional publications repositories

The presentation

I presented from an earlier version of the ‘slides’ below – I have added some notes from the discussion and clarifying material.

CAIRSS background

Parent projects

The Australian government made significant investments in institutional repositories via programs such as:

  • APSR Australian Partnership for Sustainable Repositories (ended 2008)

  • ARROW Australian Research Repositories Online to the World

  • RUBRIC, Regional Universities Building Repository Infrastructure Collaboratively.

These projects and other investments in the repository world were via these funding streams (for which the websites have disappeared):

  • ASHER2 – Australian Scheme for Higher Education Repositories [Sponsored the development of repositories in all Higher Education Institutions in Australia]

  • SII3 – Systemic Infrastructure Initiative
[APSR, ARROW and RUBRIC]

  • BAA4 – Backing Australia’s Ability

We talked in some detail about how these funding schemes have influenced the establishment of repositories; while the initial driver for Australian repositories was open access, the Excellence in Research for Australia (ERA) measurement exercise and its failed predecessor stalled the Open Access movement to some extent, by requiring universities to collect non-open access materials in complicated ways.

CAIRSS v1 2009-2011

Coming out of the investments outline above, CAIRSS was established on  March 16, 2009:

The first CAUL service was funded for two years, with the approval of Department of Innovation (DIISR), with monies remaining from the successful ARROW project, supplemented by CAUL member subscriptions.

CAIRSS Structure

CAIRSS v1 staffing

This version of the CAIRSS service covered Australian universities, and was staffed with:

  • A full time repository manager. (USQ)

  • A full time technical staff member. (USQ)

  • A full time copyright officer. (Swinburne)

  • A part-time strategic advisor and other senior support.

CAIRSS v1 approach

The initial CAIRSS service included:

  • Annual meetings with both a general and technical strand.

  • Copyright workshops for private discussions of copyright issues.

  • Maintained ‘sandbox’ instances of repository software.

  • Creation and maintenance of web pages and guides on repository issues such as statistics, indexing and an extensive copyright guide.

  • Provided direct support for government reporting processes – chiefly the establishment of the Excellence in Research Australia (ERA) exercise.

CAIRSS v2 2011-2012

With added New Zealanders

The second version of CAIRSS was funded from member subscriptions and expanded to include New Zealand:

The second CAUL service is also funded for almost two years and incorporates many of New Zealand’s higher education institutions. With this expansion, CAIRSS now stands for the CAUL Australasian Institutional Repository Support Service.

CAIRSS v2 Staffing

This version of CAIRSS had a reduced team in the central office at USQ.

  • One full time repository manager.

  • One half-time technical officer.

  • Part time senior manager.

  • Part time copyright person at Swinburne.

CAIRSS v2 approach

The second CAIRSS service included:

  • Annual meetings with a general strand.

  • Discussion list for members only.

  • Copyright workshops for private discussions of copyright issues.

  • Maintenance of web pages on repository issues such as statistics etc.

  • Provided support for government reporting processes (ERA)

Post CAIRSS: CRAC 2013-?

From 2013 CAIRSS will no longer exist – it is being replaced with a new service know as CRAC.  I gather that the feeling of CAUL was that the community is now mature enough to be self-sustaining.

CRAC (CAUL Research Advisory Committee) NEW! from 2013

CAUL Research Advisory Committee

(will undertake some of the work carried out by CAIRSSAC and COSIAC, from 2013)

Program Research 
Chair Heather Gordon (2013-2014)
Members TBC
CONZUL Janet Copsey (2013-    )
Practitioners TBC

            http://www.caul.edu.au/about-caul/caul-committees

CRAC anticipated activities

  • Running the annual event

  • Annual copyright workshop

  • Maintaining the CAIRSS discussion list

New Open Access group: AOASG

There is a new Open Access group in Australia which is not part of the CAUL/CAIRSS family.

From Danny Kingsley:

The Australian Open Access Support Group (AOASG) was launched during Open Access Week in 2012. It is a consortium of six universities with open access policies  – QUT, ANU, Macquarie University, Newcastle University, Charles Sturt University and Victoria University. The group aims to provide support, lobbying and advocacy for open access in Australia. Membership will be extended to other research institutions and affiliates during 2013.

http://www.aoasg.org.au [NOTE the website is currently being built – may not be live yet]

General comments about the CAIRSS/CRAC community

Small task-force groups now self-organize

The repository community is well established and members of the community run their own investigations into repository matters. These range from asking questions on the list about repository practices, to running formal surveys. An example from the broader CAUL community of which CAIRSS is a part is the IR / Open Access Funding Survey by Danny Kingsley and Vicki Picasso.

Opportunities

DRF collaboration?

From Caroline Drury:

It would be interesting for CRAC to consider something similar to the DRF model  - eg at the beginning of each two year period, to meet and consider what projects could be done in the space, within Australia / NZ. Then perhaps a call could be put to institutions who could then (according to their strengths) be assigned to do that project in a collaborative model, using their own funds. I’m not sure if it would work here, given the big physical distances, but I think it’s a good model in a scenario where there’s limited funding. 

I’m sure CRAC will consider this.

April event in Tasmania

An event is being organised in Tasmania in April around the following themes. Regional participating would be most welcome.

From David Flanders at ANDS:

  1. linking research data and research publications

  2. re-architecting the repository (if we started now based on what we know).

  3. business metrics/analytics from scholarly systems

  4. research profiles and author identifiers

  5. emerging scholarly vocabularies, linkeddata & 

  6. scholarly search engines (beyond Google Scholar)

  7. APIs and bringing all these systems together via shared resources.

Questions (with my notes)

Natasha Simons at Griffith University had three questions that provided a great structure for the discussion part of the meeting. I tried to take notes (included below) as well as talk.


1. How’s the Memorandum of Understanding between Digital Repository Federation (Japan), UKCoRR and the UK RSP going? What sorts of things are of the most importance to all parties to share and experience in this space? What sort of involvement has there been between the signers to the MoU? How do they envisage this MoU benefiting all parties (particularly long-term)?

In January 2012 – DRF heard there were repo managers in the UK, they invited a rep from Repositories Support Project (RSP) in the UK.  Jackie Whickam came to snowy Hokkaido, where they found out that RDF and RSP carried out similar activities – eg the re-enactment of online discussions wearing masks.  After the meeting found out that there were many more things to share. Eg in the UK they carry out residential workshops. Meeting with JW was about operational things between repo manager communities in UK and JP. Wanted to do more on activities to do with individual repositories.

Most important objective of MoU is to send representatives to counterpart meeting to share more specifics. Since signed the MoU they have not done so much. First thing was to invite a rep from UK to national workshop. UK rep was asked to give a presentation about how they promote activities inside universities.

The MoU says they will sponsor trips for each other (but the Brits have not done their bit … yet :). RSP has just come to the end of its funding cycle. DRF hopes that even though there is uncertainly about funding the collaboration can continue. Funding is restricted to long term planning is difficult.


2. Find out what you can about the NII Repositories Program - http://www.nii.ac.jp/irp/en/rfp/ 
There are some interesting projects listed. How do they decide on the project areas? Where does the funding come from? How do they decide on the actual projects? Are they all 12 month projects? Are they all collaborative? How do they share the results?

Cyber Science Infrastrcuture CSI hosted by NII – has a selection board, informatics scholars & heads / top management of uni libraries about 10 members. 200 – 300 M Yen

  • Launch – circa 50 univerisites circa 1M Yen

  • R& D (5-6 page proposal docs by multiple unis) – examples

    • DRF (proposal by several unis)

    • Sherpa Romeo – Japan

    • Statistics

About 30 submitted – 20 accepted.

Proposals ask for money never given more, usually slightly less than proposed. Money goes to the unis as project operators. Budgets split between the participants (training, workshops, system development, etc). Budget allocated on fiscal year basis, CSI checks. Proposal made around March – decision around June – activities take place from June to March. Following June there is a results project meeting in Tokyo – 2 day meeting (decreasing because the number of people launching has dropped. Initial 50 now 10).

Sharing of results at June meeting, not much more than that.  Some projects with strong outreach will be well known. Out of 20 – some projects faded out without much impact.

Some projects that have done well:

  • DRF – The Digital Repositories Federation, my hosts.

  • Sherpa-Romeo Japan.

  • SCPJ – Society Copyright Policy in Japan – 600 societies almost all grey lit – a few are ‘green’ OA.

  • ROAT – project to standardise repo stats same as PIRUS/IRUS

  • Author – ID – (ORCID) participants on this are involved in ORCID. NII working on a trial basis to have a database of Japanese researchers.

  • ShareRe – consortial repositories came out of Hiroshima . DSpace commonly used in Japan and EARMAS (original Japanese repository software)

    There are 14 on a regional basis. Each one has a lead university acts as a host to provide system and support.  

    • Hiroshima .  Collect funds for future 14 members. 14 * 13K – 420K per kept in reserve, and used for security maintenance. Operated by regional council of libraries, Hiroshima uni serves as sectretariat and hosts, additional funding come from regional 30K Yen per year. Initial launch funds came from CSI – Hiroshima was the first.

    • 7 Member universities Kagoshima as lead. Initial investment about 2M yen do not collect funds for future rennocation collectively 250 K pro rata per year contributed by member according to FTE

  • UsrCom – Trial repository system sandbox  - 2008-2009


3. Are they ways we can communicate better with them? Do they hold any webinars on topics that would be of interest to us? If so, how do they tell people about them? Could they tell us? I see they post to the JISC list every now and then (usually well-deserved achievement boasts). Should we have a ‘guest’ from DRF join our CAIRSS e-list or could they email you and then you post to our e-list?

Don’t do webinars and they do everything in Japanese so that’s a challenge. Does anyone on the CAIRSS list have good Japanese? The DRF would like to have a member on the CAIRSS/CRAC list.

Future of DRF not clear.

(In Japan moves afoot to subsidise societies as a way of driving OA)

NO OA mandates from JP govt – rules being revised now so that theses can go through IR or must? Policy reads like must but we don’t know. May open the door for theses to publicised thru network. If this is realised there will be more possibilities – need to think about metadata standards and talking to national library. )

Next steps

Once again, thanks to the DRF for having me – I am following up with CAUL on how we might be able to collaborate further. Now that we are in the CRAC era, Caroline’s suggestion of having a ‘call for projects’ that then get implemented at the member institutions sounds like it might be a way forward, and an ongoing relationship with the DRF (and the UK RSP) would be helpful, as they’ve been down this road before.

Creative Commons License
CAIRSS – CAUL Australasian Institutional Repository Support Service by Peter Sefton & Caroline Drury is licensed under a Creative Commons Attribution 3.0 Unported License.

Receding Repository Software?

Creative Commons License
“Receding Repository Software?” by Peter (pt) Sefton is licensed under a Creative Commons Attribution 3.0 Unported License.

I’m leading a brief session at the CAIRSS community days today (CAIRSS is the national repository support service for Australasia). The title is “Emerging Repository Software”, but I thought I’d turn that around and propose that the future of institutional repositories is to fade into the background.

Here are my notes for the session.

Take this screenshot of the Griffith University repository. See here’s the default browse-screen for publications.

The Griffith Publications repository

Hold on! That’s not the repository

This is the Research Hub which ties together data from a number of different sources to provide a joined-up view of publications in the context of other research information.

http://research-hub.griffith.edu.au/publications

The repository has faded into the background

Much like this invisible dog.

Just in case these Griffith people get big heads, I do have to point out that while I think this service points to the future of the Institutional Repository as an embedded part of the research information systems of the university, the work’s not all done yet.

But, this is not perfect

The ‘find it yourself’ button is sub-optimal:

And check out this URL:

http://research-hub.griffith.edu.au/collections#fq={!tag=classgroup}classgroup%3A%22http%3A%2F%2Fresearch-hub.griffith.edu.au%2Findividual%2FvitroClassGroupcollections%22

That really should be something like: http://research-hub.griffith.edu.au/collections

This stuff is hard to get right. The hardest bit is getting good quality metadata so things do join up.

There are two new kinds of repository we’re seeing in Australia, thanks to investment from the Australian National Data Service.

  1. There are many “Data Capture” systems for researchers to manage data early in its lifecycle.

  2. These feed into Data Catalogues or Data Repositories – there’s a lot of terminological confusion here because of the way the funding streams have been structured.

    Data capture Systems

    See the ANDS list of DC projects. Here’s one I selected at random:

    It’s difficult to get useful information about many of these projects.

There are many data capture projects, and all of them will presumably need to be hooked up to systems like the Griffith Research Hub at some stage.

It’s a jungle out there, mount an expedition!

Data is are the new black

A few notable projects:

See JISC’s Managing Research Data programme for more.

(I wanted to mention the Hydra Fedora-commons toolkit as well lots of work on archives and digital libraries).

The current opportunity for libraries

Use your metadata skills to help with “the great joining up”.

  • Get the governance right (see the ANDS view of this).

    Research systems are for the researchers.

     (So projects should report to the Deputy Vice Chancellor Research).

  • Start working with Research Data – not just on repositories, but on useful applications.

  • Get involved in tag-and-release programs for the feral data capture projects roaming Australia’s universities.

  • Do more work on ‘Digital library’ projects beyond the institutional publications repository.

Culture and climate

Culture and climate

I was invited to attend the planning day for the Institute for Culture and Society (ICS) at the University of Western Sydney, to talk about the eResearch team at UWS, discuss collaboration tools, and show a few useful, relevant examples of eResearch in the humanities.

Here are some rough notes for the discussion.

For eResearch I will talk about our small eResearch website, and on the subject of collaboration tools I’ll be evasive.

The problem with surveys of collaboration tools

While lots of people are interested in finding out how to collaborate using modern techniques we really need to talk this through on a project by project basis.  I tried to write about collaboration tools at the Australian Digital Futures Institute after complaints from an education researcher in the institute about the bewildering array of stuff we used to get things done. I gather it was like turning up for work as a carpenter’s apprentice and being introduced to all the tools in the ute at once.

(That piece is still online, but it is of historical interest only, as the tools have all changed. Not to mention it is very long winded and mentions some USQ tools that aren’t relevant to you, still if you’d like to see how I explained Twitter and hashtags, and predicted the demise of Google Docs ‘cos Google Wave had arrived then you might enjoy it. Otherwise, file as too long, don’t read.)

Dr Sefton’s quick cure for a lack of online collaboration

If in doubt, start a Google Group. If symptoms persist, see me in the morning; I may put you into one of my group therapy sessions.

Ok, so maybe that advice about collaboration tools is a bit too short. But rather than list tools, I’ll put up this list of collaborative tasks (not tools) as potential discussion topics to come back to, either in this session or in a future dedicated workshop or consultation.

Some collaboration modes/tasks

  • Talking to each other: email, video/audio conferencing, discussions

  • Writing together: word processing, wikis, Content Management Systems

  • Publishing: blogs, wikis, Content Management Systems, pod/video-casting, CVs, microblogging

  • Remembering and sharing: links, reference materials, bibliographic references

  • Storing: stuff

Which tools do you favour and why?

eResearch for Culture and Society

Back to the more interesting topic – eResearch as it relates to culture and society.

On the way to work on Monday I rode through local instances of some lovely spring weather (cold enough for me to want a jacket descending the mountains, warm by the time I got to the river), which got me thinking about the climate and in turn the Hawkesbury Institute for the Environment (HIE), which is just downstream from Penrith.

The eResearch team does a lot of work with HIE and the connection is easy to see. We obviously need large amounts of data to document, let alone model, climate, and we need to run climate simulations at atmospheric and oceanic scale as well as at smaller scale, like models of leaves or trees – all of which involves data management, computational tools and global collaboration.

Weather, climate, and the ICS planning day reminded me of an analogy of Michael Halliday’s:

We can perhaps use an analogy from the physical world: the difference between “culture” and “situation” is rather like that between the “climate” and the “weather”. 

I used to think about this analogy a lot, particularly when some lecturer was getting us undergrads to formulate grammar rules from half an A4 page of dodgy examples. Those ‘models’ (including Halliday’s) were severely limited by the number of data points that supported them.

Then I was introduced to corpus linguistics in the early 1990s in a workshop by John Sinclair. In the workshop multiple instances of words in context were used as data to help decide what they actually mean.  The Collins COBUILD dictionaries that Sinclair was involved in producing gave quite a different picture of the ‘climate’ of English that the traditional dictionary approach of forward-copying definitions by using, you know, evidence to decide what words mean.

Fast forward to 2012 and the Macquarie dictionary decided to re-look at its definition of misogyny, after the word got a bit of an airing in the Australian Parliament, as noted in this letter from the Macquarie’s editor. I knew that they would have been able to get plenty of data on the term’s use, and I thought of John Sinclair again. But the letter didn’t talk about data, curiously, it talked about house-work.

As Editor of the Macquarie Dictionary, I picture myself as the woman with the broom and mop and bucket cleaning the language off the floor after the party is over.

The dictionary is one sort of ‘cultural climate’ record, so of course we have to have sceptics, like this example form the Herald Sun’s Patrick Carlyon, who like a good climate change sceptic brings his own data to the table.

Given the ever-changing flow of words and their meaning, Macquarie has announced a raft of further definition shadings to reflect recent political events and current affairs:

Dog: To be known also as “cat”, after a two-year-old boy at an East Brighton childcare centre pointed at a chihuahua and meowed.

These days, dictionary editors don’t need no fancy ‘corpus’ like they used on those revolutionary Collins Dictionaries, as we find out from another letter they have the Internets, and not only that, they can still copy from others, just like they used to.

When it is brought to our attention, we are lucky these days to be able to draw on the immense resources of the internet such as newsfeeds, blogs, videos, etc., to research the use of the word over time, in different areas of the world, and in different kinds of texts. Of course, we can also check other dictionaries, to see if the same conclusions have been reached by our fellow lexicographers.

http://www.macquariedictionary.com.au/pdf/editors_response.pdf

I’m telling you this because I wanted to show a simple eResearch example from the cultural sphere. Halliday’s climate analogy seemed apt. Just as we know climate science is done with lots of data points, recording the weather at the highest possible scale that add up to a climate record, we can study cultural phenomena such as language by looking at data-points of various kinds. Text is an easy example, because it’s easy to search and there’s now a lot of it to search.

Anyway, with all that in mind, I wanted to ask the researchers from the institute:

What infrastructure do you need to do ‘culture science’?

Or is that a stupid/naïve/offensive question?

While people think about that I thought I’d continue with a few examples, and come back to the discussion of collaboration and eResearch tools at the end.

The Feds don’t seem to think this idea of ‘culture science’ it is entirely stupid, as they have funded a couple of million-dollar plus projects to build virtual laboratories, not just in sciences but in the humanities.

NeCTAR (Aus govt funding) Round 1 Virtual lab projects

A question for ICS researchers – what kind of cultural data is important to you?

And again in round 2, there is a UWS-led bid to build a virtual laboratory which is partly in the cultural domain; contract negotiations proceeding on that one. This lab is very much like ‘climate science’ for linguistics, musicology et al, bringing together data sets and letting researchers run tools on them, generate new analyses and annotations and feed back, to be built by the UWS MARCS institute.

Round 2 Virtual Laboratories recommended for funding

ul>
  • The Industrial Ecology Lab 

  • Marine Virtual Laboratory (MARVL)        

  • Biodiversity and Climate Change   

  • Endocrine Genomics (EndoVL)      

  • >>> Human Communication Science<<< This one is also in the humanities (and it’s UWS-led)

  • <

    One of the data sources that HuNI is linking into their Virtual Lab is the National Library’s Trove. To demonstrate this I’ll want to try out Tim Sherratt’s QueryPic tool, searching across the Trove Newspaper database for occurrances of terms to do with the workshop topic, which broadly speaking means stuff about Asian studies, and the Asian century. Tim’s tool is an example of an eResearch tool that’s completely data driven.

    QueryPic showing searches for Asia and Asian in Trove Newspapers 1803–1954

    http://dhistory.org/querypic/4t/

    You can click on a data point to see a list of articles.

    But be careful with these results!

    Q. Why were the Aussie papers talking about Asia so much in 1820?

    A. They were talking about  a ship.

    If only the Macquarie had something like this.

    (There are some issues with this tool, not least of which is that this is not a stable, fixed data set, people are actively improving it via crowdsourced editing, and the data set is expanding so it would be impossible to reproduce results. I’ve suggested that a solution would be to place snapshots of the data into the Research Data Storage Infrastructure starting to roll out now via lead agency, The University of Queensland so that researchers could work on known-stable corpora, and perform tricks like reindexing to improve performance on this class of query.)

    Contrast this approach of re-using existing data in a fairly generic database to ask new questions with a very different kind of eResearch application, the Dictionary of Sydney, a project of the Arts Computing Lab at the University of Sydney; we can search for the Art Gallery of NSW where we’ll be meeting, and from there browse a rich curated web of relationships between entries about buildings, people, institutions etc.

    Another way of recording culture: The Dictionary of Sydney

    http://dictionaryofsydney.org/organisation/art_gallery_of_new_south_wales

    So, back to the question.

    What infrastructure do you need to do ‘culture science’?

    Copyright Peter Sefton 2012. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia.

    [Updated 2012-11-13, removed Andrew Leahy as co author]

    Tip: Arrange dock icons by shape, colour to reduce seek-time

    Like the guy in this video I used to think it would be a good idea to arrange icons in the OS X dock by how often I used them, or maybe by type. But I found that whatever ordering I used I would have trouble finding things. I know that iTunes is a blue circle, but so is Skype, and Safari (I rarely use it but sometimes I want to test something) – so the task of finding the app I want meant scanning for blue circles and often zooming my attention to the wrong one – I hate to say it, but they all look the same to me those pictures.

    So, I now arrange them by shape, then colour. To seek, zoom in to the right shape-group. It’s easy to see the differences because they’re side by side rather than spread out.

    Circles and circle-like things:

    Squarish things:

    Love those Apple apps that are so easy to tell apart:

    Oh, and might as well arrange things with letter-icons in alphabetical order.

    Et volia!