Category Archives: Uncategorized

Research Data @ the University of Western Sydney (Introducing a data deposit management plan to the research community at UWS)

I was invited to speak at the National Higher Education Faculty Research Summit in Sydney on May 22 about our Research Data Repository project. The conference promises to provide a forum for exploration.

Explore

  • Sourcing extra grant funding and increasing revenue streams

  • Fostering collaboration and building successful relationships

  • Emerging tools and efficient practices for maintaining research efficacy and integrity

  • Improving your University’s research performance, skills and culture to enable academic excellence

My topic is “Introducing a data deposit management plan to the research community at UWS”. This relates directly to the conference theme I have highlighted, on emerging tools and practice. My strategy for this presentation, given that we’re at a summit, is to stay above 8000m, use a few metaphors, and discuss the strategy we’re taking at UWS rather than dive too deeply into the sordid details of projects. As usual, these are my notes; I hope these few paragraphs will be more useful than just a slide deck, but this is not a fully developed essay.

There are two kinds of data: Working and Archival/Published

In very general terms, we have divided our data storage into two parts: the working Research Data Storage service where people get things done, collect data and work with it and the archival Research Data Repository part where stable, citable published data sets are looked after (by the library) for the long term.

This talk is not going to be all about architecture diagrams but here’s one more, from a recent project update showing two examples of applications that will assist researchers in working with data. One very important application is HIEv, the central data capture/management platform for the Hawkesbury Institute for the Environment. This is where research teams capture sensor data, research support staff work to clean and package the data, researchers develop models and produce derived data and visualisations. We’re still working out exactly how this will work as publications using the data start to flow, but right now data moves from the working space to the archival space, and thence to the national data discovery service, see this example of weather data – (unfortunately the data set is not yet openly available for this one, I think it should be, and I’ll be doing what I can to make it so).

Data wrangling services

The other service shown on this diagram is Dropbox.com. We’d be hard pressed to stop researchers from using this service – it comes up in just about every consultation meeting. Researchers themselves must take responsibility for making sure that services like this are appropriate given their data management obligations under funder agreements and codes of practice. For those projects where Dropbox.com is appropriate we plan to let researchers invite the Research Data Store to share their stuff, thus creating a managed, backed-up copy at the university, and opening the way for us to provide useful services over the data (coming soon).

Data management

Yes, we have a web page about research data management, with some basic advice and links to more resources, but putting up web pages does not effect the kind of culture change needed to establish research data management, data re-use and data citation. As our Research Office head, Gar Jones, says this will be a change similar to the introduction of Human and Animal ethics management which will take several years to roll out.

Some key points for this presentation

I want to talk about:

  • Governance, open access, metadata, identifiers

  • The importance of the (administrative) research lifecycle

  • Policy supported by services rather than aspirations

eResearch = goat tracks

This is a concrete path on the Werrington South (Penrith) campus of the University of Western Sydney. The path is there because people kept walking through the garden bed, which was in between where the shuttle bus stops and where they wanted to be, at the library. As I said at a similar conference for IT-types last year:

Groups like mine work in the gap between the concrete and the goat track, my job is to encourage the goats.

And once we’ve encouraged the goats to make new paths, we need to get the university infrastructure people to come and pave the paths.

What’s over the horizon?

What do research administrators and IT directors need to be thinking about?

  • Changes in the research landscape – more emphasis on data reuse and citation, increasing emphasis on defensible research mean data will become as important as citations

  • Providing access to publications and data so it can be reused.

  • (e)Research infrastructure in general, where collaboration must not be constrained by the boundaries of individual institutional networks and firewalls.

Any others?

Research data, Next Big Thing?

The Australian National Data Service runs a data-discovery service designed to advertise data for reuse.

Governments are joining in

As research organisations, we want to have infrastructure for data management, and a culture of data management that involves forward planning, and data re-use. So the next section of the talk is about how we need to:

  • Stop the fat multinational-publisher tail from wagging the starving research dog. Ensure research funded by us is accessible and usable by us.

  • Understand our researchers and their habits, so we can help them take on this new data management responsibility (actually it’s not a new responsibility, but many have simply been paying no attention to it, in the absence of any obvious reason to do so).

  • Sort out the metadata mess most universities are swimming in.

Now for the big picture stuff.

Open Free scholarship is coming? (Just beyond that ridge)

OA is a Good Thing,

Which will:

  • Reduce extortionate journal pricing.

  • Provide equitable access to research outputs to the whole world.

  • Open Access to publications and Coming Soon: Open Access to data.

  • Promote Open Science and Open Research.

  • Drive huge demand for data management, cataloguing, archiving, publishing services

http://aoasg.org.au/

There are competing models for open access. Bizarrely the discussion is often framed as a contest between ‘Green’ and ‘Gold’. It’s a lot like the State of Origin Rugby League, a contrived but popular-in-obscure-corners of the world contest where the ‘Blues’ and ‘Maroons’ run repeatedly into each other. In both State of Origin and Open Access, the current winners are large media companies. Add least being an Open Access advocate doesn’t give you head injuries.

Green OA refers to author-deposited pre-publication versions of research articles. Gold means that the published version itself is ‘Open’ for some ill-defined definition of open, often at a cost of thousands of dollars, out of the researcher’s budget. Green or Gold, a lot of so-called Open Access publishing operates with no formal legal underpinnings, that is, without copyright-based licenses. For example when I deposited a Green version of a paper I had written here, and wrote to the publisher asking them to clarify copyright and licensing issues I got no reply.


We have a brief window now to try to build services for research data management that do have a solid legal basis and avoid following some of the OA movements missteps but this is not trivial (1).

Identity management is crucial

I have used a variant of the above dog picture before to talk about identity management. This dog has a name but it’s a terrible way to find out about him as he has a much more famous namesake.

Like the rest of us, this dog has all sorts of identifying names and numbers – a microchip number linked to a database, an ID assigned by the RSPCA, patient numbers at veterinary practices, which may be linked to more than one human, phone numbers on his tag etc. Point is, it’s much worse for researchers than for dogs – identities are maintained all over the place. Foley and Kochalko put it like this:

While much has changed since the days of David Livingstone, we continue to struggle with associating individuals with their works accurately and unambiguously. Author name ambiguity plagues science and scholarship: when researchers are not properly identified and credited for their work, dead-ends and information gaps emerge. The impact ripples throughout the ecosystem, compromising collaboration networks, impact metrics, “smarter” research allocations, and the overall discovery process. Name ambiguity also weighs on the system by creating significant hidden costs for all stakeholders. (2)


To do metadata management well we need to make sure that we sort out all sorts of naming and identifying issues, dealing correctly with potential causes of confusion, multiple people with the same name, people with multiple names over time, and simultaneously, name variants. Even where there are agreed subject codes like the Field of Research codes that are heavily used in research measurement exercises they can get mixed us as different databases use different variants.

We try to work out how to fit new processes into existing workflows

At Rochester university, when they installed an institutional repository the team conducted ethnographic research on their research community (3). We have not gone that far, but our Research Data Repository project does try to pay attention to what researchers do as part of their current work, and to fit new processes into existing ones.


For example, the above scenario tries to capture the interactions that would happen when a researcher is required by a journal to deposit data before publication. We spend a lot of time talking to the Office of Research Services (ORS) and research librarian team about how we can fit in with their existing processes, and how to minimise negative impacts on research groups. Research Offices are used to responding to changing regulatory environments so adding new fields to forms etc is straightforward. Changing IT services is much harder; the ITS is much bigger than ORS, new services need to be acquired, provisioned and documented, and the service desk team has to be taught new processes.

Challenge: how to stop the corporate publishing tail from wagging the scholarly dog

This is a rather a substantial issue to try to talk about in a discussion about research data management and repositories, but it’s essential to keep an eye on the big picture. We know that scholarship has to change, publishing has to change, but we don’t know how. We need to develop strategies for how we want it to change. Some examples of where this is important:

  • Policy on ‘ownership’ of intellectual property rights over data needs to be established. This is not as simple as it is for publications, as data are not always subject to copyright (1).

  • Data citation is going to be an important metric.

New models are needed. People like Alex Holcombe from Sydney uni are developing them:

Science is broken; let’s fix it. This has been my mantra for some years now, and today we are launching an initiative aimed squarely at one of science’s biggest problems. The problem is called publication bias or the file-drawer problem and it’s resulted in what some have called a replicability crisis.

When researchers do a study and get negative or inconclusive results, those results usually end up in file drawers rather than published. When this is true for studies attempting to replicate already-published findings, we end up with a replicability crisis where people don’t know which published findings can be trusted.

To address the problem, Dan Simons and I are introducing a new article format at the journal Perspectives on Psychological Science (PoPS). The new article format is called Registered Replication Reports (RRR).  The process will begin with a psychological scientist interested in replicating an already-published finding. They will explain to we editors why they think replicating the study would be worthwhile (perhaps it has been widely influential but had few or no published replications). If we agree with them, they will be invited to submit a methods section and analysis plan and submit it to we editors. The submission will be sent to reviewers, preferably the authors of the original article that was proposed to be replicated. These reviewers will be asked to help the replicating authors ensure their method is nearly identical to the original study.  The submission will at that point be accepted or rejected, and the authors will be told to report back when the data comes in.  The methods will also be made public and other laboratories will be invited to join the replication attempt.  All the results will be posted in the end, with a meta-analytic estimate of the effect size combining all the data sets (including the original study’s data if it is available). The Open Science Framework website will be used to post some of this. The press release is here, and the details can be found at the PoPS website.

http://alexholcombe.wordpress.com/2013/03/03/registered-replication-reports-are-open-for-submissions/

This seems like a positive note on which to end. Hundreds of researchers are trying to fix scholarship, they’re the ones we need to talk to about what a data repository or a data management plan should be.

Science is broken let’s fix it

1. Stodden V. The Legal Framework for Reproducible Scientific Research: Licensing and Copyright. Computing in Science Engineering. 2009;11(1):35–40.

2. Foley MJ, Kochalko DL. Open Researcher and Contributor Identification (ORCID). 2012 [cited 2013 May 21]; Available from: http://docs.lib.purdue.edu/cgi/viewcontent.cgi?article=1133&context=charleston

3. Lindahl D, Bell S, Gibbons S, Foster NF. Institutional Repositories, Policies, and Disruption. 2007 [cited 2013 May 21]; Available from: http://open.bu.edu/xmlui/handle/2144/919

Creative Commons License
Research Data @ the University of Western Sydney (Introducing a data deposit management plan to the research community at UWS) by Peter Sefton is licensed under a Creative Commons Attribution 3.0 Unported License.

Running an Open Source project from a university dev team

Steven Hayes from Arts eResearch at the University of Sydney invited me to visit their group and talk about running open source software projects, as they are making their Heurist (semantic database-of-everything) software open source. This was more of a conversation than a presentation, but I prepared a few ‘slides’ to remind me of which points to hit. Here are my notes. The focus here was not on why go open source, or open source in general, it was about doing it in a small university-based team. Comments about how various uni open source projects run would be appreciated.

I have been involved in creating two sizeable code-bases both released by the University of Southern Queensland as open source. They had very different histories. I’ll talk about both and how they run, although actually one of them doesn’t run any more in any meaningful way.

Two projects I started…

… on which other people* did most of the work

  • ICE – the Integrated Content Environment. Used at USQ for creating course materials for delivery online and in print. Almost no activity on this outside of USQ these days. Inside USQ? I don’t know for certain, but I think it is still in use, and finding a replacement has proven difficult, which doesn’t surprise me as that was the reason we built it in the first place).

  • ReDBOX – the Research Data Box (and The Fascinator, the underlying toolkit).

*Thanks to Ron Ward, Oliver Lucido, Linda Octalina, Duncan Dickinson, Greg Pendlebury, Daniel de Byl, Bron Chandler, Tim McCallum, Cynthia Wong, Jason Zejfert, Sally MacFarlane, Caroline Drury, Pamela Glossop, Warwick Milne, Sue Craig, Vicki Picasso, Dave Huthnance, Shirley Reushle and the late Alan Smith who made, tested, championed and supported these projects. Thanks also to funding from the Australian government via ANDS, ARROW and other streams. Sorry if I forgot anyone.

(At this point I wanted to check that everyone knows what Open Source means, making sure that we all understand how Richard Stallman made software free using copyright law. Whoever holds the copyright in a bit of software, which is likely to be whoever wrote it, or their employer can control distribution by using a licence, a legal instrument. Stallman’s insight was that a licence could be used to enforce sharing, openness and freedom: you can use this stuff I created provided you promise to share it with other people (that’s not a quote). Oh, and people working in this space should also understand the difference between Free and Open Source [1].

But I forgot.)

RTFM

Above, I linked to a free book on producing Open Source software [1] by Karl Fogel which seems to cover most of what you’d need to know. I haven’t read it all, looks useful.

But I don’t like this

The book begins:

Most free software projects fail.

I think that’s silly, talking about failure without first defining success.

Me, I’m not sure that all these scenarios Fogel lists are failures at all, there are lots of reasons to release code and they are not all necessarily about building a substantial community:

We tend not to hear very much about the failures. Only successful projects attract attention, and there are so many free software projects in total[2] that even though only a small percentage succeed, the result is still a lot of visible projects. We also don’t hear about the failures because failure is not an event. There is no single moment when a project ceases to be viable; people just sort of drift away and stop working on it. There may be a moment when a final change is made to the project, but those who made it usually didn’t know at the time that it was the last one. There is not even a clear definition of when a project is expired. Is it when it hasn’t been actively worked on for six months? When its user base stops growing, without having exceeded the developer base? What if the developers of one project abandon it because they realized they were duplicating the work of another—and what if they join that other project, then expand it to include much of their earlier effort? Did the first project end, or just change homes?

What’s the first thing that comes to mind when you think of Open Source?

Linux? Apache? WordPress?  Firefox?

The hits. The stadium-filling rock-star projects?

Your band has 99.9% probability of staying in the garage

Figure 1 Me (the good looking one) and cousin Tim at the Springwood Sports club, about to perform with a community uke-group. No plans for world-domination, playing for family, who are obliged to attend and even some people who , for some reason, choose to come. #Notfailure.

It’s important to work out why you are going to release software as Open Source – think about the audience. One very important audience is you, yourself. If you work on code as part of your job, then your employment contract may well mean that your employer owns the copyright. Do you want to be able to continue using it in your next job? Show potential employers? Making it open source helps your future self.

I know this first hand.

Universities are not as stable as they seem, or you may hope. At the Australian Digital Futures Institute at USQ we began by hosting code repositories and websites internally. I reasoned that the university would be a good bet for maintaining persistence of these resources.

But then one Gilly Salmon came to our institute to be the new professor, decided, along with the rest of the senior leadership team that there was altogether too much making the digital future going on in the Australian Digital Futures Institute, too much technology. They let just about all the technical staff go, no matter how useful they were to the organisation, or how pregnant they happened to be (we’re a relationship brand, the director of marketing told me, so we shouldn’t be continuing to develop software to deliver award-winning distance-ed services).

Web sites that would still have value are just gone from public view, including, ironically the PILIN project site, which was about persistent identifiers. Even the ICE website which is full of useful stuff for USQ itself now appears to be only accessible via the Wayback machine. They’re still using it but they turned off the website anyway, the code, however, is sitting on Google code so we all still have access to it.

This sort of thing happens all the time. For a couple of us, the NextEd refugees, this was the second redundancy associated with USQ. Kids, it is prudent to make sure that any code you might want to re-use later in your career is released under an open licence, and documentation, web sites etc likewise under creative commons. Think of it as a professional escape pod.

The ReDBOX project survived this ADFI shut down, because it had been open source from the beginning but further funding had to be redirected to another university which was willing to host the building of a digital future.

Lessons

  • Open Source can be worth doing even if the audience is your future self

  • Don’t trust someone else to keep your website up

  • If you want a community you’ll (likely) have to build it

  • Every project is different, so you need to structure yours around your users

Oh, and the answer to most questions is on Stack Exchange. I decided that this list was worth using as a starting point for discussion.

http://programmers.stackexchange.com/questions/51553/checklist-for-starting-an-open-source-project

Havoc P said: [with additions by me post the discussion at USYD]

Things I’d put in the early priorities are:

  • have a simple “what is it?” web site with links to some discussion forum (whether email or chat) and to the source code repository

    [Mailing lists are usually best IMO – forums can be empty, echoing and make you project look unloved. A tech list is a must, always, but other communications should be built around the reality of your project. No user community yet? Build one. Others over at Stack Exchange added that once you have a tech-list is best to hold or log all your discussions there so architectural decisions are transparent and the community can engage.

    On the ReDBOX project there are two main mailing lists, one for the techies and one for the users (mostly library staff), and lots of virtual and face-to-face get togethers. There is a committers group who are in charge of what gets into the trunk and various ad-hoc arrangements to sponsor sub-projects at the dozen or so sites using the software. The groups and how they interact were all created to serve that community, not from some manual of best practice, although it is all informed by collective experience of open source projects.]

  • be sure the code compiles and usually works, don’t commit work-in-progress or half-ass patches on the main branch that break things, because then other people’s work would be disrupted

    [Well, OK, but if you’re releasing an existing code base then don’t get too hung up on making things perfect (a) it will be a huge waste if there is no demand for your code and (b) don’t be unnecessarily shy, most open source projects are like busking, not stadium rock, nobody is watching you waiting to pounce on your errors.]

  • put a license file in the code repository with a well-known license, and mark the copyright owner (probably you, or your company). don’t omit the license, make up a license, or use an obscure license.

  • have instructions for how to contribute, say in a HACKING file or include in your README. This should include where to send patches, how to format patches, code indentation rules, any other important conventions of the project

  • have instructions on how to report a bug

  • be helpful on the mailing list or whatever your forums are

More from Havoc P

After those priorities I’d say:

  • documentation (this saves you work on the mailing list… make a FAQ from your list posts is a simple start)

  • try to do things in a “normal” way (don’t invent your own build system or use some weird one, don’t use 1-space indentation, don’t be annoyingly quirky in general because it adds learning curve)

  • promote your project. marketing marketing marketing. You need some blogs and news sites and stuff like that to cover you, and then when people show up interested, you need to talk to them and be sure they get it working and look at their patches. Maybe mention your project in the forums for related projects.

    [Yes, this is a huge one. One of the big differences between ReDBOX, which is no hit, but has a solid user base and ICE which never made it out of USQ is that Vicki Picasso from Newcastle Uni and I marketed the hell out of ReDBOX early to a very specific community of user-organisations. We needed a community so the software would have a sustainable base, so we designed the software for the community and sought input on the design as broadly as we could.

    With ICE, I talked about it to lots of the wrong people and didn’t sell it to the right ones, other distance ed unis, but that was partly because it conferred a competitive advantage on USQ. This comes back to the point above about success vs failure – there’s more than one way to succeed.]

  • always review and accept patches as quickly as humanly possible. Immediately is perfect. More than a couple days and you are losing lots of people.

  • always reply to email about the project as quickly as humanly possible.

  • create a welcoming/positive/fun atmosphere. don’t be a jerk. say please and thank you and hand out praise. chase off any jackasses that turn up and start to poison the community. try to meet people in person when you can and form bonds.

[1] K. Fogel, Producing open source software: How to run a successful free software project. O’Reilly Media, Inc., 2005.

Creative Commons License
Running an Open Source project from a university dev team by Peter (pt) Sefton is licensed under a Creative Commons Attribution 3.0 Unported License.

Repositories! (What are they good for?)

Creative Commons License
Repositories! (What are they good for?) by Peter Sefton is licensed under a Creative Commons Attribution 3.0 Unported License.

Repositories! (What are they good for?)

Georgina Edwards has invited me to Intersect NSW to give a talk to the software engineering team about repositories in eResearch. There were also quite a few eResearch analysts in attendance, not to mention a couple of members of the senior management team. (Just in case you’re wondering, the answer to the question in the title is not “absolutely nothing”).


Here are my notes, with embedded slides, which I put together on the train to and from the CBD (ie quick and dirty).

The summary: repository means a lot of different things, but the main sense I talked about with the Intersect team was ‘data-store component’. I tried to cover why using a repository in an eResearch project might be important because repositories can provide a lot of ready-made functionality, particularly in the area of digital preservation, but also access to indexing services and content-transcoding to generate new formats from things ingested. I talked about one aid for thinking about repository services which I think is useful – the Repository Micro-services framework from the California Digital Libraries, and ran through some of the repository frameworks that people in the eResearch.au world might encounter.

The liveliest discussion was around RDF, the Resource Description Framework, and what it’s good for. I made the assertion that RDF was the best practice approach to storing metadata, allowing for built-in extensibility. RDF uses URIs as names for both things and relations, which reduces ambiguity and aids interoperability. But I think it’s important to draw the line between RDF as a good way to do metadata, and annotation and the assumption that an RDF query language (via an RDF triple-store) is always going to be needed or even work. I’m sceptical about the promise of RDF as some kind of super semantic world-wide web of knowledge you can query for the answer to anything, but it’s clearly a good way to do metadata – there’s no excuse for inventing a new metadata schema that is not RDF based these days.  (Use the comments if you want to discuss).

The talk

I thought I’d start from something that the developers would be familiar with. Source-code repositories.

To a bunch of software developers…

… a repository is a place to put code

But it’s not just a place to put things. On a development project, the repository offers a number of services, like integration with task management systems, versioning, search and collaboration. I’m sure everyone in the professional eResearch world would be horrified to find a development project that wasn’t using source-code management via a code-repository: Git, Mercurial, or at the very least something ancient like Subversion.

What’s a repository to me?

The first time many of us heard the term repository in Higher Education was in connection with the Open Access movement, when a few forward thinking universities in Australia QUT, UQ, USQ and even some others outside of Queensland began to set up Institutional Repositories, using software like Eprints or Dspace. These were essentially online databases of PDF files for academic works, with bibliographic metadata. They were also seen as sites for preservation of materials, and had services to advertise their contents to the world, via the OAI-PMH metadata harvesting standard, and via metadata embedded in the web pages that described the academic works.

A group of us put together a presentation for Open Repositories last year on the growth of Institutional Research Data Repositories, alongside the ‘traditional’ Institutional Publications repository.

There are a few senses of the word:

  • Repository-as-database

  • Repository-as-application

    Institutional Repository or Data Repository

  • Repository-as-lifestyle (ie analogous to a ‘library’)

People tend not to be very careful about these senses of the word repository and indeed the boundaries are actually quite blurry. If you have chosen to call your application a repository, then that term brings a certain gravitas, you’d expect the repository-as-application to be something that’s not just for Christmas, but something you’ve made a commitment to feeding and walking at least for some time.

With that  in mind, the point of this discussion: is what might a repository-as-data store be good for in an eResearch project?

Services in a typical repository-as-datastore underneath an application:

  • If the app goes away the data is/are safe independently of the application services,

    • with all digital objects stored in standard formats

    • with standardised metadata

    • so they can be preserved*.

  • You get OAI-PMH (pull/out) and SWORD (push/in) built in

  • Built in security/access control

    (but beware of actual real-world performance)

  • Content transcoding

    (thumbnails / image viewers / video versions)

Nobody put up their hand and said “Hey that’s just a CMS” (Content Management System), but the answer would have been, yes, of course. A repository-as-application is just a serious CMS, one designed for maintaining important stuff in a well-managed way. Indeed, the University of Queensland is moving its Institutional Repository to a Drupal-based system, and leaving behind the repository-as-data-store that used to sit underneath it.

The Repository Micro-services framework from the University of California captures all these services really nicely.

Repository Micro-services

http://journals.tdl.org/jodi/index.php/jodi/article/view/1605/1766

This is implemented in http://merritt.cdlib.org/, which does not seem to have an obvious application to download.

Repository micro-services

Some repository software you may hear about

  • Eprints (Perl)

    Good for publications repositories, has been used for cultural collections, learning – has every imaginable interface to repository content

  • DSpace good for a range of digital object collections

    eg Andrea Schweer’s talk on a data capture app Building a repository for freshwater quality data

  • Fedora Commons (back end)

  • CKAN – a Research data Hub app (Python)

  • Micro-service components like BagIt for packaging and PairTree for efficient file-storage.

NOTE: All of the above apart form Eprints include built-in search using Apache Solr.

In conclusion, I asked: why use one of the above, particularly when on first acquaintance, something like Fedora can look like an anchor, impeding forward progress?

The basic answer is that if in the long run your project is going to require some large percentage of the repository micro-services discussed above, then you’re going to end up writing your own Fedora-like thing. Also, I think it’s better to be part of a community looking at these things together. For example Fedora is not a magic solution to being able to re-use repository content between applications, but it is reassuring to know that the Hydra and Islandora communities are talking about interop via their Hylandora project and there is a significant amount of preservation-work happening in the Fedora world.

To some of us, the idea of doing certain kinds of eResearch project without a back-end repository (as in something that has managed services around preservation under some kind of serious governance) would be like doing software development without a code repository. The question, of course is which kinds of project? And of course, if you do need one, where do you put the repository part in the architecture.

CAIRSS – CAUL Australasian Institutional Repository Support Service

By Dr Peter Sefton (University of Western Sydney) with Ms Caroline Drury (University of Southern Queensland).

On Wednesday 5th Dec I (Peter) visited the Japanese Digital Repository Federation at their invitation and expense, to talk about how our respective repository communities are organised.  I’d like to thank the DRF for this opportunity to make the brief trip to Tokyo. Caroline was invited but was unable to make it. The DRF folks have put up a summary of the meeting, in Japanese. Note that while my comments on that page are listed as “CAIRSS” I was not representing CAIRSS (the CAUL (Council of Australian University Librarians) Institutional Repository Support Service), I attended as a member of the Australian/Australasian repository community. I also attended the DRF international conference in 2009 on a similar basis, when I did happen to be associated with CAIRSS, so the organisers knew me. I did talk a fair bit about CAIRSS, in the context of other projects in Australia.

Before I went I polled the CAIRSS-list to find out if there were any questions people would like answered – more on that below.

First, a bit about me and repositories:

  • I was the technical lead for the Regional Universities Building Research Infrastructure Collaboratively (RUBRIC) project which was hosted by the University of Southern Queensland (and the de-facto project manager for several months during the project establishment phase).

  • I led a small team at USQ subcontracting to the ARROW project during 2008, providing technical support to ARROW, and repository services to small Higher Education institutions in Australia.

  • I worked on USQ’s successful bid to host the first CAIRSS repository support service in Australia and acted as a senior strategist for the service, for example working on guides such as the one on how to get into Google Scholar et al, and negotiating major changes to repository infrastructure such as the closure of the Australian Digital Theses search service and its subsumption into the National Library of Australia’s Trove service.

  • I was not involved in running the second version of CAIRSS from 2011-2012 but I have remained part of the repository community in Australia and attended the 2012 community day where I spoke about trends in repository software in the context of organisational governance.

  • I am on the conference committee for the Open Repositories series of conferences (from 2011) – the call for papers for the 2013 conference is just out.

The DRF

Shigeki SUGITA started off proceedings with a presentation about the activities of the Digital Repository Federation.

Perhaps the most striking thing from an Australasian point of view is a staffing issue; talented repository managers are required by management to rotate through a variety of library jobs meaning that there is constant turnover and a lack of opportunity to specialise. There are similar pressures at play in our libraries I guess, with a need to train new repository staff, and significant turnover but not to the same extent.

Japanese repositories are very much driven by an Open Access agenda, which is quite different from the situation in Australia where two different government measurement schemes collecting information about publications and push repositories in another direction, more on that below.

Another interesting dimension to the Japanese scene is that they have a number of consortial-repositories where a number of institutions share a repository. This is an idea that came up in Australia in the mid-to-late 00’s several times, but never got off the ground. It might be worth revisiting some time both for institutional publications repositories

The presentation

I presented from an earlier version of the ‘slides’ below – I have added some notes from the discussion and clarifying material.

CAIRSS background

Parent projects

The Australian government made significant investments in institutional repositories via programs such as:

  • APSR Australian Partnership for Sustainable Repositories (ended 2008)

  • ARROW Australian Research Repositories Online to the World

  • RUBRIC, Regional Universities Building Repository Infrastructure Collaboratively.

These projects and other investments in the repository world were via these funding streams (for which the websites have disappeared):

  • ASHER2 – Australian Scheme for Higher Education Repositories [Sponsored the development of repositories in all Higher Education Institutions in Australia]

  • SII3 – Systemic Infrastructure Initiative
[APSR, ARROW and RUBRIC]

  • BAA4 – Backing Australia’s Ability

We talked in some detail about how these funding schemes have influenced the establishment of repositories; while the initial driver for Australian repositories was open access, the Excellence in Research for Australia (ERA) measurement exercise and its failed predecessor stalled the Open Access movement to some extent, by requiring universities to collect non-open access materials in complicated ways.

CAIRSS v1 2009-2011

Coming out of the investments outline above, CAIRSS was established on  March 16, 2009:

The first CAUL service was funded for two years, with the approval of Department of Innovation (DIISR), with monies remaining from the successful ARROW project, supplemented by CAUL member subscriptions.

CAIRSS Structure

CAIRSS v1 staffing

This version of the CAIRSS service covered Australian universities, and was staffed with:

  • A full time repository manager. (USQ)

  • A full time technical staff member. (USQ)

  • A full time copyright officer. (Swinburne)

  • A part-time strategic advisor and other senior support.

CAIRSS v1 approach

The initial CAIRSS service included:

  • Annual meetings with both a general and technical strand.

  • Copyright workshops for private discussions of copyright issues.

  • Maintained ‘sandbox’ instances of repository software.

  • Creation and maintenance of web pages and guides on repository issues such as statistics, indexing and an extensive copyright guide.

  • Provided direct support for government reporting processes – chiefly the establishment of the Excellence in Research Australia (ERA) exercise.

CAIRSS v2 2011-2012

With added New Zealanders

The second version of CAIRSS was funded from member subscriptions and expanded to include New Zealand:

The second CAUL service is also funded for almost two years and incorporates many of New Zealand’s higher education institutions. With this expansion, CAIRSS now stands for the CAUL Australasian Institutional Repository Support Service.

CAIRSS v2 Staffing

This version of CAIRSS had a reduced team in the central office at USQ.

  • One full time repository manager.

  • One half-time technical officer.

  • Part time senior manager.

  • Part time copyright person at Swinburne.

CAIRSS v2 approach

The second CAIRSS service included:

  • Annual meetings with a general strand.

  • Discussion list for members only.

  • Copyright workshops for private discussions of copyright issues.

  • Maintenance of web pages on repository issues such as statistics etc.

  • Provided support for government reporting processes (ERA)

Post CAIRSS: CRAC 2013-?

From 2013 CAIRSS will no longer exist – it is being replaced with a new service know as CRAC.  I gather that the feeling of CAUL was that the community is now mature enough to be self-sustaining.

CRAC (CAUL Research Advisory Committee) NEW! from 2013

CAUL Research Advisory Committee

(will undertake some of the work carried out by CAIRSSAC and COSIAC, from 2013)

Program Research 
Chair Heather Gordon (2013-2014)
Members TBC
CONZUL Janet Copsey (2013-    )
Practitioners TBC

            http://www.caul.edu.au/about-caul/caul-committees

CRAC anticipated activities

  • Running the annual event

  • Annual copyright workshop

  • Maintaining the CAIRSS discussion list

New Open Access group: AOASG

There is a new Open Access group in Australia which is not part of the CAUL/CAIRSS family.

From Danny Kingsley:

The Australian Open Access Support Group (AOASG) was launched during Open Access Week in 2012. It is a consortium of six universities with open access policies  – QUT, ANU, Macquarie University, Newcastle University, Charles Sturt University and Victoria University. The group aims to provide support, lobbying and advocacy for open access in Australia. Membership will be extended to other research institutions and affiliates during 2013.

http://www.aoasg.org.au [NOTE the website is currently being built – may not be live yet]

General comments about the CAIRSS/CRAC community

Small task-force groups now self-organize

The repository community is well established and members of the community run their own investigations into repository matters. These range from asking questions on the list about repository practices, to running formal surveys. An example from the broader CAUL community of which CAIRSS is a part is the IR / Open Access Funding Survey by Danny Kingsley and Vicki Picasso.

Opportunities

DRF collaboration?

From Caroline Drury:

It would be interesting for CRAC to consider something similar to the DRF model  - eg at the beginning of each two year period, to meet and consider what projects could be done in the space, within Australia / NZ. Then perhaps a call could be put to institutions who could then (according to their strengths) be assigned to do that project in a collaborative model, using their own funds. I’m not sure if it would work here, given the big physical distances, but I think it’s a good model in a scenario where there’s limited funding. 

I’m sure CRAC will consider this.

April event in Tasmania

An event is being organised in Tasmania in April around the following themes. Regional participating would be most welcome.

From David Flanders at ANDS:

  1. linking research data and research publications

  2. re-architecting the repository (if we started now based on what we know).

  3. business metrics/analytics from scholarly systems

  4. research profiles and author identifiers

  5. emerging scholarly vocabularies, linkeddata & 

  6. scholarly search engines (beyond Google Scholar)

  7. APIs and bringing all these systems together via shared resources.

Questions (with my notes)

Natasha Simons at Griffith University had three questions that provided a great structure for the discussion part of the meeting. I tried to take notes (included below) as well as talk.


1. How’s the Memorandum of Understanding between Digital Repository Federation (Japan), UKCoRR and the UK RSP going? What sorts of things are of the most importance to all parties to share and experience in this space? What sort of involvement has there been between the signers to the MoU? How do they envisage this MoU benefiting all parties (particularly long-term)?

In January 2012 – DRF heard there were repo managers in the UK, they invited a rep from Repositories Support Project (RSP) in the UK.  Jackie Whickam came to snowy Hokkaido, where they found out that RDF and RSP carried out similar activities – eg the re-enactment of online discussions wearing masks.  After the meeting found out that there were many more things to share. Eg in the UK they carry out residential workshops. Meeting with JW was about operational things between repo manager communities in UK and JP. Wanted to do more on activities to do with individual repositories.

Most important objective of MoU is to send representatives to counterpart meeting to share more specifics. Since signed the MoU they have not done so much. First thing was to invite a rep from UK to national workshop. UK rep was asked to give a presentation about how they promote activities inside universities.

The MoU says they will sponsor trips for each other (but the Brits have not done their bit … yet :). RSP has just come to the end of its funding cycle. DRF hopes that even though there is uncertainly about funding the collaboration can continue. Funding is restricted to long term planning is difficult.


2. Find out what you can about the NII Repositories Program - http://www.nii.ac.jp/irp/en/rfp/ 
There are some interesting projects listed. How do they decide on the project areas? Where does the funding come from? How do they decide on the actual projects? Are they all 12 month projects? Are they all collaborative? How do they share the results?

Cyber Science Infrastrcuture CSI hosted by NII – has a selection board, informatics scholars & heads / top management of uni libraries about 10 members. 200 – 300 M Yen

  • Launch – circa 50 univerisites circa 1M Yen

  • R& D (5-6 page proposal docs by multiple unis) – examples

    • DRF (proposal by several unis)

    • Sherpa Romeo – Japan

    • Statistics

About 30 submitted – 20 accepted.

Proposals ask for money never given more, usually slightly less than proposed. Money goes to the unis as project operators. Budgets split between the participants (training, workshops, system development, etc). Budget allocated on fiscal year basis, CSI checks. Proposal made around March – decision around June – activities take place from June to March. Following June there is a results project meeting in Tokyo – 2 day meeting (decreasing because the number of people launching has dropped. Initial 50 now 10).

Sharing of results at June meeting, not much more than that.  Some projects with strong outreach will be well known. Out of 20 – some projects faded out without much impact.

Some projects that have done well:

  • DRF – The Digital Repositories Federation, my hosts.

  • Sherpa-Romeo Japan.

  • SCPJ – Society Copyright Policy in Japan – 600 societies almost all grey lit – a few are ‘green’ OA.

  • ROAT – project to standardise repo stats same as PIRUS/IRUS

  • Author – ID – (ORCID) participants on this are involved in ORCID. NII working on a trial basis to have a database of Japanese researchers.

  • ShareRe – consortial repositories came out of Hiroshima . DSpace commonly used in Japan and EARMAS (original Japanese repository software)

    There are 14 on a regional basis. Each one has a lead university acts as a host to provide system and support.  

    • Hiroshima .  Collect funds for future 14 members. 14 * 13K – 420K per kept in reserve, and used for security maintenance. Operated by regional council of libraries, Hiroshima uni serves as sectretariat and hosts, additional funding come from regional 30K Yen per year. Initial launch funds came from CSI – Hiroshima was the first.

    • 7 Member universities Kagoshima as lead. Initial investment about 2M yen do not collect funds for future rennocation collectively 250 K pro rata per year contributed by member according to FTE

  • UsrCom – Trial repository system sandbox  - 2008-2009


3. Are they ways we can communicate better with them? Do they hold any webinars on topics that would be of interest to us? If so, how do they tell people about them? Could they tell us? I see they post to the JISC list every now and then (usually well-deserved achievement boasts). Should we have a ‘guest’ from DRF join our CAIRSS e-list or could they email you and then you post to our e-list?

Don’t do webinars and they do everything in Japanese so that’s a challenge. Does anyone on the CAIRSS list have good Japanese? The DRF would like to have a member on the CAIRSS/CRAC list.

Future of DRF not clear.

(In Japan moves afoot to subsidise societies as a way of driving OA)

NO OA mandates from JP govt – rules being revised now so that theses can go through IR or must? Policy reads like must but we don’t know. May open the door for theses to publicised thru network. If this is realised there will be more possibilities – need to think about metadata standards and talking to national library. )

Next steps

Once again, thanks to the DRF for having me – I am following up with CAUL on how we might be able to collaborate further. Now that we are in the CRAC era, Caroline’s suggestion of having a ‘call for projects’ that then get implemented at the member institutions sounds like it might be a way forward, and an ongoing relationship with the DRF (and the UK RSP) would be helpful, as they’ve been down this road before.

Creative Commons License
CAIRSS – CAUL Australasian Institutional Repository Support Service by Peter Sefton & Caroline Drury is licensed under a Creative Commons Attribution 3.0 Unported License.

Receding Repository Software?

Creative Commons License
“Receding Repository Software?” by Peter (pt) Sefton is licensed under a Creative Commons Attribution 3.0 Unported License.

I’m leading a brief session at the CAIRSS community days today (CAIRSS is the national repository support service for Australasia). The title is “Emerging Repository Software”, but I thought I’d turn that around and propose that the future of institutional repositories is to fade into the background.

Here are my notes for the session.

Take this screenshot of the Griffith University repository. See here’s the default browse-screen for publications.

The Griffith Publications repository

Hold on! That’s not the repository

This is the Research Hub which ties together data from a number of different sources to provide a joined-up view of publications in the context of other research information.

http://research-hub.griffith.edu.au/publications

The repository has faded into the background

Much like this invisible dog.

Just in case these Griffith people get big heads, I do have to point out that while I think this service points to the future of the Institutional Repository as an embedded part of the research information systems of the university, the work’s not all done yet.

But, this is not perfect

The ‘find it yourself’ button is sub-optimal:

And check out this URL:

http://research-hub.griffith.edu.au/collections#fq={!tag=classgroup}classgroup%3A%22http%3A%2F%2Fresearch-hub.griffith.edu.au%2Findividual%2FvitroClassGroupcollections%22

That really should be something like: http://research-hub.griffith.edu.au/collections

This stuff is hard to get right. The hardest bit is getting good quality metadata so things do join up.

There are two new kinds of repository we’re seeing in Australia, thanks to investment from the Australian National Data Service.

  1. There are many “Data Capture” systems for researchers to manage data early in its lifecycle.

  2. These feed into Data Catalogues or Data Repositories – there’s a lot of terminological confusion here because of the way the funding streams have been structured.

    Data capture Systems

    See the ANDS list of DC projects. Here’s one I selected at random:

    It’s difficult to get useful information about many of these projects.

There are many data capture projects, and all of them will presumably need to be hooked up to systems like the Griffith Research Hub at some stage.

It’s a jungle out there, mount an expedition!

Data is are the new black

A few notable projects:

See JISC’s Managing Research Data programme for more.

(I wanted to mention the Hydra Fedora-commons toolkit as well lots of work on archives and digital libraries).

The current opportunity for libraries

Use your metadata skills to help with “the great joining up”.

  • Get the governance right (see the ANDS view of this).

    Research systems are for the researchers.

     (So projects should report to the Deputy Vice Chancellor Research).

  • Start working with Research Data – not just on repositories, but on useful applications.

  • Get involved in tag-and-release programs for the feral data capture projects roaming Australia’s universities.

  • Do more work on ‘Digital library’ projects beyond the institutional publications repository.

Culture and climate

Culture and climate

I was invited to attend the planning day for the Institute for Culture and Society (ICS) at the University of Western Sydney, to talk about the eResearch team at UWS, discuss collaboration tools, and show a few useful, relevant examples of eResearch in the humanities.

Here are some rough notes for the discussion.

For eResearch I will talk about our small eResearch website, and on the subject of collaboration tools I’ll be evasive.

The problem with surveys of collaboration tools

While lots of people are interested in finding out how to collaborate using modern techniques we really need to talk this through on a project by project basis.  I tried to write about collaboration tools at the Australian Digital Futures Institute after complaints from an education researcher in the institute about the bewildering array of stuff we used to get things done. I gather it was like turning up for work as a carpenter’s apprentice and being introduced to all the tools in the ute at once.

(That piece is still online, but it is of historical interest only, as the tools have all changed. Not to mention it is very long winded and mentions some USQ tools that aren’t relevant to you, still if you’d like to see how I explained Twitter and hashtags, and predicted the demise of Google Docs ‘cos Google Wave had arrived then you might enjoy it. Otherwise, file as too long, don’t read.)

Dr Sefton’s quick cure for a lack of online collaboration

If in doubt, start a Google Group. If symptoms persist, see me in the morning; I may put you into one of my group therapy sessions.

Ok, so maybe that advice about collaboration tools is a bit too short. But rather than list tools, I’ll put up this list of collaborative tasks (not tools) as potential discussion topics to come back to, either in this session or in a future dedicated workshop or consultation.

Some collaboration modes/tasks

  • Talking to each other: email, video/audio conferencing, discussions

  • Writing together: word processing, wikis, Content Management Systems

  • Publishing: blogs, wikis, Content Management Systems, pod/video-casting, CVs, microblogging

  • Remembering and sharing: links, reference materials, bibliographic references

  • Storing: stuff

Which tools do you favour and why?

eResearch for Culture and Society

Back to the more interesting topic – eResearch as it relates to culture and society.

On the way to work on Monday I rode through local instances of some lovely spring weather (cold enough for me to want a jacket descending the mountains, warm by the time I got to the river), which got me thinking about the climate and in turn the Hawkesbury Institute for the Environment (HIE), which is just downstream from Penrith.

The eResearch team does a lot of work with HIE and the connection is easy to see. We obviously need large amounts of data to document, let alone model, climate, and we need to run climate simulations at atmospheric and oceanic scale as well as at smaller scale, like models of leaves or trees – all of which involves data management, computational tools and global collaboration.

Weather, climate, and the ICS planning day reminded me of an analogy of Michael Halliday’s:

We can perhaps use an analogy from the physical world: the difference between “culture” and “situation” is rather like that between the “climate” and the “weather”. 

I used to think about this analogy a lot, particularly when some lecturer was getting us undergrads to formulate grammar rules from half an A4 page of dodgy examples. Those ‘models’ (including Halliday’s) were severely limited by the number of data points that supported them.

Then I was introduced to corpus linguistics in the early 1990s in a workshop by John Sinclair. In the workshop multiple instances of words in context were used as data to help decide what they actually mean.  The Collins COBUILD dictionaries that Sinclair was involved in producing gave quite a different picture of the ‘climate’ of English that the traditional dictionary approach of forward-copying definitions by using, you know, evidence to decide what words mean.

Fast forward to 2012 and the Macquarie dictionary decided to re-look at its definition of misogyny, after the word got a bit of an airing in the Australian Parliament, as noted in this letter from the Macquarie’s editor. I knew that they would have been able to get plenty of data on the term’s use, and I thought of John Sinclair again. But the letter didn’t talk about data, curiously, it talked about house-work.

As Editor of the Macquarie Dictionary, I picture myself as the woman with the broom and mop and bucket cleaning the language off the floor after the party is over.

The dictionary is one sort of ‘cultural climate’ record, so of course we have to have sceptics, like this example form the Herald Sun’s Patrick Carlyon, who like a good climate change sceptic brings his own data to the table.

Given the ever-changing flow of words and their meaning, Macquarie has announced a raft of further definition shadings to reflect recent political events and current affairs:

Dog: To be known also as “cat”, after a two-year-old boy at an East Brighton childcare centre pointed at a chihuahua and meowed.

These days, dictionary editors don’t need no fancy ‘corpus’ like they used on those revolutionary Collins Dictionaries, as we find out from another letter they have the Internets, and not only that, they can still copy from others, just like they used to.

When it is brought to our attention, we are lucky these days to be able to draw on the immense resources of the internet such as newsfeeds, blogs, videos, etc., to research the use of the word over time, in different areas of the world, and in different kinds of texts. Of course, we can also check other dictionaries, to see if the same conclusions have been reached by our fellow lexicographers.

http://www.macquariedictionary.com.au/pdf/editors_response.pdf

I’m telling you this because I wanted to show a simple eResearch example from the cultural sphere. Halliday’s climate analogy seemed apt. Just as we know climate science is done with lots of data points, recording the weather at the highest possible scale that add up to a climate record, we can study cultural phenomena such as language by looking at data-points of various kinds. Text is an easy example, because it’s easy to search and there’s now a lot of it to search.

Anyway, with all that in mind, I wanted to ask the researchers from the institute:

What infrastructure do you need to do ‘culture science’?

Or is that a stupid/naïve/offensive question?

While people think about that I thought I’d continue with a few examples, and come back to the discussion of collaboration and eResearch tools at the end.

The Feds don’t seem to think this idea of ‘culture science’ it is entirely stupid, as they have funded a couple of million-dollar plus projects to build virtual laboratories, not just in sciences but in the humanities.

NeCTAR (Aus govt funding) Round 1 Virtual lab projects

A question for ICS researchers – what kind of cultural data is important to you?

And again in round 2, there is a UWS-led bid to build a virtual laboratory which is partly in the cultural domain; contract negotiations proceeding on that one. This lab is very much like ‘climate science’ for linguistics, musicology et al, bringing together data sets and letting researchers run tools on them, generate new analyses and annotations and feed back, to be built by the UWS MARCS institute.

Round 2 Virtual Laboratories recommended for funding

ul>
  • The Industrial Ecology Lab 

  • Marine Virtual Laboratory (MARVL)        

  • Biodiversity and Climate Change   

  • Endocrine Genomics (EndoVL)      

  • >>> Human Communication Science<<< This one is also in the humanities (and it’s UWS-led)

  • <

    One of the data sources that HuNI is linking into their Virtual Lab is the National Library’s Trove. To demonstrate this I’ll want to try out Tim Sherratt’s QueryPic tool, searching across the Trove Newspaper database for occurrances of terms to do with the workshop topic, which broadly speaking means stuff about Asian studies, and the Asian century. Tim’s tool is an example of an eResearch tool that’s completely data driven.

    QueryPic showing searches for Asia and Asian in Trove Newspapers 1803–1954

    http://dhistory.org/querypic/4t/

    You can click on a data point to see a list of articles.

    But be careful with these results!

    Q. Why were the Aussie papers talking about Asia so much in 1820?

    A. They were talking about  a ship.

    If only the Macquarie had something like this.

    (There are some issues with this tool, not least of which is that this is not a stable, fixed data set, people are actively improving it via crowdsourced editing, and the data set is expanding so it would be impossible to reproduce results. I’ve suggested that a solution would be to place snapshots of the data into the Research Data Storage Infrastructure starting to roll out now via lead agency, The University of Queensland so that researchers could work on known-stable corpora, and perform tricks like reindexing to improve performance on this class of query.)

    Contrast this approach of re-using existing data in a fairly generic database to ask new questions with a very different kind of eResearch application, the Dictionary of Sydney, a project of the Arts Computing Lab at the University of Sydney; we can search for the Art Gallery of NSW where we’ll be meeting, and from there browse a rich curated web of relationships between entries about buildings, people, institutions etc.

    Another way of recording culture: The Dictionary of Sydney

    http://dictionaryofsydney.org/organisation/art_gallery_of_new_south_wales

    So, back to the question.

    What infrastructure do you need to do ‘culture science’?

    Copyright Peter Sefton 2012. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia.

    [Updated 2012-11-13, removed Andrew Leahy as co author]

    Tip: Arrange dock icons by shape, colour to reduce seek-time

    Like the guy in this video I used to think it would be a good idea to arrange icons in the OS X dock by how often I used them, or maybe by type. But I found that whatever ordering I used I would have trouble finding things. I know that iTunes is a blue circle, but so is Skype, and Safari (I rarely use it but sometimes I want to test something) – so the task of finding the app I want meant scanning for blue circles and often zooming my attention to the wrong one – I hate to say it, but they all look the same to me those pictures.

    So, I now arrange them by shape, then colour. To seek, zoom in to the right shape-group. It’s easy to see the differences because they’re side by side rather than spread out.

    Circles and circle-like things:

    Squarish things:

    Love those Apple apps that are so easy to tell apart:

    Oh, and might as well arrange things with letter-icons in alphabetical order.

    Et volia!

    NeCTAR Über Dojo, Reproducible Research (UWS eResearch team in Cloud Land II)

    Alf (Andrew) Leahy and I were recently in Melbourne for the NeCTAR Über Dojo event.

    By coming to this two day event you will be able to go back to your institution with a signed certificate showing that you’ve been ‘black belted’ as a Cloud expert, specifically we’ll train and qualify you in using the following tools and data in the Cloud:

    • How to use image management tools for production level VM management on the Cloud, i.e. Puppet or Chef

    • How to use libraries to access storage data APIs, e.g. SWIFT/S3, NFS, Object Store, etc.

    • We’ll also use this event to get you the experts to tell us what new features we need for all these new toys (I mean infrastructure ;-)

    The climax of this thing was an Iron Chef-style challenge. This is an initial rough post about our two hour hack (which we cleaned up over a few more hours post-event). Andrew will write up the event for the UWS eResearch blog. Thanks to Remko Duursma and Craig Barton for letting use their code and data for the demo.

    Background: HIE* researchers use R

    *Hawkesbury Institute for the Environment

    Meet Remko

    http://www.uws.edu.au/hie/people/researchers/doctor_remko_duursma

    And Craig

    http://www.uws.edu.au/hie/people/admin_and_technical_staff/dr_craig_barton

    They run R to clean data & model stuff…

    This R Notebook shows code and output together in HTML

    The night before the challenge I asked Remko and Craig if we could use some sample data and R code:

    Andrew (Alf) and I are at a workshop in Melbourne – I was wondering if I can use this as a demo tomorrow – ie it will appear on screen but not be made available over the net. The short notice is because the idea has only emerged today.

    The thing we’d be demoing would be to _simulate_ the following:

    • Data set + scripts like the attached is sitting in a web based repository.

    • Repository offers user the opportunity to download the package OR – Get an interactive R-Studio shell where they can re-run the data

      • Use clicks a “see with interactive shell” link.

      • Our server:

        • Fires up a virtual server in the cloud

        • Installs the server version of R-Studio

        • Creates a user account 

        • Pushes the data package onto the new server

        • Unpacks the package

        • Sets up R-studio to run Knitr on the main script to create HTML

        • sends a link to the R-Studio to the user (maybe by email cos this might take a few minutes)

    • The user can see the plots etc, and will have access to the R environment to tweak things

    Our idea…

    Remko’s reponse

    That sounds fantastic. Please go ahead! Wait actually those are Craig’s data :)**

    >* Get an interactive R-Studio shell where they can re-run the data

    YES that is perfect!

    greetings

    remko

    **Craig gave us the go-ahead as well

    Initial quick and dirty demo approach

    We have written a simple CGI web script in Python which simulates a repository, with the capacity to orchestrate the below:

    • Create a new NeCTAR machine using Python with the Boto library

    • Connect via SSH and install R-Studio and dependencies, and make a user account

      TODO: Use Chef for the initial server build – we learned about Chef on Monday but not enough to be able to create a new ‘recipe’ in an hour or so.

      TODO: Use a snapshot image so users don’t have to wait ten minutes for a machine to start up and install all the prerequisites

    • Copy the sample data set onto the VM

      TODO:  Use the cloud object storage so the data set is near to the VM (We know how, just didn’t have time)

      TODO: See if we can get R-Studio to launch with an R Notebook by default and log-in automatically

    Why is this cool?

    • Our researcher-colleagues like the idea

    • It uses the cloud to solve a set of real problems*:

      1. Makes it easy for ‘data shoppers’ to evaluate data sets

      2. Potentially enables research outputs to be re-run in an exact environment for real reproducible workflows (Remko notes that you really need the exact R library versions sometimes, this could be an option.)

    *Assuming a real implementation

    Screenshot : Script creating a virtual server

    Screenshot: our script Installs R-Studio*

    *This process is really too slow to do on-demand, we should launch from a pre-built snapshot. And the automation here is really hacky and crude – it would be better to do it using something like Chef but that’s not a one-day project if you have never used it before.

    Result – an RStudio Server…

    TODO: Automate the login by passing some kind of token?

    With a data-set pre-loaded*

    *We have not automated the installation of Knitr – so no R Notebook yet, but the code is runnable, and will spit out plot-images

    Show me the code

    The code we produced is demo quality hack-day code. This is not something you’d use in real life but we’re releasing it on Google Code as an example and so we can remember where to start when we come back to this and try to do it properly, maybe as part of the data capture project at the Hawkesbury Institute for the Environment.

    Copyright  Peter Sefton and Andrew Leahy, 2012. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

    Virtual Infrastructure and Research Support: Fostering collaboration across institutions

    I’m speaking at one of those commercial conferences in Melbourne this week, Virtual Infrastructure & Research Support an honour gifted to me by boss Professor Andrew Cheetham who couldn’t make it.

    There are three points they wanted me to speak about in a session about Fostering Collaboration. When I decided to do this, I thought it would be relatively easy, this is right up my alley. But as I started preparing, I started telling myself a slightly different story that I thought I would. So, here are my notes for tomorrow, complete with dodgy embedded slides.

    Fostering Collaboration across Institutions

    • Improving the collaborative capacity of research data

    • Making data discoverable and connected

    • Security and IP

    When I started working on this talk, I thought I was going to talk about data. I was going to talk about how to improve metadata, how to use a linked data approach to make sure that all the names of things used in metadata are unambiguous and Semantic Web Ready. But when I started thinking about the work we’ve been doing at University of Western Sydney on making data discoverable and connected – by managing it effectively – it became clear that most of the work was on people-processes.

    Of course the technicalities are important and in some other contexts it would be hard to stop me from going on about them but for this topic, fostering collaboration I’m going to focus on the people and think about how the systems and data serve the people.

    Remember, data don’t collaborate…

    … people collaborate.

    These people are working together in a project to get the ball off the dog.

    (Tip: Never mind cats, don’t ever try to herd someone who’s half sheep dog)

    There are two kinds of people

    • Researchers

    • And people who are here on earth to help them

    At UWS, we have started mapping out these relationships and there will be a lot more of this.

    I’d like to talk about collaboration as it relates to both of those groups. They each have their own modes of collaboration. In this venue, the expensive commercial conference, I was assuming that we are mostly the latter kind of person. So, I will talk about “improving the collaborative capacity of research data” but I also want to talk about improving the collaborative capacity of us, and our capacity to collaborate with our researchers.

    I have a lot of experience with collaborative projects both in the Higher Education world, and in open source software. I’ll start with some of the techniques that I have found to work. One of the basic things is to pay attention to what’s working for other people, and not just your immediate peers. The best people to watch are the ones who get stuff done.

    Q: Who knows best how to run huge collaborative knowledge management and generation projects?

    A: The people who built a substantial part of the Internet, the Linux operating system and invented various kinds of ‘open’.

    Watching the Alpha Geeks

    Tim O’Reilly said in 2002:

    So often, signs of the future are all around us, but it isn’t until much later that most of the world realizes their significance. Meanwhile, the innovators who are busy inventing that future live in a world of their own. They see and act on premises not yet apparent to others. In the computer industry, these are the folks I affectionately call “the alpha geeks,” the hackers who have such mastery of their tools that they “roll their own” when existing products don’t give them what they need.

    The alpha geeks are often a few years ahead of their time. They see the potential in existing technology, and push the envelope to get a little (or a lot) more out of it than its original creators intended. They are comfortable with new tools, and good at combining them to get unexpected results.

    More from Tim:

    What we do at O’Reilly is watch these folks, learn from them, and try to spread the word by writing down (or helping them write down) what they’ve learned and then publishing it in books or online. We also organize conferences and hackathons at which they can meet face to face, and do advocacy to get wider notice for the most important and most overlooked ideas.

    I started applying this back in 2006. The RUBRIC project was about Regional Universities Building Research Infrastructure Collaboratively – a circa $6M project from 2006-2009 to establish institutional repositories in a variety of Australian and New Zealand universities. I was the technical lead on the project, but the project manager went on maternity leave very soon after the project started and I ended up as de facto manager for a good while.

    To help establish the collaboration between a dozen or so universities and associated bodies such as the National Library and ARROW and APSR, I tried setting up a project collaboration area using Trac, which is a Free tool developed for and by software developers. Trac has a wiki, and a ticket management system, and was used to house the code that generated on the project. We also used online bookmark manager Delicio.us to compile collaborative lists of resources, Zotero for compiling a shared bibliography, held phone based teleconferences and even had SharePoint available for those who could bear it not to mention a static website.

    We gave our cohort of between ten and twenty collaborators (it varied) access to all these systems and let them use the bits that suited them. Different sub-groups tended to hang out in different parts of the system; most were exposed to new tools they could take away with them to future endeavours, I know that members of the RUBRIC central team at least took away knowledge of ‘alpha geek’ tools and attitudes that have helped them since.

    Please, walk on the grass

    We created a rich environment in which collaboration could take place, then let people take the desire paths. Groups like mine work in the gap between the concrete and the goat track, my job is to encourage the goats.

    I learned a couple of things about building useful working relationships in distributed teams.

    1. Face to face day-long meetings with dinner the night before and a decent hotel work better than tense teleconferences.

    2. Subsequent teleconferences work well, but repeat (1) as needed.

    I asked Amanda Nixon, who is now my counterpart eResearch Manager at Flinders in South Australia what she leaned from the RUBRIC project about collaboration. She said that building interpersonal relationships was important, not relationships based on roles.

    Personal attributes of an Alpha Collaborator:

    • Shows up

    • Speaks up

    • Keeps up

    • Blogs / publishes

    • Can operate as an individual, not a role

    So much for collaboration in the people who help researchers, now lets talk about researchers themselves. Last week I attended an Open Science Meetup in Sydney, in a pub, where I finally met Alpha-collaborative-chemist Mat Todd. Mat does Open Science, which means everything is in the open. As I said before, collaboration is about people, not data, or things. Matt and his colleagues recognize that the current process of doing long-cycle science is broken, or at least is a horse-drawn coach in the age of the rail-network.

    After talking with Mat, I out together a jocular ‘model’ of research collaboration. This shows two researchers creating “science” by taking turns to make parts of the whole. This is similar to something I put together for my PhD that showed two people collaboratively creating T E X T one letter at a time. While these simple diagrams might seem a bit silly, in Systemic Linguistics in the early 90s there was an important point to be made – all the models of collaboratively produced text were expressed as flow-charts, that is they were synoptic rather than dynamic models.

    Point is, we need to be careful in modelling and thinking about collaboration to make sure that we think in dynamic models that do take into account the various participants and stakeholders.

    Collaborative research model v0.1

    Peter (pt) Sefton @ptsefton

    After extensive ethnographic research on @MatToddChem at pub last night: my model of collaboratively doing science pic.twitter.com/2E4hkK47

    Contribution from Mat

    Matthew Todd @MatToddChem

    @ptsefton most interesting. If it’s open, one does not need to define, in advance, who the researchers are. #openscience

    Peter (pt) Sefton @ptsefton

    @MatToddChem Too hard given my skills with diagramming tool

     Peter (pt) Sefton @ptsefton

    .@MatToddChem But seriously, I gathered openness of data & process and advertising via social networks were key success factors

    Matthew Todd @MatToddChem

    @ptsefton Correct, meaning that anyone can join, including people you don’t know at the start, outside regular circle.

    An _open_ collaboration model v0.1

    I asked Mat about how they deal with their data, is it well described, do they have good metadata? The answer was, essentially; no, not really, but we’re working on that. So then how does the collaboration work? Mat told me that because everything is in the open, and deposited to their online lab notebook, it is possible for anyone to join in at any time, and they recruit people to their projects via social networks (even LinkedIn! Who knew?), so if they need a particular kind of analysis done they ask, others, people who were not known to the original researchers can see in to the project, work out what needs to be done, and step in. This short-cycle get-things-done science does result in publications, too, but it is not held up by the need to publish an article over the need to do work.

    An example of open science

    In April 2010 a request was posted to a (closed, 2,500-member) process chemistry networking forum12 on LinkedIn for suggestions, but also for people who might be willing to contribute more materially. This stimulated 20 comments (from 11 different people) and four private e-mails (via the website). None of these contributors were previously known to us. From the advice and offers, we chose to send one gram of racemic PZQamine to a Dutch contract research organization, Syncom, which arrived in mid-May. On 25 May, the company posted the identification of several chiral columns and conditions that enabled the baseline separation of the PZQamine enantiomers, permitting an assay for the effectiveness of any resolution attempts. On 25 August the company posted a lead chiral acid that had been identified (actually two months earlier) that effected the resolution of PZQamine. The company was not paid for this work.

    Woelfle, Michael, Piero Olliaro, and Matthew H. Todd. “Open Science Is a Research Accelerator.” Nature Chemistry 3, no. 10 (2011): 745–748. http://www.nature.com/nchem/journal/v3/n10/full/nchem.1149.html

    The Open Science crowd are showing us the future of research but many others find themselves locked-in to a collaboration mechanism which was optimal in the 17th Century and has since evolved into an elaborate mechanism for lining the pockets of a few multinational corporations, at the expense of the research community and the broader community in which they live.

    (Speaking of outmoded forms of communication, Mat also talked about the value of traditional conferences – we don’t need them to communicate any more, so why not use face to face time to focus on getting word done, via workshop meetings that attack particular problems.)

    In Australia, we’re now operating in a policy framework which is trying to encourage re-use of data, and open publication, but we’re stuck on hamster-wheel which is apparently hooked up to electrical generators which keep the lights on at Elsevier.

    Where can we look for hints about a way off the treadmill? Well, the astronomy and physics communities have made trails. They have been working in teams for years, they publish everything openly at arxiv.org, and being technical, tend to be able to roll-their own tools.

    Another community of alpha-collaborators  

    Description: raction of astronomical papers published with one, two, three, four or more authors
    Fraction of astronomical papers published with one, two, three, four or more authors. CREDIT: Robert Simpson

     http://radar.oreilly.com/2012/08/data-mining-the-literature.html

    At this point I’d better get back to the data.

    Making data collaboration-ready

    All we have to do is get data sets into RDA, right?

    ANDS-wins

    The latest ANDS newsletter has a number of success stories about data-driven collaboration. Read it.

    It is possible that if we carefully label data, store it somewhere and then advertise its existence via Research Data Australia, then people will discover it. But, I suspect, most of the action will be within communities and databases that are built for and by those communities, like the LinkedIn community that Mat mentioned.

    Example from the ANDS newsletter

    Repository of Antibiotic Resistance Cassettes

    The repository (www2.chi.unsw.edu.au:8080/rac) allows researchers to submit data on potential cassettes online. This information is reviewed by staff at the Centre for Infectious Diseases and Microbiology, and any new cassettes are provided with accurate names, archived and the knowledge made available. The database can then be searched online and the cassettes annotated from anywhere in the world. At present, it has regular users in 15 countries, though primarily in Australia.

    http://www.ands.org.au/newsletters/newsletter-2012-13.pdf

    I had a look at the Repository of Antibiotic Resistance Cassettes site, and it seems to offer what it says on the label, but I was not able to find any of the data sets in RDA (I didn’t look that hard, they might be there). But given that Researchers already have their own collaborative networks, as Glenn Maloney reminded us on the first day of the conference, this is probably not a problem.

    To understand what kinds of systems we need to support collaboration around data reuse, we need to talk in detail to large numbers of research cohorts in a wide variety of disciplines. I have started with such project with researchers at the Hawkesbury Institute for the Environment. At this stage I am working with climate modeller Remko Duursma on getting some examples of how environmental data can be packaged for re-use, with complete provenance to show how raw data is cleaned, filtered, fed to models, and used to generate figures for research articles. Remko and I will broaden this out to a wider range of participants once we get some initial demonstrations sorted out.

    We have to work out how to carve-up the continuous process that these researchers go through of taking raw data, cleaning it up, using stuff like “known good days” when the facility was known to be operating correctly, using a reproducible well documented process, filtering that data down to something useful and then running various processes over it and publishing the results. What should be packaged together? What’s useful for others? What’s dangerous to release on its own (like raw data with known bad parts).

    Nascent UWS project: Reproducible Research publications

    (Sorry, but it’s complicated)

    I have talked elsewhere about what we’re doing in the Institutional Data Repository projects in Australia for example at Open Repositories in July. But here’s one slide that sums it up.

    Making data linkable – give everything a URI

    Moving on to the last point. There are people here much better qualified to talk about security than I am, so I’ll talk a little about “IP”.

    Actually, I do have this to say about security. Given the state of infrastructure we have, data can be in one of two security states it’s either open to all, or locked down so that only the repository admins and the researcher can see it. More complex collaboration requires a mature community with their own resources

    Please, don’t say “IP”

    An image of Richard Matthew Stallman taken from the cover of the O’Reilly book w:Free as in Freedom: Richard Stallman’s Crusade for Free Software by Sam Williams, published on March 1, 2002 under the GFDL.

    Richard Stallman was probably more concerned with capital-F Freedom than with collaboration, but his insight that copyright could be turned 180° created the conditions for a huge collaborative engineering effort to create a vast free software stack created by O’Reilly’s alpha geeks. Stallman’s insight was possible because he looked at the mechanism of copyright, the legal framework, rather than the illusion of “intellectual property”, the vague association of the properties of physical property with a notional abstract virtual property.

    Don’t talk about “owning” data

    Use language appropriate to the various kinds of intellectual property that researchers are dealing with.

    See: Did you say you “own” this data? You keep using that word. I do not think it means what you think it means.

    To wrap this up, I’ll reiterate that we need to think about scholarly process and scholarly communications as a dynamic system, with multiple participants. We can’t pull people out of the (front loading) scholarly washing machine mid cycle, we need to consider where we can make small changes, and choose a difference wash-cycle next time.

    Making data collaboration-ready?

    • The alpha-collaborators are telling us Remove all “IP” related issues by using open licenses or go public-domain.

    • If that’s not possible, seek tools that allow the right people to work together.
      (Thanks NeCTAR with your cloud and labs and tools)

    • Do data management well. Eg use linked-data approached to metadata.

      (Sorry if you were expecting that talk, maybe next time)

    • And finally:

      If in doubt and you need a collaboration tool: use a mailing list.

    Open Repositories Developer Challenge. DRAFT manifesto v0.1

    I am on the committee for the Open Repositories conference. This year in Edinburgh I chaired the judging panel for the “Developer Challenge”, the development (or ideas) competition that runs inside the conference. Over the years I’ve been in a number of these things as a contestant, a judge, and last year one of the facilitators.

    NOTE: In this post I speak for myself, and not for the committee or for JISC/DevCSI who run this challenge.

    I think that the challenge is widely believed to be an important part of the conference, and the live pitches (with beer for all) are valuable and entertaining. But there’s always discussion about whether we’ve got the mix right. Should people be writing code? Is it healthy for developers to hang out in the developer lounge and miss papers? How can repository managers get more involved? Is this a way to get cheap development done on our repository platforms?

    And there are issues about the makeup of the developer community. For example this year’s winner Patrick McSweeney asked, how can we get more women involved?

    I think sometimes the answers are elusive because we don’t step back and think about what we’re trying to achieve.

    So, in the spirit of the Agile Manifesto, how does this draft sound?

    We want to provide opportunities for growth for our software developers and show our support and appreciation by providing them with an event, the developer challenge, that is fulfilling for them and also valuable for the Open Repositories community.
    Through this event we have come to value:

    Transparent, fun, open collaboration in diversely constituted teams over individual brilliance and/or groups of like individuals in cut-throat competition.

    The creation of new professional networks over the ossification of old ones.

    Effective engagement of non-developers (researchers, repository managers) in development over purely developer driven projects.

    Work done at the conference over presentation of something prepared earlier.

    Innovative ideas expressed in running code over wireframes, hand waving and elevator pitches.

    The development of the Open Repositories movement as a whole over siloed development on particular repository platforms.

    Entertaining live presentation of challenge projects in a relaxed setting over formal submissions.

    That is, while there is value in the items on the right, we value the items on the left more in the context of the developer challenge at the Open Repositories conference.

    Paradoxically, while there is a competitive element to the challenge, many of the values which I think really underpin  the challenge are about collaboration over competition. In 2011, for example the winning entry was a team formed from two separate teams who met in the lounge and workshopped their ideas together. This was in line with the values I suggest above, and was definitely taken into account in the judging. I was not a judge but I was helping them[1].

    Like I said, this is a proposal only, I’m publishing this draft for comment, and will be putting this to the Open Repositories committee, to see if they like this approach. If we can agree on some values, then the next step would be to structure the judging in order to maximise the value.

    These kinds of challenge/competition happen in other conferences in and contexts, what do you all think?


    [1] Me, I’m looking forward to the year everyone collaborates on one mega-entry and splits the money