Trip report: open repositories 2015, Indy

2015-07-10

[ UPDATE 2015-07-10 In my excitement at launching a new site I managed to let not-quite finished version of this post out on the web. This version has minor edits and has fixed a missing image ]

This is my trip report from a recent trip to the USA, mainly to attend the Open Repositories conference in Indianapolis.

Keynote

Kaitlin Thainey from the Mozilla Science labs gave the keynote. I was looking forward to catching up with her about open this and open that but she had to deliver her presentation remotely.

Kaitlin gave an overview Leveraging the power of the web - Open Repositories 2015 of what's happening in the Open Science / Open research from her point of view as the director of the Mozilla Science lab. There's a lot! It was nice to see Research Bazaar (resbaz), the Aussie-grown eResearch training movement get a mention alongside the likes of Software Carpentry. Mozilla Open Science are doing great work in this space and we're trying to figure out how we can get bring ResBaz and related training stuff to UTS.

Thainey's slide on Web-enabled Research. Access to content, data code, materials. Emergence of "web-native" tools. Rewards for openness, interoperability, collaboration, sharing. Push for ROI, reuse, recomputability, transparency.

There was also a featured plenary talk from Anurag Acharya at Google scholar, Indexing Repositories: Pitfalls and Best Practices. I have to say that while the content was useful, and boiled down to "if you want to be in Google Scholar, don't build a sucky website". I found it vaguely unsettling, like having a prophet drop by and give us a how-to guide for getting into heaven. I dunno what their business model is, but it's been clear for years that it doesn't involve asking the owners of the repositories what they want or responding to emerging trends like research data repositories just because we think they're important. Still, it was good addition to the program and I guess if we want to get in Google Scholar we have to listen to Google Scholar.

(Listening to Acharya's talk I was reminded of back in the day when my team at USQ ran the national repository support service. At one stage one of the major vendors was having a few performance problems with their repository product which was quite widely deployed in Australia. Essentially clicking on the 'show all' button for the repo at some sites constituted a denial of service attack, so the vendor advised people to use the robots.txt file to keep indexers from killing the site. Months later we were still fielding calls from repo managers wondering why their stuff was not in Google, let alone Scholar.)

My favourite: The Portland Common Data Model

This year the thing that struck me the hardest was the hot-off-the-press Portland Common Data Model (PCDM), which captures a common design pattern for repositories, similar to the Digital Object Pattern I was going on about last year. This model grew out of discussions in the Hydra repository community, and has spread to the general Fedora commons community and beyond. The model distills the experience of building hundreds of digital repositories into a design pattern.

The PCDM model. Note the recursively nested collections and objects. Difference between objects and collections is that collections can't have files. And files have only technical metadata, descriptive metadata needs to be attached to abn object

The github site says:

The Portland Common Data Model (PCDM) is a flexible, extensible domain model that is intended to underlie a wide array of repository and DAMS [They mean Digital Asset Management Systems, PS] applications. The primary objective of this model is to establish a framework that developers of tools (e.g., Hydra-based engines, such as Sufia, Curate, Worthwhile, Avalon; Islandora; custom Fedora sites) can use for working with models in a general way, allowing adopters to easily use custom models with any tool. Given this interoperability goal, the initial work has been focused on structural metadata and access control, since these are the key actionable metadata. To encourage adoption, this model must support the most complex use cases, which include rich hierarchies of inter-related collections and works, but also elegantly support the simplest use cases, such as a single user-contributed file with a few fields of metadata. It must provide a compact interface that tool developers can easily implement, but also be extensible enough for adopters to customize to their local needs. As the community migrates to Fedora 4, much of our metadata is migrating to RDF. This model encourages linked data best practices, such as using URIs to identify all resources, using widely-used vocabularies where possible, and subclassing existing classes and properties when creating new terms.

So why am I excited about the PCDM?

At UTS we use the Fedora Commons repository for the institutional data repository, and we'll be upgrading from v3 to v4; the PCDM gives us a good framework for organising data:

Like it says on the tin, it should improve repository interoperability, making it a bit easier to port content from one PCDM-compliant repository environment to another.
It will make it easier to design re-usable APIs for generic repository solutions around domain-agnostic abstractions. So in some domains a hierarchy of two levels of 'collection' with an object might consist of collections "Facility / Experiment" with Dataset as an PCDM object type while in another the collection levels might be "Corpus / Sub corpus" with a "Communicative Event" object, as in the Alveo virtual lab for human communications science, I presented about last year at OR 2014.
It will make it easier to map working repository and presentation-repository systems onto archival repositories, as we are planning at UTS. Eg, we want to 'shadow' all the content in DHARMAE collection-by-collection and item-by-item and file-by-file in a preservation-ready data store. The PCDM gives us a formal way to define that mapping.
It will provide a useful tool for specifying new systems, eg many Data Capture applications, such as those funded by ANDS could benefit from exposing their organisation in this way, making it clear how content is organised using simple, generic shared concepts.

There are two things that are less than perfect, IMHO.

PCDM Problem 1

The biggest issue is that in RDF ordering lists of things is a pain. In PCDM there's no whacking a set of square brackets around a bunch of things and calling it a list. So implementing something like a set of PCDM objects representing pages of a book means dealing with a linked list.

Manu Sporny had a good rant about this back in 2012.

In summary - RDF Lists are difficult to implement, even for people that know quite a bit about RDF. They are fantastically difficult to grasp for Web developers. They are really hard to author in many of the RDF syntaxes.

Or this other rant which is worth a read:

RDF is a shitty data model. It doesn’t have native support for lists. LISTS for fuck’s sake! The key data structure that’s used by almost every programmer on this planet and RDF starts out by giving developers a big fat middle finger in that area.

What do we get in PCDM instead of web-developer-friendly lists? We get the ORE ordering extension. ORE's full name is the Open Archive Initiative Object Reuse and Exchange (OAI-ORE) standard for describing aggregations of web resources.

PCDM ordering

I'm not really worried about this too much, as I assume we can put JSON-LD APIs on this stuff and let all the linked-list-horror happen behind the scenes. JSON-LD, of course, handles list-ordering natively that's why Manu and team invented it.

PCDM Problem 2

The other problem isn't really a problem as such, just that the ability to do object-hierarchies will tempt some developers and business analysts into dangerous territory.

To a computer scientist a book might look like a book object with a bunch of page objects. But wait! Why stop there. Why don't we make the pages out of paragraph objects! Actually, I know, lets do it at the sentence level and assemble those into paragraph objects and then pages, and we can have a parallel hierarchy for the logical structure of the book in chapters as well, using proxies just like in ORE! Maybe we really should probably do it a the word level. Or an object-per-glyph model with ORE serialisation per-word?

Kids, hierarchical object models are fine in moderation, but do try to stop before you grow hairs on your palms.

Tell me I'm wrong!

Comments welcome, if PCDM fans want to straighten me out on any of the above.

Our presentation, Omeka, Ozmeka etc.

This year I was presenting work done at UTS and UWS on adapting the super-simple Omeka repository to use with linked-data research data. I blogged this already. Didn't get many questions but quite a few people told me they enjoyed the talk.

Post OR I popped over to visit the Omeka team at George Mason University which is in Fairfax Virginia, just out of Washington DC. I got a look at the new Omeka S which has a new linked-data ready data and shows a lot of promise, and talked about the possibiity of having some of the work we did on extending Omeka in the Ozmeka project accepted into the core of the Omeka 2.x line.

Thanks Patrick Murray-John, who says stuff like:

Patrick Murray-John ‏@patrick_mj Jul 2 My code is jiggery-pokery applesauce.....WITH PEAS IN IT!!!!!

and Omeka Lead dev John Flatness for being so generous with your time.

I had a Saturday in Washington waiting for my plane home, rented a bike and did tour of the monuments. Here I am dressed as an innocuous American tourist from Hawaii so as not to arouse suspicion

The developer challenge

For the past few years I've been helping to organise the developer challenge event and sat on the OR committee but I stepped back this year, and did neither. Claire Knowles and Adam Field (who were part of one of the winning teams in the dev challenge last year with the Fill My List software for linked-data-lookup) stepped forward and ran a new tech-stream, the Developer Track, and a new ideas challenge.

Claire says

The Developer Track was new for OR15, it was designed, along with the Ideas Challenge, to stop the developers who attend the conference being torn between writing code for a competition and attending conference sessions. There were lots of demos and it was great to see no one having to apologise for screens of code, XML and terminal windows. I found the sessions really practical and have a list of things to try now I am back in the office, starting with Hardy Pottinger’s Vagrant-DSpace. ... The Ideas Challenge, was designed to be less time intensive that the previously run Developer Challenge but still encourage people to discuss issues they would like to resolve, meet new people, and have a fun session where audience participation was encouraged. Adam and I created an example challenge and solution based on the Sound of Music [Idea’s Challenge Slides](Idea’s Challenge Slides). We had 9 entries to the challenge this year and thanks to Adam’s scoresheet there were no long deliberations and a clear winner was identified after our judges handed in their scores. Congratulations to the winners Blog post about the challenge winners.

Loved the sound of music thing, Claire, it was one of my favourites. And a I liked dropping in to the Dev Track and looking at comforting screens full of code and edgy live demos.

In other news

DSpace is still going strong, incrementally pushing towards becoming the perfect repository. I believe it has an API now so it can be used sans user interface. It does have a few quirks, like the fact that there are actually two species of user interface that are both being maintained but it's a solid choice for simply-structured repositories (and some not so simple ones too).

I caught up with Bram Luyten from DSpace service provider [Atmire], which has grown from a tiny consultancy to having more than a dozen employees over the last few years, a sign both of the strength and solidity of DSpace, and of the repository movement in general.

ePrints is also still alive and kicking, though the user group sessions are much smaller in North America than they are in Europe, close to its home in Southampton. The ePrints community is gearing up for a huge de-tangling session, unpicking the hair-ball of Perl code at the centre of the thing to make it more modular. I didn't stay long enough to ask if they're going to stick with Perl as a language. Anyway, if I were setting up a publications repository on a tight budget ePrints would still be on my list of candidate systems, and there's the commercial services arm too, for hosting and support.

#The venue, etc

The OR conference series alternates between North America and Europe year to year, not counting the first episode which took place in Australia. There's a pretty clear difference between how these things are hosted on either side of the Atlantic. In Europe we've had venues that are a mixture of convention centre and university, including some scruffy lecture halls and the odd gem like developer rooms at the Finnish National Library, whereas the North American model is to use a hotel convention centre which, I believe, is so the accommodation offsets the cost of the meeting rooms.

The conference was well organised, the food was plentiful and tasty and the venue was fine if you like that sort of thing. I.e. a big climate controlled bubble.

The hotel had a big lightwell down the middle

This one was in the hotel district in Indianapolis, Indiana where the main industry seems to be convention centres. There is a river you can walk along, and one of those bike-share schemes with large yellow bike shaped things made of depleted uranium. I tried the service one wet day, it even came with decorative attachments resembling brake levers but which turned out to be a semaphore-like mechanism for politely suggesting to the bike that it might be nice to consider slowing down, if it was OK with the bike, and it had time, etc. Was a fun exercise planning ahead to see if we could agree be stopped at the intersections.

I liked this place

And this was comforting, knowing the Sheriff's office was in the top 1% in America!

The conference dinner was at an art museum on the other side of town.

The dinner venue, which as in an art gallery. By the way, I know all about art but I don’t know what I like.

And

I took one of the city's yellow hire bikes off to Arthur's music and bought a tiny banjo.

'Plucky'comes set up in an open C tuning cGCEG (the little c is the 5th string, an octave above the 3rd) which is only one tone away from being a standard uke tuning. So naturally, I wound the 1st string G up to an A. Now it's an honorary ukulele and I have not been kicked out of the Blue Mountains Ukulele Group for bringing it, yet. Plucky gets marked down in reviews for having crappy tuners and being a bit hard to set up with good intonation, but it's a solid piece of gear and lots of fun and it's the only option for a mass market 5-string in that super-short scale length. I'll have to dig out my copy of Pete's How to play the 5-string Banjo.

Here's Kim Shepherd (also from the Fill My List team with Adam and Claire), playing the tiny banjo, Plucky, and a tiny game, in the hotel lobby, both at the same time!

[ptsefton.com] | [CV & Bio]