All posts by ptsefton

Digital Object Pattern (DOP) vs chucking files in a database, approaches to repository design

At work, in the eResearch team at the University of Western Sydney we’ve been discussing various software options for working-data repositories for research data, and holding a series of ‘tool days'; informal hack-days where we try out various software packages. For the last few months we’ve been looking at “working-data repository” software for researchers in a principled way, searching for one or more perfect Digital Object Repositories for Academe (DORAs).

One of the things I’ve been ranting on about is the flexibility of the “Digital Object Pattern”, (we always need more acronyms, so lets call it DOP) for repository design, as implemented by the likes of ePrints, DSpace, Omeka, CKAN and many of the Fedora Commons based repository solutions. At its most basic, this means a repository that is built around a core set of objects (which might be called something like an Object, an ePrint, an Item, or a Data Set depending on which repository you’re talking to). These Digital Objects have:

  • Object level Metadata
  • One or more ‘files’ or ‘datastreams’ or ‘bitstreams’, which may themselves be metadata. Depending on the repository these may or may not have their own metadata.
Basic DOP Pattern

Basic DOP Pattern

There are infinite ways to model a domain but this is a tried-and-tested pattern which is worth exploring for any repository, if only because it’s such a common abstraction that lots of protocols and user interface conventions have grown up around it.

I found this discussion of the Digital Object used in a CNRI, Digital Object Repository Server (DORS), obviously a cousin of DORA.

This data structure allows an object to have the following:

  • a set of key-value attributes that describe the object, one of which is the object’s identifier

  • a set of named ‘data elements’ that hold potentially large byte sequences (analogous to one or more data files)

  • a set of key-value attributes for each of the data elements

This relatively simple data structure allows for the simple case, but is sufficiently flexible and extensible to incorporate a wide variety of possible structures, such as an object with extensive metadata, or a single object which is available in a number of different formats. This object structure is general enough that existing services can easily map their information-access paradigm onto the structure, thus enhancing interoperability by providing a common interface across multiple and diverse information and storage systems. An example application of the DO data model is illustrated in Figure 1.

To the above list of features and advantages I’d add a couple of points on how to implement the ideal Digital Object repository:

  • Every modern repository should make it easy for people to do linked data. Instead of merely key-value attributes that describe the object, it would be better to allow for and encourage RDF-style predicate / object metadata where both the predicate and object are HTTP URIs with human-friendly labels. This is implemented natively in Fedora Commons v4. But when you are using the DOP it’s not essential as you can always add an RDF metadata data-element/file.
  • It’s nice if the files also get metadata as in the CNRI Digital Object, but using the DOP you can always add ‘file’ that describes the file relationships rather than relying on conventions like file-extensions or suffixes to say stuff like “This is a thumbnail preview of img01.jpg”
  • There really should be a way to do relationships with other objects but again, the DOP means you can DIY this feature with a ‘relationships’ data element.

(I’m trying to keep this post reasonably short, but just quickly, another really good repository pattern that complements DOP is to keep separate the concerns of Storing Stuff from Indexing Stuff for Search and Browse. That is, the Digital Objects should be stashed somewhere with all their metadata and data, and no matter what metadata type you’re using you build one or more discovery indexes from that. This is worth mentioning because as soon as some people see RDF they immediately think Triple Store, OK, but for repository design I think it’s more helpful to think Triple Index. That is, treat the RDF reasoner, SPARQL query endpoint etc as a separate concern from repositing.)

The DOP contrasts with a file-centric pattern where every file is modelled separately, with it’s own metadata, which is the approach taken by HIEv, the environmental science Working Data Repository we looked at last week. Theoretically, this gives you infinite flexibility but in practice it makes it harder to build a usable data repository.

Files as primary repository objects

Files as primary repository objects

Once your repository starts having a lot of stuff in it like image thumbnails, derived files like OCRed text, and transcoded versions of files (say from the proprietary TOA5 format into NETCDF) then you’re faced with the challenge of indexing them all, for search and browse in a way that they appear to clump together. I think that as HIEv matures and more and more relationships between files become important then we’ll probably want to add container objects that automatically bundle together all the related bits and pieces to do with a single ‘thing’ in the repository. For example, a time series data set may have the original proprietary file format, some rich metadata, a derived file in a standard format, a simple plot to preview the file contents, and re-sampled data set at lower resolution, all of which really have more or less the same metadata about where they came from, when, and some shared utility. So, we’ll probably end up with something like this:

Adding an abstraction to group files into Objects (once the UI gets unmanageable)

Adding an abstraction to group files into Objects (once the UI gets unmanageable)

Draw a box around that and what have you got?

The Digital Object Pattern, that’s what, albeit probably implemented in a somewhat fragile way.

With the DOP, as with any repository implementation pattern you have to make some design decisions. Gerry Devine asked at our tools day this week, what do you do about data-items that are referenced by multiple objects?

First of all, it is possible for one object to reference another, or data elements in another, but if there’s a lot of this going on then maybe the commonly re-used data elements could be put in their own object. A good example of this is the way WordPress, which is probably where you’re reading this, works. All images are uploaded into a media collection, and then referenced by posts and pages: an image doesn’t ‘belong’ to a document except by association, if the document calls it in. This is a common approach for content management systems, allowing for re-use of assets across objects, but if you were building a museum collection project with a Digital Object for each physical artefact, it might be better for practical reasons to store images of objects as data elements on the object, and other images which might be used for context etc separately as image objects.

Of course if you’re a really hardcore developer you’ll probably want to implement the most flexible possible pattern and put one file per object, with a ‘master object’ to tie them together. This makes development of a usable repository significantly harder. BTW, you can do it using the DOP with one-file per Digital Object, and lots of relationships. Just be prepared for orders of magnitude more work to build a robust, usable system.

Creative Commons License
Digital Object Pattern (DOP) vs chucking files in a database, approaches to repository design is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Trip report: Peter Sefton @ Open Repositories 2014, Helsinki, Finland

[Update: 2014-07-08, fixed a couple of typos since this is getting a bit of traffic]

Just self-archiving this post from the UWS eResearch blog here

Creative Commons License
Trip report: Peter Sefton @ Open Repositories 2014, Helsinki, Finland by Peter Sefton is licensed under a Creative Commons Attribution 4.0 International License.

From June 9th-13 th I attended the Open Repositories conference way up North in Helsinki. This year I was not only on the main committee for the conference, but was part of a new extension to the Program Committee, overseeing the Developer Challenge event, which has been part of the conference since OR2008 in Southampton . I think the dev challenge went reasonably well, but probably requires a re-think for future conferences, more on that below.

In this too-long-you-probably-won’t read post I’ll run through a few highlights around the conference theme, the keynote and the dev event.

Summary: For me the take-away was that now we have a repository ecosystem developing, and the OR catchment extends further and further beyond the library, sustainability is the big issue , and conversations around sustainability of research data repositories in particular are going to be key to the next few iterations of this conference. Sustainability might make a good theme or sub-theme. Related to sustainability is risk; how do we reduce the risk of the data equivalent of the serials crisis if there is such a crisis it won’t look the same, so how we will stop it?

View from the conference dinner

Keynote

The keynote this time was excellent. Neuroscientist Erin McKiernan from Mexico gave an impassioned and informed view of the importance of Open Access: Culture change in academia: Making sharing the new norm (McKiernan, 2014). Working in Latin America McKiernan could talk first-hand about how the scholarly communications system we have now disadvantages all but the wealthiest countries.

There was a brief flurry of controversy on Twitter over a question I asked about the risks associated with commercially owned parts of the scholarly infrastructure and how we can manage those risks. I did state that I thought that Figshare was owned by McMillan’s Digital Science, but was corrected by Mark Hahnel; Digital Science is an investor, so I guess “it is one of the owners” rather than “owns”. Anyway, my question was misheard as something along the lines of “How can you love Figshare so much when you hate Nature and they’re owned by the same company”. That’s not what I meant to say, but before I try to make my point again in a more considered way, some context.

McKiernan had shown a slide like this:

My pledge to be open

  • I will not edit, review, or work for closed access journals.

  • I will blog my work and post preprints, when possible.

  • I will publish only in open access journals.

  • I will not publish in Cell, Nature, or Science.

  • I will pull my name off a paper if coauthors refuse to be open.

If I am going to ‘make it’ in science, it has to be on terms I can live with.

Good stuff! If everyone did this, the Scholarly Communications process would be forced to rationalize itself much more quickly than is currently happening and we could skip the endless debates about the “Green Road” and the “Gold Road” and the “Fools Gold Road”. It’s tragic we’re still debating using this weird colour-coded-speak twenty years in to the O A movement .

Anyway, note the mention of Nature .

What I was trying to ask was: How can we make sure that McKiernan doesn’t find herself, in twenty years time, with a slide that says:

“I will not put my data in Figshare”.

That is, how do we make sure we don’t make the same mistake we made with scholarly publishing? You know, where academics write and review articles, often give up copyright in the publishing process, and collectively we end up paying way over the odds for a toxic mixture of rental subscriptions and author-pays open-access, with some risk the publisher will ‘forget’ to make stuff open.

I don’t have any particular problem with Figshare as it is now, in fact I’m promoting its use at my University, and working with the team here on being able to post data to it from our Cr8it data publishing app . All I’m saying is that we must remain vigilant. The publishing industry has managed to transform itself under our noses from: much needed distribution service of tangible goods ; to rental service where we get access to The Literature pretty-much only if we keep paying ; to its new position as The custodian of The Literature for All Time , usurping libraries as the place we keep our stuff.

We need to make sure that the appealing free puppy offered by the friendly people at Figshare doesn’t grow into a vicious dog that mauls our children or eats up the research budget.

So, remember, Figshare is not just for Christmas.

Disclosure: After the keynote, I was invited to an excellent Thai dinner by the Figshare team, along with Erin and a couple of other conference-goers. Thanks for the salmon and the wine, Mark and the Figshare investors. I also snaffled a few T-Shirts from a later event ( Disruption In The Publishing Industry: Digital, Analytics & The Future ) to give to people back home.

Figshare founder and CEO Mark Hahnel (right) and product manager Chris George hanging out at the conference dinner

Conference Theme, leading to discussions about sustainability

The conference theme was Towards Repository Ecosystems .

Repository systems are but one part of the ecosystem in 21st century research, and it is increasingly clear that no single repository will serve as the sole resource for its community. How can repositories best be positioned to offer complementary services in a network that includes research data management systems, institutional and discipline repositories, publishers, and the open Web? When should service providers build to fill identified niches, and where should they connect with related services?  How might these networks offer services to support organizations that lack the resources to build their own, or researchers seeking to optimize their domain workflows?

Even if I say so myself, the presentation I delivered for the Alveo project (co-authored with others on the team) was highly theme-appropriate; it was all about researcher-needs driving the creation of a repository service as the hub of a Virtual Research Environment, where the repository part is important but it’s not the whole point .

I had trouble getting to see many papers, given the dev-wrangling, but there was definitely a lot of eco-system-ish work going on, as reported by Jon Dunn :

Many sessions addressed how digital repositories can fit into a larger ecosystem of research and digital information. A panel on ORCID implementation experiences showed how this technology could be used to tie publications and data in repositories to institutional identity and access management systems, researcher profiles, current research information systems, and dissertation submission workflows; similar discussions took place around DOIs and other identifiers. Other sessions addressed the role of institutional repositories beyond traditional research outputs to address needs in teaching and learning and administrative settings and issues of interoperability and aggregation among content in multiple repositories and other systems .

One session I did catch (and not just ‘cos I was chairing it) had a presentation by Adam Field and Patrick McSweeney on Micro data repositories: increasing the value of research on the web (Field and McSweeney, 2014). This has direct application to what we need to do in eResearch, Adam reported on their experience setting up bespoke repository systems for individual research projects, with a key ingredient missing in a lot of such systems; maintenance and support from central IT. We’re trying to do something similar at the University of Western Sydney, replicating the success of a working-data repository at one of our institutes ( reported at OR2013) across the rest of the university, I’ll talk more to Adam and Patrick about this.

For me the most important conversation at the conference was around sustainability. We are seeing more research-oriented repositories and Virtual Research Environments like Alveo, and it’s not always clear how these are to be maintained and sustained.

Way back, when OR was mainly about Institutional Publications Repositories (simply called Institutional Repositories, or IRs) we didn’t worry so much about this; the IR typically lived in The Library, the IR was full of documents and The Library already had a mission to keep documents. Therefore the Library can look after the IR. Simple.

But as we move into a world of data repository services there are new challenges:

  • Data collections are usually bigger than PDF files, many orders of magnitude bigger in fact, making it much more of an issue to say “we’ll commit to maintaining this ever-growing pile of data”:

  • “There’s no I in data repostory (sic)” – i.e. many data repositories are cross-institutional which means that there is no single institution to sustain a repository and collaboration agreements are needed. This is much, much more complicated that a single library saying “We’ll look after that”.

And as noted above, there are commercial entities like Figshare and Digital Science realizing that they can place themselves right in the centre of this new data-economy. I assume they’re thinking about how to make their paid services an indispensable part of doing research, in the way that journal subscriptions and citation metrics services are, never mind the conflict of interest inherent in the same organization running both.

Some libraries are stepping up and offering data services, for example, work between large US libraries.

The dinner venue

The developer challenge

This year we had a decent range of entries for the dev challenge, after a fair bit of tweeting and some friendly matchmaking by yours truly. This is the third time we’ve run the thing a clearly articulated set of values about what we’re trying to achieve .

All the entrants are listed here, with the winners noted in-line. I won’t repeat them all here, but wanted to comment on a couple.

The people’s choice winner was a collaboration between a person with an idea, Kara Van Malssen from AV Preserve in NY, and a developer from the University of Queensland, Cameron Green, to build a tool to check up on the (surprisingly) varied results given by video characterization software . This team personified the goals of the challenge, creating a new network, while scratching an itch, and impressing the conference-goers who gathered with beer and cider to watch the spectacle of ten five-minute pitches.

My personal favorite came from an idea that I pitched (see the ideas page ) was the Fill My List framework, which is a start on the idea of a ‘ Universal Linked Data metadata lookup/autocomplete ’. We’re actually picking up this code and using it at UWS. So while the goal of the challenge is not to get free software development for the organizers that happened in this case (yes, this conflict of interest was declared at the judging table). Again this was a cross-institutional team (some of whom had worked together and some of whom had not). It was nice that two of the participants, Claire Knowles of Edinburgh and Kim Shepard of Auckland Uni were able to attend a later event on my trip at a hackfest in Edinburgh . There’s a github page with links to demos.

But, there’s a problem. The challenge seems to be increasingly hard work to run, with fewer entries arising spontaneously at recent events. I talked this over with members of the committee and others. There seem to be a range of factors:

  • The conference may just be more interesting to a developer audience than it used to be. Earlier iterations had a lot more content in the main sessions about ‘what is a(n) (institutional) repository’ and ‘how do I promote my repository and recruit content’ whereas now we see quite detailed technical stuff more often.

  • Developers are often heavily involved in the pre-conference workshops leaving no time to attend a hack day to kick of the conference.

  • Travel budgets are tighter so if developers do end up being the ones sent they’re expected to pay attention and take notes.

I’m going to be a lot less involved in the OR committee etc next year, as I will be focusing on helping out with Digital Humanities 2015 at UWS. I’m looking forward to seeing what happens next in the evolution of the developer stream at the OR conference. At least it’s not a clash.

The Open Repositories Conference (OR2015) will take place in Indianapolis, Indiana, USA at the Hyatt Regency from June 8-11, 2015. The conference is being jointly hosted by Indiana University Libraries , University of Illinois Urbana-Champaign Library , and Virginia Tech University Libraries .

This pic got a few retweets

References

Field, A., and McSweeney, P. (2014). Micro data repositories: increasing the value of research on the web. http://eprints.soton.ac.uk/364266/.

McKiernan, E. (2014). Culture change in academia: Making sharing the new norm. http://figshare.com/articles/Culture_change_in_academia_Making_sharing_the_new_norm_/1053008.