A few discoveries

2009-08-19

I have been thinking lately about repository architecture, and wondering about whether the term 'repository' is actually skewing our idea of what we should be doing to preserve and disseminate via Open Access, and report on research output and the other things we want out repositories to do.

One of the ideas I'm keen on exploring is what librarians call a “discovery service” or a “discovery layer”, basically a web-view of a number of different services brought together in a smart index, with faceted browsing, like the National Library of Australia's Single Business Discovery Service prototypewhich uses Apache Solr indexes to search across eight different types of resource all at once. In this post I want to give an example of another place where this might be useful, using an example that I came across when I was looking for examples for the conversation with Les Carr about PowerPoint dismantling.

I'm not picking on Southampton here, they're leaders in Open Access or the EPrints software they have given our community, which is wonderful (in parts) but I would like to use their two (are there more than two?) EPrints repositories as an example of the advantages of a Solr-powered discovery layer.

How this came about was, I was poking around the Southampton EPrints site and I grabbed a PowerPoint presentation to use as an example; I knew that it was by Liz Lyon and Les Carr because that metadata was in the file, and The Fascinator managed to extract it. (Turns out there was a bunch of other authors listed on the EPrint, but maybe they were not authors of the PowerPoint.)

Now, lets see if I can find it again. What you're about to read below might reflect badly on my search skills but this is roughly what I did, I'm sure there are other bumblies out there like me.

A search on e-Prints at Southampton has 8 matches for the string “Les Carr”. But thing I'm after is not in that list. Took me a while to remember (and I did have to remember, the site doesn't tell you this) but there's another EPrints site for the School of Electronics and Computer Science (branded EPrints rather than e-Prints). So, having found the ECS EPrints site, I had a search for Les Carr. No results. Searching for plain-old Carr in either site gave me too many results to sort through. After that I went back to the blog post where I talked about the thing and found a direct link – and yes it's there in the Southampton IR (not ECS) but the name I was after is Carr, Leslie. I can find it using that search (and I note that it was deposited by Carr, Dr Leslie). Right. Got it.

With my team, I did a little experiment, harvesting both the Soton repositories into our repository software, The Fascinator. It's running on the Amazon EC2 cloud, so it might go away at some time in the future, but here you can see a portal Bron Chandler set up for an aggregated view of the two repositories: or drill down by facet to look at the two of them separately. The advantage of a faceted interface like this is that even if the data remain un-normalized, you can pick out what you might be looking for a bit easier than in the raw EPrints software, at least you can see the range of aliases this character operates under – here's an edited view of the list of authors when you do a simple search for Carr.

Author

Carr, Les(56)

Carr, Leslie(55)

Carr, Les A.(38)

Carr, LA(15)

Carr, L(10)

Carr, Leslie A.(10)

On one level It's actually the function of a repository to preserve this seeming chaos – the different contexts in which Carr, Dr Leslie operates have resulted in different forms of the name and the repository needs to remember them for bibliographic integrity, I'll talk below about name authorities, but The Discovery service approach means I was able to discover all these name-forms across both repositories, which is a huge win over the more limited native EPrints search.

Note that this was just a quick 'n' dirty experiment, we're not sure if we harvested everything and we did zero special configuration, but it seems like a pretty useful view of the Soton repostiories to me.

I think EPrints has unbeatable workflows for item management and ingest and so on, but the interface is very beatable.

Stepping back a bit I note that these EPrints sites are not that easy to find – I got to them by typing in the URL, but if you go to the university home page and browse to Research it's not obvious where the EPrints are. A search for Les Carr works, as you can get to his home page where there are of course links to recent publications in EPrints. That's better than what you get searching for Peter Sefton on the USQ site where a lot of the hits are from things like the test content that ships with ICE. No sign of my EPrints stuff in the first couple of pages).

At USQ we're using Google, which seems to pick up the ICE site over other content, presumably because it has more incoming links than the other pages at USQ with my name (mostly from here at this blog I guess :-), whereas the Southampton search seems to be driven by SharePoint. In both cases, I think a discovery-service approach could be a real winner; instead of the Google-style metadata-poor view why not build your own faceted view of your own university's stuff?

Here at USQ I'd like to see the main site brought together with EPrints, and the ePortfolio system and maybe courseware (particularly if we get around to taking Open Courseware seriously), and with any other databases of our stuff, like the library catalogue, which is already powered by Solr.

I'll finish with a quick comment on name authorities. I talked about them on CAIRSS blog recently. Of particular interest is the NicNames software which can help to link up all the instances of Les Carr, Carr, Leslie and Carr, Dr Leslie. NicNames should be able to suck up data from any repository and try to work out via subject codes and so on who is who, and let a repository administrator have the final say on name-authority matters. Once that's done and the system is trained to know that we have, say a Carr, L working in web systems and maybe another in physiology then when you're putting in a new record the system will be able to give you a choice between the two, and add an ID as well as the variant of the name string that happens to be on the item; there are two ways to approach this – one would be to integrate this service directly in ePrints, but another would be to leave them as separate services and bring them together at the discovery level. More on that soon.

[ptsefton.com] | [CV & Bio]

A few discoveries

2009-08-19