I’m at the University of Newcastle visiting repository rat extraordinaire Vicki Picasso (actually at this bushland campus she should be a repository wallaby or something) and her colleague Dave Huthnance from IT. We are working on a model for how research data collections destined for Research Data Australia might be described and managed in the local institutional repository.
(Please ANDS can we have some advice on this metadata issue? Some of you say to use RIF-CS and some say that’s a bad idea.)
Vicki presented a model of how metadata about research data could be ingested into the VITAL repository they use at Newcastle at eResearch Australasia 2009; it featured the VALET system which is a very simple repository ingest tool and what Vicki calls “Institutional Data Triggers” such as events in a grants database which would fire-off a metadata ingest workflow.
The new model
Today we refined that diagram. Like it says the blue bits represent the current Nova repository infrastructure (VITAL + VALET + Fedora) which feeds data to (amongst other things) the National Library’s harvesting systems.
The red bits are new proposed infrastructure, to be developed, to enable collections metadata to be captured and feeds of RIF-CS metadata to Research Data Australia. The new red box labelled “Research Data Collections”, should it be built, will be a more sophisticated version of VALET, probably written in Java so it can work in the same Tomcat web container as Fedora – it would have a VALET-style simple forms interface for walk-up submissions (this could be used to replace the existing publications ingest and staging workflows too, as shown by the dotted red line – if this were a requirement).
Green is for external services. One of the very interesting green bits is the Research Storage system which is being provided by university IT and administered by the Research Office. I gather that this is essentially a file-store; we are proposing to add an interface that lets researchers see their files in the new (red) ingest system and add metadata to them, and flag them as candidates for RDA. I think Newcastle’s policy will be that if you want data to be available via Research Data Australia then it is desirable this it goes in the Research Storage System. Sounds good to me. To bridge the gap between files on a storage system we are proposing a bit of middleware to link the file view of data to a web/repository view.
As discussed before here, the ANDS stakeholders in this project are keen for us to take a linked-data approach to metadata (slogan: Less typing, more linking!). I talked a bit about how this might work in the previous post on name identities; potential integration with services like the NLA’s PIP/People Australia and possible services like an ARC website for grant information are shown in green at the bottom right of the diagram (I have some input from Basil at the NLA I need to process, but at this stage I think we’re looking at having NicNames in there so institutions can manage their own metadata.
One assumption we’re making here is that the core class of item we’re describing here is a collection, which should fit with the kind of data that is already in the repository, which is research outputs, like data.
There are some questions, of course.
-
What metadata schema to use for describing data collections?
-
And where would the ISO2146 notion of Services fit in? The services listed in the RIF-CS documentation are all repository-type search/feed services so it seems appropriate to either tie them in to the OAI-PMH ‘identify’ verb or to let repository managers simply enter them in to an ANDS system directly.
Going further
One idea that has come up is that VITAL sites might want to use Fedora and the OAI-PMH feeds available off it but not expose them via a web portal at all. In conversation with Teula Morgan from Swinburne today, Vicki proposed a model where there is no portal interface. I call this a ‘headless’ approach; there would be local management interface for research data collection metadata (the red box) but it could be that the primary discovery mechanism is outsourced to RDA. This is pretty common for university web sites – USQ uses Google for our website search service for example.
I am also exploring the idea that this ingest tool, which will be able to put records into Fedora (which as far as I know nobody has ever been fired for acquiring) could form the basis for our major deliverable on our ANDS metadata stores project; a specification for a stand-alone metadata -about-research-data-collection system.
Copyright Peter Sefton, 2010. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>
This post was written in OpenOffice.org, using templates and tools provided by the Integrated Content Environment project and published to WordPress using The Fascinator.

@Peter – re the metadata issue. Use of RIF-CS as a metadata storage format is a bad idea. The reasons are covered in an earlier post. It is an interchange format. If people are telling you otherwise send them to the previous post.
Thanks @scott – I am interested in what Simon Porter is doing at Melbourne Uni with an RDF based schema for metadata about data. Trying to find out more.
Scott, – What metadata schemas are ANDS suggesting to the community as appropriate for storage and for mapping to RIF-CS? Also, are there any specific ontologies that are being recommended to use with RIF-CS, or is it more of the case that the use of ontologies is recommended generally?
I would agree with Scott here – use RIF-CS as an interchange format and map to it from richer collection descriptions. The same goes for Party records by the way!
In terms of the modelling of collection descriptions Michael Heaney’s “An Analytical Model of Collections and their Catalogues” (http://www.ukoln.ac.uk/metadata/rslp/model/amcc-v31.pdf) may be worth a read.
Hi Peter,
we have provisionally adopted the swrc Ontology for describing people projects, departments, publications, grants and the relationships between them.
http://ontoware.org/swrc/swrc_v0.3.owl
we have added to this a draft ontology that describes research data, to facilitatite all processes associated with our policy on research data and records. It will be instructive to compare this with the ukoln work.
we map to RIF-CS at harvest time.
@Vicki – Although I can’t be 100% sure as this is really a Seeding the Commons question, I don’t believe that ANDS is recommending any particular schema/ontology/standard for collections or other related objects. The reality is (as with item-level descriptions) the metadata (descriptive/administrative) required by, and able to be captured by, individual institutions and applications will vary. RIF-CS is a subset of metadata for supporting ANDS registry related services.
Thanks @scott & @simon for helping out here.
@simon – I couldn’t find much documentation on SWRC – would be interested in anything you can make public on your schema.
@scott – I am hoping that since ANDS has considerable resources that they can at least lead a discussion about metadata formats. At the very least it is in ANDS interest to encourage the use of defined terms for parties and activities and resource types using well maintained identifier infrastructure and ontologies so you don’t end up with the kind of inconsistencies we see in our IRs (Article vs Journal Article, vs Article (Peer reviewed) etc) – instead people would use a term like:
dc:type:http://purl.org/ontology/bibo/AcademicArticle
And then also maybe: http://purl.org/ontology/bibo/status/peerReviewed
@Peter – Metadata formats and ontologies are really aspects you’d need to discuss with your ANDS project contact. I’m not sure what would be considered in or out of scope for ANDS and so can’t really offer any comment on these.
All depends on what you mean by and what you want from a storage format.
We ask that organisations deliver to ANDS information about their data collections, parties, research activities and services using the Registry Interchange Format (Collections and Services) RIF-CS. This is an XML profile optimised for machine serialisation and designed for processing into a national registry; RIF-CS will be managed into the future with that end in mind.
Each research organisation will store information in various internal systems about these things in different ways and for different purposes. ANDS recommends an analysis of your organisation’s information management needs in this area to ascertain the appropriate storage format and storage environment.
“Storage formats” might be relational tables in many cases; RDF triples in other cases. XML fragments in others. Decisions about this would depend upon how the organisation wants to use that information as part of their research enterprise. For example ANDS itself does not use RIF-CS as a storage format in our registry nor in Research Data Australia. Relational tables suit our business better.
And there is the point – storage format depends on what your organisation wants to do with that information – what queries you want to run – what other systems you want to integrate with. Exporting to ANDS in RIF-CS is likely to be only one requirement of the overall system.
Thanks Adrian, that makes sense, but in the VITAL/Fedora world where we are dealing with digital objects which are “done” then typical approach is to describe items with a metadata file in XML – usually MARCXML or MODS which are essentially equivalent.
The obvious question coming from a VITAL point of view is “what schema should I use?” And as far as I know some ANDS operatives have been suggesting RIF-CS.
Another question, where I think ANDS should take an interest is “What _terms_ should I use for things?” To reuse an example from the IR world, today we have stuff being harvested where the dc:Type is expressed as a string: “Journal Article” “Article (Peer Review)”, “Dest Category C” (I don’t know my DEST categories”. It would be much better for the integrity of the whole system if we agreed to use a term like: http://purl.org/ontology/bibo/AcademicArticle as well as the locally-relevant string. ANDS and the RDA registry are still young enough to attack this issue NOW rather than end up in the situation we have with our IRs where the NLA has to normalize the data they harvest.
Is it in scope for ANDS to help with either of these? (a) Recommend a metadata schema for collection descriptions in static repositories, analogous to ARROW’s recommendation of MARCXML and (b) recommend some properly defined terms from a set of ontologies to use therein?