[This is a re-post. I put this up a couple of days ago but my hosting provider lost the server for a day and had to restore from backup.]
At the Australian Digital Futures Institute we’ve landed a contract with ANDS (the Australian National Data Service) to write some software specifications for key bits of infrastructure for the data commons. I’m going to be doing most of the initial work on this one, and I’ll use my blog and Twitter to communicate what I’m up to as I go, as per the communication plan for the project. I know that some of my regular readers are interested in this stuff.; please comment if you have anything to add.
This first post will look at the scope of the consultancy, and talk about some of the major issues that I think we’ll need to resolve.
Scope & Deliverables
Our contract is to provide specifications for software applications to help institutions manage metadata about data collections and project plans to build them. There are two models for these applications. The first is a stand-alone system, while the second is using the existing institutional repository to store metadata – here’s what the project plan says:
The proposed [agreed now] process is for ADFI staff to prepare at least one scoped and costed software specification, with an agreed development methodology for a:
[…] standalone metadata store that can be used to augment either an existing institutional repository or data store, and which will manage metadata about data objects and data collections. This metadata will need to complement the existing object-level metadata. The interface should be designed to be web-based and easy to use by repository managers without specific technical skills. The metadata store should be able to generate collection descriptions as RIF-CS (http://www.globalregistries.org/rifcs.html) and make these available for harvesting using OAI-PMH and/or direct harvest of XML.
Source: ANDS-internal document.
Additionally we will develop one or more specifications for providing metadata capture solutions based on existing institutional repository software.
So the main thing we’ll be looking at is a kind of repository or registry-like thing where institutions can describe their data collections, and publish those descriptions to Research Data Australia. If you visit that site you can see the data model that’s being used by ANDS with four classes of thing. Here’s a snapshot of how it looks today:
Collections (877)
Where a collection is a useful grouping of physical or digital items.
Parties (260)
Where a party is a person or organisation that has some relationship to a collection, service, activity, or party.
Services (1)
Where a service is a mechanism for gaining some kind of access to or information about a collection (or items within a collection).
Activities (2)
Where an activity is an undertaking or process related to the creation, update, or maintenance of a collection. [I think projects are activities - PS]
This model, which is expressed in a schema known as RIF-CS, is based on a yet-to-be-approved ISO Standard (ISO2146). RIF-CS implements quite a different approach from the way most Institutional Repositories (IRs) are structured, where the repository items themselves are the only primary objects.
In an IR you typically have metadata which refers to parties and services and so on using text-strings. Parties, services and activities are not things in the repository (speaking in general here, there are some exceptions, and increasing people have some status as objects in repositories like Fez). A publisher, which is a party, will be described in an XML document conforming to some metadata schema (in Australia this is typically Dublin Core, MODS or MARC XML) using its name expressed as a string. If two records refer to the same party using different strings then things start to get messy. And as we all know, there is one kind of party around which strings proliferate and cause confusion: people. There is an ANDS effort under way to provide services for describing people outside of a repository so you can refer to them using some kind of ID, but that service is a way off still.
RIF-CS is described here as an interchange format (it is, after all the The Registry Interchange Format – Collections and Services) The which I gather was originally cooked up to support the Global Registries Initiative. But I think some of the people are thinking of using it as a storage format within repositories. If we decide do that we will have to do so carefully. For example, you would not want to store this example RIF-CS as a single repository object, trust me, as the file contains a number of things that really should be treated as discrete objects in a repository, including metadata about people and projects.
Issues for a new standalone metadata store
If the goal for the standalone metadata store is to support the abstract model behind RIF-CS then one of the challenges of this project will be working out how to support this kind of a model. How do make sure that parties and so on are described once, and as accurately as possible? And where that fails, make sure there is infrastructure to assert that two differently described parties are the same, and two identically described parties are in fact different. If you want to refer to a party that is not described yet how will we support that? Will people and objects and so on be primary objects?
In discussions so far, we have tentative agreement between our ANDS stakeholders and ADFI and on one key point; as far as possible we’d like the stand-alone metadata store to be Linked (Open) Data ready. This means that all Collections, Parties, Services and Activities would have URIs, and the metadata store would allow data entry that uses the URIs behind the scenes. (I checked with Scott Yeadon at ANDS and he tells me that RIF-CS ‘keys’ can be URIs and are expected to be globally unique, even though this is not entirely clear from the schema documentation). This is a user interface challenge, but if we can pull it off we should be able to avoid stuff like this, where a mistyped string results in two parties where there should be one:
http://services.ands.org.au/home/orca/rda/list.php?group=&class=Party&page=2
Australian Institute od Marine Science
Australian Institute of Marine Science
So this is all pointing to a repository which uses RDF to describe things, drawing on appropriate vocabularies for relations and values, such as FOAF for parties, the bibliogrpahic ontology for documents, and maybe the Dublin Core Collections for data collections . We could, of course build a classic database-backed system with a database schema that reflect the abstract model directly, but that’s very inflexible and not easily extensible.
(We don’t have a lot of experience with large-scale RDF at ADFI, but I am thinking that in addition to the RDF, and a triple store to keep it in you would probably still have a high performance index of the repository using Apache Solr to provide the search/browser interface.)
Issues with adapting an IR
If we work on storing data collection metadata in IRs then there will be additional challenges. All institutional repositories in Australia already support Dublin Core metadata, at least in their OAI-PMH feeds. But a lot of repository software in use in Australia is limited to storing metadata as plain strings. Because of this, most of the repository people are used to thinking in terms of flat metadata models serialised in XML, not networks of relations, RDF style. Anything we do to augment IR software will have to fit in with the way IRs are used now, mainly for document content. There are lots of design challenges there for the information model and user interface. We have started thinking about this in detail for a particular repository platform at a particular uni, more on that soon.
Copyright Peter Sefton, 2010. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>
This post was written in OpenOffice.org, using templates and tools provided by the Integrated Content Environment project and published to WordPress using The Fascinator.
The statement beginning “If the goal for the standalone metadata store is to support the abstract model behind RIF-CS…” should be changed to “If the goal for the standalone metadata store is to support ISO2146 (the abstract model behind RIF-CS)…” since RIF-CS is only a subset of 2146 applicable to ORCA.
On your point that “…people are thinking of using [RIF-CS] as a storage format within repositories”, this is an exceptionally bad idea and should be discouraged. Besides the headaches that schema changes will cause if used as a storage format, RIF-CS is an interchange format for the specific purpose of supporting the provision of descriptive metadata to collection registries and is not compatible with the requirements of institutional metadata/content repositories. Repositories would be better looking at mapping ISO2146 object classes to their own requirements and capabilities.
Just a minor note on your point that keys “are expected to be globally unique, even though this is not entirely clear from the schema documentation”. In the RIF-CS guidelines I think we make this clearer with the statement: “ANDS Collections Registry providers: In the Collections Registry context, the key must be a globally unique identifier.” The schema documentation is separate from the guidelines as at the time the schema was created we had no idea whether and how different Global Registry requirements would be from ORCA (or other applications which might be developed that would use the same schema) so the specifics of element usage in a particular context is largely separated from the schema documentation.
@Scott
One problem with ISO2146 is that right now it’s not a standard, so it’s a bit difficult to get hold of an authoritative version. How can we get hold of it?
Another is that people are not used to working with abstract data models and will just reach for the nearest XML schema; ‘best’ practice in ARROW for example is to store stuff in MARCXML or MODS.
So if we are to avoid the use of RIF-CS as the main storage format for metadata about data I think we’re going to need a lot more info from ANDS about the abstract model behind it and some advice (maybe part of our project?) on implementing the model in a sensible way. I am proposing than RDF might be a good way to realise the abstract model in a stand-alone metadata store, but would this be practical in DSpace or ePrints or VITAL?
Looks good!
A couple of comments:
It would be good “delete” function can be implemented in your OAI-PMH provider (either in “standalone metadata warehouse” or “IR”)
NLA has a discovery service for people and organisations (a persistent identifier has been assigned for individual person and organization).
http://trove.nla.gov.au/people
Of course, it has limited info at this stage…. If NLA can collection people and organisations info across all Australian universities and other research organizations. It may be a good idea to use their service to identify individual researcher and organisation to reduce “messy”.
Storing URIs is a good idea, but what happen if clients changed their domain name? Will you consider “handle” instead of URI? At least, you can have “globally unique identifier” if you use handle.
For access to ISO2146 you should take this up with your ANDS contact point as access to a copy of this seems a fundamental issue. I don’t know whether there is any other option than paying ISO for a copy but is worth raising.
In regard to people just reaching for the nearest XML schema, I don’t have a problem if people simply use the RIF-CS schema as a reference point (as opposed to paying for a copy of an “open” standard which isn’t even officially a standard yet), but it is not really a storage format. If people choose to hold RIF-CS as a storage format that is fine, but I emphasise that it is an interchange format controlled by an external party (ANDS) specific to an application context and is not suitable as a master storage format. If they choose to store information in this form they do so at their own risk. I would also argue that if RIF-CS is seen as a suitable storage format then appropriate business analysis has not been undertaken. A RIF-CS feed certainly meets ANDS collection registry requirements but it can no means meet the requirements for effectively managing and maintaining the object classes (people, activities, collections and services) in a core institutional system/information context.
In regard to the technology choice, from the blog I’m not sure you have enough information to really know whether an RDF store or other underlying technology is applicable at this point. If your scope is “…to help institutions manage metadata about data collections…” then this implies that this would be a core institutional information management system which not only provides descriptive metadata (which RIF-CS is interested in) but more sophisticated record management and auditing. I don’t have an opinion on whether RDF (and presumably an underlying triplestore) would make sense in this context, I imagine you could make it work but really depends on the requirements of the metadata store and the capabilities of the available triplestores.
I must admit I’m somewhat confused by this project. The usefulness of a stand-alone metadata store seems dubious. Researcher metadata, project/funding metadata, and collection metadata will often be held in different repositories for good reasons. Being able to link them is important, being able to make information findable is important, aggregating information is important. But will a stand-alone metadata store help. And would an ISO2146-compliant metadata store be much different to, say, ORCA.