My favourite issue in the work we're doing with institutional
repositories is what I call the Affiliation Issue. It's all about how
you record authors and their institutional affiliation in your
repository. In the Australian context there are two very good reasons to
do this:
1. **For reporting** to the Australian Department of Education, Science
and Training and Training, know to its friends as
[DEST](http://dest.gov.au/), institutions need to keep track of
research output.
2. For their own purposes institutions often also want to know not just
the total number of publications, but **which departments are
publishing**.
Sounds simple, but there are lots of issues. Affiliation can change over
time. Departments change their names or merge and as with any metadata
consistency is important. Am I the same Peter Sefton who also calls
himself Petie Sefton, whose honours thesis is cited in [this
paper](http://portal.acm.org/citation.cfm?id=972659&dl=GUIDE&coll=GUIDE).
(Answer: yes. Google Scholar seems to be able to work out that Peter
Sefton and Petie Sefton are synonyms. Cuts my vanity searching time in
half. But I'm not the only Peter Sefton mentioned.).
I'll talk about the four repository software packages with which I have
some experience through the RUBRIC project. They are:
1. [ePrints](http://www.eprints.org/software/) (not part of RUBRIC but
used by USQ)
2. [Fez](http://sourceforge.net/projects/fez/)
3. [DSpace](http://www.dspace.org/)
4. [VTLS Vital](http://www.vtls.com/Products/vital.shtml)
But first, a few words on identity management.
#
Identity management
Broadly speaking, there are two approaches to identify management in
repository software. Either it's managed or it aint.
That is, in some systems there is a table of Authors, and the metadata
for a record contains a key that points to that table. So instead of
putting my name in the author field I pick it off a list and the system
stores the fact that author 8342947 (that's my student number from the
University of Sydney) wrote this item.
Other systems just store a name, as a string.
Author identities managed:
: ePrints and Fez both do it this way. Fez has a lot of functionality
in this area for describing the structure of your organization.
Not managed:
: DSpace and VTLS Vital / VALET.
I think that managed identities need to be used with caution.
Affiliations change, even potentially over short timeframes so someone
might be publishing from two departments and departments can change
their names, split and merge.
#
Affiliation
So the repository solutions we're looking at divide neatly into ID
Managed and ID Not Managed, but it gets better. They also divide equally
on the issue of whether affiliation can be stored with the author's name
as a bundle or not.
Some repository software, like DSpace uses flat metadata consisting of
name value pairs. So you can say:
Name
Value
Name
Peter Sefton
Affiliation
Rubric Project, USQ
Which is OK until we get to two authors. Now which affiliation belongs
with which person? There are no guarantees that the order of the
metadata will stay the same as it goes in and out of repositories.
Name
Value
Author Name
Peter Sefton
Author Affiliation
Rubric Project, USQ
Author Name
Catherine Sefton
Author Affiliation
Department of Presents, Summer Hill University
There are a number of ugly hacks one might use to record author
affiliation in systems that use flat metadata:
1. Make a field with author and affiliation concatenated. Yuck.
Violates several basic principles of computer science, the names of
which I have forgotten. Error prone.
2. Create another data stream that **does** do nested metadata and then
change the repository to be aware of it. This is a possible DSpace
solution, but would take some work.
3. Add special fields like author1, affiation1, author2, affiliation2.
Workable, but really, really ugly.
Some solutions do allow nested metadata so if I were to write a paper
with my little sister the repository could keep track of our affiliation
like so:
Name
Value
Author
+--------------------------------------+--------------------------------------+
| Name | Peter Sefton |
+--------------------------------------+--------------------------------------+
| Affiliation | Rubric Project, University of |
| | Southern Queensland |
+--------------------------------------+--------------------------------------+
Author
+--------------------------------------+--------------------------------------+
| Name | Catherine Sefton |
+--------------------------------------+--------------------------------------+
| Affiliation | Department of Presents, Summer Hill |
| | University |
+--------------------------------------+--------------------------------------+
That's nested metadata. Now our simple taxonomy of repository solutions
looks like this. The two that support nested metadata are both based on
the [Fedora repository backend](http://www.fedora.info/), which is no
surprise as it is a very flexible component that allows multiple streams
of structured metadata.
**ID Managed**
**ID Not Managed**
**Flat metadata**
GNU ePrints
DSpace
**Nested metadata**
Fez (sort-of the feature is there but I ran into some bugs trying it
out)
VTLS VITAL
#
My opinion
This is my opinion, which is why this is being posted here rather than
on the RUBRIC site, but I think that storing references to database
tables is not a good idea for repositories. Repositories should reflect
the state of an item when it was created, with the name of the person
and department involved **as they were at the time**. If the repository
software stores references to information elsewhere instead then
important information will be lost.
Lookup tables that help you pick canonical versions of author names and
correctly spelled department names are a good idea. But I think that the
details of author affiliation need to be a snapshot in time, and not a
reference to database tables that might change. Although a unique ID as
well might be a good idea, so if I change my name from Petie Sefton to
Peter Sefton to ptsefton you can track me.
I want to be clear that I don't think one should or should not use any
of the above software just because of this issue. What I do think is
that repository implementors need to be aware of the affiliation issue
and work out what they're going to do about it.
My feeling (not a RUBRIC recommendation, my feeling) is that even if one
is using one of the ID managed repositories there should be a plain-text
'snapshot' of author affiliation as well as any managed ID, so that's
not a barrier to using something like Fez or ePrints. For recording
affiliation repositories that allow nested metadata make it much more
straightforward.
Note that I left out a lot of technical detail here – and generalized a
fair bit to make this simple taxonomy of repositories.
(I'll leave the comments open here for a change, so please contribute if
you have anything to add.)