ptsefton

2008-12-02

Embargoes on bits of theses: skating on thin ICE?

Filed under: Uncategorized — ptsefton @ 6:07 am

I gave myself a task* after the last TheOREM-ICE teleconference to look into how ICE might be used for fine-grained thesis embargoes.

I have not seen a full spec but I gather from the conversation that sometimes you want to make a thesis available but place some of the data, or maybe a chapter or two under embargo. Jim Downing proposed that a repository would still advertise the ORE resource map of the thesis on the web but some parts would be unfetchable by anonymous access until the embargo period is expired. Could we use ICE to do that?

Maybe not such a good idea.

I’ll talk here about why not, and outline some other possibilities. Of course you should be cautious when I start talking like this ‘cos based on prior form I’m probably really just trying to get money to start a new software development or integration project.

For a start embargoes are really quite different from the sorts of access you want in an authoring system, being time based for a start and they have to work on the wide web rather than the intranet and this stuff happens when the thesis finished, so it really should be out of the document authoring system at that point.

ICE is designed for document authoring and collaboration and only has fairly broad-brush access control. In the courseware we work with access is by a whole team to a whole course. For a thesis it is similarly simple, you have the ability to add stuff and your supervisor can comment. I’m not at all sure that it would make sense to add complex document-level access features to ICE, instead why not concentrate on the ICE templates and conversation system? That is the bit we do better than anyone else that I know of. We could integrate the templates into other systems and leave the business of writing content or document management systems to all the other contenders. (One reason why not is that many CMSs are pretty hopeless at managing multiple renditions of the same content and don’t have plug-in convertors but there must be at least some we could work with.)

I’m not at all sure that ICE itself should be pushed too much further into thesis management past the authoring stage. It certainly looks like it would be good for drafting a thesis and getting your supervisor to comment, and it’s definitely on the cutting edge of being able to mashup document content with data visualization, but we don’t have all the elaborate approval and review steps that you’d need for the internal and external processes that follow. One promising lead springs from the way that the Maths & Computing department here at USQ use the Open Journal Systems (OJS) to manage theses. We’re exploring the idea of an ICE/OJS mashup. Stay tuned for a report from the APSR Open Access Publishing Workshop later this week where I will put this idea and many others to the technical stream.

For delivery I think The Fascinator might be a good way to manage the kind of embargoed access that the TheOREM team have identified as a requirement. ICE could manage the authoring, with examination via something like OJS and then use an ORE resource map to pass the thesis to a departmental or institutional repository running The Fascinator, which IS designed to do access control. It could then re publish the resource map to the world-wide repository grid and manage the embargoes. Either at the examination stage, or at the repository deposit stage the candidate would set up the embargoes, which would need to be kept as metadata against the components of the thesis.

Thinking about this led me to the idea of putting something like The Fascinator on the desktop, letting it find all your stuff, giving you a simple way to organize it into projects, embargo bits of it and so on, and then automate the process of disseminating it to the institutional and other places you’d like it go. I’m thinking of something like Picasa (which finds all your pictures on your hard drive no matter how embarrassing or not safe for work they are) and iTunes which although in my opinion potentially evil has some nice ways of browsing and organizing content, but with a connection to the world wide repository grid. More on this idea soon.


* We’re doing this project more or less in the open so you can poke around and see what we’re up to.

2008-11-18

The Fascinator @ !dea MMVIII

Filed under: Uncategorized — ptsefton @ 4:15 pm

The Fascinator is a bit of repository software, designed to show off stuff that resides in a Fedora Commons store. It was designed and built at the Australian Digital Futures Institute at The University of Southern Queensland (USQ). The software development was funded by the ARROW project with a contribution from USQ. It is based on some work that was part of FRED (Federated Repositories for Education). None of it, including the ARROW software would be possible without the Free and open source software on which it was built and in particular the indexer for Fedora that the Muradora team adapted.

The Fascinator was largely written by Oliver Lucido, with task management and web-skinning by Bron Chandler, and some programming on the harvesting component from Tim McCallum, not to mention the screencast. Neil Dickson at ARROW was our chief stakeholder. Also holding a stake was Alison Dellit from the National Library of Australia (NLA).

Why was The Fascinator relevant to the audience at !dea? The motto there was Enabling Technology: Forging a Digital Future it was a cross-sectoral meeting for educational technologists (at least that’s what I thought it was).

I had to think about that a bit. In addition to covering the software I tried to explore these themes:

  1. Talk about some contrasting software development/procurement approaches.

  2. Look at standards and interop in Higher Education, particularly how the higher-ed sector has gone trying to harmonize standardize and align their metadata.

  3. Is the e-Framework any use to a dev shop like ours? (If you don’t know what that is, don’t worry, the e-Framework crowd can comment below if they like)

  4. Open Source as a conversation.

To understand the motivation for The Fascinator we have to look a little at the ARROW project. ARROW (Australian Research Repositories Online to the World) was (it ends at the end of 2008) an Australian government funded project. ARROW got involved in an interesting software development model, partnering with a commercial systems vendor from the USA to build some Institutional Repository software on top of an open source base.

The result was an, um, interesting mix of free and commercial software.

There was some talk on Thursday morning about building vs buying technology. The ARROW-sponsored application was written as part of a complex interaction in which some technology had been bought but some was open, and we built something that complemented the bought bit and filled in some gaps. That’s much more complex that just ‘build or buy’.

I ‘m not up to expressing this in ‘proper’ eFramework-speak, but broadly speaking the services the ARROW community are after include:

  1. Collection management; the ability to browse and search groups of like or related things.

  2. Full text search.

  3. Simple access control such as on-campus access only to some kinds of document, rather than more complex far-fetched requirements.

  4. Separate views or ‘portals’ for different campuses or purposes. One thing that would be nice would be to separate images from documents for example. Can I have a document portal and an image portal so I don’t keep seeing pictures of cathedrals when I want to read theses?

(Apparently one of the eFrameworkistas was working on a Service Usage Model as I spoke. Can I see it please?)

The ARROW software, which is called VITAL, currently does 1 and 2 above, and has very rudimentary support for 3, while 4 is a feature that is coming in the new not yet released version 4 of VITAL.

graphics1

While the VITAL software in the above diagram is represented as a nice clean box, what goes on inside is probably rather complicated. Of course it’s not possible to find out exactly what VITAL does as it is a proprietary closed-source product. What I know about VITAL comes from talking to people from the vendor. I don’t want to get into too many specifics as I know the product is evolving and they might not like me to share all of the things they have shared with me. I have based this discussion on knowledge that is common to in the ARROW community.

Now the current version of VITAL can’t fulfill all the requirements I outlined above, but lets think about a imaginary repository at an imaginary university the University of the East Coast. That institution might have a repository which has multiple campuses (Bondi Beach, Surfers Paradise, North Stradbroke Island, etc), and which has ‘collections’ of documents such as research articles or theses. Some of the material is not for public consumption, such as theses which are under embargo because they contain stuff that’s patentable.

This diagram shows what I imagine might go in inside a repository as it grinds through the process of doing a search. This involves doing four separate lookups. I have no idea how VITAL really works so please don’t take this literally it’s a thought experiment.

If I search for something, say the word surfboard in the thesis collection, from the portal belonging to one of the campuses then:

First it must merge the results from three of its indexes;

  1. A text index.

  2. A collections index using the RDF triple-store.

  3. And a portal table which might be implemented using a SQL database.

  4. When that’s done it has to compare all the results to the access policy store in a process know as post-filtering.

graphics2

When I gave this talk an audience member wanted to know why I wanted to post-filter search results. Answer? I don’t. This slide was to illustrate the problem with this kind of architecture. Got that? I don’t think that the above is a very good idea unless you want to spend a lot of time and resources on optimizing out all the problems. It’s an anti-pattern.

This example illustrates a potential problem with using an architecture with multiple different indexes where queries need to be merged. The application designers will either have to accept slow performance or write a lot of optimizations themselves to merge all those results (using the mergerator which is a word I just made up) from searches against different indexes. This is complex, hard stuff.

The main problem with this architecture is post-filtering search results for access control. The impact of having to do this could be very great in some situations. I have concocted a ludicrous example to illustrate this.

At the University of the East Coast, there’s this rumour going around that someone is onto a new kind of skeg with a special shape. In fact there is a (very, stupidly,) large number of PhD theses that have been written on the topic but they are all under embargo and the people who wrote them are sworn to secrecy.

So when a designer from a surf company searches the uni repository it’s important not to reveal that, yes in fact we do have a lot of hits for the search term ‘helical skeg’.

graphics3

You can tell that I made this example up, right? But it does illustrate an important point. If you have some non open access content you need to be very sure that you do not expose even the fact that you have a document that contains a particular string.

This illustrates only what I see as a potential problem with using an architecture with multiple different indexes where queries need to be merged. The application designers will either have to accept slow performance or write a lot of optimizations themselves. This is complex, hard stuff.

The Fascinator takes a different approach; it uses just one technology to the drive the search; Apache Solr. Want to do a text search? Apache Solr. Limit results by campus? Apache Solr. Limit the amount of stuff guests can see? Apache Solr.

One big advantage of this approach is that Solr already handles lots of the internal caching of results. If you search for everything that guests can see it remembers the result. If you search for everything with an index entry of Collection='thesis' it can remember that. When you perform a complex query smart programmers have already worked out the fastest way to merge all the pre-cached results. It’s built in.

Main features:

  1. Flexible indexing.

  2. Faceted search.

  3. New portals using facets.

  4. Access control using facets.

graphics4

You can see the Fascinator in action in a couple of places:

  • Our demo site. Currently has content from a variety of places including the Université catholique de Louvain where we have harvest open content for demonstration purposes but your mileage may vary.

  • The AURIC site where we will eventually harvest all the Australian University Repositories we can get our paws on.

The Fascinator is mission-accomplished in terms of its goals, which were really about making a proof-of-concept statement in an ongoing conversation with the vendor and others in our repository community. We could park it now and be happy but that would be a pity. It looks like the software could have a life beyond the initial project; there has been interest expressed in a quite a few places, but everyone involved needs to understand that support from now on is based on whatever we can manage in between other projects, unless we can work out a way to make it sustainable.

graphics5

There are a couple of things we can do now:

In any case we will offer the indexing component, which we (meaning Oliver) built on top of work by the Muradora team back to the Fedora Commons community.

In conclusion, I offered the folks at !dea at 2008 the following. (Don’t think I really offered much re software development methodology):

  1. I observed that from what I’ve heard about the vocational (VET) sector they have done much better with metadata standardization. From what I hear there was a bit of cynicism from some people about that, but on the whole I think that’s right they’re more focussed and have done much better at aggregation and sharing. See LORN.

  2. I’m still not sure how I would use the e-Framework to design software what we did here feels more like improvisation; mashing up a couple well-known patterns. Sort of like playing a Reggae song using country licks, which is what I do at home as my hobby. But talking to Lyle Winton, he says that what we have done is identify an anti-pattern (the imagined architecture above) and a new pattern which works. The e-Framework team are welcome to use us as a case study…

  3. This open source thing has been really important to us. Instead of being stuck in conference rooms and conference calls trying to explain to a software vendor what ARROW wants, we were able to spend a comparatively small amount of money and say it with code, building on what others have already done.

2008-10-20

Towards (Australian) repository interoperability using OAI-PMH

Filed under: Uncategorized — ptsefton @ 12:52 pm

[Update: fixed typo in title]

Jim Downing tagged the presentation I posted on Tuesday What the OAI-ORE protocol can do for you as Apart from the ORE parts, this contains a nice exposition of the difference between standards and interoperability.

That tag nails a lot of what the talk was about. ICT Standards are nice, but they don’t always guarantee interop. The main example I gave of this was the National Library of Australia’s ARROW Discovery Service. It creates a normalized view of what’s in Australia’s institutional repositories. But the half-finished harvest we’re doing with USQ’s Australian University Institutional Repository Census (AURIC) shows the underlying chaos, more politely described as the diverse range of ways repositories describe their content.

The reasonably rosy picture you see at the ARROW Discovery Service is is not a result of true interop and I fear that it is not scalable. Instead it depends on a great deal of ongoing work by the maintainers to keep all the normalizing rules up to date.

There are other situations where data need to be moved around where the recipient of the data is not going to patiently and laboriously normalize the your ad-hoc content. Lets digress to look at one: the forthcoming ERA for Australia, which replaces the never-ran RQF.

I think I’m right in saying that the Australian Government is not going to be particularly understanding if your institution submits its data using a local ’standard’. If they think that a particular kind of research output is of type journalArticle then you won’t be able to submit something called Article, Journal.

If you want to use the repository of research outputs to feed a report to the government system then you will have to source or create some kind of report, or adaptor or something. Contrary to the fears of at least one repository manager I have spoken to recently this doesn’t necessarily mean that you have to change what’s in your repository (unless the metadata you’ve been collecting is inadequate to make the required distinctions). But it will mean that some bit of software needs to be created to provide the reports that you need.

I think what’s causing some stress is that our repositories don’t really have adaptors in all the places they need them yet. We don’t have anything like those little power adaptors that let you plug European appliances into Aussie wall-outlets. (By the way, Rick Jelliffe has a great post on interop relating power points to office document standards.).

Returning to the ARROW discovery service, the available-on-request guidelines say (very reasonably):

Populate as many fields as possible. It is a good idea to populate the Type element at all times, and to use the MACAR list to do so.

But there is no straightforward way using standard IR software to both use a MACAR type for the external view of a repository, vs the internal view. At USQ, for example the official university nomenclature for theses and dissertations is at odds with the MACAR types.

So, here’s a three step process that I think could move us towards better interop. that could be made more sustainable and less of a drain on the NLA:

  1. The NLA publish all the rules they use to normalize repository content in a public place. This might not be in the form of a standard, but it can be made human readable.

    At USQ we’ve been adapting NLA rules in our work on The Fascinator and with the AUIRC, and they’re going to be in The Fascinator’s distribution as an example.

  2. The repository community works to create adaptors for the various repositories so that they can move the normalization closer to home. Instead of feeding whatever you happen to have to the long-suffering ARROW discovery service you take responsibility for mapping your local view of your data to the shared standardized view.

  3. After a testing period the NLA switch over to the new system and start rejecting out-of-band input.

This is of course, just an example for discussion, to illuminate the general issue of interoperability. I don’t set NLA policy. It’s not my service.

There are lots of ways this normalization could be done, but how about using an OAI-PMH adaptor. Hook one side of it to your repository and expose the other side to the harvester. It would sit there quietly normalizing the content.

It doesn’t matter where the adaptor runs, but it does matter who looks after the normalization rules. It’s not sustainable to expect staff at the consumer end to keep adapting to non-standard inputs unless there is a very clear case for ongoing funding for such activities.

Being a techo I think of this normalizing OAI-PMH adaptor as a proxy and it turns out that a proxy is actually one of the recommendations from:

a small pilot project to set up a European doctoral e-theses Demonstrator. The work was funded by the Joint Information Systems Committee (JISC) in the UK, the National Library of Sweden and SURFfoundation in the Netherlands. The project has been performed by SURFfoundation.

http://www.surffoundation.nl/download/ETD_LessonsLearned_Full-Report+Annex.pdf

It mentions proxies:

The best solution is to create national proxy services. These proxies are gateways that on one side harvest the metadata from local Institutional Repositories, and on the other side have an OAI-PMH gate that can be used for service providers to harvest the national proxy. The advantage of a national proxy is that it can centrally normalise the metadata, and deliver unambiguous metadata to service providers.

The ARROW discovery service is already a national proxy it normalizes and acts as a gateway - but I am concerned about the centralized maintenance of the normalizing rules there’s not much of a disincentive to changing stuff around if someone is going to clean up after you, is there?

From the same report, the tedium of normalizing:

This has to be done for every type of field of every repository, which is a labour intensive enterprise when offering a high quality service. This is a short term solution, the long term solution will be that repositories offer standardised content that can be easily used for interoperable services.

Right what I’m suggesting is finding a way to give the NLA proxy/normalizer back to the providers. One way would be to make software available that lets people adapt their OAI-PMH feed without having to change their repository. Or the NLA could make the rule-sets editable by staff from the sites they harvest.

Given the current state of the repository art I would not propose that a harvester service suddenly start enforcing standards (ie rejecting non standard input). But, if we could come up with some freely available adaptor infrastructure then would it be reasonable for the NLA to, after a giving people time to adapt, stop accepting ad-hoc data and stop being responsible for making the service consistent? (That’s a genuine question. Would it? Use the comments below if you have an opinion).

There are other models.

The Zotero research tool project, for example relies on volunteers writing vast numbers of adaptors to let Zotero harvest metadata from various kinds of web sites. I don’t see anybody seriously suggesting that all this work is pushed back to the sites. I wonder about opening adaptor rules up to everyone wiki style. What would happen?

If it turns out that the ARROW discovery service is not all that important to the owners of the repositories it harvests then there’s no use expecting them to take over responsibility for normalizing their outputs. In that case it’s either it’s worth supporting at a national level, via a community or it’s not worth having. Repository owners do you value the service?

Finally, a reflection on the process I went through in writing this post. What I thought I was sitting down to write was a survey of potential ways to share and reuse the normalizing cross-walks that have been developed at the NLA. But in the process of thinking it through I came to the conclusion that a normalizing OAI-PMH adaptor is a pretty good idea. Then I realized that the Arrow Discovery Service is already doing that job. I was thinking about all sorts of complicated machinery but now I think that we have nearly all the technology we really need, it’s just a matter of distributing it the right way.

Returning to Jim’s perceptive comment re standards versus interop; the OAI-PMH Standard is more or less working for IR’s in Australia but the interoperability is limping along. I think we now have a great opportunity to think about what interoperability is worth to us as a community and put the power into the right hands to make it happen. Should it be central, at each repository site, or crowd-sourced? Choose one or more, or give up.

I’m very interested in this as a test case for how we handle content models in OAI-ORE, if we want to start swapping theses and journal issues and the like are we going to be able to get working interop from the start or will we need to go through the same kind of process as we’re going through with OAI-PMH?

2008-10-16

Happy Open Access day

Filed under: Uncategorized — ptsefton @ 4:16 pm

Chris Rusbridge points out that at the ARROW day on Tuesday week nobody mentioned Open Access Day.

I knew it was Open Access day when I was preparing my talk, and I meant to mention it but I forgot* as did everyone else, apparently. So happy Open Access day everyone. I hope this doesn’t say anything too significant about the ARROW community. (I do have a theory that because ARROW became involved in a rather drawn out and complex software development and deployment process that there was a tendency to focus on technical matters over policy for some people some of the time and that some of us may have lost sight of why were we doing this in the first place.)

Ironically, I had an approach very shortly after I posted my talk, from the publisher of a toll-access publication, asking if I’d like to work up my talk into a paper. My first response was that yes I am interested. But maybe I should only bother with full OA publishers. [Update: I should have mention that Chris says that that’s what he does]

Any advice? Should those of us in this business be making a point of doing everything as full open access (is that gold?)? Green? Some other colour I don’t know about? The terminology in this space is very confusing I don’t even know how to express my question!

[Update: I made a couple of minor edits to this post]


* I did remember that Wednseday was Ride to Work day and mention that in my presentation. I was sorry to miss out on riding (and two breakfasts) in Toowoomba but I didn’t go so far as to bring a bike and take it for a gratuitous ride across the Brisbane CBD.

ARROW week

Filed under: Uncategorized — ptsefton @ 11:38 am

This week I’m in Brisbane for an ARROW week. On Tuesday there was the ARROW repositories day, which was live-blogged [1, 2, 3] by Chris Rusbridge. Chris and I met via the blogosphere talking about how many clicks it should take to put stuff in a repository. His presentation at the ARROW day was good vision of how the repository might disappear into the walls like plumbing. (It also turns out that we nearly ended up working for the same institution at one stage.)

I talked about OAI-ORE but spent a bit of time on the impact that other standards have had on the repository world so far. Maybe too much, as David Flanders tagged the post as This post identifies some issues with XACML in the ARROW project (it also talk about ORE a little).

One mildly surprising thing I got from the first couple of ARROW days is that a number of people seem to think that repository is a dirty word now and they try to avoid using it maybe we should use the French-language version dépôt, which brings to my mind the place where omnibuses go to sleep every night. (Renaud Michotte showed us the work they’ve been doing at the Université catholique de Louvain.)

Wednesday was the ARROW community day which was a slightly uneasy mix of celebration and trepidation as ARROW central prepare to move on to new challenges while the ARROW repository rats (depot dingos?) wonder about how they’ll go in a world where there’s less support. In a HHGTTGesque gesture ARROW-central gave us all a towel. In addition to the towels, which should come in handy, there is going to be a new repository support service in Australia along the lines of the Repositories Support Service in the UK but we don’t know what that will look like yet.

I showed off The Fascinator.

Today and tomorrow (Thursday and Friday) I’m at the second VALET code-fest/camp thing at QUT. The developers have just started work, so nothing to report yet but I’ll report here.

2008-10-14

What the OAI-ORE protocol can do for you

Filed under: Uncategorized — ptsefton @ 1:47 pm

Peter Sefton

University of Southern Queensland

sefton@usq.edu.au

A presentation for the ARROW Repository day 2008-12-14

Abstract: Open Archives Initiative Object Reuse and Exchange (OAI-ORE), is an important new protocol for representing compound objects, or aggregations, in a web environment. The system is generating a lot of development activity in the repository community some of which will be reported in this presentation.

One of the main contributions of ORE will be a way to describe an item that is made up of several parts. The classic example is an HTML document and its images until the advent of OAI-ORE there has been no standardized way to draw a line around such an aggregation. What may seem obvious to a reader has not been obvious to machines for which the document and its images are treated as equivalent in status. Likewise ORE will help to define what is an item in a repository in a way that can help to make items portable between systems. It will also allow systems to exchange objects that are made up of multiple parts, such as a thesis with multiple chapters and data files.

The presentation will include some ORE demonstrations and discussion showing some USQ work on how documents from a content management system can be automatically ingested into repository systems (we will demo with ICE, ePrints and The Fascinator), and how items can be migrated from one repository to another. There will also be some more speculative discussion of future possibilities and some examples of other work.

OAI is a conceptually complex system with its own very tightly defined set of terms and some very refined and nuanced design. It is likely that for the most part, developers will work with OAI-ORE libraries to get things done, and for end users and repositarians the system will be completely transparent in much the same way most of us never need to look inside an OAI-PMH feed.

1 Introduction Standards

Before I go on to talk about ORE that’s Object Reuse and Exchange I thought I’d talk a little bit about standards in general and then about some experiences with standards in the Australian repository domain, particularly in the ARROW community.

I used to work at Standards Australia. They liked to say there The good thing about standards is that there are so many of them. It was a laugh a minute, at Standards at least when it was in Homebush in the good old days, before they moved it to the CBD and floated it on the stock exchange.

Me, I always wanted a little sign in the office that said We have standards to maintain.

Eventually I got sick of wearing a tie and moved to Queensland.

So what’s the point of a Standard?

In the software world one thing people worry about is, can I replace this bit of software with a different one later? Can I shop around for a database? Even when we’re dealing with open standards and open software he answer is often, well sort of.

Standard components

You might be able to drop in / screw in components such as:

  1. A house brick.

  2. A tap.

  3. A database (given the right drivers). Eg Fedora works with several different databases.

  4. A screw (given the right drivers)

  5. An authentication system like LDAP.

graphics5

It’s not just about components, though. The big thing is interoperability. Can my systems interoperate with yours now? And will my digital assets interoperate with my own future systems?

The thing that’s important about those standard components is their interfaces. That is, how they fit together.

And how they behave if you move them from one site to another.

I make the point about interop because I want to come back to it later. You need to think very hard about whether it is worth using a standard to do something that you can’t later reuse.

2 Two examples of standards

Lets look at two standards that have been used and/or promoted in the Repository space before going on to look at OAI-ORE; OAI-PMH and XACML.

2.1 OAI-PMH

The Open Access Initiative Protocol for Metadata Harvesting is a must-have standard for repositories. It’s used for disseminating repository content to registries and indexes that aggregate content.

OAI-PMH

  • Mostly works.

  • A standard for moving metadata from one place to another.

But:

OAI-PMH is, basically A Good Thing. That’s not to say that it’s painless, though.

The OAI-PMH standard actually clashes with the Dublin Core standard in at least one place. Neil Godfrey nails the issue in a blog post.

But worse than that in practical terms harvesting is a mess. While the interchange protocol (PMH) more or less works the stuff that people interchange is very far from being standardized. At the National Library of Australia’s ARROW discovery service they have a normalization process built into their harvester that presents a coherent view. This is not the result of standardization, it’s the result of Alison Dellit and the NLA team’s hard work in writing rules that say thing like:

For the USQ repository:
  If type is_equal_to ADT_Thesis”:
     set type to "australasian digital thesis

These rules normalize the chaos that is Australian Institutional Repositories. This is certainly made easier by a pretty-good level of standardization in some areas. At least people put the resource type in the dc:type element!

Don’t shoot the messenger (PMH)

The Australian University Repository Census (AUIRC pronounced OIK) uses OAI-PMH to harvest items from as many Australian Universities as it can.

Compare!

2.2 XACML

But not all standards have worked out so well. One thing that has definitely been much more painful for the ARROW community than OAI-PMH is XACML, the eXtensible Access Control Markup Language. There has been an expectation that this will be a key standard for repositories but it so far has not turned out that way.

(Actually, it’s not even a markup language! Markup is something that you’d put in-line in something else, like a bold tag in an HTML page, or a structural element like chapter in XML.)

Great eXpectations for XACML

XACML is supposed to let you write role-based policies for items in your repository. For example:

Only initiated females from the Australian Labor Party and mathematicians are allowed to see this.

But:

  • How was your university going to share access policies for mathematicians with my university?

    See the eduPerson spec. Can you figure it out?

  • And did we all expect to be interchanging XACML policies. Really?

I was always vaguely worried about how XACML policies were going to work but one day I met Kent Fitch who really nailed it. On the subject of these use cases for XACML where you, an anthropologist want to grant access to a repository to other anthropologists, he asked What’s an anthropologist?1

This is a very, very good question. Does an academic working in the education faculty who self-identifies as a visual ehnographer qualify? What if she’s got an honours degree in anthropology? In an access federation would the archaeologists who make up our anthropology department count as archaeologists?

Just look at the wide variety of names used to refer to to a thesis in Australian repositories. Remember this is librarians we’re talking about here (and maybe the research office), if these information professionals can’t agree on what to call something as well defined as a thesis how will they go labeling archaeologists anthropologists, ethnographers, linguists and so on in such a way that I can trust the labels enough to give those people access to research materials in my repository?

But that’s not the biggest problem with the unrealistic expectations heaped on XACML.

The problem is that an XACML policy tells you who has what kind of access to an item, but if the XACML is not able to be integrated with the search index then you can’t filter search results per-user without looking at every single item. Very slow, that. But if you don’t then you risk letting on that your repository contains something that you should not be disclosing, like the name of a chemical which is the subject of a patent application, or worse the name of a person that you must not disclose.

(The good news is that a project we’ve been doing at USQ called The Fascinator takes an approach to access control that is integrated with search and so far that seems to be working.)

So as I go on and I tell you about how I think ORE is going to be a very useful standard for your repository and the services that go around it, we all need to remember that standards are much less use if there’s no possibility of interoperability. For XACML the possibility that you could move an access policy from one repository to another is vanishingly small remember, we all call a PhD thesis something different. There might be an advantage to learning just one policy expression language but it’s certainly not the same advantage as you’d get if you could share policies.

And don’t forget that the big problem with XACML is that it turns out to be no good for search. My advice is not to ask for a particular standard because someone else tells you it’s important; look for useful software but check that it is going to interoperate with other services and with your own services into the future. Standards do matter.

3 OAI-ORE

Now the the real topic of this presentation.

Officially:

Open Archives Initiative Object Reuse and Exchange (OAI-ORE) defines standards for the description and exchange of aggregations of Web resources. These aggregations, sometimes called compound digital objects, may combine distributed resources with multiple media types including text, images, data, and video. The goal of these standards is to expose the rich content in these aggregations to applications that support authoring, deposit, exchange, visualization, reuse, and preservation. Although a motivating use case for the work is the changing nature of scholarship and scholarly communication, and the need for cyberinfrastructure to support that scholarship, the intent of the effort is to develop standards that generalize across all web-based information including the increasing popular social networks of web 2.0.

http://www.openarchives.org/ore/

What is OAI-ORE?

Open Archives Initiative Object Reuse and Exchange* (OAI-ORE) defines standards for the description and exchange of aggregations of Web resources.

graphics3

Exchange and subsequent re-use of an object.

Used without permission.

*Shouldn’t it be exchange then reuse; OER?

Look at my blog. To put content on the blog I use AtomPub a standard protocol for pushing web content to a server.

One of the really dumb things about blogs in general is that to the blog application an article usually consists of just the HTML part. In WordPress, for example a post is a bit of HTML and some metadata. If I want to include a picture it has to go into an uploads directory. To the reader, it is perfectly reasonable to think of a blog post with a couple of images as a single document but most blogging software and the ATOM protocol don’t really support this. To further complicate matters a blog could be laced with advertisements and adorned with web 2.0 widgets that most of us would not consider part of the content. In the example above there’s an automatically generated map which is actually not part of the content I typed in but which is generated from it.

ORE offers a web-based way to describe an aggregate resource, what I think of as a post, made up of the HTML part and the images that the software likes to think of as uploads but which I think of as part of my document.

Only it doesn’t work like that. Not for WordPress. The plugin I installed at my blog doesn’t treat stuff from the uploads directory as being part of the post. (But it will, it seems).

The ORE view of a blog post

Object1

The ORE tutorial starts off with simple stuff like this, but it quickly gets fiendishly complicated, and you have to deal with prose which is obviously very carefully worded but can be quite impenetrable.

I’m looking forward to a time when there is some consensus on a ‘blog post’ type I can attach to the aggregation (failing that we can use NLA-style normalization a la ARROW discovery).

What are other people doing with ORE?

(I don’t much like the entry which won the ORE challenge, which was just a worse way to visualize relationships than you could get using text).

What are we (USQ and friends) doing with it?

  1. Pushing theses and other research content around, as part of the JISC-funded ICE-TheOREM project.

  2. Harvesting repositories into The Fascinator.

  3. Ingesting image collections into The Fascinator, exploring possibilities for working repositories for eResearch.

Here’s a thesis in the ICE content management system.

4 Conclusion

But what will ORE do for me?

IMHO here are a few examples:

  1. Replace or supplement OAI-PMH for moving content between repositories not just harvesting metadata but even upgrading to new software.

  2. Improve research tools like Zotero by making it easier to tell the tool what to download when saving a local copy of a paper.

  3. Replace the use of METS packages in work like APSR’s OJS to DSpace demo. [I may be on my own here]

  4. Allow for thesis by publication in a very elegant way.

  5. Pave the way for a new repository architecture which understands content models. (No more discussion of atomistic versus compound objects).

Finally

To help ORE it along I suggest a minor rebranding. From now on, call it…

2008-10-08

eResearch for Word users?

Filed under: Uncategorized — ptsefton @ 10:56 am

I’m back in Toowoomba after a week away at the eResearch Australasia 2008 conference in Melbourne. As usual I didn’t live-blog any of it but I will try to post on my thoughts from the conference over the next week or so. Not sure how I go with that as I have stuff to prepare for the ARROW event in Brisbane next week, where I’m looking at the new OAI-ORE standard and for Open Standards 08 in Sydney where I’m talking about standards in the ICE context.

The peer-reviewed paper from era08 is now available in the UQ repository it’s just a PDF at present, as that’s the way research is typically published. Very Web 0.5.

The Web 1.0 version, i.e. one that is actually in HTML will be out soon, as we are in the process of finalizing an ICE to ePrints hook-up at USQ. I’m aiming to post the paper, the poster and the presentation that I gave at the humanities workshop in Friday as an HTML package, with PDF versions as well.

I posted my poster here last week.

2008-09-30

ICE: eResearch for Word users

Filed under: Uncategorized — ptsefton @ 4:26 pm
View as PDF

    I’m just blogging this poster from OR08 to show that it can be done.

    About this hyperposter

    This poster is a hyperdocument designed to show some potential applications for eResearch publications.

    This document has embedded semantics.

    For example, it was written in:

    Embedded geographical data (via geohash) can be used to generate a map like the one here. On the web, this is an interactive, automatic process.

    graphics1

    OpenStreetMap data can be used freely under the terms of the Creative Commons Attribution-ShareAlike 2.0 license.

    The mythical datument

    The term Datument was coined in 2004 by Peter Murray-Rust and Henry Rzepa:

    A datument is a hyperdocument for transmitting “complete” information including content and behaviour. … where the machine is supplied with tools which are semantically aware of the document content. Examples of the latter are domain-specific XML components such as maps (GML), graphics (SVG) and molecules (Chemical Markup Language, CML)

    Murray-Rust, P. & Rzepa, H.S., 2004. The Next Big Thing: From Hypermedia to Datuments. Journal of Digital Information, 5(1), p.248. Available at: http://jodi.tamu.edu/Articles/v05/i01/Murray-Rust/?printable=1

    But they are far from common. This poster / blog post / presentation / map-mashup might be the closest you have ever been to one.

    It’s only 2008 be patient!

    Object3

    Produce PDF, HTML and more from word processors

    1. Microsoft Word (Windows & Mac)

    2. OpenOffice.org Writer & derivatives (Windows, Mac and Linux)

    3. Applies styles behind the scenes to capture structure

    4. Command line or web service for integration

    5. Open source built on Python + OpenOffice.org

    6. Works with Zotero

    7. Built in version control via Subversion

    8. Integrated with ePrints and other repositories (coming soon via the ICE-TheOREM project)

    Object2

    ICE: a hub for collaborative authoring

    Object1

    Ask me how

    (Metadata is embedded in the hyperposter using styles)

    Peter Sefton

    {p-meta-author-name}

    The University of Southern Queensland

    {p-meta-author-affiliation}

    peter.sefton@usq.edu.au

    {p-meta-author-email}

    +61 (0) 410 326955

    {p-meta-author-phone-mobile}

    Also available in machine readable form:

    • Dublin Core

      <oai_dc:dc>
       <dc:title>ICE: eResearch for Word users</dc:title>
       <dc:creator>Peter Sefton</dc:creator>
      </oai_dc:dc>
    • RDF ORE resource map for migration to repositories

    2008-09-19

    Is this thing working?

    Filed under: Uncategorized — ptsefton @ 12:54 pm

    I’m working on my hyperposter for eResearch Australasia 2008. This is a test to see if the mapping system here is still working.

    This document has embedded semantics It was written in:

    1. Toowoomba at USQ [Update: fixed spelling] (S 27.601335° E 151.930854°),

    2. for a conference in Melbourne (S 37.849925° E 144.978368°)

    2008-09-09

    Embedding XML in word processing documents (if you really must)

    Filed under: Uncategorized — ptsefton @ 1:38 pm

    Rick Jelliffe has posted a comparison of how foreign XML can be embed in OOXML (that’s the XML format for Microsoft Office) and ODF (the Open Document Format).

    Rick starts with:

    First the caveat: Word and OpenOffice are not general-purpose XML editors.

    Right. That means that if you do decide that there’s a case for embedding extra XML in OOXML or ODF then you are going to have to supply add-ons to the applications in question to edit it. So what does this mean for the two formats? (As usual I’ll just talk about the word processing format here and ignore spreadsheets and the rest.)

    For OOXML, you would have to create a Word Addin such as the one I’ve looked at here before. There could be business case for that, but you’d have to accept that your documents were only going to be editable in Word 2007+. I gather from recent posts that Rick does some work on projects where this does make good business sense.

    For OpenOffice.org you’re out of luck. Rick’s tests show that OpenOffice.org strips out foreign markup. It’s unclear whether this is conformant behaviour or not:

    But the bottom line for foreign elements as wrappers in ODF and OOXML is that ODF allows them to be stripped out while OOXML doesn’t allow that; neither of course require that any particular application understands them. The bottom line for OpenOffice and Office seems to be that OpenOffice strips them (dangerously, but perhaps allowed because of bad drafting of that part of the ODF standard) while Office 2007 does allow them.

    As I’ve covered here many times ODF interoperability between applications is basically non-existent except between Microsoft Office and OpenOffice.org and its derivatives where some things work quite well. Bottom line is, ODF doesn’t have any formal notion of what’s conformant it’s up to application developers to implement the bits they feel like implementing.

    The OpenDocument specification does not specify which elements and attributes conforming application must, should, or may support. The intention behind this is to ensure that the OpenDocument specification can be used by as many implementations as possible, even if these applications do not support some or many of the elements and attributes defined in this specification. Viewer applications for instance may not support all editing relates elements and attributes (like change tracking), other application may support only the content related elements and attributes, but none of the style related ones.

    http://www.oasis-open.org/committees/download.php/12572/OpenDocument-v1.1-os.pdf

    I think for most uses a much better bet is to use microformats which leverage the built in features of the formats. These not only work in the aforementioned major applications for OOXML or ODF, in many cases they interchange between the formats quite nicely as well.

    What’s a word processing microformat? One example would be using a one-cell borderless table with a paragraph in it of style ‘h-warning’ to indicate a bit of content that’s a warning, to use Rick’s example. Ok, so using a table is inelegant, but it works in both Word and OpenOffice.org writer and will survive round tripping between .doc and .odt and .docx. You could use a frame, which is a more semantically neutral element and sacrifice some interop, or you could use styles only, which is a bit harder for users to manage and more error prone. Actually, Rick gives an example of a styles-based microformat approach.

    We use this kind of technique to do things like generate slide shows from text embedded in documents, and we’re developing methods for embedding metadata in documents using styles.

    Newer Posts »

    Powered by WordPress