PT’s blog

2008-07-31

Improving VALET - part 2

Filed under: Uncategorized — ptsefton @ 11:14 am

This is my second post on the VALET repository deposit tool. Again, if you’re not a repository aficionado you can probably move on1.

Still here?

One of the issues we confronted with VALET was to rewrite in Java or not to rewrite in Java? VALET is written in Perl and quite nicely written in my opinion, apart from the HTML forms which are a big mess of non-valid HTML. There’s nothing wrong with that as such, but it does have a couple of downsides relative to Java:

  1. VALET requires a web server to be installed. VITAL used to ship with Apache but it no longer does, so to run VALET you can end up having to compile and install Apache, and obtain some other dependencies. If it were a Java application then you could just drop it in to the same servlet container as you use for VITAL and Fedora.

  2. We have heard from some of the, um, younger techies in the ARROW community that Perl is a complete mystery. Others report difficulties in hiring Perl programmers, whereas everyone does Java at uni these days.

On the other hand, there are some reasons not to want to do a port:

  1. Some of the ARROW contingent have been using Perl since 1934 and can at least tolerate it. I’d count myself in that group. Fortran anyone?

  2. Hacking a Java program is not as simple as using a text editor to change a Perl file, because you need to compile (and worry about stuff like CLASSPATH, ugh).

  3. A port will create a huge fork.

All these points count for something, but Prashant from University of South Australia has pointed out that using JSP (to which I’m allergic, like PHP and ASP) gives a much easier entry point for ‘casual’ developers and even if it does fork VALET is actually a fairly small application so the investment is not huge and the gain for sites where they want to just consume the software should be worth it.

In the end the group here at the VALET camp decided that there was enough interest in a Java version that they were going to go for it. Nobody would own up to being a Java expert but four or five confessed to having written production Java code.

They’re creating an application as I write this. While they do that Harry, Duncan and David are integrating all the changes that ARROW sites made to VALET and submitted to the Google group. So the Java team will have a moving target as they re-implement the Perl code.

The Perl version won’t be going away but it looks like at least some sites will move straight over to the Java version once it’s done.

So what are the Java team (Tim, Guy, Prashant and Cyrus) doing?

They’re starting a VALET compatible clone. The idea is that you should be able to take an existing VALET workflow and data entry forms and with minimal effort, port it to run in the new application. Best case would be no work at all required; the new application will be a drop-in replacement for VALET. We’ll see if that can be achieved.

The new app rejoices in the working title of Squire, which is not an acronym; it shows that the developers know how to use a thesaurus. Or is it named for the fish? I reckon they should call it Alfred or Pennyworth2. Either way, it’s better than the original working title of Black Hole. which would be like calling your deposit interface Roach Motel. Although at least if you had a repository deposit called Black Hole you could claim very high rates of compression for data. Just don’t mention decompression.

The new JAVA platform will make it easier to do some of the other changes that the community are asking for (we’re discussion this on the ARROW Google group for those of you in the inner-circle), in some cases because there are more repository-oriented libraries for Java than for Perl but also just because as a community we have more competent Java programmers than Perl programmers these days.

Here are some enhancements that we will probably do at USQ at some stage there are lots of other requirements too which we are not going to forget these are just the ones that I can speak for at this stage:

  1. A SWORD deposit so the application can push content to repositories other than Fedora. We’re going to look at deposit of complex objects over SWORD in the TheOREM-ICE project very soon so this will be a quick add-on.

  2. The inevitable ICE interface so that if you submit a styled word processing document to Squire if will generate good quality HTML and PDF renditions automatically. We’re working with Ian Barnes at ANU and talking to the PKP people about how we might be able to do a better job of inferring document structure than the standard, breathtakingly abysmal Save as HTML feature in word processors. Another step in my campaign to stamp out PDF-only Web 0.5 repositories, at least in Queensland.

  3. Automatic embedding of metadata and license in the PDF file in XMP format, based on some work which is apparently going on in collaboration between QUT and an Australian Government agency.

  4. A lightweight complete open source repository package with Squire for deposit plus Sun Of Fedora as a portal. Not a lot of features, or complexity, just the basics.


1 If you don’t want to read about repositories, I recommend Bike Snob NYC. Which prominent fast but not fast enough Australian cyclist was he talking about last week?

Firstly, there was Saunier Duval’s impressive one-two finish, proving once again that there is no “I” in “team.” (Though there is a “moi” in “chamois.”) Secondly, ___ ____ (whose collarbones are only intact after yesterday’s crash because they have both been replaced by titanium) proved he is in fact a great stage racer by taking the Maillot Jaune by one second. (Anybody can blast his way up a mountainside in a distateful display of power, but it takes a certain dignified restraint to sidle up behind people and pilfer seconds the way ___ does, like an uninvited party guest nabbing cocktail weiners.)

http://bikesnobnyc.blogspot.com/2008/07/rest-day-roundup-stealing-seconds-and.html

2 Bron Chandler points out that there is some potential for recursive naming in the tradition of GNU and HURD. Alfred Pennyworth is sometime know as Batman’s batman. What would VALET’s nemesis be called? Do valets have nemeses? Do nemeses have valets?

2008-07-30

Improving VALET - part 1

Filed under: Uncategorized — ptsefton @ 4:33 pm

This week the ARROW community is having get together for developers to work on the VALET repository ingest tool. This is probably of little interest if you’re not a repository person (or rat) but if you are then this may be of interest whether you are associated with the VITAL / Fedora world or not.

VALET is a deposit tool designed to allow self-deposit of electronic stuff into a Fedora repository, specifically one running VTLS VITAL. The bit about VITAL is crucially important Fedora is an underlying storage layer, a kind of database, and different software will use it in different ways. VITAL has some tricks for storing datastreams derived form other assets, such as full-text extracted from PDF that other software like Fez would not understand.

VALET comes in two versions.

  1. There’s an open source one Valet for ETDs which is set up initially just to deal with Electronic Theses and Dissertations (ETDs). It’s available from the VTLS website or from Google Code (last week the one at the VTLS site was out of date, and the package for download from Google Code was slightly less out of date but I think they might be up-to-date now).

  2. The other version is mostly the same but is not free. It is important to make the distinction because if you customize the non-free version then you would have to ask VTLS for permission to redistribute it, possibly even within your own institution. I am not a lawyer (although I have a 10 year old who is threatening to become one) but I would be very cautious about changing a file that says (c) <Some Corporation> All rights reserved (Her other potential career is being a computer programmer might be a good idea to do both so she can be rich and happy).

So the outcome of the workshop will be to get a version of the open-source VALET with the best of the modifications that people have made at their sites, with maybe some new features.

One much requested feature for VALET (and for VITAL too) is to be able to edit submissions that have already been approved and pushed through VALET workflow into the repository. It’s kind-of surprising that VALET doesn’t do this already but it doesn’t.

I had an idea about how this might work last week, and Tim McCallum has implemented the first part of it already. To explain it we have to go into a little bit of detail about how VALET works. VALET takes a very simple approach to workflow, of which I for one approve. In simple terms:

  • An administrator defines a workflow with a set number of steps and says who can approve a submission at each step.

  • An administrator defines a web form, based on the example(s) shipped by VTLS to collect the metadata required for a submission.

  • At each stage the software simply serializes the information in the form into XML and saves it on disk.

  • For each new stage the program picks up the information from disk and puts the values back into the form.

  • At the final stage the program runs XSLT stylesheets (supplied by the administrator) to transform the serialized form data into the ‘proper’ metadata for the repository.

What Tim has done is simply to create an additional data stream containing the form data along with the other data streams when an item is approved. This means that it will be there alongside the repository item and all the other metadata streams. I think this will be really useful in solving some of the ongoing issues people are having with their repositories. For example, you might want to capture author email addresses but there is no sensible place to put them in a MODS datastream.

I know, some of you are thinking about standards how can I save my important data in a non-standard format? To which I say, better to save your data in a form which is not standard and not pretending to be standard, than to rush into inventing a new standard which only you support. Is there a standard out there that captures all the data you want to save? Then use it. If not, capture the data now and work with the community to define the standard you need.

I’m not the only one who had this idea. I found out that Vicki Picasso from Newcastle also thought it would be good to capture the VALET form.

This approach is actually very similar to what you do in ePrints you can define any old metadata you want (as long as it’s flat name-value pairs) and map it to Dublin Core as you see fit for dissemination purposes.

In VITAL, and in our Sun Of Fedora repository portal project you can index any XML datastream you like. So if you want to collect HERDC categories (that’s to do with reporting research publications to the Australian Government very important stuff) then you can, without having to jam them into a metadata schema that was not designed to take them.

Next steps in the work Tim started:

  1. Work out how to search for and retrieve an item to be re-edited, putting it back in the workflow.

  2. Work out how to create the formdata from existing items that did not get put in the repository. We already have some experience with generating VALET form data based on a very cool idea by Simon McMillan of UNE who can’t make it to the workshop. Get well Simon!

(I put it to my daughter that she could be a programmer and a lawyer and that would make her rich and happy. She said of course being a lawyer would make her rich and happy. I asked what would being a programmer make her? A nerd, apparently.)

2008-07-24

More on Buzzword

Filed under: Uncategorized — ptsefton @ 11:12 am

Two people have recently reminded me about Adobe’s online word processor, Buzzword. Coincidence? Groundswell of popularity? Probably not as they are married to each other.

Anyway, it has improved a bit since I first looked at it. At least it has HTML export now (it handles lists wrongly, nesting lists inside lists instead of inside list items, but that’s a common mistake). Still no styles or headings and I fear that it is trying to get people to lock up their documents in some kind of proprietary Flash and/or PDF format.

Adobe are asking for feedback so I gave some over at the Acrobat.com blogs.

I think that there’s an opportunity to Adobe to do what I Google should have done with Google Docs (used to be Writely). I suggested this:

What could be done differently over at Writely so they can reliably import documents and get the lists right, and better still, let people start off in Writely online and produce word processing docs to send out to others?

The Writely / Google people could design a well thought out, freely available generic word processing template that works more or less equally well in various different word processing environments (hint - you’ll need some clean-up code to help the poor word processors keep their lists straight).

http://ptsefton.com/blog/2006/03/21/writely,__meet_the_ice_template/

I think Buzzword should not only use styles, it should get a well designed set of generic styles as a basis and the Adobe folks should build templates which are Buzzword compatible the online service that does this first has the best chance of bridging the gap from the offline to the online world.

If I create a document in Buzzword why not make the default export to Word use some Adobe-defined styles and give the user a buzzword-like toolbar to play with them, post the doc back to Buzzword etc? In all the online word processors I have tried import and export is appalling and I’m sure this must slow adoption.

At the moment all the online word processors are far behind on features that are needed for some documents, you couldn’t write a thesis in Buzzword (not if you wanted tables of contents and figures and numbering and reference management) but you could draft some stuff in there or collaborate on papers then export into Word, or FrameMaker or something to finish the job. Here a well thought out style set would really help with interop.

Adobe if you want any advice on word processing templates drop me a line. (Someone from Google did, but the conversation didn’t go anywhere). The ICE project has some templates you might like to look at.

2008-07-15

Some architectural changes to ICE

Filed under: Uncategorized — ptsefton @ 4:59 pm
View as PDF

This post is a look at some architectural changes we’re looking at for the ICE system, as we hit the limits of what we could squeeze out of the old architecture.

Ron Ward has just finished a major rewrite of lots of the application, designed to make it work on a central web server with multiple users, in addition to the ‘classic’ mode where everyone has their own ICE server running on their own computer. He’s spent the last few months trying to get Subversion to do things it was clearly never meant to do.

ICE uses Subversion as a back-end version controlled data store. In the ICE classic mode multiple users work with checked-out working copies of a repository and hit ‘Sync’ to send their changes back to the server and get updates. Behind the Sync button is a fiendishly complicated bit of code that gets updates from the server, detects conflicts, tries to resolve them as gracefully as possible and provide a usable web GUI for the authors.

Object1Figure 1: ICE Classic mode: each user has their own ICE application which looks after their working copy, ICE uses the Subversion protocol to synchronize everyone’s work

Ron’s big rewrite has lots of unit tests based on all the trouble we’ve come across (mis)using Subversion for the last couple of years so we’re happy that it will be robust when running in classic mode.

But the new server version is a problem. If you have multiple users trying to access the same working copy all at once, then Subversion gets in the way it starts locking files all over the place for example. One simple solution is just to put out a server version that doesn’t allow distributed editing like ICE classic does, but our courseware authors really need the ability to manage large volumes of stuff on their own PCs as some courses are pretty big, with a lot of digital assets, while we want to have web access for reviewers and casual contributors to the same courses via a central web service.

So we’re looking at a new server mode where ICE still has a working copy but it knows that it is the only user-agent who has it checked out so it doesn’t need to do updates, it can just do commits. If all you want is a web based content management system then this will be all you need to install and it should run pretty well.

If you are following this technobabble then you’ll be asking but how does that help the ICE classic users work when there’s an ICE server? That would mean that changes made on an ICE client would never make it to the server!

Object2Figure 2: ICE Server mode: No subversion updates required as it is the only user-agent committing changes to the working copy

That’s the tricky part we need to create a new mode of operation for ICE where people want the benefits of the server version AND the classic distributed mode of working. In this mode the ICE application will work in a new ‘client’ mode. It will only ever get updates from the central repository. Any additions or changes won’t be fed back to subversion directly the ICE client will post them just like any other user into the ICE server.

This will require some more coding, but probably not as much as it would have taken to get the ICE server working any other way and it opens up the possibility that we can replace Subversion and use a simpler version control system, possibly of our own devising in future. So a future model might have the ICE server acting not only as interface for humans but for other ICE systems.

Object3Figure 3: ICE Client mode: Users can update their local repository but all changes go via the ICE server. We will automate this so it is seamless for users.

Having made this architectural decision we can press on with testing the ICE server straight away, even without making any changes to the client version. Here’s the plan which we will roll through over then few weeks:

  1. For the repositories which currently allow both server and classic access we turn off the ability for users to commit using ICE classic. If people want to check out their own copy of the content they can, as long as they post their changes back in through the server version manually.

  2. We modify the ICE server so it now assumes that it has THE working copy and only commits changes never updates this will mean we can support multiple users with no dramas (that’s the plan anyway).

  3. We will make a new client mode for ICE which automate the process of detecting changes and posting them from the client version of ICE through the ‘front door’ of the server version pretty much like any other user. Updates will happen as they do now, from the subversion repository.

Powered by WordPress