PT’s blog

2008-07-31

Improving VALET - part 2

Filed under: Uncategorized — ptsefton @ 11:14 am

This is my second post on the VALET repository deposit tool. Again, if you’re not a repository aficionado you can probably move on1.

Still here?

One of the issues we confronted with VALET was to rewrite in Java or not to rewrite in Java? VALET is written in Perl and quite nicely written in my opinion, apart from the HTML forms which are a big mess of non-valid HTML. There’s nothing wrong with that as such, but it does have a couple of downsides relative to Java:

  1. VALET requires a web server to be installed. VITAL used to ship with Apache but it no longer does, so to run VALET you can end up having to compile and install Apache, and obtain some other dependencies. If it were a Java application then you could just drop it in to the same servlet container as you use for VITAL and Fedora.

  2. We have heard from some of the, um, younger techies in the ARROW community that Perl is a complete mystery. Others report difficulties in hiring Perl programmers, whereas everyone does Java at uni these days.

On the other hand, there are some reasons not to want to do a port:

  1. Some of the ARROW contingent have been using Perl since 1934 and can at least tolerate it. I’d count myself in that group. Fortran anyone?

  2. Hacking a Java program is not as simple as using a text editor to change a Perl file, because you need to compile (and worry about stuff like CLASSPATH, ugh).

  3. A port will create a huge fork.

All these points count for something, but Prashant from University of South Australia has pointed out that using JSP (to which I’m allergic, like PHP and ASP) gives a much easier entry point for ‘casual’ developers and even if it does fork VALET is actually a fairly small application so the investment is not huge and the gain for sites where they want to just consume the software should be worth it.

In the end the group here at the VALET camp decided that there was enough interest in a Java version that they were going to go for it. Nobody would own up to being a Java expert but four or five confessed to having written production Java code.

They’re creating an application as I write this. While they do that Harry, Duncan and David are integrating all the changes that ARROW sites made to VALET and submitted to the Google group. So the Java team will have a moving target as they re-implement the Perl code.

The Perl version won’t be going away but it looks like at least some sites will move straight over to the Java version once it’s done.

So what are the Java team (Tim, Guy, Prashant and Cyrus) doing?

They’re starting a VALET compatible clone. The idea is that you should be able to take an existing VALET workflow and data entry forms and with minimal effort, port it to run in the new application. Best case would be no work at all required; the new application will be a drop-in replacement for VALET. We’ll see if that can be achieved.

The new app rejoices in the working title of Squire, which is not an acronym; it shows that the developers know how to use a thesaurus. Or is it named for the fish? I reckon they should call it Alfred or Pennyworth2. Either way, it’s better than the original working title of Black Hole. which would be like calling your deposit interface Roach Motel. Although at least if you had a repository deposit called Black Hole you could claim very high rates of compression for data. Just don’t mention decompression.

The new JAVA platform will make it easier to do some of the other changes that the community are asking for (we’re discussion this on the ARROW Google group for those of you in the inner-circle), in some cases because there are more repository-oriented libraries for Java than for Perl but also just because as a community we have more competent Java programmers than Perl programmers these days.

Here are some enhancements that we will probably do at USQ at some stage there are lots of other requirements too which we are not going to forget these are just the ones that I can speak for at this stage:

  1. A SWORD deposit so the application can push content to repositories other than Fedora. We’re going to look at deposit of complex objects over SWORD in the TheOREM-ICE project very soon so this will be a quick add-on.

  2. The inevitable ICE interface so that if you submit a styled word processing document to Squire if will generate good quality HTML and PDF renditions automatically. We’re working with Ian Barnes at ANU and talking to the PKP people about how we might be able to do a better job of inferring document structure than the standard, breathtakingly abysmal Save as HTML feature in word processors. Another step in my campaign to stamp out PDF-only Web 0.5 repositories, at least in Queensland.

  3. Automatic embedding of metadata and license in the PDF file in XMP format, based on some work which is apparently going on in collaboration between QUT and an Australian Government agency.

  4. A lightweight complete open source repository package with Squire for deposit plus Sun Of Fedora as a portal. Not a lot of features, or complexity, just the basics.


1 If you don’t want to read about repositories, I recommend Bike Snob NYC. Which prominent fast but not fast enough Australian cyclist was he talking about last week?

Firstly, there was Saunier Duval’s impressive one-two finish, proving once again that there is no “I” in “team.” (Though there is a “moi” in “chamois.”) Secondly, ___ ____ (whose collarbones are only intact after yesterday’s crash because they have both been replaced by titanium) proved he is in fact a great stage racer by taking the Maillot Jaune by one second. (Anybody can blast his way up a mountainside in a distateful display of power, but it takes a certain dignified restraint to sidle up behind people and pilfer seconds the way ___ does, like an uninvited party guest nabbing cocktail weiners.)

http://bikesnobnyc.blogspot.com/2008/07/rest-day-roundup-stealing-seconds-and.html

2 Bron Chandler points out that there is some potential for recursive naming in the tradition of GNU and HURD. Alfred Pennyworth is sometime know as Batman’s batman. What would VALET’s nemesis be called? Do valets have nemeses? Do nemeses have valets?

2008-07-30

Improving VALET - part 1

Filed under: Uncategorized — ptsefton @ 4:33 pm

This week the ARROW community is having get together for developers to work on the VALET repository ingest tool. This is probably of little interest if you’re not a repository person (or rat) but if you are then this may be of interest whether you are associated with the VITAL / Fedora world or not.

VALET is a deposit tool designed to allow self-deposit of electronic stuff into a Fedora repository, specifically one running VTLS VITAL. The bit about VITAL is crucially important Fedora is an underlying storage layer, a kind of database, and different software will use it in different ways. VITAL has some tricks for storing datastreams derived form other assets, such as full-text extracted from PDF that other software like Fez would not understand.

VALET comes in two versions.

  1. There’s an open source one Valet for ETDs which is set up initially just to deal with Electronic Theses and Dissertations (ETDs). It’s available from the VTLS website or from Google Code (last week the one at the VTLS site was out of date, and the package for download from Google Code was slightly less out of date but I think they might be up-to-date now).

  2. The other version is mostly the same but is not free. It is important to make the distinction because if you customize the non-free version then you would have to ask VTLS for permission to redistribute it, possibly even within your own institution. I am not a lawyer (although I have a 10 year old who is threatening to become one) but I would be very cautious about changing a file that says (c) <Some Corporation> All rights reserved (Her other potential career is being a computer programmer might be a good idea to do both so she can be rich and happy).

So the outcome of the workshop will be to get a version of the open-source VALET with the best of the modifications that people have made at their sites, with maybe some new features.

One much requested feature for VALET (and for VITAL too) is to be able to edit submissions that have already been approved and pushed through VALET workflow into the repository. It’s kind-of surprising that VALET doesn’t do this already but it doesn’t.

I had an idea about how this might work last week, and Tim McCallum has implemented the first part of it already. To explain it we have to go into a little bit of detail about how VALET works. VALET takes a very simple approach to workflow, of which I for one approve. In simple terms:

  • An administrator defines a workflow with a set number of steps and says who can approve a submission at each step.

  • An administrator defines a web form, based on the example(s) shipped by VTLS to collect the metadata required for a submission.

  • At each stage the software simply serializes the information in the form into XML and saves it on disk.

  • For each new stage the program picks up the information from disk and puts the values back into the form.

  • At the final stage the program runs XSLT stylesheets (supplied by the administrator) to transform the serialized form data into the ‘proper’ metadata for the repository.

What Tim has done is simply to create an additional data stream containing the form data along with the other data streams when an item is approved. This means that it will be there alongside the repository item and all the other metadata streams. I think this will be really useful in solving some of the ongoing issues people are having with their repositories. For example, you might want to capture author email addresses but there is no sensible place to put them in a MODS datastream.

I know, some of you are thinking about standards how can I save my important data in a non-standard format? To which I say, better to save your data in a form which is not standard and not pretending to be standard, than to rush into inventing a new standard which only you support. Is there a standard out there that captures all the data you want to save? Then use it. If not, capture the data now and work with the community to define the standard you need.

I’m not the only one who had this idea. I found out that Vicki Picasso from Newcastle also thought it would be good to capture the VALET form.

This approach is actually very similar to what you do in ePrints you can define any old metadata you want (as long as it’s flat name-value pairs) and map it to Dublin Core as you see fit for dissemination purposes.

In VITAL, and in our Sun Of Fedora repository portal project you can index any XML datastream you like. So if you want to collect HERDC categories (that’s to do with reporting research publications to the Australian Government very important stuff) then you can, without having to jam them into a metadata schema that was not designed to take them.

Next steps in the work Tim started:

  1. Work out how to search for and retrieve an item to be re-edited, putting it back in the workflow.

  2. Work out how to create the formdata from existing items that did not get put in the repository. We already have some experience with generating VALET form data based on a very cool idea by Simon McMillan of UNE who can’t make it to the workshop. Get well Simon!

(I put it to my daughter that she could be a programmer and a lawyer and that would make her rich and happy. She said of course being a lawyer would make her rich and happy. I asked what would being a programmer make her? A nerd, apparently.)

2008-07-24

More on Buzzword

Filed under: Uncategorized — ptsefton @ 11:12 am

Two people have recently reminded me about Adobe’s online word processor, Buzzword. Coincidence? Groundswell of popularity? Probably not as they are married to each other.

Anyway, it has improved a bit since I first looked at it. At least it has HTML export now (it handles lists wrongly, nesting lists inside lists instead of inside list items, but that’s a common mistake). Still no styles or headings and I fear that it is trying to get people to lock up their documents in some kind of proprietary Flash and/or PDF format.

Adobe are asking for feedback so I gave some over at the Acrobat.com blogs.

I think that there’s an opportunity to Adobe to do what I Google should have done with Google Docs (used to be Writely). I suggested this:

What could be done differently over at Writely so they can reliably import documents and get the lists right, and better still, let people start off in Writely online and produce word processing docs to send out to others?

The Writely / Google people could design a well thought out, freely available generic word processing template that works more or less equally well in various different word processing environments (hint - you’ll need some clean-up code to help the poor word processors keep their lists straight).

http://ptsefton.com/blog/2006/03/21/writely,__meet_the_ice_template/

I think Buzzword should not only use styles, it should get a well designed set of generic styles as a basis and the Adobe folks should build templates which are Buzzword compatible the online service that does this first has the best chance of bridging the gap from the offline to the online world.

If I create a document in Buzzword why not make the default export to Word use some Adobe-defined styles and give the user a buzzword-like toolbar to play with them, post the doc back to Buzzword etc? In all the online word processors I have tried import and export is appalling and I’m sure this must slow adoption.

At the moment all the online word processors are far behind on features that are needed for some documents, you couldn’t write a thesis in Buzzword (not if you wanted tables of contents and figures and numbering and reference management) but you could draft some stuff in there or collaborate on papers then export into Word, or FrameMaker or something to finish the job. Here a well thought out style set would really help with interop.

Adobe if you want any advice on word processing templates drop me a line. (Someone from Google did, but the conversation didn’t go anywhere). The ICE project has some templates you might like to look at.

2008-07-15

Some architectural changes to ICE

Filed under: Uncategorized — ptsefton @ 4:59 pm
View as PDF

This post is a look at some architectural changes we’re looking at for the ICE system, as we hit the limits of what we could squeeze out of the old architecture.

Ron Ward has just finished a major rewrite of lots of the application, designed to make it work on a central web server with multiple users, in addition to the ‘classic’ mode where everyone has their own ICE server running on their own computer. He’s spent the last few months trying to get Subversion to do things it was clearly never meant to do.

ICE uses Subversion as a back-end version controlled data store. In the ICE classic mode multiple users work with checked-out working copies of a repository and hit ‘Sync’ to send their changes back to the server and get updates. Behind the Sync button is a fiendishly complicated bit of code that gets updates from the server, detects conflicts, tries to resolve them as gracefully as possible and provide a usable web GUI for the authors.

Object1Figure 1: ICE Classic mode: each user has their own ICE application which looks after their working copy, ICE uses the Subversion protocol to synchronize everyone’s work

Ron’s big rewrite has lots of unit tests based on all the trouble we’ve come across (mis)using Subversion for the last couple of years so we’re happy that it will be robust when running in classic mode.

But the new server version is a problem. If you have multiple users trying to access the same working copy all at once, then Subversion gets in the way it starts locking files all over the place for example. One simple solution is just to put out a server version that doesn’t allow distributed editing like ICE classic does, but our courseware authors really need the ability to manage large volumes of stuff on their own PCs as some courses are pretty big, with a lot of digital assets, while we want to have web access for reviewers and casual contributors to the same courses via a central web service.

So we’re looking at a new server mode where ICE still has a working copy but it knows that it is the only user-agent who has it checked out so it doesn’t need to do updates, it can just do commits. If all you want is a web based content management system then this will be all you need to install and it should run pretty well.

If you are following this technobabble then you’ll be asking but how does that help the ICE classic users work when there’s an ICE server? That would mean that changes made on an ICE client would never make it to the server!

Object2Figure 2: ICE Server mode: No subversion updates required as it is the only user-agent committing changes to the working copy

That’s the tricky part we need to create a new mode of operation for ICE where people want the benefits of the server version AND the classic distributed mode of working. In this mode the ICE application will work in a new ‘client’ mode. It will only ever get updates from the central repository. Any additions or changes won’t be fed back to subversion directly the ICE client will post them just like any other user into the ICE server.

This will require some more coding, but probably not as much as it would have taken to get the ICE server working any other way and it opens up the possibility that we can replace Subversion and use a simpler version control system, possibly of our own devising in future. So a future model might have the ICE server acting not only as interface for humans but for other ICE systems.

Object3Figure 3: ICE Client mode: Users can update their local repository but all changes go via the ICE server. We will automate this so it is seamless for users.

Having made this architectural decision we can press on with testing the ICE server straight away, even without making any changes to the client version. Here’s the plan which we will roll through over then few weeks:

  1. For the repositories which currently allow both server and classic access we turn off the ability for users to commit using ICE classic. If people want to check out their own copy of the content they can, as long as they post their changes back in through the server version manually.

  2. We modify the ICE server so it now assumes that it has THE working copy and only commits changes never updates this will mean we can support multiple users with no dramas (that’s the plan anyway).

  3. We will make a new client mode for ICE which automate the process of detecting changes and posting them from the client version of ICE through the ‘front door’ of the server version pretty much like any other user. Updates will happen as they do now, from the subversion repository.

2008-06-27

Tim McCallum shows off Sun of Fedora

Filed under: Uncategorized — ptsefton @ 5:12 pm

Here in the Repository Services group at USQ we have been working on a project funded by ARROW and in partnership with the National Library of Australia. It’s a bit of repository software originally designed to explore the Apache Solr search application.

We looked at Solr last year at USQ, and I blogged about it as part of a consulting job to compare VTLS Vital, Fez and Muradora. Since then, Muradora and Fez have both started using Solr, there is a plugin for Fedora’s standard text search package to use Solr. As far as I know VTLS have not announced anything to do with Solr apart from their Visualizer product.

The goal of the current project is to create a simple interface to Fedora that uses a single technology that’s Solr to handle all browsing, searching and security. This contrasts with solutions that use RDF for browsing by ‘collection’, XACML for security and a text indexer for fulltext search, and in some cases relational database tables as well. We want to see if taking out some of these layers makes for a fast application which is easy to configure. So far so good.

This is not a replacement for VTLS Vital, and is not intended to replace the NLA’s ARROW Discovery service which is also based on Solr.

We now have a working demonstration with content pulled from a number of repositories, and are able to show the main things we set out to achieve. Administrators can set up a new portal which shows a subset of the main index with a few clicks, and we have a security model which can restrict access to metadata and data based on group roles.

I will post some more information about the emerging architecture of the application soon, but for now Tim McCallum has put together a demo screencast, which had him slaving over a hot video editor over the weekend (forgive any glitches, it’s his first time). Or you can try it out for yourself (Demo URL may not work after October 2008). If you want to log in contact me for a password.

Thanks to Oliver Lucido who did most of the development, building on work he did for the FRED project last year with David Levy. Tim has also been assisting, with project coordination from Bron Chandler and stake-holding from Neil Dickson at ARROW and Alison Dellit at the NLA.

2008-06-26

A few words on magic

Filed under: Uncategorized — ptsefton @ 1:00 pm

MJ Suhonos from PKP has patiently explained where I got some things wrong about Lemon8XML in my previous hasty post.

I’d like to pick up one theme from MJ’s post. MJ says (with emphasis by me):

The larger problem, of course, is that L8X is encumbered, in a way, by the common expectation that it should just “magically” work on whatever format the author or user is providing — it is an application that is designed to solve, in part, an infinitely-unsolvable problem. So, the user has to meet the application halfway.

I agree that this expectation that tools should perform magic is a problem. We see this in the HTML export from word processors; they take arbitrary input and turn it into HTML. In the inevitable absence of magic you typically get sub-standard output.

I understand the requirement to try to understand the structure of ad hoc documents if you can, but I don’t think it’s a good idea to encourage people to keep creating them; if L8X has a version of meet me half way which involves direct formatting instead of styles then that will be a step backwards in my opinion. My version of meet me half way would be at least to try to get people to use headings. If they don’t then the structure guesser will step in, try to guess and give them their document back to correct when the inevitable errors occur.

I took a look at the single sample document for L8X on the demo site. It’s clear that the structure-guesser part of the application is going to have to be very clever to work well. It seems, for example, that the goal is to detect captions either before or after a graphic or table even when they have no special formatting. Introducing edge cases like short paragraphs both before and after an image seem to cause it problems, including loss of text but I could be wrong, again.

(I’ve had a look at the document parser code and it is taking into account paragraph length, and doing some reasoning based on text-size and formatting attributes).

So, even though I had some of the architecture wrong, I still think that Lemon8 XML would be vastly more useful if it had a two part architecture:

  1. Styled word processing document to XML conversion, with the obvious caveat that if you’re turing a generic format into a domain specific one you’re going to be producing stuff that doesn’t use the whole of the target format and may have gaps that need to be filled in.

    Lemon8 XML has its own XML format, but I’m wondering if it couldn’t just use ODF which is a well specified standard, with the ability to give the document back to the user. (Checking with MJ via email about this).

    The goal would be to get as many people using this mode as possible because it is the least work for everyone no guessing strucutre required if people can use markup.

  2. Ad hoc-formatting to styled word processing conversion using the best available heuristics to guess structure and give the document back to the author in an improved form. As far as I can tell that’s not a goal for the PKP team, but the code is out there so we could do it, using their algorithm. We’re looking into it.

It is important to help our colleagues who are authoring documents in word processors to use styles. It’s good for them. It will improve their working lives. And it will open the door for them to start dealing with real eResearch and the semantic web. A project like the TheOREM-ICE would be impossible with documents like the L8X sample document.

2008-06-23

Lemon8 XML beta released

Filed under: Uncategorized — ptsefton @ 1:12 pm

The PKP people have released a beta of Lemon8-XML, (L8X) their journal-oriented word processor-driven XML publishing system.

I tried out the demo server with an ICE test document.

The bad news is that the service had significant problems with my document; It could not locate author metadata, incorrectly identified some ordinary text as being citations, and lost most of the document text, which is obviously a very major issue.

The good news is that MJ Suhonos from PKP was onto me straight away with an email and is keen to work on support for styles in general and ICE styles in particular. (It’s in the FAQ that we will collaborate on this).

If the PKP team can get a decent structure guessing application to work on arbitrary input that would be great, but even better would be to close the loop and give back documents with more structure than you put in. At the ICE project we will help however we can.

If it was me doing this I would break this problem into two parts:

  1. Build a converter that can take structured word processing documents and map them to the NLM XML format used by L8X. ICE offers one well worked out structure for generic documents, others may exist for specific formats.

  2. Build a structure-guessing application to add structure to word processing documents (something which Ian Barnes has been chipping away at for a while).

With both of these in place you can improve documents in the wild as you go; every time someone submits a draft add styles and give it back to them, rather than trying to guess structure at the end. I would like to see this embedded in the OJS journal management system from PKP so that authors get rapid and continual feedback every time they upload a draft. This would allow some editorial and review processes to take place in an HTML interface as well rather than via PDF on word processing files.

If you leave L8X as the final step, authors will have little feedback as to how they can improve the structure of their drafts.

My two-part plan would re-ordering sections in L8X become redundant word processors have outlining tools with which you can reorder content, so why try to do it through an HTML interface?

On a technical note, last time I looked at L8X I concluded that Docvert is a weak link it tries to to use XSLT to guess structure; our experience with ICE was that XSLT (version one at least) was not a productive way to do this as the austere functional programming environment in XSLT made the structure-reasoning code very hard to maintain and very slow, so we moved to more traditional parser written in Python which is much easier for typical programmers to work with.

2008-06-20

An ICE like ODF based web publishing system

Filed under: Uncategorized — ptsefton @ 3:55 pm

From Kay Ramme at the GullFOSS blog at Sun comes this demo of a wiki-like system using ODF as a document format and OpenOffice.org as an editor.

It seems to be using WebDAV to allow users to edit documents on a server, then convert them to HTML automatically when they load the document in a browser.

Good idea to have the user change a document and automatically render it to HTML on request.

Same idea, in fact as the ICE system.

Some differences with ICE:

  • ICE doesn’t use WebDAV because, well, it doesn’t work with Windows reliably and it doesn’t work with the Mac too well either.

  • ICE doesn’t rely on OpenOffice’s native save as HTML feature which will produce awful results on all but the simplest text documents. A few of several reasons not to use it:

    • It gets list formatting badly wrong.

    • It exports photos at full resolution and puts height and width attributes on them to resize them meaning that you end up shipping megabytes when you should be shipping kilobytes.

    • It is not styles-based so you have no way of configuring it to do things like use pre formatted text in the right places.

  • ICE is styles-driven which means it produces very clean HTML compared the rubbish that office suites spit out.

  • ICE uses templates to help people apply styles.

  • ICE can deal with Microsoft Word documents and has cleanup code to correct some of the interop issues with OpenOffice.org.

  • ICE has a version-controlled back end courtesy of Subversion so it can be used by distributed teams.

  • ICE can create IMS content packages for courseware.

  • ICE has an Atom Publishing Protocol button which can send stuff to a blog and do a much better job of formatting than the Sun Weblog Publisher addin too.

  • ICE has a plugin architecture and a growing number of hooks for integrating other content types like chemistry data.

  • ICE doesn’t deal with spreadsheets, but we could add that pretty easily.

  • ICE doesn’t have a mechanism to create new pages by linking to a target that doesn’t exist if we add that we’ll make it a bit smoother than what’s shown in the demo.

  • ICE can be used as a conversion service by other systems.

I could go on.

If you like the demo, check out some of ours although I note that we don’t have a really basic one that shows what Kay shows in hers. We’ll get on to that.

2008-06-19

Adventures in Geocoding part 2: Embedding data points in documents

Filed under: Uncategorized — ptsefton @ 4:18 pm
[update: the map doesn’t seem to work well in IE - works well for me in Firefox.] View as PDF

I have been thinking about how to start integrating more semantics into ICE documents. This is only a preliminary look, but it’s very promising so far.

A wrote a while ago about embedding metadata in pictures. This time I look at how one might embed geographical data in a document.

I was tempted to do a dog-poo map of East Toowoomba showing how my hounds like to defecate as far as possible from a rubbish bin so I get to carry the re-used plastic bag further, but I’ll spare you that and show you some thing else.

Take this cycle hazard for example. I have linked the picture it to a web album where you can see it in context. The caption has the location in the text so if you download the PDF you can find the location for yourself.

graphics1

One of many hazardous grates on Ruthven Street Toowoomba (-27.590334, 151.948166)

Or I could point out another dangerous place where the cycle lane disappears (-27.595667, 151.947174).

If everything is working correctly, you should see a map somewhere in this post (I’m still wavering about where to put it) showing those two points; and if you click on the little pins you will get the description for that point. I doubt it will work in places like Google Reader so click through to the post.

What I’ve developed here is actually not ICE specific. All I have done is adapt little bit of Javascript of Simon Willison’s to go through a page and look for HTML elements marked with the class attribute ‘geo’. It’s pretty dumb at the moment, and it relies on a convention that each location has an optional description followed by the coordinates in brackets. Only handles decimals, not degrees and minutes and would spit the dummy if you said 27.6045° S instead of -27.6045.

To use this in ICE I have to set up some javascript stuff to load everything in, install it in the blog server and so on, which took me a stupid amount of time, but in the documents themselves it couldn’t be easier. I defined a new style called i-geo (i is for inline) and ICE automatically converts that to HTML spans with class=geo when I generate the HTML.

By coincidence, there was a post this week from Roderic Page on mining PDFs for geographical data. Great stuff. It’s very like the work that Peter-Murray Rust’s group and others do with mining chemistry data from PDFs. But there are problems:

The service uses a bunch of regular expressions to try and extract latitude and longitude pairs from the text (needless to say, there are nearly as many different ways to write a latitude and longitude as there are authors).

http://iphylo.blogspot.com/2008/06/from-pdfs-to-google-earth.html

What we want to do in ICE is provide authors with easy to use tools so they can unambiguously encode data and validate it before they hit the ‘publish’ button. One way we plan to do this is to adapt an application like Roderic’s tool. In this case I’d point it at my document and it could tag all the coordinates I’ve got and normalize them to my preferred method of expressing coordinates, then mark them up in some way. Ultimately this will be more robust than these fragile after-the-fact scraping services. My document will be able to advertise its own meaningful content not cling to it jealously until it is exhumed by an application later on which has to pry it out of its cold dead PDF-fingers.

We’re going to do something like this with the TheOREM project, too. It’s in the work plan to run the OSCAR chemistry-sniffer-outer application over documents and get it to mark all the bits of chemistry, as well as give its automatic sanity check; once that’s done we can start pushing out chemistry with built-in semantics.

Now, I bet if Bruce D’Arcus is reading this he’d be saying ‘Use the new metadata support in ODF 1.2‘ and I will investigate that. But, given the user base I deal with an OpenOffice.org only solution is not optimal we also need a solution that will work for groups who use other tools such as Microsoft Word. The Style based microformat approach is one such interoperable mechanism. Styles work, but are bit tricky to apply. I like simple links even better.

In the geo-world geohash looks pretty cool. Geohash is an algorithm which can turn this:

graphics2

Into this short URL: http://geohash.org/r7h51ehscv0g

I have set up my little Javascript experiment so that if I add a link it will automatically push a pin into my map. The Geohash algorithm is open, so I don’t need the Geohash service to use it. I found an open source library easily enough, and the URL makes a perfectly good identifier in my opinion. Yeah yeah it’s got the http protocol on the front but it’s a unique string for a point on the earth and more importantly I can use it in pretty much any modern editor.

So I could use a simple link to point out a place where the road is in terrible condition and have that point show up on my map. If you grab the PDF version of this page, you’ll see that the links are all footnoted automatically so I don’t have to type in coordinates or mess with styles. I just link. ICE has a nice feature where it can footnote all the links in my documents for the PDF version, too so the information is there in a usable form in print . If we wanted to get really fancy it could decode the Geohash into a human readable format for the print view.

What I’m thinking about now is a framework for semantic markup in word processors and beyond that takes into account all the prior art (smart-tags in Word for example) and the practical realities of mixed-application workgroups and a Microsoft-heavy world. I might try to put something together for the forthcoming e-Research in the Arts, Humanities and Cultural Heritage workshop. We have a part-written paper about embedding metadata in documents lying around that may serve as a base.

2008-06-16

More on negative click or net benefit repositories

Filed under: Uncategorized — ptsefton @ 2:03 pm

So the conversation that Chris Rusbridge started about low-effort repositories rolls on. Chris summarizes some of the responses. Including mine and broadens the discussion to bring in some of the stuff that Andy Powell has been saying:

Andy wants repositories to be more consistent with the web architecture. He spoke at a Talis workshop recently; his slides are here (on Slideshare, one of his models for a repository).

This reminded me that earlier this year people in my network were talking about Andy’s keynote at VALA. We responded to the ripples running through the Oz-repos community by putting a project proposal to ARROW in Australia to start working on a repository ingest application that is much more ‘of the web’ than those we have now.

The ARROW board didn’t approve that one, I’m sure it wasn’t the just the name that was wrong but I gather that was not popular. And it was a truly stupid name.

I had to think of something quickly so I called it VICE-SQUAD in the spirit of highly contrived acronyms that seems to pervade the ARROW community.

VICE SQUAD means (VITAL-compatible Integrated Content Environment-driven Service-oriented Queryable User-friendly Application for Data-acquisition)

Here’s a bit of the proposal we put, which seems to be along the lines of what Chris from the Logical Operator blog suggested in response to Chris R. From our proposal:

The goal of this project is to build a smart user-friendly repository ingest system for VITAL and/or other Fedora based repositories, which will be implemented in the Integrated Content Environment (ICE) service framework. The system will be released as open source software. The application will be a stand-alone ingest system with back-end coupling to ICE.

The project will attempt to create an innovative interface for repository ingest which is quite different from other approaches, allowing users to upload content into a working repository, or workbench from where it can be shared with the world (sharing with defined groups is out of scope for this project but will be dealt with in a separate USQ project) and/or submitted for ingest into the repository; ie pushed over the curation boundary1.

It will consist of three interfaces:

  1. A dead-simple user interface for academics to share their work as quickly as possible and tag it with free-form metadata. They will upload items to a workbench where they will be able to work on them further, or merely mark them for ingest into the repository.

    (see this blog post from the JISC repositories interest group for some thinking along the same lines, with pointers to a commercial service called box.net which could serve as a model for the sharing-features proposed here if adapted to an academic context.)

  2. A graphical user interface for repository staff or advanced users to edit MODS metadata for a record and turn the user’s initial tags into formal metadata, including the ability to edit existing metadata records from VITAL.

  3. A seamless tie-in to a structured authoring environment, so that papers authored in such an environment can be sent to a repository with a single click

In addition to the two interfaces there will be behind-the-scenes ’smarts’ that can extract metadata from documents and produce HTML and PDF automatically, using technologies already developed by USQ.

I think the time has come for someone to build a repository which has the simple ePrints approach to collecting metadata, with an option to make it even simpler and just go with tags if that’s all the energy the depositor can muster.

Our proposal goes on to talk about MODS and MARC and METS but I think maybe the time is right to do RDF, especially if the Bibliographic Ontology makes it into Zotero. And we should look at ORE support rather than bother with METS.

For those who care to add more higher-quality metadata and often this a librarian tidying up later there needs to be just a little bit more smarts than ePrints or DSpace offer with their flat metadata in the area of stuff like research affiliation and researcher identity, stored in RDF, with an option to serialize it in other metadata formats as required.

While we didn’t get that project up with ARROW we will have another opportunity to build on the forthcoming TheOREM-ICE work.

We have a big need for simple sharing in ICE right now, and I imagine that this will be true for thesis writing too wouldn’t it be great to share your PhD draft with reviewers and draft-readers in a simple way?

One thing I’d like to do is to turn on document sharing via an obscure non-guessable URL so that people can drop in and comment on my documents using ICE’s inline annotation systems without authentication. Or for more formal collaboration, I want to be able to create ad hoc workgroups preferably using a single sign on service of some kind. Once we get through some of the nasty issues we’re having with the ICE 2 beta version we will no doubt start adding those collaborative features.

Then when TheOREM kicks off we’ll have an ICE to repository gateway pumping content into DSpace, Fedora and ePrints.

What’s needed as well are some simple services to let people upload stuff and push it out. ICE already lets you push to a blog via ATOM (all the posts here are done that way), but we could add SlideShare and Flickr and suchlike as additional services, as well as a simple web sharing interface that is less controlled than the Institutional Repository. As Peter Murray-Rust says: Dont use Institutional Repositories put it on the web.

« Older PostsNewer Posts »

Powered by WordPress