PT’s blog

2008-05-20

Please comment on this abstract

Filed under: Uncategorized — ptsefton @ 9:44 am

I’m writing a paper that I hope to submit for the eResearch Australasia 2008 conference if I get it done in the next three days.

I’m putting the current draft of the abstract up here for comment. What do you think? It’s due on Friday, but there’s still time to change things around a bit. For those of you who are in the ICE research group, you can comment using the ICE-one server, everyone else please use the comments below.

Working title: eResearch for Word users

Abstract: This paper documents the plight of the ‘average’ modern researcher as they apply their academic writing skills in the new world of eResearch.

We might expect researchers to have mastered some of the basic generic writing tools; an office suite with a word processor, the ability to generate charts from tables of data; a reference manager that can insert citations; and tools of their discipline like statistics packages.

But the ‘ordinary’ researcher who tunes-in to the clamour about ideas and tools from a conference like eResearch Australia could be easily overwhelmed by the gap between the obvious potential and their own command of the technology they have to hand.

Nine things to which a tuned-in researcher might aspire: (a) to share data with colleagues, (b) to collaborate on semantically rich documents which include appropriate data visualizations, (c) to blog their research as it happens, (d) to annotate data and works in progress, (e) to submit to journals, (f) to deposit appropriate copies of papers into various discipline and institutional repositories, and not just in PDF format, (g) in HTML, with rich interactivity and links to their data. They might also aspire to (h) understand some of the services available to them on the World Wide Web; to become citizens of ‘Web 2.0′, but without compromising the (i) preservation of their data and their writing without accidentally infringing copyright or choosing a doomed data format.

The question is how do we get there from here? The starting point is using Microsoft Word with references in EndNote emailed around a workgroup then sent to a publisher. The goal is to collaborate on a document which has embedded rich semantics, such as say, geographical data points that can be displayed on maps and overlaid with data from other sources. The document needs to be viewed on the web with interactive maps, and annotated, tagged and commented upon, as well as being distributed as a traditional paper paper and stored in the dreaded PDF file. Finally it must be automatically deposited in appropriate repositories, one of which is a publisher’s review queue.

Focussing on the writing process, this paper explores some of the aspirations listed above and suggests some practical advice for researchers and their support staff. There is a discussion at this point about the Integrated Content Environment an academically focussed collaborative content management system, with integration into repository systems which can help with some of the aspirations of the modern eResearcher, but with a lot of work still to do. Other tools are also considered and found wanting.

The conclusion suggests some more areas for research and development, targeted both at the Australasian context but also globally, to research funding bodies. How can our researchers get there from here?

2008-05-13

More on ODF and OOXML

Filed under: Uncategorized — ptsefton @ 4:42 pm

I posted yesterday about document formats and applications.

Today a couple of additions and a correction.

I left the Sun ODF plugin for Microsoft Word off the list. So here’s an update of my summary table. I have not been able to test it yet.

Updated converter table

All the converters here apart from the last two are using the same technology: ODF Add-in for Microsoft Word, Excel, and PowerPoint

Platform

Application

What does it do?

Windows

ODF in Microsoft Word 2003+

The ODF add-in will read .odt files and turn them into .docx. It’s slow.

(I couldn’t make it run on my version of Word 2007 because of what I think are conflicting versions of .NET)

Windows / Linux

OOXML in OpenOffice.org or another ODF aware application (but see below are there any?)

OpenOffice.org Ninja; a little program that intervenes when you click on a .docx file and converts it it .odt, slowly.

Windows / Linux

OOXML in Novell’s version of OpenOffice.org writer

Uses the ODF Addin and allows you to edit a .docx file. Open and save are, you guessed it, slow.

Mac OS X

Microsoft Word

Currently no options for reading ODF AFAIK

Mac OS X

NeoOffice

Contains a the Novell plugin, so you can open and save .docx files (slowly, of course).

Windows, OS X, Linux

OpenOffice.org Writer version 3 (currently in beta)

Contains a new different converter which is much faster than the ODF add-in. But Sun are not aiming to provide round-trip editing of .docx files. This is intended to be an import filter only. (Not based on the ODF add-in

Windows

ODF in Microsoft Word

Sun’s ODF plugin for Word. (Uses StarOffice code)

Revised view: how well does the ODF add-in work?

Yesterday I wrote:

You know, I’m pleasantly surprised to to be reporting that using NeoOffice on the Mac a complex ICE test document seemed to round trip from Neo to Word 2008, where I made a few changes and then back to Neo with no visible problems. This ODF add-in thing has improved a lot since the first time I tried it.

I was a bit hasty there. I tried again with an ICE sample document saving it as .docx and reloading it. Most of the formatting is fine, with a few oddities in the numbering but styles support is very very lacking.

The biggest issue is that list styles get lost and the associated paragraph style is kept in the OOXML but none of its formatting is specified. The result is a document which looks like it is using styles but actually isn’t really doing so 100%.

This is not that big a problem if you are using a known stylesheet like we do with ICE. We have the style name so we can write macros or file-fixers to go through the document and repair it up so that it does styles again. We already have a list-repair macro for Word and there’s code for Writer to create new styles so this should be doable. Who knows, we may even be able to help fix the ODF addin converter.

This is the point of the work we’ve been doing to make sure that our users can have interoperable portable documents. I recommend checking these converters with documents from your own environment before trusting them.

We can report this issue and maybe look into helping with the converter. Daniel? You listening?

Claims about ODF support are typically meaningless

Filed under: Uncategorized — ptsefton @ 4:31 pm

I know I’m repeating myself a bit. But as you know there’s a Wikipedia page about applications that support the Open Document Format and it gets quoted and linked to. A lot.

I linked to Peter Murray Rust yesterday, and one of the commenters on his blog also talks about the number of implementations.

OpenOffice.org is only one of the tools that can generate it, there are several others as well as various converters (e.g. SUNs MS Office plugin, Clever Age ODF translator) available for MS Word users.

But folks, as of 2008-05-13 mid afternoon Toowoomba time, that Wikipedia page is not much help to people who might want to, like you know work on real documents. GoogleDocs, for example will throw away your styles if you happen to care about them. And why would you? It’s a web two-dot-oh world now what do we need with styles?

I’m going to get around to editing that page, but I’m not an expert so I have held off.

I just know that it’s not useful. Lots of things on that list don’t work with my ODF documents all created with applications derived from OpenOffice.org.

As part of my homework for when I do engage the page I went to the spec.

What does the ODF spec (v1.1) say about conformance? You need only read the first sentence of this extract, which I have highlighted for your convenience.

The OpenDocument specification does not specify which elements and attributes conforming application must, should, or may support. The intention behind this is to ensure that the OpenDocument specification can be used by as many implementations as possible, even if these applications do not support some or many of the elements and attributes defined in this specification. Viewer applications for instance may not support all editing relates elements and attributes (like change tracking), other application may support only the content related elements and attributes, but none of the style related ones.

Even typical office applications may only support a subset of the elements and attributes defined in this specification. They may for instance not support lists within text boxes or may not support some of the language related element and attributes.

So you don’t have to do anything in particular to claim to support ODF. Maybe just allowing the top level element would be enough?

We clearly need to add some detail to the Wikipedia page about who supports what specifically.

My working definition of support for ODF (the text format is actually what I care about) is people using our word processing templates to edit files interoperably with Writer and some other application. And you know what? There’s nothing outrageous in any of our templates, but the only applications that work are ones that are based on the original Star Office code base that spawned OpenOffice.org.

This is unsurprising. The file format was built around OpenOffice.org. Lots of people point out that nobody but Microsoft will be able to build an office suite that supports OOXML. People, this is the same process at work.

Let me tell you about a quick experiment I did. I looked up the bit in the ODF spec about lists. Apparently you can have a thing called a list-header at the start of the list.

A list header is a special kind of list item. It contains one or more paragraphs that are displayed before a list. The paragraphs are formatted like list items but they do not have a preceding number or bullet. The list header is represented by the list header element.

So I made a .odt using NeoOffice, put in a list using the default style List 1 saved it, unzipped it and added a list-header to the start of the list, then rezipped it and opened in NeoOffice. Hmm, well it does display, but the Writer interface seems to know nothing about list headers. The only way to create one seems to be outside of the application. Having a feature like this creates some very serious weirdness. You can load up a document with multiple paragraphs in the list header and they are preserved but if you try to add one then you just get a normal list item.

Now I know this is not quite in line with what I’m saying about the file format being largely derived from OpenOffice.org. I don’t know the history of this element but OpenOffice.org doesn’t support it in any useful way. Looks like something added by a standards committee raised on SGML with a clear idea of what makes a ‘good’ document format and not much consideration about what makes a usable word processor interface but that’s just a guess.

Lets talk about the formatting I got from my experiment. Is this what the spec really means?

graphics1See how the list header (Header 1) is indented? Not formatting I can imagine using. But it seems to be what the spec says. The paragraphs are formatted like list items but they do not have a preceding number or bullet.

How many of the ODF cheer squad have read the standard? Dealt with document interop issues? Me, I have only glanced at the OOXML spec and so I don’t go around telling people about how bad it is, but on the few occasions I have looked at ODF and tested support for various things in Writer I have found problems or lack of support in OOo. The good thing with OpenOffice.org, of course is that you can tell Sun about the bug and it will likely get fixed sooner or later, particularly if you can rally enough supporters to vote.

The bottom line is that if you want to work with ODT you have to check out whether the other applications are going to support the bit you want to use. I will get around to changing that wikipedia page to be more useful.

2008-05-12

Some comments on OOXML, ODF and Microsoft Word

Filed under: Uncategorized — ptsefton @ 11:31 am

There is a conversation about file formats and word processors going on between Peter Murray-Rust and Glyn Moody.

Peter made some comments about wanting access to chemistry publications in Word format so he can better extract chemical information embedded in them, which sparked some push-back regarding the Microsoft OOXML format.

Unsurprisingly I have a few opinions on all of this. In this post I’ll:

  1. Outline the state of conversion software between OOXML and ODF. Specifically, .odt and .docx, the word processor formats.

  2. Look at what we can do to support researchers right now.

  3. Comment on the issues Glyn Moody raises re Microsoft lock-in and OOXML.

File format conversions

Peter writes:

I may be optimstic but it [OOXML] can also be converted to ODT. See the WP entry:

http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=1078

Yep. He’s optimistic. There are two freely available converters that I know of, but both of them have a number of practical issues, starting with which one to choose?

The most mature converter is the ODF Add-in for Microsoft Word, Excel, and PowerPoint which is open source but sponsored by Microsoft. I put together a table to try to get my head around the options. All the converters use the ODF add-in apart from the last one. I only care about the word processor format, so that’s what I looked at.

Platform

Application

What does it do?

Windows

Microsoft Word 2003+

The ODF add-in will read .odt files and turn them into .docx. It’s slow.

(I couldn’t make it run on my version of Word 2007 because of what I think are conflicting versions of .NET)

Windows / Linux

OpenOffice.org or another ODF aware application (but see below are there any?)

OpenOffice.org Ninja; a little program that intervenes when you click on a .docx file and converts it it .odt, slowly.

Windows / Linux

Novell’s version of OpenOffice.org writer

Uses the ODF Addin and allows you to edit a .docx file. Open and save are, you guessed it, slow.

Mac OS X

Microsoft Word

Currently no options for reading ODF AFAIK

Mac OS X

NeoOffice

Contains a the Novell plugin, so you can open .docx files (slowly, of course).

Windows, OS X, Linux

OpenOffice.org Writer version 3 (currently in beta)

Contains a new different converter which is much faster than the ODF add-in. But Sun are not aiming to provide round-trip editing of .docx files. This is intended to be an import filter only.

So how to the two technologies perform?

You know, I’m pleasantly surprised to to be reporting that using NeoOffice on the Mac a complex ICE test document seemed to round trip from Neo to Word 2008, where I made a few changes and then back to Neo with no visible problems. This ODF add-in thing has improved a lot since the first time I tried it.

But I think Microsoft’s approach of sponsoring an open source project instead of doing the work themselves is very dodgy. The ODF converter site is hard to use, lots of the documentation is in Word and Excel downloads instead of proper web pages and it requires an obsolete version of the .NET framework to run. The site is also big on frightening lists of incompatibilities between the formats.

All this seems to me to be designed to say Told you so! Word and ODF are not compatible cos they use fundamentally different document models.

There will never be 100% translation of all the stuff you can do in the two formats because they are both built around existing implementations that had different feature sets and different document models. Glyn Moody puts it like this:

They probably wont ever work very well because of the proprietary nature of the OOXML format: theres just too much gunk in there ever to convert it cleanly to anything.

http://opendotdotdot.blogspot.com/2008/05/word-in-your-ear.html

You can call it gunk but it’s just the reality of a legacy format. Me, I’m glad to have all the gunk out in the open.

But it’s not just OOXML that’s gunky. Take a look at the list model in ODF some time; it has this hierarchical list style model that comes straight out of OpenOffice.org version 1. The user interface in Writer has never been any good at dealing with these list styles because the whole thing is just wrong for a word processor. KWord just ignores them, meaning that it won’t interoperate with Writer even though the developers claim ODF support. Gunk.

Back to testing the ODF converter add-in. Using an ICE document which is all styles based, the NeoOffice version of the ODF add-in converter seems to pretty much get it right. I’m impressed.

So how about Sun’s importer? In the beta of Writer 3 it has only one advantage over the ODF add-in. It’s fast. Fast but broken lists and document outlines are an absolute mess. Of course we already have a document repair function for documents that follow the ICE style conventions so we could work around this with our users, but it’s not ready for use in the real world.

And the politics! Sun have OOXML import in OpenOffice.org version 3 but don’t plan to give you the ability to edit OOXML files directly. And the Microsoft sponsored project takes a similar approach. You can open ODF files but they automatically get turned into OOXML although you can at least save to .odt.

What can we do now?

Now the obligatory ICE-plug.

In the ICE project we have devised a set of styles and associated templates that can be used for interoperable document authoring. Like Peter Murray-Rust I want to support Word because that’s what researchers are using. Unlike Peter Murray-Rust my group has not accepted any money from Microsoft.

Yet.

We have defined a set of structural and semantic styles that you can use in any word processor then even if the applications can’t understand each others formatting in detail they can still understand your documents. If you use the ICE toolbar and follow a few guidelines then we can make good quality (X)HTML from your documents and more-or-less move documents from ODF to Word and back.

Not only that, we’re building plugins that understand various semantics, like CML for chemistry, graphs and tables, geographical information.

What we do currently is load Word documents into OpenOffice.org via the old .doc format, save them to .odt (OpenDocument Format Text) and run some fix-up code over them. The fixups do things like remove spurious list-styles that Writer adds and fix the document outline. We have not tackled OOXML, but come Writer version 3 or a version of the Novell plugin that works with mainstream OpenOffice.org then we’ll do the same thing with .docx. As long as the style names are preserved we can repair your document. The usual result for Word users is a pretty good XHTML output but some loss of WYSIWYG for the print view of your document.

ICE is still not that easy for people to try out casually, but we’re working on that we have the server-based version running in beta now and that’s proving much easier for people to get in to.

Some comments on the Microsoft issue

And a note to Peter’s correspondent Glyn Moody who writes:

Word? OOXML??? Come on, Peter, you want open formats and youre willing to accept one of the most botched standards around, knocked up for purely political reasons, that includes gobs of proprietary elements and is probably impossible for anyone other than Microsoft to implement? *Thats* open? I dont think so.

XHTML by all means, and if you want a document format the clear choice is ODF - a tight and widely-implemented standard. Anything but OOXML.

http://opendotdotdot.blogspot.com/2008/05/ooxml-for-petes-sake-no.html

ODT is not widely implemented in any meaningful sense. I’d love to be proved wrong, but see my take on this. Really. I’d love to see another word processor that can edit .odt files interoperable with Writer that is not based on the same code like most of the contenders.

Take the example of lists that I raised earlier. To implement an ODF word processor you would have to implement something that dealt with hierarchical lists as specified in the standard, and work out how to cope will all the mess that results from a system where a paragraph has can have a paragraph style as well as a list style and item-level within the list. And there’s a very loose connection between list and paragraph styles, and a very strange mechanism for embedding extra paragraphs in a list item. I have never seen a real-life .odt file that uses the list model properly.

I don’t think ‘anything but OOXML’ is the goal, I prefer stick to the interoperable subset of OOXML and ODF‘.

If we stick to the interoperable features that work with current software then back-end systems like an ICE server can transform Word documents into ODT and XHTML for archiving, without users having to worry about it. Even though as far as I can tell there’s only one code base that really works with ODT at least it is open source.

But at the moment we do tell our ICE power-users that they’re better off with Writer than Word even if they have to migrate from EndNote to Zotero.

And Microsoft itself seems happy to drive users to OpenOffice.org. I can pick which word processor and operating system I want to use, and I don’t like going near Word 2007 for Windows with its stupid ribbon thing. Too hard to find anything; Writer is much more like Word now than Word is and it’s easy to download and install.

And on the Mac, while Word 2008 doesn’t have the ribbon it also misses out on VBScript, meaning that we can’t support it in our ICE system unless we rewrite our toolbar in AppleScript or whatever it’s called.

It’s unlikely there’s a business case for that. I imagine that the same will be true in lots of organizations where templates and tie-ins to corporate systems will no longer be supported on the Mac but it is possible to port macros to OpenOffice.org and work cross platform. For example, the current ICE toolbar code is actually a single Basic code-base that works on both Word and Writer.

And a final comment about the Microsoft sponsored work that PMR’s group are doing. I hope the tools they develop are going to work with OpenOffice.org and ODF as well, at least to the point that we can process the data in ICE.

2008-05-02

Adventures in geocoding part 1: The Toowoomba BUG Cycle Hazard Investigation Team does Ruthven Street

Filed under: Uncategorized — ptsefton @ 6:43 pm

One of the things we’re thinking about at USQ is how researchers might integrate data into their publications. This will be key to the Australian National Data Service.

I have posted a few things here before about stuff like embedding chemistry, maps, and graphing. I’ll start keeping track of those and other posts under the del.icio.us tag DataIntegrationForDocuments for want of a better tag.

(You know, I don’t bother with WordPress categories, I use del.icio.us as an outboard organizer for my site and have one less thing to worry about when I migrate it . Hmm I wonder if my automated backup of my del.icio.us tags is still working).

I think one of the big kinds of data integration we need to work on is getting geo-spatial data hooked in to documents. This is potentially useful for lots of disciplines and there are lots of tools out there waiting to be mashed up.

But as usual there are issues. Just like the stuff I talked about recently regarding metadata for images and potential vendor lockin we need to be very careful about how we encode our data for the long term and use the sexy online services wisely and within their licenses.

Licenses are really important; you can’t just grab online maps and use them however you like, if you want preservation-quality maps it’s a whole different issue than just throwing up a Google map. More on that in the future.

I wanted to link to an example of some great mashing up of data and maps from the Bidwern project at ANU, but today the map page there is returning an error about keys for Google maps. That helps make my point that we need to think about storing important data independently of any particular service that might go away at any time. I’m sure that in the case of Bidwern the data are safe, but researchers, don’t just go and ‘build a service where the data you need are all Google dependent and stored only on Google’s servers, OK?

Over a series of posts I will look at some stuff I’m learning about how to use documents and pictures with geographical information embedded in them, and think about how we should use services like Google Maps without locking ourselves in. On one level this is about something I did on the weekend, but it’s also relevant to the kinds of things our researchers are wanting to and will be wanting to do. Just be thankful I didn’t plot all my veggie plants on a map and bore you with that.

This time, I report on a rather exciting expedition I went on with the Toowoomba Bicycle Users Group Hazard Inspection Team aka the TBUG HIT Squad. The squad last Sunday was Hugh, David and me. Not sure what Hugh was doing there though ‘cos he rides a trike; I’ll see what I can do about getting him expelled or at least compelled to turn up on a proper conveyance. Really, those recumbent riders should have got the message back in 1934.

We took a slow ride from the Southern edge of town to the Oxygen cafe, looking at Ruthven street with new eyes. I have ridden parts of that hundreds of times in just deal with it mode, but it’s an eye-opener to look in detail at all the hazards.

I strapped-on a GPS (a three year old antique Garmin eTrex) and a compact digital camera.

Hugh kept notes and both David and I took pictures.

I have been able to synchronize the tracklog from my GPS with the pictures, by getting the clock in both cameras set to within a minute or so of the spot-on GPS. We’ll see more and more software and hardware that can make this a smooth process but for now I had to use a couple of commandline tools to get the job done. I will leave boring you with the how-to stuff for another post.

Anyway the result is some very arty pictures complete with embedded metadata.

The view from Digikam is this:

graphics1

There’s Hugh at 612.5 metres above sea level. We reckoned he could have gone under that tray-back but he took his life in his hands and rode around.

Now that I had my geotagged piccies I uploaded them to Flickr. but guess what? While Flickr recognized the tags I had embedded. (What/ToowoombaBUG/HazardInspectionTeam) it didn’t like the geo-metadata. It wants the metadata expressed as tags in its own Flickry way.

Google’s Picasa site, on the other hand recognizes the geographical data nicely as long as you don’t let its uploader software scale down the images. But guess what? While it groks my mistyped captions Google’s service doesn’t recognize the tags.

It’s the usual story. You have to figure out what works and what doesn’t. Mistakes with valuable data could be very upsetting so we need people helping our research communities with this stuff.

By the way, this is an interesting kind of narrative genre. I can type parts of a story into the metadata of the pictures and the computer can sort them for me using the time stamp, to assemble the story but they could be served up other ways so the captions should try to be clear enough to understand on their own. I know I need practice at that.

Here’s a screenshot of the result:

graphics2

The pictures have metadata in them that should stick with them forever. My insightful comments and skillful photography will no doubt delight road geeks of the future. Google’s Picasa service might go away or I might take them down, but I’m reasonably happy that I have future-proof-ish data that will work with other services in the future. At the very least I can get a paper map or some old fashioned navigational instruments and a clock that knows what time it is in Greenwich and use the latitude and longitude.

When we get ICE version 2 out the door at the end of June we can consider how we might make ICE more aware of geotagged images, and potentially other data, but before that I will post on another quick experiment I did that finds points in web pages and shows them on a map.

Powered by WordPress