PT’s blog

2008-02-27

Google claims to support ODF over OOXML but Google Docs has awful ODF support

Filed under: Uncategorized — ptsefton @ 10:15 am

I wrote last week about the way supporters of the OpenDocument Format talk-up the number of supporting applications, but for our purposes the support is not as widespread as is often claimed.

Now there’s a post on the Official Google Blog A renewed wish for open document standards. Note the plural; standards.

They talk about supporting the Open document Format, even though support in their Google Docs product is pretty limited. I think this is support in the political sense, not in the sense that matters to users. Here’s the usual pointers to Wikipedia lists of supporting applications, most of which I reckon would disappoint if you tried them:

Google supports open document standards and the Open Document Format - ODF, the recognized international standard (ISO 26300). ODF is supported and implemented across the globe, and its communal creation and iteration has helped ensure the transparency, consistency and interoperability necessary in a workable standard.

http://googleblog.blogspot.com/2008/02/renewed-wish-for-open-document.html

But despite the plural in standards, they are not wanting to support the Microsoft-led OOXML standard on its journey through ISO. Maybe these reasons are valid, I don’t have time to investigate:

Google’s concerns about OOXML include, but are not limited to:

  • The limitations on the openness of OOXML format;

  • The lack of proper review as compared to other ISO standards;

  • The continued use of binary code tied to platform-specific features; and

  • Unclear licensing terms for third-party implementers.

http://www.odfalliance.org/resources/Google%20XML%20Q%20%20A%20(2).pdf

Me, I don’t particularly care about the bickering; as a practitioner working with document formats I know that the OpenOffice.org Writer + ODT combinations which AFAIK is the most mature implementation, does not support lots of Word features, whereas OOXML supports all of them. And Word doesn’t support some ODF features. We do the best we can given these realities. and Google Docs to edit ODT? Forget it.

As far as I can tell, the ODF format is pretty much built around the OpenOffice.org application. See the comment from Gary Edwards on my previous post:

My own take is that ODF suffers from the same application specific problems that will forever doom OOXML. The presentation layers of each format are entirely reflective of the application specific idiosyncrasies, feature sets and layout engines unique to the originating applications.

http://ptsefton.com/2008/02/22/opendocument-format-support-not-as-great-as-the-cheer-squad-keep-saying.htm#comment-64

I don’t really care if it’s an ISO Standard but it will be important for our work on document interoperability to have the OOXML spec out there, and to have MS Office saving into native XML file formats. For example if Word import in Writer doesn’t improve then we’ll probably look for funding to extend out ICE system to work with MS Word. And yes, we are active in feeding back problems to the OOo team and we make contributions to the open source community as well as whingeing.

2008-02-26

ICEify this document!

Filed under: Uncategorized — ptsefton @ 4:05 pm

I have been working on some proposals to get us a some interesting projects (with attendant money) for the Learning Futures Institute, where I work. To do this I have to work with other people’s templates, which is fine, but it could be so much better if I could use some of the tools we’ve developed on the ICE project.

What happens is, when I open up one of these templates that people supply, I still have my ICE toolbar sitting there (see the demo if you don’t know what I mean). Out of habit, I reach for the shortcut keys for ICE, or hit the toolbar buttons if my hand happens to be on the mouse. So when I want to whack in a bullet list it’s Esc 8 or Esc *. I really like the ICE toolbar interface. Really. I use it all the time.

( I first thought about building it in about 1996, only took 11 years for me to get around to suggesting it to my team last year. Why take so long? Afraid of macros, once you write them then you are stuck with maintaining them. But it was worth it.)

I’m using an Ubuntu Linux machine at the moment which means the word processor is OpenOffice.org Writer, a reasonable if uninspiring word processor but the same thing applies on the Mac or Windows using Microsoft Word.

Now the ICE toolbar is pretty smart, so even though the document is not an ICE document, it goes ahead and creates a style for me, in this case li1b for List Item level 1, with a bullet. If the template designer has bothered with styles for list bullets they might be called List Bullet 1, but then again they might not. There are no standards for this stuff.

What I’m thinking is that most of these funding bodies (or even many journals) don’t really care about the styles in their documents, so long as they look right, so it would be great to be able to get the ICE toolbar to adapt a document to the ICE styles automatically, to ICEify it as it were.

In a lot of cases this would involve renaming Default or Normal to p, and Heading 1 to either h1 or h1n depending on whether it is numbered or not (I’m going to write on the subject of heading numbering soon for the ICE FAQ if you’re interested in why we did it the way we did). Actually a better way to do it would be to create new styles that are based on the existing ones and then search and replace from the old to the new.

I’m not sure how interactive the process would need to be, but the idea would be to try to disturb the supplied template as little as possible, while making it possible to take advantage of the ICE toolbar, not to mention automated conversion to HTML and PDF and even possibly the ability to embed slides, if you’re going to have to present your proposal. You could even post it to a blog and have it come out in perfect HTML, although I’m not too sure how many funding bodies would appreciate that. (And if one of the proposals I’m working on comes off you will be able to submit a completed application to a repository, for long term preservation).

I covered the manual steps involved in ICEifying a document last year, I might have to have a look-see at how hard it would be to write a simple ICEifier.

Another trick might be for it to put the styles back the way they were when you’re done.

2008-02-22

Anyone know about research into what authoring tools academics use?

Filed under: Uncategorized — ptsefton @ 4:32 pm
[Update: Fixed a typo and added two more names]

I’m in paper-writing mode at the moment which means doing some actual reading-type research. I’m interested in issues like why repositories use PDF and not HTML.

It’s surprising how many papers there are on ‘Library 2.0′ or ‘Repositories and Web 2.0′ that manage to not mention HTML at all, or discuss file formats in technical terms but neglect to talk about what people actually use to write up their research.

XML is good and Microsoft Word is bad is a fairly common analysis. Great. What should I do next?

Found one lovely example of a paper reluctantly defending PDF, yet its own title is mangled in Google Scholar presumably cos it is only available as PDF.

I’m having trouble finding much about the following:

  1. Actual data on what authoring tools people use we have plenty of anecdotal data that most authors use Word, with pockets of LaTeX in tech disciplines.

  2. Prior-art in embedding metadata in word procesing documents, inline, in-text, using styles. Found some really ambitious stuff about Web 2.0 but not much on the more modest goal of being able to identify author’s names reliably.

Does anyone have any sources to get me started? There may be pockets of research I’m not finding through Google Scholar?

Dorothea Salo, you commented here recently and you know a lot about this stuff…

Andrew Treloar, anything come out of DART/ARCHER on this?

Ian Barnes you’ve looked any sources?

Chris Blackall I know you worry about these things.

Susan Gibbons & team have looked at this and Nathan Sarr is implementing (has implemented?) some tools

Comments are open but I may not get to moderate them over the weekend, so be patient.

Here’s the little collection I have tagged in Zotero so far (lots of detail missing):

Barnes, I., 2006. Preservation of word-processing documents. Retrieved October, 2, p.2006.

Eriksson, H., 2007. The semantic-document approach to combining documents and ontologies. International Journal of Human-Computer Studies, 65(7), p.624-639.

Sally Murray, 2008. Open science, open access and open source software at Open Medicine. Available at: http://www.openmedicine.ca/article/viewArticle/205/104 [Accessed February 22, 2008]. [Grabbed this cos it refers to Lemon8XML]

Tallis, M., Semantic Word Processing for Content Authors.

Witten, I.H. et al., 2002. Importing Documents and Metadata into Digital Libraries: Requirements Analysis and an Extensible Architecture. Proceedings of the European Conference on Digital Libraries, p.390405.

OpenDocument Format support not as great as the cheer squad keep saying

Filed under: Uncategorized — ptsefton @ 11:31 am

I know I’ve said this before, but I can’t help it. Just because an application has ‘Save as OpenDocument’ feature that does not mean it has useful OpenDocument support.

Benjamin Horst at SolidOffice points to Dispelling Myths around ODF (I don’t think all the myths really are myths (and why do people keep saying that ODF uses SVG? I reckon that’s a myth.) but I don’t have time to argue) I will confine myself to a comment on this bit. Benjamin writes:

My favorite section is where Erwin lists some of the prominent applications that use ODF as their default, or one of their primary, formats. These include KOffice, OpenOffice.org, StarOffice, IBM Lotus Symphony, Corel WordPerfect, Apple TextEdit, Google Docs, and plenty more.

http://www.solidoffice.com/archives/719

I can’t comment on all of those, or the full list on Wikipedia but I do know these three things.

  1. OpenOffice.org and StarOffice share a codebase they’re really variants on the same application, while IBM Lotus Symphony also uses a lot of the same code, with the menus scrambled and no HTML export. At least they’re likely to interoperate.

  2. Google Docs still has pretty useless ODT support. I just did another quick test and it still (a) produces awful HTML and (b) loses styles when you import ODT and (c) can’t export native Google Docs without significant weirdness.

  3. The current shipping version of KOffice uses ODT in such a different way to OpenOffice.org that interoperability doesn’t work for anything but straight-ahead text. KOffice doesn’t have any implementation of list styles for example. The new version might be better but I have not figured out how to get a copy to try out without building it from source.

I wonder if anyone has looked properly at ODF interop, with a matrix of supported features and notes on how various import/export pairs work? From what I have seen, it would not be a pretty picture.

2008-02-21

XHTML Challenge: Mozilla Seamonkey Composer is not suitable for writing academic papers

Filed under: Uncategorized — ptsefton @ 11:00 am

I have written a few posts about trying to write HTML, peferably XHTML, with various word processing tools. Today was reminded of the Mozilla Seamonkey project by this post about open source alternatives to proprietary software packages. (Some things on the list are sensible, but suggesting that ‘DocBook‘ might be a replacement for FrameMaker is a stretch. DocBook aint a software package it’s a lifestyle.)

Remember Netscape Navigator, which came with email and an editor and all bundled together? That’s what Seamonkey is. I grabbed a copy of Seamonkey for the Ubuntu Linux machine I’m using at the moment and tried out the Composer application.

How would one go writing an academic paper with it?

Some technical stuff follows, but the conclusion is in the title of this post.

The first thing I looked at is how I might mark up a quote. There’s no obvious way to get a blockquote element. If you hit the indent button, then it does produces code like this:

<h1>Here's a heading</h1>
And the first paragraph.<br>
<br>
<div style=”margin-left: 40px;”>Now this is a quote<br>
More quote.<br>
</div>
More text.<br>

This is garbage. There are no paragraphs, only line breaks <br> and instead of a blockquote element we have the semantically null margin-left: 40px; what I want from my editor is what the ICE toolbar for OpenOffice.org and Microsoft Word delivers, if I indent (demote) an ordinary paragraph it gives me a quote, hit the same button when I’m on a heading and it will give me a lower-level heading (ie h1 -> h2).

And the list editing is like most other HTML editors I have looked at recently, unintuitive and wrong.

Take this for example. The lists are not nested correctly (the ol should be inside the first li). And what’s that br doing at the end of my second bullet item?

<ul>
<li>Bullet</li>
<ol>
<li>Numbered</li>
<li>Numbered</li>
</ol>
<li>Bullet<br>
</li>
</ul>

This editor doesn’t think hard enough about what you might mean. It is easy to produce all sorts of terrible markup. IIf you click the toolbar buttons in the wrong order then you can produce very wrong code. Why should it be different if I change to numbered list format then indent (which is what I did in the above example), rather than indent then change to numbered list format (below)?

<ul>

<li>Bullet</li>
</ul>
<ol style=”margin-left: 40px;”>
<li>Numbered</li>
<li>Numbered</li>
</ol>
<ul>
<li>Bullet<br>
</li>
</ul>

What I meant was this, but I have yet to see an HTML editor that can do this properly (note, I don’t look at HTML editors much, I tend to specialize in complaining about word processors.):

<ul>

<li>Bullet
<ol>
<li>Numbered</li>
<li>Numbered</li>
</ol>
</li>
<li>Bullet</li>
</ul>

So, the bottom line is that even thought Mozilla Seamonkey composer comes with a button to fire the document off to a validation service it would be useless for writing a paper, even before we get to the issue of how you might make a PDF version, or deal with your bibliography.

2008-02-20

Cyclone Peter Murray-Rust moves away from Toowoomba. Cleanup continues.

Filed under: Uncategorized — ptsefton @ 11:26 am

Last week we hosted Peter Murray-Rust in Toowoomba. The ICE team have been busy getting ready for some other visitors, so I have not had time to write about the visit. Peter has blogged about his stay a few times, describing it as intensive talk and hacking. Intensive indeed.

He talked about ICE emphasis is mine:

Ive had the chance to look closely at ICE: The Integrated Content Environment- an authoring environment for academic material. USQ eat their own dog-food and over 100 academic staff at USQ use it routinely for authoring their course material. USQ is very committed to high-quality distance education - they have an impressive enrollment from overseas and they put a lot of work into the material which supports it. So their material can be repurposed as notes, lecturers copies, slides, summaries, etc. All this is managed through stylesheets - which are the key to ICE. The content is written once but delivered in many ways. Because the material is in XML it is also possible to amend it with XML-aware tools or to generate new material programmatically. A key aspect is that the structure of the document(s) can be managed in XML. So I am now convinced that for academic work it is (a) fit-for-purpose (b) reconfigurable (c) powerful. Its still early-adopter for theses, but as it can do so many new things I cant see any real competition.

http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=982

Here’s a quick summary of what the ICE team did with Peter on his visi.

We introduced Peter to ICE, and showed him how to blog with it. We’ll be very interested in feedback on this, does it work? Is it better for some posts than the WordPress editor? If so, which types of post? We’ll also try some other collaboration with Peter & team.

Oliver Lucido extended our work on embedding Chemical Markup Language (CML) into publications. CML lets your describe molecules and reactions and the like. We’ve come a long way since my first post on CML last year. You can now put a CML file into your working directory, and ICE will turn it into a variety of formats for you automatically. See Daniel’s screencast. Now we start to think about similar services for other disciplines, drop me a line or comment here if you have any suggestions.

Also related to CML, PMR is interested in being able to create what he calls a data overlay journal, taking bits of data gathered from various publications and aggregating them by type. So Ron Ward worked on something that turns ATOM feeds with chemistry in them into OpenDocument Format documents. Peter seemed pleased with the results, which are really just a proof of concept for now. Using Ron’s new code we scraped some data from the Crystaleye service, turned into an ODF, then PMR sent it to his blog over the Atom Publishing Protocol. You can’t see the PDF version just yet, or interact with the molecules via JMOL but we’ll sort that out soon. (When I say ‘we’ obviously I mean Oliver.)

We’re wondering if this might be useful in other situations, like for celebrity bloggers who want to turn their blog into a book. Point it at a feed of the relevant content and it will put it into a word processing format you can edit and send off to a publisher. Might also be useful when quoting web material. Me, I’d like to be able to get hold of posts from sources that I quote often in ODF rather than having to copy and paste HTML and re-style it.

Peter supplied a chemistry thesis and with a bit of word processor magic, I broke it up into a number of chapters which we fed into ICE. Daniel de Byl did a bit more work to fix up some character encoding issues. Oliver worked with Peter to find the chemistry that’s embedded in the thesis, which is authored using an application called ChemDraw. Peter has some code that can break open the proprietary ChemDraw format and extract open standard CML, and Oliver was able to hook this in to ICE to automate the process: find the ChemDraw pictures, turn them into CML and then re-render molecules using non-proprietary open source software. Still a lot of rough edges, but when Daniel’s back from a break in a couple of weeks we will show how ICE can be used to create a thesis in both HTML and PDF with the data linked-in and machine accessible.

Peter gave a spirited and inspiring performance of his genre of talk, this time on semantic publishing. I see lots of potential for the semantic web in some domains, particularly where the data are very bounded and structured like chemistry. But rather than trying to deal with the huge mess of unstructured data out there, the ICE team try to find or create the right tools to help people create a huge mess of structured data, with some semantic information embedded. It’s slow work, but PMR thinks 2008 is the year some of this will start to break out of the labs.

Peter presents using his presentation toolkit, which contains lots of material on a range of topics which he jumps around during the talk.

I gave my talk in HTML as normal - a series of ca. 100 major topic with 2-20 slides under each. I select each slide as I go along and stop at the time limit. At least this means I never overrun. The system has evolved over the years and now has a vertical menu for each topic and a horizontal one within the topic.

http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=960

I have been looking at how we might be able to capture Peter’s slides in ICE, and provide an interface that can assemble them into presentation packages (not necessarily linear). One thing that would be nice is for the slides themselves to have some notes. I’m going to see if we can do something like the ICE brochure where there is a document you can read, with a slide presentation that is automagically derived from it.

I was feeling a bit weathered after all of that hence the title of this post.

I was curious about cyclone naming. Turns out that Pete is on the list for a cyclone in the Brisbane region. The way I read it, it’s about 40th on the list. Found this:

Requests by the public for tropical cyclone names

The Bureau of Meteorology receives many requests from the public to name Tropical Cyclones after themselves, friends, etc. The Bureau is unable to grant all these requests as they far out-number the number of Tropical Cyclones that occur in the Australian region.

The Bureau will only accept requests received in writing (not e-mail). The request cannot be immediately granted but the name will be added to a supplementary list. When a name is retired of similar gender and initial, a name can be included from this supplementary list (subject to checks to ensure it is not on the Southern Hemisphere retired name list or offensive in any of the languages of our international neighbours.)
Note that it can take many decades for a suitable slot to become available, then a further 10-20 years for the names to cycle through, so it is likely to be well over 50 years before your requested name is allocated to a cyclone.

http://www.bom.gov.au/weather/cyclone/about/cyclone-names.shtml

2008-02-06

This site is now Zotero friendly, courtesy of the unAPI plugin for WordPress

Filed under: Uncategorized — ptsefton @ 2:22 pm

If you’re using the Zotero research tool (and if you read this blog you probably should) you should now see the little icons to add posts to Zotero up in your Firefox address bar. Save metadata for your favourite posts! Cite me!

This will only work for the WordPress part of the site, post November 2007.

Reading the list of ingredients it seems like I am now metadata central:

The server provides records for each WordPress post and page in the following formats: OAI-Dublin Core (oai_dc), Metadata Object Description Schema (MODS), SRW-Dublin Core (srw_dc), MARCXML, and RSS. The specification makes use of LINK tag auto-discovery and an unendorsed microformat for machine-readable metadata records.

http://lackoftalent.org/michael/blog/unapi-wordpress-plug-in/

This is courtesy of the unAPI plugin, found via the Zotero site.

Off the top of my head I can think of a few other WordPress sites I’d like to have this so I can use them in my research.

The download page has clear instructions about how to install the plugin.

2008-02-05

Mini websites of supporting material for scholarly articles

Filed under: Uncategorized — ptsefton @ 12:19 pm

[Update repost with the spell checker on, text was in Aussie English so the spell checker ignored it.]

Via First Author, I see that BioMed Central has added the ability for authors to include extra web material with a paper:

… to make it possible to upload collections of files that can be conveniently navigated in the web browser - essentially a miniature website associated with the article. This functionality has now been added to the BioMed Central publication system.

The BioMed Central homepage offers instructions for uploading these mini-websites as a ZIP file. Readers of the published article will have a choice of whether to download the ZIP file to view locally on their own machine, or alternatively they can follow a link to view the contents of the ZIP file via the BioMed Central website…

http://www.firstauthor.org/blog/?p=159

There’s a link there to an article about butterflies, with some extra material in HTML.

This sounds like something that ICE could do really well. ICE has lets you create mini-websites complete with internal navigation with an export to ZIP. The ZIP files exported by ICE are actually IMS packages, designed to slot into learning management systems, but they work as stand alone websites. You can see examples at the USQ open courseware site, see the automatically generated navigation at the left? This could be a good way to package materials. You can tie just about anything together into a package using ICE, and we’re working hard right now to make sure that there are tools to help embed data visualizations into documents as seamlessly as possible.

I’m working on a paper at the moment about exporting HTML from word processors. One of the things I’ve been doing is documenting some of the issues with the built-in HTML export in widely available software packages. I was just going to link to the blog, but one idea would be to include all these blog posts and a set of sample documents as a package to go along with the paper. Whether or not a publisher would be set up to take it is another matter.

Another packaging mechanism that might be useful given tools to support it would be METS packages. We saw first hand at our repository interop workshop how APSR developed an Australian METS profile for packaging journals, to enable automated deposit from a journal management system into a repository. But who knows how to make a METS package? Apart from geeks like the ICE team, or some special repository-rats1, that is. BioMed Central’s choice of ZIP is sensible. No fancy metadata required, just make sure your ZIP has an entry point named index.html. But it could do more if it knew how to deal with IMS content packages or a METS profile as well.

To submit such a ‘mini-website’ as an additional material file, all you need to do is to ensure that the homepage is named index.html, and sits in the root folder of the content you wish to submit. Then convert the folder hierarchy into a ZIP archive, and upload this ZIP file using the regular manuscript submission system, which will recognize and process it automatically. Full guidelines are provided in each journal’s Instructions for Authors (example).

http://blogs.openaccesscentral.com/blogs/bmcblog/entry/additional_material_gets_additional_features


1 Dorothea Salo, A Messy Metaphor, Caveat Lector , Blog, January 9, 2006, http://cavlec.yarinareth.net/archives/2006/01/09/a-messy-metaphor/

Mini websites of supporting material for scholarly articles

Filed under: Uncategorized — ptsefton @ 12:13 pm

Via First Author, I see that BioMed Central has added the ability for authors to include extra web material with a paper:

… to make it possible to upload collections of files that can be conveniently navigated in the web browser - essentially a miniature website associated with the article. This functionality has now been added to the BioMed Central publication system.

The BioMed Central homepage offers instructions for uploading these mini-websites as a ZIP file. Readers of the published article will have a choice of whether to download the ZIP file to view locally on their own machine, or alternatively they can follow a link to view the contents of the ZIP file via the BioMed Central website…

http://www.firstauthor.org/blog/?p=159

There’s a link there to an article about butterflies, with some extra material in HTML.

This sounds like something that ICE could do really well. ICE has lets you create mini-websites complete with internal navigation with an export to ZIP. The ZIP files exported by ICE are actually IMS packages, designed to slot into learning management systems, but they work as stand alone websites. You can see examples at the USQ open courseware site, see the automatically generated navigation at the left? This could be a good way to package materials. You can tie just about anything together into a package using ICE, and we’re working hard right now to make sure that there are tools to help embed data visualizations into documents as seamlessly as possible.

I’m working on a paper at the moment about exporting HTML from word processors. One of the things I’ve been doing is documenting some of the issues with the built-in HTML export in widely available software pacakages. I was just going to link to the blog, but one idea would be to include all these blog posts and a set of sample documents as a package to go along with the paper. Whether or not a publisher would be set up to take it is another matter.

Another packaging mechanism that might be useful given tools to support it would be METS pacakages. We saw first hand at our repository interop workshop how APSR developed an Australian METS profile for packaging journals, to enable automated depost from a journal management system into a repository. But who knows how to make a METS package? Apart from geeks like the ICE team, or some speical repository-rats1, that is. BioMed Central’s choice of ZIP is sensible. No fancy metadata required, just make sure your ZIP has an entry point named index.html. But it could do more if it knew how to deal with IMS content packages or a METS profile as well.

To submit such a ‘mini-website’ as an additional material file, all you need to do is to ensure that the homepage is named index.html, and sits in the root folder of the content you wish to submit. Then convert the folder hierarchy into a ZIP archive, and upload this ZIP file using the regular manuscript submission system, which will recognize and process it automatically. Full guidelines are provided in each journal’s Instructions for Authors (example).

http://blogs.openaccesscentral.com/blogs/bmcblog/entry/additional_material_gets_additional_features


1 Dorothea Salo, A Messy Metaphor, Caveat Lector , Blog, January 9, 2006, http://cavlec.yarinareth.net/archives/2006/01/09/a-messy-metaphor/

Powered by WordPress