PT’s blog

2008-09-05

More ideas about online and offline word processor integration - is anybody listening?

Filed under: Uncategorized — ptsefton @ 11:47 am

Via Glyn Moody who doesn’t want to say he told us so I see that Adobe is discontinuing support for Flashpaper, a proprietary Adobe (via Macromedia) technology for disseminating documents online. This means that anyone who has put stuff in there now has to migrate all their stuff to some other format. That’s what you get for using technology that’s controlled by a single vendor.

That reminded me that I had this piece I’ve been working on about Adobe Buzzword, another Adobe proprietary document format.

Following my last post on Buzzword, I had an email from Tad Staley at Adobe which seemed encouraging:

You had an interesting point about exporting named styles to Word. By this, I assume you mean that we create a handful of styles that correspond to Buzzword fonts and paragraph settings, and use them within the .doc or .docx file we create on exporting? This would then allow us to “round trip” the document better from Buzzword to Word and back again.

We’d like to hear any other thoughts you have with respect to styles - as I said, we’re working on them now, so the timing is good.

So I drafted something along the lines of what you’re about to read and sent it off to Tad. It was pretty clear from Tad’s reply that Adobe are not thinking along the same lines as me at all . They don’t see it as important to be able to interchange with other word processors because they’re going to make theirs broadly available and they don’t care much for HTML because they care too much about controlling the fine details of document presentation. What this means it they’d like you to use Buzzword / Flash / PDF to disseminate your work rather than an open web format OK some PDF is a bit open but it is very much page oriented and much harder to integrate with other services than HTML. I think that’s terribly short sighted and reduces considerably what people can do with their documents. Mashups and so on that are built for the open web would all have to be redone for the Buzzword world for example my geocoding example.

I’d be cautious of Buzz word out if I were you, because this format could easily go the way of Flashpaper.

Anyway here’s the gist of what I sent to Tad at Adobe.

I think it would be a good idea for you to map Buzzword docs to a set of styles when you export to .doc or .docx- I’d like to see .odt as well. I’ll go into specifics below but first some general comments.

There are so many issues with styles in word processors regarding styles, interop and HTML export it’s a bit hard to summarize in a short email or blog post, but here are some of the main problems. It would be great for someone to get it right for once:

  1. No standard set of styles: Nobody ships a rational set of styles by default - I’d be looking for something that covers headings (both numbered and not in the same document) lists, block-quotes, pre-format text; the list is actually very similar to the set of elements in HTML, which is no coincidence as that’s a generic schema.

  2. Awful HTML export: Word processors almost always try to reproduce whatever the user inputs in the way of formatting resulting in all sorts of crap in the HTML they output. Building a new product is a great opportunity to do it differently.

    (Buzzword’s HTML isn’t bad by comparison with some, within its current limitations. But really, you should fix the list nesting. I think the HTML model is silly too, but it is what it is.)

  3. No ’structure-only’ mode: Why not have UI mode where you can’t do gratuitous formatting, only structure your document using headings, do lists and blockquotes etc and then choose from a menu of stylesheets? That is, turn off the font panel. This may have been hard to sell in the old days but now I think people would get it particularly when they are writing for multiple media where the same document could be published as both HTML and PDF. If you restrict users to a known style set then you can reliably change the presentation of their documents automatically. If not then you have problems. A couple of examples:

    1. If people have chosen colours then you can’t change the background colour of a page in case you have readability problems.

    2. If you allow absolute indents (say 4cm) then you might not be able to reformat into multiple columns and still have the document look OK.

  4. Extreme confusion in the area of lists: Both Word (and by extension OOXML), and Writer (and by extension ODF) have mind-blowingly crazy list models.

    • Word has paragraph styles to which you can attach list formatting, and it has named outlines (with one of the worst GUIs ever even before Word 2007 took it to new heights) AND it has list styles which came along circa Word 2003.

    • Writer has paragraph styles and list styles both of which can be applied independently, and lists are represented as a hierarchy in the file format, although he GUI gives almost no clues as to what the hierarchy actually is.

    In the ICE project we deal with this by automatically creating paragraph styles and list styles / named outlines and providing toolbars to apply both at once, resulting in much more stable, interoperable documents than you get if you leave users to deal with all this by themselves.

So here’s what I would do if I had a chance to influence Buzzword, in addition to building in a standard kind of word processor style system.

Based on my observation of the behavior of the list formating Buzzword obviously has some notion of structure built into it even if it doesn’t (yet) have headings. So lets look at what you could do with lists.

I think the Buzzword UI for lists is pretty cool one thing I like is that lists stay connected. In most online editors if you change an item in the middle of a list into a plain paragraph and then back into a list item you get two disconnected lists, something that makes no structural or practical sense. Buzzword gets this right and makes sure that list items adjacent to each other are part of the same list.

Here’s a test-list in Buzzword:

graphics1

The UI is really slick it actually understands the structure of the list so when you hit the promote (<-) and demote (->) buttons it does The Right Thing. My only quibble is the way it insists on all the items at a particular level being the same kind of list item even if they are not siblings.

Oh, and don’t call the list level ‘outline level’ because in other word processors that term is used for the heading structure.

My proposal is that on export Buzzword should not just use formating it should create styles. As I mentioned before this is more complex than it needs to be, due to the legacy of gratuitous features in the target applications but it is doable.

Lets take the example of .odt export for use in the OpenOffice.org family of word processors. I’ll use the ICE version of the style names, chosen for their brevity but you could use longer versions.

Here’s the same test list embedded in this document which I’m writing using NeoOffice. The paragraph style names are shown in curly-braces at the end of each paragraph, behind the scenes my toolbar also applies a list style of the same name.

Buzzword test document. {p}

When exporting, you could embed some macros that provide a buzzword-like interface via a toolbar. In ICE we have a toolbar which tries to Do The Right Thing (doesn’t always succeed I have to admit, but we’re getting there). We take a different approach from Buzzword’s modal interface and re-use the same buttons in different contexts. So the promote button in a list will move your list item to the left and it should pick up the right list style by looking back through the document to see what is appropriate - whereas for a heading it would change the heading level in the document outline.

Why do we do this? It’s all about interoperability. The styles mean that we can produce good HTML, and also move documents between Word and Writer pretty easily, correcting for the differences between their wacky, annoying, productivity-sapping list models. And we give users on both word processor the same toolbar running the same code.

One advantage for Adobe and their buzzword product would be the same good interoperability with offline word processors. But there’s another potential benefit, the same one I suggested to Google. Adobe could start ‘infecting’ documents with a benign structure virus. Lets see how this could work:

  1. I draft a blog post like this one in Buzzword and send it via Buzzword’s sharing feature to a colleague to add their contribution. I was going to say ‘a paper’ but Buzzword is a long way from ready for that.

  2. My colleague doesn’t want to sign up to yet another online service, and besides is going to be editing the document later at home, so chooses the option to download it as a Word document and saves it on a USB drive.

  3. Later at home Word prompts to say that the document contains macros and should they be allowed to run? If no, then it’s not the end of the world as we still have a Word document that can be re-imported to Buzzword later. If yes then read on.

  4. On opening the document, it’s got a Buzzword-style or ICE-style toolbar, so my colleague is able to make some changes to the document without realizing that they are dealing with the styles that were added to the document automatically on export.

  5. When the editing is done, they can save the document locally, but since there’s a toolbar installed they can click the ‘Return to sender’ button and it gets automatically uploaded back into my Buzzword account via an inbox.

    Because they used the toolbar all the headings are set properly and the lists are nice and orderly.

    (If you don’t understand why I’m going on about this go over to Google docs and try importing and exporting documents using OpenOffice.org Writer).

  6. Later, if my colleague decides that they did like the Buzzword experience they can click the ‘Install the Buzzword template’ button and have the toolbar show up all the time. If they go further and sign up for an account then they can draft things in Buzzword and have them save automatically into Buzzword.

You can see how this could spread the Buzzword way of life not by replacing offline word processors but by providing a bridge into the online service. If the online way is better then people will naturally stop using their offline programs.

A couple of other things that would help drive the service:

  • AtomPub support so you can post to your blog, both from the online service and from your word processor. ICE does this already.

  • Simple web page publishing. At the moment Buzzword does HTML export in a Zip file why can’t it just put the page up online for you?

  • An import feature where when a user uploads an unstyled word processing document Buzzword gives it back with added styleage. (See the ongoing conversation I’m having with MJ Suhonos).

There would be a couple of ways for the online word processor vendors to approach this. One would be to work with the ICE team. As far as I know there is nobody else out there with our commitment to generic word processing based web and print content management. The first mover would have an advantage and if it worked others would follow. The users would win.

Another would be to invent a proprietary set of styles and toolbars and go for more of a lockin effect. Might work. Wouldn’t be so great for the users.

I am reminded writing this that all the recent activity on word processing standards hasn’t changed things much for users. For complex documents, like business documents with embedded fields and so on interoperability between packages both online and offline is still really poor, and interoperability between word processing packages and the web is terrible. It’s not about whether you’re doing OOXML or ODF. It’s about what you’re doing with them.

2008-09-03

Put on The Fascinator

Filed under: Uncategorized — ptsefton @ 9:30 am
View as PDF

At the Australian Digital Futures Institute (ADFI, née LFII) we have been working on a software project, funded by our friends at ARROW, to build a lightweight web front-end to the Fedora Commons repository software. It used to go by the name of Sun of Fedora, which was just a temporary off the cuff in-joke kind of a name. (It uses the Apache Solr search engine).

It now has a new name.

Choosing a name mainly consisted of the ADFI doing a lot of ‘research’ on Google and Wikipedia and IMing each other lke crazy. The process threatened to consume what remains of the project budget so we cut it short after a couple of hours.

I suggested Christine after the Siouxsie and the Banshees song about a person with multiple personalities on account of the software is used to show the same repository in many different ways. Most of the ADFI staff turn out to be too young, too inattentive or too sheltered to remember Christine although I’m pretty sure it would have been on Countdown. It would have made for a good tag-line for the software.

Now she’s in purple, now she’s a turtle.

Anyway, Bron Chandler suggested Fascinator, amongst many many other names. I liked that one, as it’s a kind of add-on to a hat and is typically smaller than a Fez. It also sounds a bit like ‘facet’ which is nice, as the software uses facets to help you discover stuff in the repository. I think having an ‘F’ is nice too. The Fascinator, powered by Fedora.

This, from the current Wikipedia page is apt for a bit of open source software:

They are available pre-made, but are also quite easy and cost effective to self assemble. They are also sold in kit form.

Turns out The Fascinator is also the name of a ragtime tune by James Scott. I haven’t been able to source an Open Access version you can listen to, but maybe someone out there can knock it out for us using the sheet music. *

graphics1

The Fascinator it is.

It is not an acronym and very importantly it is not in upper case but we await construction of a gratuitous backronym, from the man who brought you ARROW, ARCHER and DART or from the creator of FABULOUS and Absolutely Fabulous.

We have soft-released the software before but now there is a new, open project site where you can download it, if you’re comfortable with Subversion and installing software on Linux and such. There are instructions for Ubuntu.

The Fascinator will also be used in a project that Caroline Drury and I are leading to take a snapshot of the contents of Australian university institutional repositories, partly to test the software and partly to give a series of point in time snapshots of what is in them for research purposes. We’d like to look at the range of ways people describe their content and compare the way different repository platforms are used.

I road tested the name on The Long Suffering Sandra.

PT: You know that software I’ve been working on called Sun of Fedora?

SC: No.

PT: Well anyway, we’re going to call it The Fascinator. Is that a good name?

SC: Only if it’s a project to do with hats.

PT: Well it is, it builds on Fedora, which is a kind of repository.

SC: In that case it’s a stupid name, you don’t put a fascinator on a fedora.

Oh yes we do. Here’s the demo site. And besides, here’s a thing which is both a fascinator AND a Fedora. Unfortunately it’s already sold. (I hope Glamour Bomb doesn’t mind me borrowing this image).

graphics2


* I wonder if anyone in the ADFI happens to be a piano teacher in her spare time? (There are a couple of tracks on eMusic in case you’re interested (no, I’m not an eMusic affiliate cos the form was too scary)).

2008-08-26

More thoughts on an application to find structure in word processing documents

Filed under: Uncategorized — ptsefton @ 1:24 pm

In my last post I said I’d write more about how Ian Barnes’ Structure Guesser AKA Structure Sniffer1 might work, and how it might be able to leverage Schematron.

The sniffer is part of Ian’s Digital Scholar’s Workbench concept, where you can upload an unstructured word processing document, and use the workbench to add explicit structure in as automated a way as possible. Explicit structure really helps in being able to convert the document to other formats such as HTML for the web, or structured PDF with a table of contents, but also for preservation formats that might keep the words and other content for posterity without necessarily worrying about exact formatting. Ian has looked at using DocBook for this, but I reckon HTML might be good enough, and I know others are thinking the same thing2.

Ian’s looked at the statistical approach to guessing structure used by in the Lemon8-XML project, found that particular implementation wanting and is now thinking about more of a machine learning approach.

I too have been thinking about how this application might work for a while now and I’m getting increasingly enamored of the idea of using an HTML interface, something like this:

  1. Upload a word style-free processing document to a web site.

  2. You see an interactive preview of an HTML version of the document, complete with a full table of contents (so you can see where the sniffer application thought the headings were).

    Interactive? Hover the mouse over a top level (h1) heading in the preview and see some details about why the machine formatted it that way, such as Paragraphs at 18pt (10 instances) and 19pt (1 instance) Helvetica look like Heading 1. You’d be able to correct the machine, either on a case by case basis or wholesale.

    Another area where some interaction might be needed would be in disambiguating various kinds of indented text, some indentation might mean block-quote some might be example while other text might just be, you know, indented. We had to add an indent style in addition to the bq1 (block-quote) style to ICE to support this because some authors just, you know, want to indent stuff.

  3. Once you were happy with the HTML view of the document, there would be an option to improve your original by adding styles without changing its presentation too much (Did I mention? You too should use styles.) or you could just use the rendition and leave the original alone. Either way, the choices you made would constitute feedback to the learning system. So even if you don’t choose to use styles, the next time it sees the same document it will be able to handle it better.

So where does Schematron come into this? Well, leaving aside the (very) hard problem of actually writing the learning system, that system could generate Schematron rules, which could be used to annotate the original document with suggested styles for each paragraph. Having done that, you could then feed the document into the existing ICE HTML formatter, which is style-driven and it could use the suggested styles to render the document.

These rules can be hierarchical meaning that based on certain cues different sets of rules might apply. For example, there might be a family of documents which all come from a user who uses Palatino 11pt for the main text, and makes use of an idiosyncratic mixture of formating and styles the learner could derive rules for that situation. I know nothing about this kind of thing, I wonder if it would be like the Naïve Bayseian in the Old Bailey where a machine is trained to classify trials.

Using Schematron rules would mean that they could also be written or tweaked by humans. Returning to the example before, a human could add a rule that if a bit of text is indented relative to the text around it and it contains something that looks like a citation which could mean either that it uses something like a Zotero field, or is formatted like a citation with brackets or a footnote then it’s a blockquote.

This would be a nice modular approach. Chances are we’re going to be looking at Rick Jelliffe’s in-zip Schematron for use on Open Document Format documents, so the sniffer could piggyback on that1.


1 Also know as that by me , at least.

2 And no, OOXML and ODF are not necessarily the answer for preservation although they are important, I’ll expand on this in a future post as I think about a presentation for Open Standards 08 .

1 Actually there is an issue with this, it’s not that simple to write rules that work on the formatting in an ODT file, cos it uses these automatically defined styles that introduce a layer of indirection. We could consider a pre-processor that remembers these automatic styles between documents, it would also probably need to annotate docuents with some kind of weighted score like they use in Lemon8-XML.

A courseware authoring dashboard using Schematron

Filed under: Uncategorized — ptsefton @ 10:58 am

As with busses, sometimes you can wait ages for a Schematron and suddenly a whole pack of them come along together*.

For those of you who don’t know:

In Markup Languages, Schematron is a rule-based validation language for making assertions about the presence or absence of patterns in XML trees. It is a simple and powerful structural schema language. It typically uses XPath to describe patterns.

(Wikipedia contributors 2008)

Instead of the all or nothing syntactic approach that you get with other kinds of schemas Schematron lets you pick and choose things to worry about. So instead of saying all course books must begin with a Learning Outcomes section you can write a rule that simply reports on whether there’s a Learning Outcomes section or not without letting there be any variation. Why? In some courses it might be important to add something before that section while I have heard arguments that in some situations specifying learning outcomes upfront scares off potential students.

We’ve discussed using Schematron to provide reports on ICE content but have never got around to using it. This week it has resurfaced in couple of contexts.

Relevant to ICE as a course-authoring system, the Learning and Teaching Support Group at USQ have a checklist, The USQ course writing guide which authors can use to see if their courses meet our standards for fleximode courseware. At the moment it’s a manual process to tick the boxes. We met with Michael Sankey from LTSU this week, and it’s pretty clear that Schematron could play a part in automating lots of the checklist.

As part of our ongoing exploration of how we might create an automatic or semi-automatic system for inferring structure in documents Ian Barnes has pointed out that Schematron might play a role there too.

Ian’s insight was prompted by a recent post of Rick Jelliffe’s about a project to add annotations to a corpus of (presumably) word documents in the the OOXML zip package format:

The brief was for an organization with a large number of documents from multiple sources, but with each source supposed to use stylesheets. The idea was to make a rules base that would distinguish all the different ways that a few structures (titles, table of contents, potentially citations, etc) were represented. This would allow classification of documents according to the structures found, the discovery of outliers and exceptions (e.g. incorrectly marked up documents, or where additional rules were needed), and automated annotation back to the original documents.

http://news.oreilly.com/2008/08/a-standardsbased-expert-system.html

I’ll come back to Ian’s structure guesser (or as I like to call it the structure sniffer) in another post and talk here about the possibilities for adding validation or dashboard services for courseware written using ICE, via Schematron.

Rick’s idea of Schematron rules that can reach inside Zip files would be perfect for the USQ courseware context as our content is in Open Document Format files (actually some of it is Word docs but we convert it to ODF as part of the process). We could translate a lot of the checkboxes in the USQ course writing guide into Schematron rules to do things like check that there is a an acknowledgements section in the course introduction. Not only could the system report issues, it could open up the documents in question for you and take you to the trouble spots and insert comments in the documents.

Not everything needs to be seen as a validation issue though, just some reporting would be useful to create a kind of dashboard for courseware. Module 4 contains no activities might a worthwhile thing to report along with word counts for various modules and how many citations there are, etc.

Another place we could use Schematron to report on course structure would be in the course organizer, which is part of the IMS package manifest file in every ICE course. An organizer is a kind of table of contents for the course, and it is used to generate the navigation. Schematron could easily be used to validate things such as There must be a Study schedule, and check things like whether the links to study modules have names that are not just like Module 1 but convey a bit more about what’s in the module.

A few years ago Ron Ward and I were involved in a project that used Schematron. There we used it to validate metadata for documents as they were uploaded into a content management system Schematron would look for patterns in the metadata and complain when it was wrong. The complaint took the form of an HTML form that the user could fill-out to fix the metadata to the Schematron system’s satisfaction. The Schematron rules worked well to create a true declaratively specified interface, but our implementation was a bit inflexible, like my attitude at the time, so usability suffered. Lesson learnt, I hope.

I think that presenting this as a dashboard that lets you know what your course is like will be better than presenting it as validation which has connotations of centralized control, something that doesn’t always go down well in a university, even when we do have agreed standards to maintain.

It will be a little while before we get to implementing this I just wanted to record our current thinking.


* Although come to think of it I don’t think I’ve ever seen two busses in a row in Toowoomba.

2008-08-20

Compound documents in ICE and beyond: referencing parts of things

Filed under: Uncategorized — ptsefton @ 1:44 pm

Ben O’Steen has put up some thoughts on what he refers to as ‘compound’ documents and how to store them in repositories and allow for referencing of parts of a document, such as a table, a graph or even a paragraph.

Why did I add the scare quotes to compound?

While to a computer scientist a research paper with its graphs and tables and paragraphs might be compound, I suspect most authors tend to think of a research article as a single entity. Until we start giving them access to services that make it clear that it’s not monolithic, that is.

As background, Ben gives four rules:

Note that the four rules of the web (well, of Linked Data technically) are in essence:

  • give everything a name,

  • make that name a URL …

  • which results in data about that thing,

  • and have it link to other related things.

I strongly believe that applying this to the individual components of a document is a very good and useful thing.

http://oxfordrepo.blogspot.com/2008/08/four-rules-of-web-and-compound.html

Agreed.

He goes on to talk about repository services will have to have an explicit contract with authors that lets them know that their document is not just going to be presented in one monolithic format, by default the dreaded PDF.

One thing first, we have to get over the legal issue of just storing and presenting a bitwise perfect copy of what an author gives us. We need to let author’s know that we may present alternate versions, based on a user’s demands. This actually needs to be the case for preservation and the repository needs to make it part of their submission policy to allow for format migrations, accessibility requirements and so on.

As we get authors using a system like ICE then this will be:

  1. Easier for them to understand because they can see multiple formats generated automatically.

  2. Easy to implement, by hooking up ICE (or similar) directly to repositories. Just this week Oliver Lucido has ICE putting content straight in to ePrints via OAI-ORE that’s automatically adding an HTML and PDF view.

So far with ICE we have done a number of demo hook-ups to repository software. It’s now time to turn this on for real we will get ICE hooked up to USQ ePrints ASAP. This will mean that all the images in a document will automatically become referenceable. That is, in Ben’s terms each image will have a name which is a URL.

Going beyond images, we have already done some work in ICE on making paragraphs referenceable, not in a repository context but in an editorial workflow. For example, this blog post has been created in ICE. Here’s a screenshot of an earlier version of this very paragraph in the HTML view.

graphics1See the blue pilcrow? That’s the symbol that Tim Bray uses on his blog to make each paragraph referenceable. Go and have a look, you can link to or refer to any part of any post on his site. In ICE, however, the plicrow is not for referencing elsewhere, it’s for commenting.

See the spelling error? I can annotate the document:

graphics2

Now, if I fix the paragraph, the comment will disappear from the main body of the text but the old, broken version of the paragraph is kept it shows at the bottom of the page until I delete it.

So, ICE already knows how to identify any paragraph and has some rudimentary version control for document parts*, but the context matters.In an authoring context we needed something that was not too sensitive to document order, and it had to work with documents created by word processors, so we can’t just assign unique IDs to paragraphs the way Tim Bray can in his bespoke workflow. But when it comes to pushing (or pulling) a document into a repository, where there is some expectation that it will not change, there is no reason that we can’t mint IDs for parts of a document, and figure out a way to make them obviously citable along the lines of Tim’s purple pilcrows.

Coming back to Ben’s post. Why not make the HTML view the ‘normal’ way to look at an article where possible? This would mean that you don’t have to store a document in fragments, merely label the parts of the HTML. I guess I’m agreeing with Ben’s tentative suggestion that HTML might be a good format to hang this on:

I have yet to settle on basing it on the content XML format inside the OpenDocument format, or on something very lightweight, using HTML elements, which would have a double benefit of being able to be sent directly to a browser to ‘recreate’ the document roughly.

Forget ‘roughly’, at least for documents created with an HTML-ready workflow like ICE. It would even less rough if authors choose something like the Article Authoring Add-in for Microsoft Office Word 2007. But Ben’s right; for documents that are deposited in PDF or in unstructured word processing formats then HTML is going to be rough.

Just how we might handle the user interface issues for exposing names (URLs) of the parts of a document is unresolved, but we’ll give it a go here at USQ with our ICE and ePrints systems.


* There’s the current version and then there are obsolete versions. ICE of course has rich version control at the document level courtesy of subversion

2008-08-11

Study shows real-world ODF/OOXML interoperability is not great

Filed under: Uncategorized — ptsefton @ 10:31 am

Via Doug Mahugh at Microsoft comes this study (Shah & Kesan 2008) on interoperability of word processing applications using the Open Document Format and Office Open XML.

After outlining some possible approaches to testing conformance of applications against the standards and pointing out what a gargantuan task that would be, they settle on a pragmatic approach: test interoperability with the dominant application for each format.

This research tested the interoperability for ODF and OOXML document formats based on a reference implementation approach. For ODF, the test documents are developed in OpenOffice, which is currently the dominant implementation for ODF. For OOXML, the test documents are developed in Microsoft Office 2007 for Windows. These are not reference implementations in a true sense, because they do not perfectly implement the standard. However, they act as de facto reference implementations, because they are the dominant implementations that all developers seek compatibility with.

This makes perfect sense for real-world testing. The results are interesting and unsurprising (to me, at least). Basically the best interoperability is between Microsoft Office Word and OpenOffice.org Writer even when they are reading each other’s formats. I reckon that would be because the OOo team have invested person-decades of effort in reverse engineering the Word document model, and Writer is more or less able to deal with Word docs. The document serialization format is not that relevant. It’s the document models that count. And some of the applications they test are not really even word processors.

This paper makes a great case that it is interop that counts and the goes on to show how poor interop really is.

Unfortunately, this study didn’t get as far as looking at styles compatibility as that’s one area where there are some frustrating problems but also great opportunities to help in interoperability. If you use styles then at least the semantics and structure of documents can be preserved even if page fidelity is not.

And there’s a way to improve interoperability. You don’t have to leave users to their own devices, you can advise them of which features of which applications to use for particular tasks. This is what we try to do on the ICE project. We provide templates and advice to help people create interoperable documents.

Inspired by this paper, I’m off to start work on a paper looking at proactive interoperability, by helping users to pick features that will interoperate. As noted in this study there’s not much out there to choose from apart from Writer and Word. That’s why we will continue to work with Writer and Word looking for practical solutions.

Shah, R.C. & Kesan, J.P., 2008. Lost in Translation: Interoperability Issues for Open Standards - ODF and OOXML as Examples by Rajiv Shah, Jay Kesan. In The proceedings of the 36th Research Conference on Communication, Information and Internet Policy (TPRC), Arlington, VA Sept. 26-28, 2008. Available at: http://ssrn.com/abstract=1201708 [Accessed August 10, 2008].

2008-08-05

Another look at the Article Authoring Add-in for Microsoft Office Word 2007

Filed under: Uncategorized — ptsefton @ 3:32 pm

The Article Authoring Add-in for Microsoft Office Word 2007 (AAAiMOW1) has been turned loose as a release candidate. I looked at an earlier version of this a while a ago.

The name of the thing doesn’t let on that it is targeting just one version of what an article looks like, in the form of the NLM schema I’m not sure if that reflects confidence that the NLM schema is generic enough to cope with all articles or anticipates a future version which can support multiple formats.

I had a lot of questions in my previous post most of which I think are not yet answered, although Pablo Fernicola did drop by my blog and shed light on some of the issues.

This time, with a fresh virtual installation of Windows XP running under VirtualBox on OS X the plugin worked a bit better for me so I could see it in full flight. I still have some serious concerns about this add-in thing and what it might mean for organizations.

I was going to make a few quick comments about usability, preservation and lock-in but this post kept growing, I emailed Jon Udell for his take and did a few tests, and it’s ended up well on the way to 2000 words.

[Update : (Minor edits and fixes)

I should point out that while this post is quite picky I’m glad to see this work going on in Microsoft and I’d love to see how it works out.

Look, if you’re not concerned about using an application which is only for Word 2007 on Windows XP or Vista to create articles which you don’t need to re-use or archive then most of what I’ve got to say here is irrelevant.]

Usability?

I can’t find any reports of how this plugin works in real life. Has anyone tried it? Are you all under NDAs?

I’m concerned about the way that you can add NLM structural elements all over the place, and nested inside each other in bizarre ways, but then you can’t save to the new proprietary .nlmx format because of validation errors.

It would be pretty easy to show how you can create invalid structures using this plugin but I don’t really think that’s a useful stunt to pull what I want to see is what real problems, or lack of them people have with the structural stuff.

Me, I found it a bit weird but as I said I didn’t try to write an article with it.

There’s one interface device that I really like. Each ’section’ element gets a little handle above it so you can drag the whole thing around:

graphics1It would be really nice if this applied to the document outline as well as part of the normal Word interface not just to the special embedded XML sections. I could just style a bit of content as Heading 2, which is part of the document outline structure, and be able to drag around the whole of that implied section. Word already does something very like this if you use the Outline view. Of course, dragging sections in an NLM document doesn’t make sense as they’re supposed to be in a particular order, but I don’t imagine most people would drag the high-level sections. (There’s some kind of complex process for dealing with section ordering or editors, I think).

I’m not sure if I get why the embedded XML is any better than just recognizing that the text ‘Abstract’ in Heading 2 style is a the start of the Abstract section. Or you could define sub-classes of heading if you really wanted to such as Heading 2 - Abstract.

You could still have a toolbar like this so that people can drop in sections where they want them:

graphics2Lock in

This add-in represents a new opportunity for Microsoft to lock users in to Word, having just moved on from the proprietary .doc format. This is not just a matter of trying to sell more copies of Microsoft Office it’s about encouraging users to create documents that only work with a particular version of Office.

We have just been through a great long debate about standardizing word processing formats. Microsoft got their way and had their OOXML format accepted as an ISO Standard (ISO/IEC DIS 29500). The benefit is supposed to be that when you write a word processing document it can be managed and edited in more than one application but I have always been very dubious about how this fits with the way you can embed arbitrary foreign XML in Word documents. By contrast the Open Document Format approach is an RDF based extension mechanism which seems a lot cleaner.

I tried out some simple interop with an AAAiMOW document.

  1. Word 2008 on OS X can open it, and you can edit the document at least a little bit, apparently without breaking it, but anything you add doesn’t have the magic embedded XML. It round tripped without error but I assume you could break some of the XML.

  2. NeoOffice Writer on the Mac can open the .docx file and you can edit, but if you save it and re-open in Word then you get an error . The good news was that Word 2007 was apparently able to rescue the content but the bad news was that embedded XML went AWOL.

At the moment I would not have any confidence that anything except Word 2007 can deal with documents created with the add-in, which is as advertised. Of course, if that’s what your team of scientists is using then no problem, provided you think about how you will preserve the outputs (see below).

That quick interop test was using the new .docx format which is not the same as ISO/IEC DIS 29500, which won’t be available as a Word format until the next version.

One of the features of the AAAiMOW is a new file format. Yes. A new non-standard file format which is a misbegotten mashup of OOXML and NLM. I’m not sure how this is different from the way the content is embedded in the .docx file. From the readme file:

Both the article contents and metadata authored through the add-in are stored using XML, as part of a single file, using the Open XML format for the content and the NLM tagset for the metadata. Content which does not have an equivalent in Word, or extends existing Word elements, is stored as custom XML elements within the Open XML data stream. When a file is saved in the NLM format, the resulting XML file is stored within a nlmx file, using the same Open Packaging Conventions used by docx files, providing a single file which can package all related content (such as images) and supports extensibility.

Meanwhile, the next service pack for Word 2007 will add support for the Open Document Format (ODF) as a native file format. I’m assuming the plugin won’t work with ODF. (Pablo, am I wrong?)

There’s some very alarming use of the passive voice in the documentation too, a classic computer industry trick. Say it can be done without mentioning who’s going to do it and how much it’s going to cost.

Based on the use of Open Packaging Conventions, the Open XML format, and the NLM tagset, tools can be built to access any part of the file, content or metadata, and extract, validate, or add information to the file, as part of the publishing workflow.

Can be built? Please. We have one format mixed in to another format using a user interface that is only accessible from an expensive proprietary application. I’m sure I could write a script to pull the NLM bits out of the Open XML but for each new kind of embedded XML I would have to rewrite my code and test that it works with the user interface code that has been added to Word in this case it involves dealing with some special attributes to re-order sections (I think) doesn’t look easy or pleasant to me.

[Update to be clear there is no way that I can see for an author to export he NLM XML format only. I’m assuming that must be something that happens using a different tool.]

And it is worth remembering that this plugin is not accessible to the majority even of Windows users. For example here at USQ Word 2007 has not been rolled out yet2. And the plugin is not available at all on platforms other than Windows. That’s not what I hoped the new standards-wielding Microsoft was on about.

Preservation

There are going to be serious issues with preservation. What are archivists supposed to do with bastard mashed-up formats like this which depend on a particular package to make sense of them?

It is true that for documents that make it to publication in the NLM XML format this should not be an issue: the resulting XML should be perfect for archiving. But I can see that a lot of things that are of value might not make it through to XML. What about archived author’s manuscripts which are one of the backbones of Open Access? What about the original editable files for images drawn using Microsoft Office tools, which are embedded in the source file?

Think about what would happen if this approach became common for different XML formats there could be a proliferation of non-standard polluted Word document to deal with in repositories.

This add-in represents the Microsoft business model in action. See Brian Jones’ response to my probing on the issue of how this bastard mashup stuff is supposed to work. I quoted this last time, but it’s worth reminding ourselves that this is what Microsoft is about, never mind the standards:

There is a huge market that exists today for custom Office solutions. People customize the Office applications in all kinds of ways to try to get more out of their documents. By adding the support for custom defined schemas, we made it much easier to build semi-structured solutions on top of Word. Rather than rely on hacks with styles or bookmarks, folks could create a simple schema and add some XML tags into their existing document solutions.

http://blogs.msdn.com/brian_jones/archive/2005/07/08/436973.aspx#452483

Brian Jones calls using styles to carry semantics ‘a hack’ and yet embedding foreign XML in a Word document and hand-crafting a user interface to deal with the resulting mishmash of tags is somehow not a hack? I agree that styles and bookmarks (and tables we use them a lot) are somewhat limited carriers for microformats but the XML embedding thing has always looked like a trap to me too expensive to set up and maintain and too much embedded in the Windows world. As I mentioned above, I think the new extension mechanism for ODF may be a better compromise maybe we’ll see that in the ODF support in the service pack release in 2009.

An alternative

There’s an alternative approach which is to use features that are common to word processors in general and which are expressed in the underlying file formats directly, which I wrote about in my last post. There would be some interesting challenges in finding interoperable ways to embed all the ’special’ items that are allowed some of these are already supported in our ICE templates but not quite with the same structural rigor as in this the add-in.

graphics3

Chris Rusbridge from the lamented in the comments of my last post that we don’t do NLM export from ICE but I reckon we could produce NLM XML from ICE documents with no more subsequent work required of editors than you would get using the AAAiMOW (I’m guessing we have no data about how well it [the add-in] works and I have yet to work through the section of the documentation for editors).

The readme tells us that styles don’t work (emphasis mine):

Custom XML elements are used to represent other abstractions that exist in the NLM tagset, but that are not found in Word, and to do so in a manner that can be presented to the author for editing in a robust way (unlike the use of custom styles, which was one of the ways to try to solve this problem in earlier versions of Word, and was not very reliable).

I have no doubt that there are lots of terrible style based systems out there, but we have worked hard on making styles usable, interoperable and easy to apply and providing robust rapid-feedback document conversion.

(Maybe both of us are wrong MJ Suhonos at PKP thinks that you can create XML without using either styles or embedded XML by using document formatting to infer structure.)

Would anyone care to fund a small project to see if we can use ICE to produce similar overall results (in terms of overall ROI) to the AAAiMOW but in a cross-platform solution? Microsoft? Anyone in the UK with access to JISC funds? A publisher?


1 Sounds like a noise emanating from a petulant feline.

2 We’re bracing for the onslaught on the help desk as hundreds of users have to re-learn commands they’ve been using for their whole working lives. It seems to us on the ICE team that this is a perfect time to introduce our users to the copy of OpenOffice.org which will be on their computer.

2008-07-31

Improving VALET - part 2

Filed under: Uncategorized — ptsefton @ 11:14 am

This is my second post on the VALET repository deposit tool. Again, if you’re not a repository aficionado you can probably move on1.

Still here?

One of the issues we confronted with VALET was to rewrite in Java or not to rewrite in Java? VALET is written in Perl and quite nicely written in my opinion, apart from the HTML forms which are a big mess of non-valid HTML. There’s nothing wrong with that as such, but it does have a couple of downsides relative to Java:

  1. VALET requires a web server to be installed. VITAL used to ship with Apache but it no longer does, so to run VALET you can end up having to compile and install Apache, and obtain some other dependencies. If it were a Java application then you could just drop it in to the same servlet container as you use for VITAL and Fedora.

  2. We have heard from some of the, um, younger techies in the ARROW community that Perl is a complete mystery. Others report difficulties in hiring Perl programmers, whereas everyone does Java at uni these days.

On the other hand, there are some reasons not to want to do a port:

  1. Some of the ARROW contingent have been using Perl since 1934 and can at least tolerate it. I’d count myself in that group. Fortran anyone?

  2. Hacking a Java program is not as simple as using a text editor to change a Perl file, because you need to compile (and worry about stuff like CLASSPATH, ugh).

  3. A port will create a huge fork.

All these points count for something, but Prashant from University of South Australia has pointed out that using JSP (to which I’m allergic, like PHP and ASP) gives a much easier entry point for ‘casual’ developers and even if it does fork VALET is actually a fairly small application so the investment is not huge and the gain for sites where they want to just consume the software should be worth it.

In the end the group here at the VALET camp decided that there was enough interest in a Java version that they were going to go for it. Nobody would own up to being a Java expert but four or five confessed to having written production Java code.

They’re creating an application as I write this. While they do that Harry, Duncan and David are integrating all the changes that ARROW sites made to VALET and submitted to the Google group. So the Java team will have a moving target as they re-implement the Perl code.

The Perl version won’t be going away but it looks like at least some sites will move straight over to the Java version once it’s done.

So what are the Java team (Tim, Guy, Prashant and Cyrus) doing?

They’re starting a VALET compatible clone. The idea is that you should be able to take an existing VALET workflow and data entry forms and with minimal effort, port it to run in the new application. Best case would be no work at all required; the new application will be a drop-in replacement for VALET. We’ll see if that can be achieved.

The new app rejoices in the working title of Squire, which is not an acronym; it shows that the developers know how to use a thesaurus. Or is it named for the fish? I reckon they should call it Alfred or Pennyworth2. Either way, it’s better than the original working title of Black Hole. which would be like calling your deposit interface Roach Motel. Although at least if you had a repository deposit called Black Hole you could claim very high rates of compression for data. Just don’t mention decompression.

The new JAVA platform will make it easier to do some of the other changes that the community are asking for (we’re discussion this on the ARROW Google group for those of you in the inner-circle), in some cases because there are more repository-oriented libraries for Java than for Perl but also just because as a community we have more competent Java programmers than Perl programmers these days.

Here are some enhancements that we will probably do at USQ at some stage there are lots of other requirements too which we are not going to forget these are just the ones that I can speak for at this stage:

  1. A SWORD deposit so the application can push content to repositories other than Fedora. We’re going to look at deposit of complex objects over SWORD in the TheOREM-ICE project very soon so this will be a quick add-on.

  2. The inevitable ICE interface so that if you submit a styled word processing document to Squire if will generate good quality HTML and PDF renditions automatically. We’re working with Ian Barnes at ANU and talking to the PKP people about how we might be able to do a better job of inferring document structure than the standard, breathtakingly abysmal Save as HTML feature in word processors. Another step in my campaign to stamp out PDF-only Web 0.5 repositories, at least in Queensland.

  3. Automatic embedding of metadata and license in the PDF file in XMP format, based on some work which is apparently going on in collaboration between QUT and an Australian Government agency.

  4. A lightweight complete open source repository package with Squire for deposit plus Sun Of Fedora as a portal. Not a lot of features, or complexity, just the basics.


1 If you don’t want to read about repositories, I recommend Bike Snob NYC. Which prominent fast but not fast enough Australian cyclist was he talking about last week?

Firstly, there was Saunier Duval’s impressive one-two finish, proving once again that there is no “I” in “team.” (Though there is a “moi” in “chamois.”) Secondly, ___ ____ (whose collarbones are only intact after yesterday’s crash because they have both been replaced by titanium) proved he is in fact a great stage racer by taking the Maillot Jaune by one second. (Anybody can blast his way up a mountainside in a distateful display of power, but it takes a certain dignified restraint to sidle up behind people and pilfer seconds the way ___ does, like an uninvited party guest nabbing cocktail weiners.)

http://bikesnobnyc.blogspot.com/2008/07/rest-day-roundup-stealing-seconds-and.html

2 Bron Chandler points out that there is some potential for recursive naming in the tradition of GNU and HURD. Alfred Pennyworth is sometime know as Batman’s batman. What would VALET’s nemesis be called? Do valets have nemeses? Do nemeses have valets?

2008-07-30

Improving VALET - part 1

Filed under: Uncategorized — ptsefton @ 4:33 pm

This week the ARROW community is having get together for developers to work on the VALET repository ingest tool. This is probably of little interest if you’re not a repository person (or rat) but if you are then this may be of interest whether you are associated with the VITAL / Fedora world or not.

VALET is a deposit tool designed to allow self-deposit of electronic stuff into a Fedora repository, specifically one running VTLS VITAL. The bit about VITAL is crucially important Fedora is an underlying storage layer, a kind of database, and different software will use it in different ways. VITAL has some tricks for storing datastreams derived form other assets, such as full-text extracted from PDF that other software like Fez would not understand.

VALET comes in two versions.

  1. There’s an open source one Valet for ETDs which is set up initially just to deal with Electronic Theses and Dissertations (ETDs). It’s available from the VTLS website or from Google Code (last week the one at the VTLS site was out of date, and the package for download from Google Code was slightly less out of date but I think they might be up-to-date now).

  2. The other version is mostly the same but is not free. It is important to make the distinction because if you customize the non-free version then you would have to ask VTLS for permission to redistribute it, possibly even within your own institution. I am not a lawyer (although I have a 10 year old who is threatening to become one) but I would be very cautious about changing a file that says (c) <Some Corporation> All rights reserved (Her other potential career is being a computer programmer might be a good idea to do both so she can be rich and happy).

So the outcome of the workshop will be to get a version of the open-source VALET with the best of the modifications that people have made at their sites, with maybe some new features.

One much requested feature for VALET (and for VITAL too) is to be able to edit submissions that have already been approved and pushed through VALET workflow into the repository. It’s kind-of surprising that VALET doesn’t do this already but it doesn’t.

I had an idea about how this might work last week, and Tim McCallum has implemented the first part of it already. To explain it we have to go into a little bit of detail about how VALET works. VALET takes a very simple approach to workflow, of which I for one approve. In simple terms:

  • An administrator defines a workflow with a set number of steps and says who can approve a submission at each step.

  • An administrator defines a web form, based on the example(s) shipped by VTLS to collect the metadata required for a submission.

  • At each stage the software simply serializes the information in the form into XML and saves it on disk.

  • For each new stage the program picks up the information from disk and puts the values back into the form.

  • At the final stage the program runs XSLT stylesheets (supplied by the administrator) to transform the serialized form data into the ‘proper’ metadata for the repository.

What Tim has done is simply to create an additional data stream containing the form data along with the other data streams when an item is approved. This means that it will be there alongside the repository item and all the other metadata streams. I think this will be really useful in solving some of the ongoing issues people are having with their repositories. For example, you might want to capture author email addresses but there is no sensible place to put them in a MODS datastream.

I know, some of you are thinking about standards how can I save my important data in a non-standard format? To which I say, better to save your data in a form which is not standard and not pretending to be standard, than to rush into inventing a new standard which only you support. Is there a standard out there that captures all the data you want to save? Then use it. If not, capture the data now and work with the community to define the standard you need.

I’m not the only one who had this idea. I found out that Vicki Picasso from Newcastle also thought it would be good to capture the VALET form.

This approach is actually very similar to what you do in ePrints you can define any old metadata you want (as long as it’s flat name-value pairs) and map it to Dublin Core as you see fit for dissemination purposes.

In VITAL, and in our Sun Of Fedora repository portal project you can index any XML datastream you like. So if you want to collect HERDC categories (that’s to do with reporting research publications to the Australian Government very important stuff) then you can, without having to jam them into a metadata schema that was not designed to take them.

Next steps in the work Tim started:

  1. Work out how to search for and retrieve an item to be re-edited, putting it back in the workflow.

  2. Work out how to create the formdata from existing items that did not get put in the repository. We already have some experience with generating VALET form data based on a very cool idea by Simon McMillan of UNE who can’t make it to the workshop. Get well Simon!

(I put it to my daughter that she could be a programmer and a lawyer and that would make her rich and happy. She said of course being a lawyer would make her rich and happy. I asked what would being a programmer make her? A nerd, apparently.)

2008-07-24

More on Buzzword

Filed under: Uncategorized — ptsefton @ 11:12 am

Two people have recently reminded me about Adobe’s online word processor, Buzzword. Coincidence? Groundswell of popularity? Probably not as they are married to each other.

Anyway, it has improved a bit since I first looked at it. At least it has HTML export now (it handles lists wrongly, nesting lists inside lists instead of inside list items, but that’s a common mistake). Still no styles or headings and I fear that it is trying to get people to lock up their documents in some kind of proprietary Flash and/or PDF format.

Adobe are asking for feedback so I gave some over at the Acrobat.com blogs.

I think that there’s an opportunity to Adobe to do what I Google should have done with Google Docs (used to be Writely). I suggested this:

What could be done differently over at Writely so they can reliably import documents and get the lists right, and better still, let people start off in Writely online and produce word processing docs to send out to others?

The Writely / Google people could design a well thought out, freely available generic word processing template that works more or less equally well in various different word processing environments (hint - you’ll need some clean-up code to help the poor word processors keep their lists straight).

http://ptsefton.com/blog/2006/03/21/writely,__meet_the_ice_template/

I think Buzzword should not only use styles, it should get a well designed set of generic styles as a basis and the Adobe folks should build templates which are Buzzword compatible the online service that does this first has the best chance of bridging the gap from the offline to the online world.

If I create a document in Buzzword why not make the default export to Word use some Adobe-defined styles and give the user a buzzword-like toolbar to play with them, post the doc back to Buzzword etc? In all the online word processors I have tried import and export is appalling and I’m sure this must slow adoption.

At the moment all the online word processors are far behind on features that are needed for some documents, you couldn’t write a thesis in Buzzword (not if you wanted tables of contents and figures and numbering and reference management) but you could draft some stuff in there or collaborate on papers then export into Word, or FrameMaker or something to finish the job. Here a well thought out style set would really help with interop.

Adobe if you want any advice on word processing templates drop me a line. (Someone from Google did, but the conversation didn’t go anywhere). The ICE project has some templates you might like to look at.

Newer Posts »

Powered by WordPress