More on Microsoft Word and non-interoperable standards compliance

Glyn Moody was pleased with my response to his rant on Microsoft Collaboration.

Others were less so. And Jim Downing would like me to expand on what I think might be good ways to do interoperable plugins. I’ll start by dealing with the comments, which leads to looking at how plugins like Chem4Word might work. I don’t have time right now to do a lot of background research for this post so I will pose a few questions and I’d appreciate more feedback.

First up, Ian Easson has some advice for me.

The essence of your objection to custom XML embedded in OOXML files is this:

This custom XML is an insidious trick in my opinion as it makes documents non-interoperable. As soon as you use custom XML via Word 2007 you are guaranteeing that information will be lost when you share documents with OpenOffice.org users and potentially users of earlier versions of Word.

You should learn more about something before you make a big fuss about it.

Thanks Ian, I’ll try to fit that into my busy schedule.

He then says:

First, it is entirely a question of the *consumer* of this information as to whether the custom XML gets lost. In the case of older versions of Word using Microsofts free add-on to read and write OOXML, THE INFORMATION IS NOT LOST. The write part of the add-in preserves any such information.

Right. The information is written out into OOXML then when you open it in an earlier version of Word you can’t see it. I tried this with the NLM plugin and as far as I recall there was no structure apparent if you open the file in an earlier version of Word. The result is that if you try to edit the document you break the structure at which point THE INFORMATION IS LOST.

And then there’s OpenOffice.org:

As for Open Office not preserving it (if it does so, I dont know), the fault is with OpenOfiice, not with Word or with the OOXML ability to embed custom XML. Talk to them about it.

I’m pretty sure that the official Sun OOo build can read OOXML but not write it (thanks Sun for your commitment to interoperability) and whether it discards the XML is irrelevant in that case as there is no equivalent mechanism in the Open Document Format. There is another version of OOo with some code contributed by (I think) Novell which may preserve the XML but the situation is the same as with earlier versions of Word the XML is unlikely to survive editing of the document.

As for the general argument that the custom XML in IS29500 documents acts to reduce interoperability, you are simply wrong, for two reasons:

1) Applications that consume and generate IS29500 (OOXML) documents, if properly written, are supposed to ignore (not strip out) any such custom XML if they dont understand it. So, the custom XML is totally invisible to such applications.
2) For applications that are specially written to consume or generate a specific variety of custom XML, the ability to have such XML embedded in ordinary office documents acts greatly to IMPROVE their specific kind of interoperability.

So, the end result is: improved interoperability, with no downside.

I disagree. This doesn’t promote interoperability in practice, as I noted above, it promotes sales of Word 2007. I think this ‘feature’ is a trick.

What does promote interoperability is the approach that Microsoft itself used to use with OLE embedding. We saw this in action with Peter Murray-Rust last year. He brought a Word document with embedded ChemDraw chemistry pictures in it. Guess what? When you open them in OpenOffice.org Writer you can view the rendered image, even if you don’t have ChemDraw. Even better, when you crack open the ODT file the original ChemDraw binary is there we were able to feed that to some open source software that PMR and team were involved in writing and extract data. The result of using OLE was: workable interoperability. That’s because OLE objects are embedded using standard interfaces not vague hand-wavy whatever-you-like Custom XML.

If he’d turned up with a document authored with the Microsoft Word ontology plugin we would have seen none of the embedded semantics in OOo Writer. So, the end result would have been: reduced interoperability and no upside.

You should rethink your attitude towards this matter after you have checked out the facts first.

Thanks for visiting. Feel free to correct any factual mistakes.

Doug Mahugh of the Microsoft Office interoperability project also dropped by.

I agree with your view of the value of interoperability with open software. Unfortunately, that isnt possible yet for this issue, because there is no approach to semantic tagging that has been implemented in open-source applications. When the RDF approach in ODF 1.2 becomes widely implemented (assuming it does, of course), there may be some interesting options there, but for now implementers need to either use a product like Word, or overload styles (or something similar) to get the job done.

Actually, Doug, my question was for the specific case being covered in the ontology Add-in. If I want to link a to bit of my text to a node in an ontology why could I not just use a link,as in Linked Data? Say I’m talking about my long suffering mongrel dog Spensa. I might want to link ‘Spensa’ using, say his OpenId and maybe the words ‘mongrel dog’ to some kind of taxonomy and/or ontology (I guess a taxonomy is an ontology or is that wrong?). This link could be styled to look unobtrusive in text, and could possibly be automatically footnoted for print viewing and linked in some cool way on the web. I had a go at this but I was unable to find a useful endpoint for my link for whatever species he is, canus-spensis I suppose.

Or the case I mentioned of wanting to assert that authorship. What if my local ePrints repostiory had a page for me-as-author. It might look like http://eprints.usw.edu.au/authors/PeterMacolmSefton and resolve to a page that describes me and the works that are attributed to me with a note there that says if you want to indicate that this person is an author of a paper, link their name as it appears on the work to this page. That is way simpler to implement than the stuff we talked about in our paper on embedding semantics in word processing documents.

Now, this linking is something anyone could do with a little training but you could build a Word plugin to make it easier. I’d have no objection to that particuarly because it is possible to build code that can work in both Word (for Windows) and OOo as we have shown in our ICE project.

If I’m wrong about simple hyperlinks being a workable solution there are a couple of mechanisms in Word for embedding semantics. Styles are probably no use in the case, but fie lds could work. A custom field could contain the semantic info, it will get stored in the underlying XML just fine and it will work with any version of Word. Why didn’t the ontology tool get built using fields to store the info? It wouldn’t have anything to do with wanting to sell more copies of Word 2007 now would it? (I know, OOo dumps Word fields but it would still be an improvement)

Bookmarks are another (clumsy) way to do interop. I know this stuff because I did a lot of checking of facts when we worked to build the first interoperable citation toolbar for Zotero that would let you round trip documents from Word to Writer and back.

I am well aware that this is a jungle. Microsoft is a big cat whose DNA tells it to sell more copies of Word 2007. That’s fine, none of my business. My business, amongst other things is to help the people at USQ make web pages and books from their writing. And they don’t all have Word 2007 and they don’t all want Word 2007. I suspect that with the ontology project there would be a more interoperable way to do the in-document tagging. Maybe someone from the project could share some of the background. Did you try to use fields? Did you consider interop with OpenOffice.org? Do you test tagged documents that have been passed to older versions of Word, edited, saved and reopened?

Now, onto the final bit where I think Doug is talking about the use case where you want to author a document in Word that is later going to be mapped onto a complex schema. After working in this field since 1995 I am pretty well convinced that attempting to write a fully featured editor layered on a word processor for the likes of the NLM DTD or DocBook is not going to end well. I have seen enough of these things marched out at SGML and XML conferences with great hopes and waited in vain for the follow up presentation next year. Anyone remember BladeRunner? Microsoft SGML Author?

In most scenarios people have a specific schema in mind for tagging the content of the document, so with the styles approach you end up imitating XML structure through nested styles. That can get very complicated, and can cause interop issues for consumers that dont handle nested styles well. I like the microformats/customXml approach because its simple for developers to work with, and consumers that dont care about custom semantics can just ignore it without losing any formatting (as would happen if styles were ignored). Its also easy to round-trip the customXml element, since its in the WordprocessingML namespace.

Easy to round trip in Word 2007, but what happens if someone without the plugin edits a document with the custom XML in it, particularly one like an NLM document where the stuff is all over the place? What if they reorder sections, or delete a section heading? Has this been tested or is the intention with the NLM plugin to recommend that people only use Word 2007?

Doug is right that overly complex style systems don’t work for authors. Been there. Done that. That’s why we have devised a simple, generic set of styles that users can use to create structured HTML, on the ICE project. Ian Barnes used the same styles to create DocBook, and I’m pretty sure we could do the same for NLM but all we would be doing is producing a small subset of those formats.

There are so many words flying around here it may not be clear what I’m complaining about. So I will try again. What I am saying is:

We in the Academic community have an interest in promoting interoperable authoring practices so our users across platforms can collaborate on documents together. Therefore if we get involved in working with proprietary software we should make a serious effort to make sure that what work we do will work as widely as possible. Our goal is not to promote Word 2007. If we could get the work done using an older version or a free alternative then we might save some money that we can use for whatever it is we do here in the academy.

I think that the XML format under Microsoft Office is being used as a distraction from real practical issues of interoperability. Whether this is good for Microsoft in the long run is for them to work out but I am pretty convinced that some of the solutions coming out of Microsoft Research are no good for me or my colleagues and I will continue to say so and continue to work with my team to do things in a way that benefits all of us.

  • http://universal-interop-council.org Paul E. Merrell, J.D. (Marbux)

    I can’t share Doug Mahugh’s optimism for the RDFa support coming in ODF 1.2. As originally drafted, the Metadata SC’s work included a provision that mandated xml:id preservation. But Sun launched a successful last-minute proposal to change “shall preserve” to “should preserve,” which grants discretion to destroy xml:ids whilst still claiming conformance.

    There was a fiery battle over it and Sun ignored all requests for a use case exposing the need for the change, but the Sun proposal was duly rubber-stamped by the big vendor-dominated ODF TC.

    Gary Edwards and I resigned from the TC on that occasion, which was only the latest in a series of TC decisions on ODF 1.2 that broke interoperability with MS Office. Two notable prior examples are the switch from ordered list tuples to list triples, and Sun’s refusal to even place on the TC agenda five Novell proposals aimed at compatibility with MS Office.

    So I don’t understand Doug Mahugh’s optimism about RDFa support in ODF 1.2, particularly since the Sun “should preserve” language appeared to be directly aimed at blocking the use of RDFa to work around the destruction of (conformant) foreign elements and attributes by OpenOffice.org in order to create MS Office ODF plug-ins that can interoperate with OOo.

    My sense is that these issues will get fixed at ISO/IEC JTC 1 after ODF 1.2 gets there. But between the time it takes to get ODF 1.2 through JTC 1 and the lag time before JTC 1′s changes are implemented, we’re probably talking about a multi-year delay in interoperability.

    I think it vitally important that people realize Microsoft is not the only big vendor who misbehaves in regard to document format standards.

  • Ian Easson

    The essence of your misunderstanding is in the following sentence:

    “The information is written out into OOXML then when you open it in an earlier version of Word you can’t see it. I tried this with the NLM plugin and as far as I recall there was no structure apparent if you open the file in an earlier version of Word. The result is that if you try to edit the document you break the structure at which point THE INFORMATION IS LOST.”

    Just because no structure is *apparent* in another application doesn’t mean the information is lost. As long as you don’t inadvertently (or deliberately) destroy the information, it won’t be lost. To prove that, all you have to do is re-open the edited document in Word 2007, and voila – the schema is intact and all tagged terms are still tagged.

    Repeat the following experiment, which I just did a minute ago, and then you will understand:

    1. Install and configure the ontology add-in to Word 2007 on one PC.
    2. Create a Word 2007 document that uses a term in one of the ontologies you said to use during the configuration of the add-in.
    In my case, the entire document was the sentence: “I have potter syndrome.” (“Potter syndrome” is a term in the Human Diseases ontology.)
    3. Save the document as a Word 2007 file.
    4. On another PC that has an earlier version of Word (in my case, Word 2003) and that of course has the “Microsoft Office Compatibility Pack for Word, Excel, and PowerPoint 2007 File Formats” installed, open the Word 2007 file.
    5. You will see that it doesn’t recognize the term (just as you mentioned).
    6. Edit the document in some way.
    In my case, I made 2 copies of the sentence, so I had three copies overall. The first I left intact. The second, I changed the words “potter syndrome” to “blah blah”. The third, I added the word “don’t” to the sentence, so it now read “I don’t have potter syndrome”. So, my document now read:
    I have Potter syndrome.
    I have blah blah.
    I don’t have Potter syndrome.
    7. In Word 2003, save the edited document in Word 2007 format.
    8. Open the edited document in Word 2007 on the original machine.

    According to you, the information is destroyed. Well, it isn’t.

    To be specific:

    - The schema is still there, intact, which is easily verifiable if you have the Developer tab installed.
    - The first, unedited sentence is still recognized by the ontology add-in as containing a recognized term (potter syndrome) with all its tags and meanings.
    - The second sentence (which has “blah blah”) is not recognized as using the custom XML (because it doesn’t).
    - The third sentence, which says I don’t have potter syndrome, is recognized as having the “potter syndrome” tag. Everything is fully functional.

    So, the information is NOT DESTROYED (sorry for the shouting!).

    This fact shows that the custom XML feature in IS29500 is *not* a trick.

    In my humble opinion, it finally realizes the original vision of SGML in the 80′s, in that it provides a way for office documents to contain custom vocabularies (which can, of course, also be ones covered by international standards). That GREATLY increases interoperability between applications that need to consume, produce, or transform information in those custom vocabularies. There is no *inherent* information loss in OOXML applications like Word 2003 with the Compatability Pack (i.e., not counting deliberate user destruction of information) that are properly written not to destroy any custom schemas or custom schema tags.

    As for OpenOffice, talk to whoever maintains it. From what you have said, it is not a properly conforming OOXML application in that it destroys the information. But that does not take away anything I have said.

  • Ian Easson

    One more clarification.

    You said that:
    “That’s because OLE objects are embedded using standard interfaces not vague hand-wavy whatever-you-like Custom XML.”

    You seem to think that the following are either undefined or non-standardized:

    1) The method of embedding a custom xml schema in the OPC container for an OOXML document
    2) The method of tagging words or sections in the body of a document, for the case in which those tags are from the custom XML schema.

    Neither of these are true. They are both clearly documented as part of the IS29500 and ECMA-376 standard.

    Those methods are INDEPENDENT (sorry for the shout) of the particular custom XML schema used. It is not that custom schema that defines the embeddings; it is the IS29500 or ECMA-376 standards. If the former were true, then you would have reason to worry, because there would indeed been no standardized way of handling the interoperability of such documents.

    So, you can see why the use of custom XML in OOXML greatly *increases* the interoperability for applications that consume, produce, or modify files that make use of the hundreds of standardized XML vocabularies that have been defined for specific domains of knowledge. When you combine this fact with what I showed in my earlier comment about no loss of information by conforming applications that do not understand the specific custom XML vocabulary, then you come to my main conclusion: a big win for interoperability overall.

    I hope this helps.

  • Ian Easson

    Minor editing glitch: Please replace the phrase “If the former were true” by “If that were not the case”.

  • http://blogs.msdn.com/dmahugh Doug Mahugh

    Marbux, regarding RDF in ODF 1.2, I’m just hoping for the best. I hear you on the challenges there, and we’ll continue to work with the ODF TC to try to improve interoperability. We’re in a position where we need to balance our perspective on various issues with the “OMG, Microsoft is trying to dominate things” reaction that we seem to get for just falling out of bed some days.

    Peter, I’m with Ian on the fact that IS29500′s custom XML support is every bit as well-defined as OLE embedding is. I agree with your point that it’s messy for two different editing applications to round-trip documents with custom markup in them, especially if the document is edited in both apps, but that’s not the typical use case I’ve seen in custom XML scenarios. The tagging of content is usually done immediately before some type of automated processing by a non-desktop app, so the issue of what an editor should do with that markup doesn’t come up very often.

    As for whether custom XML is a thinly veiled plot to lock people into Word, I think another way to say it is that Word is the richest implementation of custom XML concepts in a word processing application today. We would love to have other implementations support these concepts as thoroughly as we do, exactly as they’re specified in IS29500 and ECMA-376, and I think many organizations would get great value out of the resulting interoperability. We’re the proud owners of the first fax machine, but the protocols have been standardized, wires have been strung between towns, and we’re looking forward to more fax machines coming online soon. :-)

  • Pingback: Doug Mahugh : Miscellaneous links for 03/18/2009

  • Bruce D’Arcus

    Sigh … I see Paul is trotting out the same distortions about the ODF metadata process again. I see that again, elsewhere, he’s presenting conspiracy theories about backroom deals and such, suggesting my position on an issue was due to some ethically-dubious quid pro quo.

    Having been involved in all of those conversations (which Paul was not, by his own choice), I cannot recall a single engineer who agreed with Paul and Gary’s view. In fact, the strongest voice against their position in the end was Thomas Zander, who is one of the lead developers of KOffice. Hard to pin that entirely on Sun, then, or “big vendors.”

    I’m certainly biased, but I am optimistic that the new RDF support in ODF 1.2 will provide room for enhancing interop. But as I’ve said recently, this is about much more than just document format specs. Office 2007/2008 has citation and bibliography support, for example. But AFAIK, no third-party extensions actually use it. There are probably many reasons for this, but one big one is the inadequate API that MS exposes.