Opening up Microsoft

Glyn Moody has posted Open Science, Closed Source in which he takes John Wilbanks to task for collaborating with Microsoft and effectively perpetuating Microsoft’s stranglehold on word processing.

I agree with this analysis:

Working with Microsoft on open source plugins might seem innocent enough, but it’s really just entrenching Microsoft’s power yet further in the scientific community, weakening openness in general – which means, ultimately, undermining all the other excellent work of the Science Commons.

It would have been far better to work with to produce similar plugins, making the free office suite even more attractive, and thus giving scientists yet another reason to go truly open, with all the attendant benefits, rather than making do with a hobbled, faux-openness, as here.

Looking at the example here and reading Pablo’s Blog I share Glyn Moody’s concern. They show a chunk of custom XML which gets embedded in a word document. This custom XML is an insidious trick in my opinion as it makes documents non-interoperable. As soon as you use custom XML via Word 2007 you are guaranteeing that information will be lost when you share documents with users and potentially users of earlier versions of Word.

For something like an ontology this is completely unnecessary all you should need to do is link to a web endpoint for a term to associate a word in your text with an ontology. This is similar to something I discussed with Les Carr; a repository like ePrints could provide endpoint pages for links that when you link to them say Les Carr is an author of the document that links to this, that is, you bundle the predicate and the object together in an RDF assertion. All the users have to do is link to a page. That is a nanoformat that will work with any web capable tool on the planet, including wikis and text editors. Am I missing something here about how this could work at the file format level?

On top of the interoperability it might be really useful to have some custom code embedded in Word to help people apply these links, and if Microsoft are involved in that and the stuff is available open source then that’s fine it might help them sell a few more copies of Office, but on the basis of ease of use not impossibility of escape. Sun could even be involved in a similar effort for or Star Office if they wanted they just don’t seem to care too much about eResearch or getting their word processor to talk to the web at the moment.

Another example I have been vocal about is the Microsoft NLM XML Add-in. If it works to allow ordinary word users to create XML to the NLM schema then it too will be one of these open-yet-closed Microsoft systems. Open source, yes, based on an open format, yes1 Glyn Moody is adamant that the Office format is a psuedo standard that has harmed the ISO Standards forum. That may be so. Me, I welcome having the format well specified., in fact in this case two open formats yet it will only run in Word 2007 on the Windows platform and the source documents you create with it will only work on that platform. It may be useful but it will also continue the Microsoft stranglehold that Glyn Moody was complaining about. I just don’t buy the argument that they can’t implement the same thing using styles and have something that would at least interoperate with other word processors (including other Word versions).

I think we are seeing a new kind of format lock-in; a kind of monopolistic wolf in open-standards lambskin. I’m not saying that this is deliberate, at least not at Microsoft Research where the staff seem to be well meaning, open, communicative and friendly. They keep talking to me even though I keep ranting at them, at least so far.

I warned Jim Downing from Cambridge about this kind of lock-in when I was over in Cambridge last year and he has taken up this issue with the Microsoft developers working on Chem4Word as discussed here by Peter Murray-Rust, who also offers this in defense of the work they’re doing there and a follow-up.

In conclusion I offer this: I would consider getting our team working with Microsoft (actually I’m actively courting them as they are doing some good work in the eResearch space) but it would be on the basis that:

  • The product (eg a document) of the code must be interoperable with open software. In our case this means Word must produce stuff that can be used in and round tripped with and with earlier versions, and Mac versions of Microsoft’s products. (This is not as simple as it could be when we have to deal with stuff like Sun refusing to implement import and preservation for data stored in Word fields as used by applications like EndNote.)

    The NLM add-in is an odd one here, as on one level it does qualify in that it spits out XML, but the intent is to create Word-only authoring so that rules it out not that we have been asked to work on that project other than to comment, I am merely using it as an example.

  • The code must be open source and as portable as possible. Of course if it is interface code it will only work with Microsoft’s toll-access software but at least others can read the code and re-implement elsewhere. If it’s not interface code then it must be written in a portable language and/or framework.

1 Glyn Moody is adamant that the Office format is a psuedo standard that has harmed the ISO Standards forum. That may be so. Me, I welcome having the format well specified.

  • Jim Downing

    Hi Peter,

    if I’ve understood this issue correctly; OOo has an equivalent to custom xml files, but there’s no interop because both MS and Sun converters chuck them away? I know you’ve mentioned that OLE objects offer better interoperability, and I’ll look at how we might transcode for that.


  • ptsefton

    @jim – No there is no direct equivalent to embedding ad hoc, custom XML in an ODT file. There is an extension mechanism based on RDF coming in ODT 1.2 but I don’t think it is implemented yet. See Rick Jelliffe’s post:

  • Doug Mahugh

    I agree with your view of the value of interoperability with open software. Unfortunately, that isn’t possible yet for this issue, because there is no approach to semantic tagging that has been implemented in open-source applications. When the RDF approach in ODF 1.2 becomes widely implemented (assuming it does, of course), there may be some interesting options there, but for now implementers need to either use a product like Word, or overload styles (or something similar) to get the job done.

    In most scenarios people have a specific schema in mind for tagging the content of the document, so with the styles approach you end up imitating XML structure through nested styles. That can get very complicated, and can cause interop issues for consumers that don’t handle nested styles well. I like the microformats/customXml approach because it’s simple for developers to work with, and consumers that don’t care about custom semantics can just ignore it without losing any formatting (as would happen if styles were ignored). It’s also easy to round-trip the customXml element, since it’s in the WordprocessingML namespace.

  • Ian Easson

    The essence of your objection to custom XML embedded in OOXML files is this:

    “This custom XML is an insidious trick in my opinion as it makes documents non-interoperable. As soon as you use custom XML via Word 2007 you are guaranteeing that information will be lost when you share documents with users and potentially users of earlier versions of Word.”

    You should learn more about something before you make a big fuss about it.

    First, it is entirely a question of the *consumer* of this information as to whether the custom XML gets lost. In the case of older versions of Word using Microsoft’s free add-on to read and write OOXML, THE INFORMATION IS NOT LOST. The “write” part of the add-in preserves any such information.

    As for Open Office not preserving it (if it does so, I don’t know), the fault is with OpenOfiice, not with Word or with the OOXML ability to embed custom XML. Talk to them about it.

    As for the general argument that the custom XML in IS29500 documents acts to reduce interoperability, you are simply wrong, for two reasons:

    1) Applications that consume and generate IS29500 (OOXML) documents, if properly written, are supposed to ignore (not strip out) any such custom XML if they don’t understand it. So, the custom XML is totally invisible to such applications.
    2) For applications that are specially written to consume or generate a specific variety of custom XML, the ability to have such XML embedded in ordinary office documents acts greatly to IMPROVE their specific kind of interoperability.

    So, the end result is: improved interoperability, with no downside.

    You should rethink your attitude towards this matter after you have checked out the facts first.

  • Pingback: ptsefton » More on Microsoft Word and non-interoperable standards compliance()

  • Rick Jelliffe

    I think the comment that ODF has no equivalent to custom XML (i.e. the ability to stick in arbitrary XML) is kind-of incorrect.

    ODF has no equivalent *elements* to the customXML tags that allow foreign markup by indirection. However, it doesn’t need them, because it allow foreign markup directly, and allows applications to strip them (which is what OpenOffice does, for example): where is the significant difference between the formats for guaranteed interop?

    (This is complicated because OpenOffice implemented this in a way that makes it useless for wrappers. However, tables and sections could presumably be used, instead.)

    The aspect of this is that it seems a bit strange to criticize a general feature that is specifically designed to allow value-added documents to participate in custom toolchains that they impair interoperability. It is like saying that the pen impedes interoperability because it may be used to write in Mongolian, which few people can read.

    Interoperability is not the be-all and end-all of qualities for a document format. Customizability and extensibility are useful things, in their place.

    So it seems to me that the question should be whether the customXML features are useful or appropriate or optimal for a certain case (or class of cases), rather than blanket statements. I think I am allowed to have different use cases, and therefore features, to you, aren’t I?