ptsefton

2008-04-24

Some comments on the NLM XML plugin for Word 2007

Filed under: Uncategorized — ptsefton @ 4:59 pm
View as PDF

I have been very slow getting to this, and I missed catching up with Microsoft people at Open Repositories 2008 but I wanted to make a couple of comments on the new Microsoft Word 2007 plugin for authoring journal articles in the (USA) National Library of Medicine XML format. I saw this originally via Brian Jones at Microsoft.

I have several questions about this plugin. Actually, they’re concerns more than questions.

I will look briefly at the plugin here and then try to put it in historical context and then as always I will suggest another approach using the Integrated Content Environment, which we’re probably not going to have time to try out in the short term.

Microsoft’s Pablo Fernicola says:

The goal is to simplify several activities in the publishing workflow, from authoring to publishing and archiving, with this last step including conversion to the XML format from the National Library of Medicine. The current process of getting an article from the authors to a journal (increasingly electronic only) is a bit complicated and many times lossy, especially in relation to the metadata related to the article, we hope that the add-in will help simplify and improve the process.

http://blogs.msdn.com/exscientia/archive/2008/03/20/Technology-Preview-Launch.aspx

There’s a video. Which shows Pablo creating a new document from a template and adding some stuff to it. After watching the demo I downloaded the software and tried it out. You need Word 2007 so that meant installing Windows XP in a virtual machine on the MacBook, along with Office and other stuff. I’m pretty sure this plugin will never be coming to Mac Word or to Linux which means cutting out a growing number of authors.

I’m not sure if the Add-In is really working for me. I get a template but I can’t find any new user interface items or figure out how to add metadata. But then I can’t find anything in that new Word 2007 interface because I have not invested time in learning it.

A few of the questions that occur to me:

  1. Why does the demo show putting a table in the abstract? I checked, and it is allowed by the DTD, but I don’t think that articles usually have tables in the abstract. Also, I don’t think the demo shows valid NLM XML; shouldn’t the table be wrapped in a section element?

  2. What’s with the ‘lets go ahead and make some stuff bold and underlined’ in the demo are there some more semantic elements that could be used instead?

  3. How come I can pick up the introduction and move it around my document? If I can do that what’s the difference from a style-based system where the structure is implied by styles? Also, I’m a bit confused by the empty paragraphs between sections. Are they meant to be there? What if I type in one of them?

  4. How would preservation people rate the resulting Word document? It’s a kind of mishmash of two standards and as far as I can see. So to make sense of it you’re going to need lots of user interface code.

  5. What’s the development cost on an effort like this versus the likely saving in time at the journal?

  6. Are there any examples of similar efforts that work with decent ROI? Has anyone yet made a usable DocBook editor in Word? What about XHTML? TEI? I suspect the answer remains no, just as it was back in 2005 when Word 2003 had been around for a while. This suggests to me that Word is still not a good platform on which to build an XML editor. Happy to be proved wrong, though.

  7. How do I repurpose my article for other journals when it’s rejected? How about copy and paste with my other documents?

  8. How are the pilot authors liking it?

  9. How much support do we expect authors to need?

Now, lets go back in history a bit.

In 2005 I corresponded with Brian Jones of Microsoft’s Word team about this kind of use for Word as an XML editor. I questioned just how the feature in MS Word that lets you mix in foreign XML schemas is supposed to work. Back then I called it the bizarre feature that lets you mix schema-controlled XML content in amongst Word’s own structure. See Brian’s reply in his comments with emphasis by me:

The first point is that our main scenarios weren’t about turning Word into an XML editor. As you can imagine, we have a fairly large user base, and investing the amount of resources we did into our XML support just to target the XML editor market wouldn’t have made a lot of sense. The XML support is really for a much broader set of scenarios.

There is a huge market that exists today for custom Office solutions. People customize the Office applications in all kinds of ways to try to get more out of their documents. By adding the support for custom defined schemas, we made it much easier to build semi-structured solutions on top of Word. Rather than rely on hacks with styles or bookmarks, folks could create a simple schema and add some XML tags into their existing document solutions.

We provide a fairly rich object model on top of the XML functionality, as well as the ability to save an entire Word document as XML (using the WordprocessingML schema). These tools make it much easier to build document generation and consumption solutions, as well as more reliable add-ins that act on the document while it’s being authored.

http://blogs.msdn.com/brian_jones/archive/2005/07/08/436973.aspx#452483

So Word’s XML schema integration is going to work at the level of small structural chunks like metadata but is not likely to work for schemas with several hundred elements in them.

I keep an eye out and I have not been able to find a single example of a full DocBook, or XHTML, or TEI or whatever application that’s built on Word 2003+.

Going back further in history, who remembers BladeRunner? Never even made it to market.

What about Microsoft’s own SGML author, terrible thing, more than ten years dead.

With my limited imagination I can’t see how you can mash-together a generic word processor and an XML editor and get something that’s going to be widely usable. Adobe Framemaker had a sort-of XML editing mode but it was not the sort of thing I’d expect ordinary authors to deal with, and setting up new documents required us to hire a consultant for a week just to get started. WordPerfect had a structured mode, but the way I remember it, it was very much segmented off from the ordinary word processing part of the application.

And even if this approach did work it represents a new kind of vendor lockin. Instead of having a standard format for your document as with OOXML or the NLM XML format you have an unholy mishmash of the two which requires custom code for users to edit it. The custom code runs in Word, so you have just lost one of the main benefits you’re supposed to get from a standard format which is the ability to switch editing applications. Yeah, I know, the only thing on the planet that can edit full OOXML is MS Office and that’s not likely to change in the near future, but if you take care about which features you use then you can get reasonable interop between word processors which you will loose with this kind of mishmash.

In my opinion one of the great things about word processors, and particularly Word, is that they allow you to manage structure by implying it. Word has a great outliner which lets you structure your work and move stuff around, while still allowing lots of freedom to copy and paste. It’s simply not true what a lot of XML people say, that word processing files are ‘flat’. Out of the box using headings, your Word document has an outline. Just because it’s not serialised into nested XML doesn’t mean the structure doesn’t exist.

You can turn something into a heading by adding a style and turn it back into ordinary text the same way. That kind of operation can require some serious gymnastics if you try it with a validating XML editor, cos a heading is not just a heading it’s typically a magic element that’s part of a bigger element which is part of an explicit hierarchy. Me I prefer working with implicit hierarchy and WYSIOOTMFYG (What You See Is One Of The Many Formats You Get) which is why I choose a word processor over an XML editor or a text editor even though I’m pretty sure I’m smart enough to use Emacs + DocBook if I wanted to.

I promised an alternative suggestion. So here goes.

Why not work out a generic set of word processor styles and microformats that can be used for academic authoring, which can be assembled into downloadable templates with the right look and feel for various journals and conferences, so authors need only learn one system of styles. Write some software that can render these documents as high-quality HTML, and PDF using the word processor itself, for low-end publishing. For higher-end work transform from XHTML to the XML DTD of your choice or just pump the content into Adobe InDesign or similar and forget the DTD like I said nearly four years ago in support of Tim Bray.

As an author, I’d prefer to have a generic way to write structured documents that I can use for all my writing. I’d really dread being expected to learn several variations on the XML embedded in Word theme. Or adapt to several different templates all with a different name for a block-quote, or a bullet list, or deal with templates that don’t use styles at all, which is pretty much the way things are at the moment for academic authors.

Actually, the ICE team have done lots of work on this already, including a proof-of concept we knocked up in FrameMaker that used its structured mode to render XHTML back when most of us worked for the ill-fated NextEd.

We’re working with a research group at USQ to test out these very ideas they will work in ICE, either using MS Word or OpenOffice.org Writer, and we’ll try to help out by providing import and export from the various word templates that journals provide.

I hope we can find the time and resources to explore this idea with the National Library of Medicine XML format as well it would be nice to contrast the development cost and usability of a cross-platform ICE-based authoring environment versus the work that Microsoft has been doing on their Word 2007 solution.

Another nice to have would be a Word version of our word-processor to XHTML code. At the moment we use OpenOffice.org as a conversion-hub, but it has some limits to its Word support. A native implementation would be better for Windows users who want to use Word. At the moment we recommend serious ICE users use OpenOffice.org or a derivative but we do support Word.

Some thoughts on vendor lock-in, from the domestic to the institutional (is Apple Mac OS X evil?)

Filed under: Uncategorized — ptsefton @ 9:42 am
View as PDF

I spent a fair bit of time during a period of enforced physical inactivity in March sorting out the home music and picture collections, getting rid of stray MP3s that nobody wants and trying to work out how to start organizing our photos a bit better. This is another exercise in self preservation, like the ongoing but currently stalled work I’m doing on my theses.

We have a couple of Macs around the house now, a new PowerBook that I recently bought, and sometimes the work laptops (a brand new Intel MacBook Pro and an old G4 PowerBook). The patched-together home PC now runs Linux (Ubuntu 7.10) exclusively, no Windows, but we do have a couple of Windows XP licenses if we really need them and XP is installed on the work PowerBook so I can test stuff in Windows when I absolutely can’t find anyone else to do it.

So lets talk about managing digital pictures, and how we might start to organize them and add metadata. How can we label stuff so it’s findable and sortable, and remains so years and years from now?

The basic requirement is that you can add metadata to images, in the form of tags and captions, and have them stored in the image in a standard way so that future software can use the tags and captions.

You need to be smart about this. I could tag an image Peter Sefton but does that mean that I took the picture, or I’m in it or I own the copyright? There are a couple of approaches to sorting this out:

  1. Use tags which include a predicate and object, what the Flickr folks call Machine Tagging, AKA Triple Taggging; if I took the picture I could tag it dc:creator=Peter Sefton or if I’m in it I might tag it foaf:Person=Peter Sefton or maybe foaf:Person=pt@ptsefton.com. If you’re wondering what the DC and FOAF bits mean then you can read up on it.

  2. Use hierarchical tags. So if I’m in a picture it might be tagged People/PeterSefton or People/Sefton/Peter or suchlike. This kind of hierarchical tagging is built in to some tools. If you do this then it should be possible to map these tags onto a more formal metadata system later on, given the right software. (Note: don’t use spaces in tags, while some tools support them not all do.)

Anyway, provided you can capture one of the above kinds of tag and make sure they stay with your images then that’s a good start. So I looked for software which would:

  1. Respect your EXIF metadata which is embedded in pictures that come from modern digital cameras, and not mess it up.

  2. Write captions and tags into the EXIF metadata.

  3. Deal with the IPTC standard at least for captions and tags.

  4. Understand Adobe’s XMP which is better than plain IPTC even if all you’re doing is tagging because it deals with more than just ASCII characters.

I’ve oversimplified here I know feel free to elaborate in the comments.

So what are the options available on the Mac?

One of the fist impressions I got when I started using a Mac a couple of years ago is just how bad the operating system level support for images is. In the Finder you can look at thumbnails and that’s about it.

Even on Windows XP there’s a pretty useful built in photo viewer that lets you flip through a directory full of pictures, use keys to rotate them and print them and so on. Not so on the Mac. Even with the new CoverFlow feature in OS 10.5 you can look, but you can’t touch. That is, you can’t rotate an image or do anything useful from the Finder. And you can’t mess with picture metadata in the Get Info dialogue either. There is basically no operating system level support for images.

Why is the Mac like this when it’s supposed to be so great for multimedia?

I think it’s to do with Apple’s plans to lock you in to their hardware/software platform. You want photo management? Well, every new Mac comes with iPhoto. Shut up and use it. Better still fork out hundreds of dollars for their up-market photo software.

So what if I were to simply let iPhoto import all my images?

Two very bad things would happen.

  1. By default on OS 10.5, Leopard, images all disappear into a black hole called iPhoto Library in the Pictures folder. It looks like this:

    graphics1

    Click on iPhoto Library and iPhoto opens. But where are the files? Have they been eaten by iPhoto?

    If you right-click you can Show Package Contents and see the files in a complicated directory structure that keeps track of originals and changes you have applied.

    But good luck knowing which file to click if you want to grab one and do something with it without using iPhoto. And who knows what would happen if you changed one of the files in the ‘package’.

  2. iPhoto will steal my hard-earned metadata. There’s a nifty looking tagging interface, but the tags live in the iPhoto database, wherever that is, not in the images themselves. So if you invest in tagging up the photos then you’d better be prepared to stick with iPhoto for the rest of our life.

    From what I can tell there used to be scripts that could round trip metadata from the iPhoto database to your images themselves, and the other way round, but iPhoto ‘08 broke that and I gather that Apple have taken away the scripting hooks that made it work. Oops, sorry valued customers.

    OK, so there’s a ‘rudimentary’ export that will put the tags into your images into the IPTC metadata, but if you do that then iPhoto will just spew them all out into a single directory losing any directory structure you might have devised.

These two points make me think that this is a very deliberate, cynical, contemptuous move by Apple. Unsuspecting Mac users are being treated like crayfish and being ever so gently boiled alive. By the time you wake up it will be 2015 and Apple will own your entire digital life and you’ll be sliced in half and served to Steve Jobs for supper. I am beginning to wonder whether the Finder is such a useless file manager because Apple want to actually move as many of your files as possible into ‘packages’ so you have to use Apple software to manage them. Just wondering.

Mark Pilgrim wrote in 2006 why he was leaving the Mac:

Im creating things now that I want to be able to read, hear, watch, search, and filter 50 years from now. Despite all their emphasis on content creators, Apple has made it clear that they do not share this goal. Openness is not a cargo cult. Some get it, some dont. Apple doesnt.

http://diveintomark.org/archives/2006/06/02/when-the-bough-breaks

Me, I think Apple get it. They know what’s open. They can read blogs and forums and know that users want iPhoto to write metadata like tags back to their images. I mean who wouldn’t if you explained it to them? But if Apple can get away with not doing it then they will. This is the story of Microsoft Word’s file-format lockin all over again. Ten years from now we can enjoy the fireworks when Apple’s media database standards go through ISO, OOXML style. Tim Bray will write an essay.

Take iTunes for another example. It does everything it can to tie you to the iTunes store, and it wants to organize your files for you. It actually does write metadata changes back into the music files, unlike iPhoto, but for such a slick application it has some surprising dumb spots, such as the way it can’t detect changes to your music library if you add or change files using another tool. If you delete a bunch of songs from underneath iTunes it gets awfully confused and if you add new ones it won’t notice until you point it back at its own library and tell it to ‘import’ a directory into itself.

Why? I think it’s just designed to wear you out so you give in and let iTunes manage your stuff. I wonder how long it will be before the iTunes library is an impenetrable package like the iPhoto library.

To summarize the state of just these two media offerings from Apple, there are really, really obvious features that are in the customer’s interest that aren’t there. And there’s no excuse.

For me, the Mac’s still a worthwhile platform, and I can even live with iTunes, but I’m not letting iPhoto boil me alive however tasty the stock I get boiled in.

The really funny part is that Microsoft’s much maligned Windows Vista sounds like it has really good, really open built in image support. I don’t have the constitution to go down that particular rabbit hole and invite yet another version of Windows into my home, but no doubt we’ll have a look at work and see how it all works.

Outside of installing Vista, and without spending money on pro tools that may have their own lock in, looks like the solution is Digikam. It’s open source and Mark Pilgrim likes it so it must be good. And it plays nice with metadata.

Digikam can be installed on OS X to run under the X Windows system, if you’re very persistent and learn how to compile and use the Fink package manager. A native version should be available soon. I also got Digikam running in a virtual machine using Virtualbox and Ubuntu. That worked but it was a bit sluggish.

Plan B is Mapivi. It’s far from pretty but it seems to be a serviceable photo manager. I have not bothered to get it running on OS X yet. (In testing this I worked out how to make tags that will interoperate nicely between Digikam and Mapivi, I’ll post a how-to at some stage.)

Plan C is Google’s non-free but very slick Picasa. Runs on Linux and is rumored to be coming to OS X this year. I have managed to get the old Windows 98 compatible version to run on the Mac using Wine, but it doesn’t do tagging the right way.

I’m talking about the home situation here, but lots of this applies to institutional contexts too; it may be that the repository you’re using, or about to be using suffers from some of the same problems as consumer-level software.

Is your repository keeping some of your important data or metadata in its own database and not giving lossless export? Maybe researchers in your university are having trouble looking after their data. Some of the same kinds of commercial forces I describe below may be at work to subvert your interests.

What does this mean for the institutional context?

Two main things:

  1. We need to makes sure that repository software vendors/developers are not pulling a Steve Jobs on us. If you enter metadata into your repository to make a collection is it written back to your storage layer in a way that you can re-use it in future? Likewise for access control information, and any other ‘feature’ that seems like you really must have it.

    I won’t name names, but repository vendors and open source developers you know who you are and you know we are watching you.

  2. For the forthcoming Australian National Data Service we turn our attention to what researchers do with all their digital stuff. A lot of that stuff is photos that they’re pulling off digital cameras how should they look after it all? How to add metadata?

    I mentioned the Field Helper application here before, from the University of Sydney Archaeological Computing Laboratory. But I’m beginning to wonder if we might not be able to go a long way by tying together existing tools, like Digikam or Picasa, to help out our researchers. Lots more work is needed in this space, but if you’re using iPhoto to look after important research data then I suggest you seek help before it’s too late. Maybe a switch to Windows Vista, which cares about your metadata would be in order?

Powered by WordPress