ptsefton

2008-08-26

More thoughts on an application to find structure in word processing documents

Filed under: Uncategorized — ptsefton @ 1:24 pm

In my last post I said I’d write more about how Ian Barnes’ Structure Guesser AKA Structure Sniffer1 might work, and how it might be able to leverage Schematron.

The sniffer is part of Ian’s Digital Scholar’s Workbench concept, where you can upload an unstructured word processing document, and use the workbench to add explicit structure in as automated a way as possible. Explicit structure really helps in being able to convert the document to other formats such as HTML for the web, or structured PDF with a table of contents, but also for preservation formats that might keep the words and other content for posterity without necessarily worrying about exact formatting. Ian has looked at using DocBook for this, but I reckon HTML might be good enough, and I know others are thinking the same thing2.

Ian’s looked at the statistical approach to guessing structure used by in the Lemon8-XML project, found that particular implementation wanting and is now thinking about more of a machine learning approach.

I too have been thinking about how this application might work for a while now and I’m getting increasingly enamored of the idea of using an HTML interface, something like this:

  1. Upload a word style-free processing document to a web site.

  2. You see an interactive preview of an HTML version of the document, complete with a full table of contents (so you can see where the sniffer application thought the headings were).

    Interactive? Hover the mouse over a top level (h1) heading in the preview and see some details about why the machine formatted it that way, such as Paragraphs at 18pt (10 instances) and 19pt (1 instance) Helvetica look like Heading 1. You’d be able to correct the machine, either on a case by case basis or wholesale.

    Another area where some interaction might be needed would be in disambiguating various kinds of indented text, some indentation might mean block-quote some might be example while other text might just be, you know, indented. We had to add an indent style in addition to the bq1 (block-quote) style to ICE to support this because some authors just, you know, want to indent stuff.

  3. Once you were happy with the HTML view of the document, there would be an option to improve your original by adding styles without changing its presentation too much (Did I mention? You too should use styles.) or you could just use the rendition and leave the original alone. Either way, the choices you made would constitute feedback to the learning system. So even if you don’t choose to use styles, the next time it sees the same document it will be able to handle it better.

So where does Schematron come into this? Well, leaving aside the (very) hard problem of actually writing the learning system, that system could generate Schematron rules, which could be used to annotate the original document with suggested styles for each paragraph. Having done that, you could then feed the document into the existing ICE HTML formatter, which is style-driven and it could use the suggested styles to render the document.

These rules can be hierarchical meaning that based on certain cues different sets of rules might apply. For example, there might be a family of documents which all come from a user who uses Palatino 11pt for the main text, and makes use of an idiosyncratic mixture of formating and styles the learner could derive rules for that situation. I know nothing about this kind of thing, I wonder if it would be like the Naïve Bayseian in the Old Bailey where a machine is trained to classify trials.

Using Schematron rules would mean that they could also be written or tweaked by humans. Returning to the example before, a human could add a rule that if a bit of text is indented relative to the text around it and it contains something that looks like a citation which could mean either that it uses something like a Zotero field, or is formatted like a citation with brackets or a footnote then it’s a blockquote.

This would be a nice modular approach. Chances are we’re going to be looking at Rick Jelliffe’s in-zip Schematron for use on Open Document Format documents, so the sniffer could piggyback on that1.


1 Also know as that by me , at least.

2 And no, OOXML and ODF are not necessarily the answer for preservation although they are important, I’ll expand on this in a future post as I think about a presentation for Open Standards 08 .

1 Actually there is an issue with this, it’s not that simple to write rules that work on the formatting in an ODT file, cos it uses these automatically defined styles that introduce a layer of indirection. We could consider a pre-processor that remembers these automatic styles between documents, it would also probably need to annotate docuents with some kind of weighted score like they use in Lemon8-XML.

A courseware authoring dashboard using Schematron

Filed under: Uncategorized — ptsefton @ 10:58 am

As with busses, sometimes you can wait ages for a Schematron and suddenly a whole pack of them come along together*.

For those of you who don’t know:

In Markup Languages, Schematron is a rule-based validation language for making assertions about the presence or absence of patterns in XML trees. It is a simple and powerful structural schema language. It typically uses XPath to describe patterns.

(Wikipedia contributors 2008)

Instead of the all or nothing syntactic approach that you get with other kinds of schemas Schematron lets you pick and choose things to worry about. So instead of saying all course books must begin with a Learning Outcomes section you can write a rule that simply reports on whether there’s a Learning Outcomes section or not without letting there be any variation. Why? In some courses it might be important to add something before that section while I have heard arguments that in some situations specifying learning outcomes upfront scares off potential students.

We’ve discussed using Schematron to provide reports on ICE content but have never got around to using it. This week it has resurfaced in couple of contexts.

Relevant to ICE as a course-authoring system, the Learning and Teaching Support Group at USQ have a checklist, The USQ course writing guide which authors can use to see if their courses meet our standards for fleximode courseware. At the moment it’s a manual process to tick the boxes. We met with Michael Sankey from LTSU this week, and it’s pretty clear that Schematron could play a part in automating lots of the checklist.

As part of our ongoing exploration of how we might create an automatic or semi-automatic system for inferring structure in documents Ian Barnes has pointed out that Schematron might play a role there too.

Ian’s insight was prompted by a recent post of Rick Jelliffe’s about a project to add annotations to a corpus of (presumably) word documents in the the OOXML zip package format:

The brief was for an organization with a large number of documents from multiple sources, but with each source supposed to use stylesheets. The idea was to make a rules base that would distinguish all the different ways that a few structures (titles, table of contents, potentially citations, etc) were represented. This would allow classification of documents according to the structures found, the discovery of outliers and exceptions (e.g. incorrectly marked up documents, or where additional rules were needed), and automated annotation back to the original documents.

http://news.oreilly.com/2008/08/a-standardsbased-expert-system.html

I’ll come back to Ian’s structure guesser (or as I like to call it the structure sniffer) in another post and talk here about the possibilities for adding validation or dashboard services for courseware written using ICE, via Schematron.

Rick’s idea of Schematron rules that can reach inside Zip files would be perfect for the USQ courseware context as our content is in Open Document Format files (actually some of it is Word docs but we convert it to ODF as part of the process). We could translate a lot of the checkboxes in the USQ course writing guide into Schematron rules to do things like check that there is a an acknowledgements section in the course introduction. Not only could the system report issues, it could open up the documents in question for you and take you to the trouble spots and insert comments in the documents.

Not everything needs to be seen as a validation issue though, just some reporting would be useful to create a kind of dashboard for courseware. Module 4 contains no activities might a worthwhile thing to report along with word counts for various modules and how many citations there are, etc.

Another place we could use Schematron to report on course structure would be in the course organizer, which is part of the IMS package manifest file in every ICE course. An organizer is a kind of table of contents for the course, and it is used to generate the navigation. Schematron could easily be used to validate things such as There must be a Study schedule, and check things like whether the links to study modules have names that are not just like Module 1 but convey a bit more about what’s in the module.

A few years ago Ron Ward and I were involved in a project that used Schematron. There we used it to validate metadata for documents as they were uploaded into a content management system Schematron would look for patterns in the metadata and complain when it was wrong. The complaint took the form of an HTML form that the user could fill-out to fix the metadata to the Schematron system’s satisfaction. The Schematron rules worked well to create a true declaratively specified interface, but our implementation was a bit inflexible, like my attitude at the time, so usability suffered. Lesson learnt, I hope.

I think that presenting this as a dashboard that lets you know what your course is like will be better than presenting it as validation which has connotations of centralized control, something that doesn’t always go down well in a university, even when we do have agreed standards to maintain.

It will be a little while before we get to implementing this I just wanted to record our current thinking.


* Although come to think of it I don’t think I’ve ever seen two busses in a row in Toowoomba.

2008-08-20

Compound documents in ICE and beyond: referencing parts of things

Filed under: Uncategorized — ptsefton @ 1:44 pm

Ben O’Steen has put up some thoughts on what he refers to as ‘compound’ documents and how to store them in repositories and allow for referencing of parts of a document, such as a table, a graph or even a paragraph.

Why did I add the scare quotes to compound?

While to a computer scientist a research paper with its graphs and tables and paragraphs might be compound, I suspect most authors tend to think of a research article as a single entity. Until we start giving them access to services that make it clear that it’s not monolithic, that is.

As background, Ben gives four rules:

Note that the four rules of the web (well, of Linked Data technically) are in essence:

  • give everything a name,

  • make that name a URL …

  • which results in data about that thing,

  • and have it link to other related things.

I strongly believe that applying this to the individual components of a document is a very good and useful thing.

http://oxfordrepo.blogspot.com/2008/08/four-rules-of-web-and-compound.html

Agreed.

He goes on to talk about repository services will have to have an explicit contract with authors that lets them know that their document is not just going to be presented in one monolithic format, by default the dreaded PDF.

One thing first, we have to get over the legal issue of just storing and presenting a bitwise perfect copy of what an author gives us. We need to let author’s know that we may present alternate versions, based on a user’s demands. This actually needs to be the case for preservation and the repository needs to make it part of their submission policy to allow for format migrations, accessibility requirements and so on.

As we get authors using a system like ICE then this will be:

  1. Easier for them to understand because they can see multiple formats generated automatically.

  2. Easy to implement, by hooking up ICE (or similar) directly to repositories. Just this week Oliver Lucido has ICE putting content straight in to ePrints via OAI-ORE that’s automatically adding an HTML and PDF view.

So far with ICE we have done a number of demo hook-ups to repository software. It’s now time to turn this on for real we will get ICE hooked up to USQ ePrints ASAP. This will mean that all the images in a document will automatically become referenceable. That is, in Ben’s terms each image will have a name which is a URL.

Going beyond images, we have already done some work in ICE on making paragraphs referenceable, not in a repository context but in an editorial workflow. For example, this blog post has been created in ICE. Here’s a screenshot of an earlier version of this very paragraph in the HTML view.

graphics1See the blue pilcrow? That’s the symbol that Tim Bray uses on his blog to make each paragraph referenceable. Go and have a look, you can link to or refer to any part of any post on his site. In ICE, however, the plicrow is not for referencing elsewhere, it’s for commenting.

See the spelling error? I can annotate the document:

graphics2

Now, if I fix the paragraph, the comment will disappear from the main body of the text but the old, broken version of the paragraph is kept it shows at the bottom of the page until I delete it.

So, ICE already knows how to identify any paragraph and has some rudimentary version control for document parts*, but the context matters.In an authoring context we needed something that was not too sensitive to document order, and it had to work with documents created by word processors, so we can’t just assign unique IDs to paragraphs the way Tim Bray can in his bespoke workflow. But when it comes to pushing (or pulling) a document into a repository, where there is some expectation that it will not change, there is no reason that we can’t mint IDs for parts of a document, and figure out a way to make them obviously citable along the lines of Tim’s purple pilcrows.

Coming back to Ben’s post. Why not make the HTML view the ‘normal’ way to look at an article where possible? This would mean that you don’t have to store a document in fragments, merely label the parts of the HTML. I guess I’m agreeing with Ben’s tentative suggestion that HTML might be a good format to hang this on:

I have yet to settle on basing it on the content XML format inside the OpenDocument format, or on something very lightweight, using HTML elements, which would have a double benefit of being able to be sent directly to a browser to ‘recreate’ the document roughly.

Forget ‘roughly’, at least for documents created with an HTML-ready workflow like ICE. It would even less rough if authors choose something like the Article Authoring Add-in for Microsoft Office Word 2007. But Ben’s right; for documents that are deposited in PDF or in unstructured word processing formats then HTML is going to be rough.

Just how we might handle the user interface issues for exposing names (URLs) of the parts of a document is unresolved, but we’ll give it a go here at USQ with our ICE and ePrints systems.


* There’s the current version and then there are obsolete versions. ICE of course has rich version control at the document level courtesy of subversion

2008-08-11

Study shows real-world ODF/OOXML interoperability is not great

Filed under: Uncategorized — ptsefton @ 10:31 am

Via Doug Mahugh at Microsoft comes this study (Shah & Kesan 2008) on interoperability of word processing applications using the Open Document Format and Office Open XML.

After outlining some possible approaches to testing conformance of applications against the standards and pointing out what a gargantuan task that would be, they settle on a pragmatic approach: test interoperability with the dominant application for each format.

This research tested the interoperability for ODF and OOXML document formats based on a reference implementation approach. For ODF, the test documents are developed in OpenOffice, which is currently the dominant implementation for ODF. For OOXML, the test documents are developed in Microsoft Office 2007 for Windows. These are not reference implementations in a true sense, because they do not perfectly implement the standard. However, they act as de facto reference implementations, because they are the dominant implementations that all developers seek compatibility with.

This makes perfect sense for real-world testing. The results are interesting and unsurprising (to me, at least). Basically the best interoperability is between Microsoft Office Word and OpenOffice.org Writer even when they are reading each other’s formats. I reckon that would be because the OOo team have invested person-decades of effort in reverse engineering the Word document model, and Writer is more or less able to deal with Word docs. The document serialization format is not that relevant. It’s the document models that count. And some of the applications they test are not really even word processors.

This paper makes a great case that it is interop that counts and the goes on to show how poor interop really is.

Unfortunately, this study didn’t get as far as looking at styles compatibility as that’s one area where there are some frustrating problems but also great opportunities to help in interoperability. If you use styles then at least the semantics and structure of documents can be preserved even if page fidelity is not.

And there’s a way to improve interoperability. You don’t have to leave users to their own devices, you can advise them of which features of which applications to use for particular tasks. This is what we try to do on the ICE project. We provide templates and advice to help people create interoperable documents.

Inspired by this paper, I’m off to start work on a paper looking at proactive interoperability, by helping users to pick features that will interoperate. As noted in this study there’s not much out there to choose from apart from Writer and Word. That’s why we will continue to work with Writer and Word looking for practical solutions.

Shah, R.C. & Kesan, J.P., 2008. Lost in Translation: Interoperability Issues for Open Standards - ODF and OOXML as Examples by Rajiv Shah, Jay Kesan. In The proceedings of the 36th Research Conference on Communication, Information and Internet Policy (TPRC), Arlington, VA Sept. 26-28, 2008. Available at: http://ssrn.com/abstract=1201708 [Accessed August 10, 2008].

2008-08-05

Another look at the Article Authoring Add-in for Microsoft Office Word 2007

Filed under: Uncategorized — ptsefton @ 3:32 pm

The Article Authoring Add-in for Microsoft Office Word 2007 (AAAiMOW1) has been turned loose as a release candidate. I looked at an earlier version of this a while a ago.

The name of the thing doesn’t let on that it is targeting just one version of what an article looks like, in the form of the NLM schema I’m not sure if that reflects confidence that the NLM schema is generic enough to cope with all articles or anticipates a future version which can support multiple formats.

I had a lot of questions in my previous post most of which I think are not yet answered, although Pablo Fernicola did drop by my blog and shed light on some of the issues.

This time, with a fresh virtual installation of Windows XP running under VirtualBox on OS X the plugin worked a bit better for me so I could see it in full flight. I still have some serious concerns about this add-in thing and what it might mean for organizations.

I was going to make a few quick comments about usability, preservation and lock-in but this post kept growing, I emailed Jon Udell for his take and did a few tests, and it’s ended up well on the way to 2000 words.

[Update : (Minor edits and fixes)

I should point out that while this post is quite picky I’m glad to see this work going on in Microsoft and I’d love to see how it works out.

Look, if you’re not concerned about using an application which is only for Word 2007 on Windows XP or Vista to create articles which you don’t need to re-use or archive then most of what I’ve got to say here is irrelevant.]

Usability?

I can’t find any reports of how this plugin works in real life. Has anyone tried it? Are you all under NDAs?

I’m concerned about the way that you can add NLM structural elements all over the place, and nested inside each other in bizarre ways, but then you can’t save to the new proprietary .nlmx format because of validation errors.

It would be pretty easy to show how you can create invalid structures using this plugin but I don’t really think that’s a useful stunt to pull what I want to see is what real problems, or lack of them people have with the structural stuff.

Me, I found it a bit weird but as I said I didn’t try to write an article with it.

There’s one interface device that I really like. Each ’section’ element gets a little handle above it so you can drag the whole thing around:

graphics1It would be really nice if this applied to the document outline as well as part of the normal Word interface not just to the special embedded XML sections. I could just style a bit of content as Heading 2, which is part of the document outline structure, and be able to drag around the whole of that implied section. Word already does something very like this if you use the Outline view. Of course, dragging sections in an NLM document doesn’t make sense as they’re supposed to be in a particular order, but I don’t imagine most people would drag the high-level sections. (There’s some kind of complex process for dealing with section ordering or editors, I think).

I’m not sure if I get why the embedded XML is any better than just recognizing that the text ‘Abstract’ in Heading 2 style is a the start of the Abstract section. Or you could define sub-classes of heading if you really wanted to such as Heading 2 - Abstract.

You could still have a toolbar like this so that people can drop in sections where they want them:

graphics2Lock in

This add-in represents a new opportunity for Microsoft to lock users in to Word, having just moved on from the proprietary .doc format. This is not just a matter of trying to sell more copies of Microsoft Office it’s about encouraging users to create documents that only work with a particular version of Office.

We have just been through a great long debate about standardizing word processing formats. Microsoft got their way and had their OOXML format accepted as an ISO Standard (ISO/IEC DIS 29500). The benefit is supposed to be that when you write a word processing document it can be managed and edited in more than one application but I have always been very dubious about how this fits with the way you can embed arbitrary foreign XML in Word documents. By contrast the Open Document Format approach is an RDF based extension mechanism which seems a lot cleaner.

I tried out some simple interop with an AAAiMOW document.

  1. Word 2008 on OS X can open it, and you can edit the document at least a little bit, apparently without breaking it, but anything you add doesn’t have the magic embedded XML. It round tripped without error but I assume you could break some of the XML.

  2. NeoOffice Writer on the Mac can open the .docx file and you can edit, but if you save it and re-open in Word then you get an error . The good news was that Word 2007 was apparently able to rescue the content but the bad news was that embedded XML went AWOL.

At the moment I would not have any confidence that anything except Word 2007 can deal with documents created with the add-in, which is as advertised. Of course, if that’s what your team of scientists is using then no problem, provided you think about how you will preserve the outputs (see below).

That quick interop test was using the new .docx format which is not the same as ISO/IEC DIS 29500, which won’t be available as a Word format until the next version.

One of the features of the AAAiMOW is a new file format. Yes. A new non-standard file format which is a misbegotten mashup of OOXML and NLM. I’m not sure how this is different from the way the content is embedded in the .docx file. From the readme file:

Both the article contents and metadata authored through the add-in are stored using XML, as part of a single file, using the Open XML format for the content and the NLM tagset for the metadata. Content which does not have an equivalent in Word, or extends existing Word elements, is stored as custom XML elements within the Open XML data stream. When a file is saved in the NLM format, the resulting XML file is stored within a nlmx file, using the same Open Packaging Conventions used by docx files, providing a single file which can package all related content (such as images) and supports extensibility.

Meanwhile, the next service pack for Word 2007 will add support for the Open Document Format (ODF) as a native file format. I’m assuming the plugin won’t work with ODF. (Pablo, am I wrong?)

There’s some very alarming use of the passive voice in the documentation too, a classic computer industry trick. Say it can be done without mentioning who’s going to do it and how much it’s going to cost.

Based on the use of Open Packaging Conventions, the Open XML format, and the NLM tagset, tools can be built to access any part of the file, content or metadata, and extract, validate, or add information to the file, as part of the publishing workflow.

Can be built? Please. We have one format mixed in to another format using a user interface that is only accessible from an expensive proprietary application. I’m sure I could write a script to pull the NLM bits out of the Open XML but for each new kind of embedded XML I would have to rewrite my code and test that it works with the user interface code that has been added to Word in this case it involves dealing with some special attributes to re-order sections (I think) doesn’t look easy or pleasant to me.

[Update to be clear there is no way that I can see for an author to export he NLM XML format only. I’m assuming that must be something that happens using a different tool.]

And it is worth remembering that this plugin is not accessible to the majority even of Windows users. For example here at USQ Word 2007 has not been rolled out yet2. And the plugin is not available at all on platforms other than Windows. That’s not what I hoped the new standards-wielding Microsoft was on about.

Preservation

There are going to be serious issues with preservation. What are archivists supposed to do with bastard mashed-up formats like this which depend on a particular package to make sense of them?

It is true that for documents that make it to publication in the NLM XML format this should not be an issue: the resulting XML should be perfect for archiving. But I can see that a lot of things that are of value might not make it through to XML. What about archived author’s manuscripts which are one of the backbones of Open Access? What about the original editable files for images drawn using Microsoft Office tools, which are embedded in the source file?

Think about what would happen if this approach became common for different XML formats there could be a proliferation of non-standard polluted Word document to deal with in repositories.

This add-in represents the Microsoft business model in action. See Brian Jones’ response to my probing on the issue of how this bastard mashup stuff is supposed to work. I quoted this last time, but it’s worth reminding ourselves that this is what Microsoft is about, never mind the standards:

There is a huge market that exists today for custom Office solutions. People customize the Office applications in all kinds of ways to try to get more out of their documents. By adding the support for custom defined schemas, we made it much easier to build semi-structured solutions on top of Word. Rather than rely on hacks with styles or bookmarks, folks could create a simple schema and add some XML tags into their existing document solutions.

http://blogs.msdn.com/brian_jones/archive/2005/07/08/436973.aspx#452483

Brian Jones calls using styles to carry semantics ‘a hack’ and yet embedding foreign XML in a Word document and hand-crafting a user interface to deal with the resulting mishmash of tags is somehow not a hack? I agree that styles and bookmarks (and tables we use them a lot) are somewhat limited carriers for microformats but the XML embedding thing has always looked like a trap to me too expensive to set up and maintain and too much embedded in the Windows world. As I mentioned above, I think the new extension mechanism for ODF may be a better compromise maybe we’ll see that in the ODF support in the service pack release in 2009.

An alternative

There’s an alternative approach which is to use features that are common to word processors in general and which are expressed in the underlying file formats directly, which I wrote about in my last post. There would be some interesting challenges in finding interoperable ways to embed all the ’special’ items that are allowed some of these are already supported in our ICE templates but not quite with the same structural rigor as in this the add-in.

graphics3

Chris Rusbridge from the lamented in the comments of my last post that we don’t do NLM export from ICE but I reckon we could produce NLM XML from ICE documents with no more subsequent work required of editors than you would get using the AAAiMOW (I’m guessing we have no data about how well it [the add-in] works and I have yet to work through the section of the documentation for editors).

The readme tells us that styles don’t work (emphasis mine):

Custom XML elements are used to represent other abstractions that exist in the NLM tagset, but that are not found in Word, and to do so in a manner that can be presented to the author for editing in a robust way (unlike the use of custom styles, which was one of the ways to try to solve this problem in earlier versions of Word, and was not very reliable).

I have no doubt that there are lots of terrible style based systems out there, but we have worked hard on making styles usable, interoperable and easy to apply and providing robust rapid-feedback document conversion.

(Maybe both of us are wrong MJ Suhonos at PKP thinks that you can create XML without using either styles or embedded XML by using document formatting to infer structure.)

Would anyone care to fund a small project to see if we can use ICE to produce similar overall results (in terms of overall ROI) to the AAAiMOW but in a cross-platform solution? Microsoft? Anyone in the UK with access to JISC funds? A publisher?


1 Sounds like a noise emanating from a petulant feline.

2 We’re bracing for the onslaught on the help desk as hundreds of users have to re-learn commands they’ve been using for their whole working lives. It seems to us on the ICE team that this is a perfect time to introduce our users to the copy of OpenOffice.org which will be on their computer.

Powered by WordPress