[ptsefton.com] | [CV & Bio]

Yet more about XHTML export: sustainability


Last postI replied to Peter Albion's comments on a previous post about why I obsess about XHTML export. That post was mostly in defense of the ICE application.

Helpfully, Ian Barnes of ANU posted a nice summaryof one of the really good reasons to worry about XHTML export: sustainability. If you care about the long-term viability of documents then worrying about export to a standard XML format is essential, to save being locked in to an unreadable format.

Ian explains sustainability well, so I'll just add that people who have used styles consistently in the past now, as Ian puts it “get sustainability for free”. I wrote about this a while ago, pointing out that the real value is in having consistent structure that can be mapped to other structures:

What has happened in the last ten years is the word processors I work with (Microsoft Word and now OpenOffice.org Writer) have made it easier and easier to get XML out and back in again with no major innovations in the way they work. Although Word is clearly getting worse.

Using styles has been essential. Documents created in 1997 at Standards Australia would slot straight into the work I've been doing on the Word Processor Interoperability Project with a few simple style-replacements.

The message here is that it is consistent styles that are the most help, not 'doing XML' (yes, yes exporting to XML is an essential insurance policy and a requirement for any system). Standards Australia's strategy of keeping the Standards in Word format, using it to render them, then 'siphoning off' XML for web output has worked out well, but only because of the styles.

Another key issue is usability. In conversation Peter Albion has wondered if maybe we should not bother with HTML and just use PDF.

But PDF Files for online Reading is usability the number two mistake in the Top Ten Mistakes in Web Design (according to usability expert Jacob Nielsen).

Users hate coming across a PDF file while browsing, because it breaks their flow. Even simple things like printing or saving documents are difficult because standard browser commands don't work. Layouts are often optimized for a sheet of paper, which rarely matches the size of the user's browser window. Bye-bye smooth scrolling. Hello tiny fonts.

Worst of all, PDF is an undifferentiated blob of content that's hard to navigate.

PDF is great for printing and for distributing manuals and other big documents that need to be printed. Reserve it for this purpose and convert any information that needs to be browsed or read on the screen into real web pages.


So, I ask again, if we can painlessly make courses that allow HTML for online reading, so people can scan and find the bits they are interested in and follow links that will work, and have PDF for printing why wouldn't we write some cost-effective software to make that happen?

Peter Albion's closing paragraph questions the point of all this work I've been doing (I think I've answered all his points in the last couple of posts):

So, given the pain that seems to be involved, is the answer to the original question [Why do I keep going on about HTML export from word processors?] a matter of satisfaction at bending the machine to the will of a “hard master” (Sherry Turkle in The Second Self) or of masochism? It doesn’t seem to be a matter of need. Word produces respectable PDF and, when I need or want (X)HTML, there are adequate tools available for that too.http://edux.usq.edu.au/~albion/weblog/?p=117

I'm not sure where Peter gets this notion that I'm suffering pain in this process. Certainly my team and I do things that others might choose not to, but I for one am in this line of work because I find it rewarding, even financially (I've had well -paying jobs in this field). Most rewarding for me is not writing programs, but in fixing situations that are obviously broken with new processes that work around the limitations and in-built stupidity of mass market software like Word and Writer.

Hang on, why am I engaging with this name-calling?

Back to the matter under discussion, Peter seems to be implying that you can:

  1. easily generate PDF from word processors which is true, and

  2. that there are adequate tools for making XHTML, which is maybe true, depending on who you are.

But what about doing both HTML and PDF, really well, from the same source? It's not clear that he's saying that there are tools for that. I'm asking – what are good tools for that? Ones that might have a hope of broad uptake in a university context?.

If someone else would help complete this stuff I would move on to something else. And it seems that the work Ian Barnes is doing may help here:

Personally I think a well-designed, standardised, structured format like DocBook is even better. That's what I'm aiming at in my work at ANU for the Australian Partnership for Sustainable Repositories. I'm using Peter's ICE template as the starting point, and converting the resulting documents into DocBook, which can be put under good version control, transferred to any platform, converted using pre-existing software into good HTML and ever-improving PDF.


If Ian's software can help us hook ICE documents into the established set of tools for DocBook, with ways of generating HTML and PDF and so on then that's great – I'll stop playing with HTML export and move on.