[ptsefton.com] | [CV & Bio]

Beyond blogging: style-driven HTML export from 2007. Please.

2006-05-13

Via Brian Jones, who writes about the new Office XML formats for Microsoft, I welcome the news from Joe Friend that there will be built-inblogging in Word 2007.

This is good news not so much for the blogging bit but for the way that Word will be able to make clean HTML from styles. Joe Friend only mentions a couple of styles (h1 and, I assume, quotes, or does he mean any paragraph enclosed in quote marks and indented?):

Go ahead, click View, Source in your browser and look at the HTML starting with "Word is a great tool..." We really are going pretty basic here. Bold become <strong>, Italic becomes <em>, Heading 1 become <h1>, Quotes become <blockquote> and on it goes. There are definitely kinks in Beta 2. For example we are encoding smart quotes incorrectly so I had to turn off that feature in Word, but the goal is to output just what is needed to make your blog post clean and readable (code and rendered HTML).

That's fine, but what about lists, and pre-formatted text embedded in quotes and so on? (And actually I think bold should map to <b>, or nothing, and you should use a style called 'strong' if that's what you want).

Well at the ICE project we have developed a stylesheet that can drive clean HTML output, and we have templates for both Word and OpenOffice.org – so I can post to this blog from Microsoft Word or any OpenDocument aware application, like OpenOffice.org. I have covered this in a number of previous posts. So look here for an example, and here for some stuff about the ICE approach to blogging from a word processor, and in the pre-print of the paper I'll be giving on ICE at Ausweb 06 for some more detail about how the mapping works. I'll quote that paper here:

The core styles are listed below.

Family

Type

Style names

Paragraph (p)

p

Heading (h)

h1

h2

h3

h4

h5

Heading (h)

Numbered (number)

h1n

h2n

h3n

h4n

h5n

List item (li)

Numbered number)

li1n

li2n

li3n

li4n

li5n

List item (li)

Bullet (bullet)

li1b

li2b

li3b

li4b

li5b

List item (li)

Uppercase Alpha (A)

li1A

li2A

li3A

li4A

li5A

List item (li)

Lowercase Alpha (a)

li1a

li2a

li3a

li4a

li5a

List item (li)

Lowercase Roman (i)

li1i

li2i

li3i

li4i

li5i

List item (li)

Lowercase Roman (I)

li1I

li2I

li3I

li4I

li5I

List item (li)

Continuing paragraph (p)

li1p

li2p

li3p

li4p

li5p

Blockquote (bq)

bq1

bq2

bq3

bq4

bq5

Definition List

Term (dt)

dt1

dt2

dt3

dt4

dt5

Definition List

Description (dd)

dd1

dd2

dd3

dd4

Dd5

Pre formatted

(pre)

pre1

pre2

pre3

pre4

Pre5

Metadata: title

(title)

Title

Table of style names for paragraph styles in ICE.

The set of style names is designed to be different to those that ship by default with major word processors in order to emphasize that this is a self-contained system. For example, a first level heading is called h1, rather than Heading 1 in Word or OpenOffice.org while a first level bulleted list item would be li1b for “list item, level 1, bullet”.

In the default style-sets that come with other word processors this kind of list item might be “List 1” in OpenOffice.org, or “List Bullet 1” in Word. The Word style name is more readable than the ICE style, but at the cost of being so long that it can be difficult to work with in Word itself, when trying to view style names in the left margin (a feature denied to users of OpenOffice.org).

So, what if Word 2007 finally shipped with the Normal template containing a complete set of styles, like the ICE styles, that would cover pretty much the same territory as HTML? Not just headings, but different flavours of numbered list, definition lists, pre-formatted text and blockquotes in a number of levels that could be combined. Something a bit better than the feeble, incomplete set of styles Microsoft has been shipping for years.

Hey Joe, you can contact me if you'd like some help – I've been working on this issue for ten years.

(And what if the much hyped new clean Word interface defaulted to using styles for its formatting? Imagine if pressing those little list-icons have you not only list-like formatting but style-driven list-based formatting. That would mean that you could export clean HTML and really interoperate with other packages.)

Given a decent set of styles then finally the default Save as HTML... in Word could produce nice clean HTML. Please, please, Microsoft don't tell us that you've continued to bury and de-value styles, and make templates even harder to find in the interface.

For example, it Word's HTML export system saw a paragraph with the style List Bullet 1 followed by List Bullet 2, it would know how to output nested list in HTML. At the moment HTML export in any word processor is severely handicapped by having to divine good mappings to HTML from a completely open-ended formatting palette, with the result that clean export is pretty much impossible. You can read about my frustrations with the OpenOffice.org Writer application here.

And going a bit further wouldn't it be great if OpenOffice.org and Microsoft Word and Google's Writely (see my post) all understood the same set of styles and could make clean HTML from them? (They all agree on “Heading 1, Heading 2” but that's as far as it goes).

Ok, so maybe Microsoft and Sun and Google don't care. But we do so we'll continue in our struggle to provide good word processor interoperability even if we have code it ourselves. It would just be so much easier if the vendors helped the community.