Beyond blogging: style-driven HTML export from 2007. Please.
2006-05-13
Via Brian Jones, who writes about the new Office XML formats for Microsoft, I welcome the news from Joe Friend that there will be built-inblogging in Word 2007.
This is good news not so much for the blogging bit but for the way that Word will be able to make clean HTML from styles. Joe Friend only mentions a couple of styles (h1 and, I assume, quotes, or does he mean any paragraph enclosed in quote marks and indented?):
Go ahead, click View, Source in your browser and look at the HTML starting with "Word is a great tool..." We really are going pretty basic here. Bold become <strong>, Italic becomes <em>, Heading 1 become <h1>, Quotes become <blockquote> and on it goes. There are definitely kinks in Beta 2. For example we are encoding smart quotes incorrectly so I had to turn off that feature in Word, but the goal is to output just what is needed to make your blog post clean and readable (code and rendered HTML).
That's fine, but what about lists, and pre-formatted text embedded in quotes and so on? (And actually I think bold should map to <b>, or nothing, and you should use a style called 'strong' if that's what you want).
Well at the ICE project we have developed a stylesheet that can drive clean HTML output, and we have templates for both Word and OpenOffice.org – so I can post to this blog from Microsoft Word or any OpenDocument aware application, like OpenOffice.org. I have covered this in a number of previous posts. So look here for an example, and here for some stuff about the ICE approach to blogging from a word processor, and in the pre-print of the paper I'll be giving on ICE at Ausweb 06 for some more detail about how the mapping works. I'll quote that paper here:
The core styles are listed below.
Family
Type
Style names
Paragraph (p)
p
Heading (h)
h1
h2
h3
h4
h5
Heading (h)
Numbered (number)
h1n
h2n
h3n
h4n
h5n
List item (li)
Numbered number)
li1n
li2n
li3n
li4n
li5n
List item (li)
Bullet (bullet)
li1b
li2b
li3b
li4b
li5b
List item (li)
Uppercase Alpha (A)
li1A
li2A
li3A
li4A
li5A
List item (li)
Lowercase Alpha (a)
li1a
li2a
li3a
li4a
li5a
List item (li)
Lowercase Roman (i)
li1i
li2i
li3i
li4i
li5i
List item (li)
Lowercase Roman (I)
li1I
li2I
li3I
li4I
li5I
List item (li)
Continuing paragraph (p)
li1p
li2p
li3p
li4p
li5p
Blockquote (bq)
bq1
bq2
bq3
bq4
bq5
Definition List
Term (dt)
dt1
dt2
dt3
dt4
dt5
Definition List
Description (dd)
dd1
dd2
dd3
dd4
Dd5
Pre formatted
(pre)
pre1
pre2
pre3
pre4
Pre5
Metadata: title
(title)
Title
Table of style names for paragraph styles in ICE.
The set of style names is designed to be different to those that ship by default with major word processors in order to emphasize that this is a self-contained system. For example, a first level heading is called h1, rather than Heading 1 in Word or OpenOffice.org while a first level bulleted list item would be li1b for “list item, level 1, bullet”.
In the default style-sets that come with other word processors this kind of list item might be “List 1” in OpenOffice.org, or “List Bullet 1” in Word. The Word style name is more readable than the ICE style, but at the cost of being so long that it can be difficult to work with in Word itself, when trying to view style names in the left margin (a feature denied to users of OpenOffice.org).
So, what if Word 2007 finally shipped with the Normal template containing a complete set of styles, like the ICE styles, that would cover pretty much the same territory as HTML? Not just headings, but different flavours of numbered list, definition lists, pre-formatted text and blockquotes in a number of levels that could be combined. Something a bit better than the feeble, incomplete set of styles Microsoft has been shipping for years.
Hey Joe, you can contact me if you'd like some help – I've been working on this issue for ten years.
(And what if the much hyped new clean Word interface defaulted to using styles for its formatting? Imagine if pressing those little list-icons have you not only list-like formatting but style-driven list-based formatting. That would mean that you could export clean HTML and really interoperate with other packages.)
Given a decent set of styles then finally the default Save as HTML...
in Word could produce nice clean HTML. Please, please, Microsoft don't
tell us that you've continued to bury and de-value styles, and make
templates even harder to find in the interface.
For example, it Word's HTML export system saw a paragraph with the style
List Bullet 1
followed by List Bullet 2
, it would know how to output
nested list in HTML. At the moment HTML export in any word processor
is severely handicapped by having to divine good mappings to HTML from
a completely open-ended formatting palette, with the result that clean
export is pretty much impossible. You can read about my frustrations
with the OpenOffice.org Writer application
here.
And going a bit further wouldn't it be great if OpenOffice.org and Microsoft Word and Google's Writely (see my post) all understood the same set of styles and could make clean HTML from them? (They all agree on “Heading 1, Heading 2” but that's as far as it goes).
Ok, so maybe Microsoft and Sun and Google don't care. But we do so we'll continue in our struggle to provide good word processor interoperability even if we have code it ourselves. It would just be so much easier if the vendors helped the community.