<?xml version="1.0" encoding="UTF-8"?>
<!-- generator="wordpress/2.3.1" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>

<channel>
	<title>PT's blog</title>
	<link>http://ptsefton.com</link>
	<description>Why can't word processors make decent HTML?</description>
	<pubDate>Thu, 02 Oct 2008 22:33:51 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.3.1</generator>
	<language>en</language>
			<item>
		<title>ICE: eResearch for Word users</title>
		<link>http://ptsefton.com/2008/09/30/ice-eresearch-for-word-users.htm</link>
		<comments>http://ptsefton.com/2008/09/30/ice-eresearch-for-word-users.htm#comments</comments>
		<pubDate>Tue, 30 Sep 2008 06:26:40 +0000</pubDate>
		<dc:creator>ptsefton</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://ptsefton.com/2008/09/30/ice-eresearch-for-word-users.htm</guid>
		<description><![CDATA[My poster from eResearch Australasia 2008]]></description>
			<content:encoded><![CDATA[<abbr class="unapi-id" title="http://ptsefton.com/?p=203"><!-- &nbsp; --></abbr>
<script type="text/javascript" src="/jquery.js"><!-- --></script><script type="text/javascript" src="/geo.js"><!-- --></script><div><span class='pdf-rendition-link'><a href='http://ptsefton.com/wp-content/uploads/2008/09/eresearch-word-poster2.pdf'>View as PDF</a></span><div class='page-toc'><ul/></div><div><p>I&#8217;m just blogging this poster from OR08 to show that it can be done. </p><div class="slide"><h1>About this hyperposter</h1><div class="Table8" style="width: 100%; margin: 0px; padding: 0px; text-align:left;"><table class="Table8" style="border-spacing: 0;empty-cells: show; margin-left:0cm; margin-right:0cm; width:100%; border-collapse: collapse; "><colgroup><col style="width:20.313cm;"/></colgroup><tbody><tr><td class="Table8_A1" style="vertical-align: top;  border:none;  padding:0.097cm; "><p class="P3">This poster is a hyperdocument designed to show some potential applications for eResearch publications.</p><p>This document has embedded semantics.</p><p>For example, it w<span class="T2">as written in:  </span></p><ul class="lib"><li><p><a href="http://geohash.org/r7h4cr2dt6pz"><span class="T2">Toowoomba at USQ</span></a><span class="T2"> (S 27.601335<span class="spCh spChxb0">°</span> E 151.930854<span class="spCh spChxb0">°</span>),  </span></p></li><li><p><span class="T2">for a </span><a href="http://geohash.org/r1r0ejdh0yd6"><span class="T2">conference in Melbourne</span></a><span class="T2"> (S 37.849925<span class="spCh spChxb0">°</span> E 144.978368<span class="spCh spChxb0">°</span>)</span></p></li></ul><p><span class="T2">Embedded geographical da</span>ta (via <a href="http://geohash.org/">geohash</a>) can be used to generate a map like the one here. On the web, this is an interactive, automatic process.</p></td><td class="Table8_A1" style="vertical-align: top;  border:none;  padding:0.097cm; "><p class="center"><a name="graphics1"/><img alt="graphics1" class="fr1" height="333" src="http://ptsefton.com/wp-content/uploads/2008/09/m3924e4b2s534x3332.jpg" style="border:0px;" width="534"/></p><p><a href="http://www.openstreetmap.org/">OpenStreetMap</a> data can be used freely under the terms of the Creative Commons Attribution-ShareAlike 2.0 license.</p></td></tr></tbody></table></div><p/></div><div class="slide"><p/><h1>The mythical datument</h1><p>The term Datument was coined in 2004 by Peter Murray-Rust and Henry Rzepa:</p><blockquote class="bq"><p>A datument is a hyperdocument for transmitting &#8220;complete&#8221; information including content and behaviour. &#8230;  where the machine is supplied with tools which are semantically aware of the document content. Examples of the latter are domain-specific XML components such as maps (GML), graphics (SVG) and molecules (Chemical Markup Language, CML)</p><p>Murray-Rust, P. &amp; Rzepa, H.S., 2004. The Next Big Thing: From Hypermedia to Datuments. Journal of Digital Information, 5(1), p.248. Available at: <a href="http://jodi.tamu.edu/Articles/v05/i01/Murray-Rust/?printable=1">http://jodi.tamu.edu/Articles/v05/i01/Murray-Rust/?printable=1</a> </p></blockquote><p>But they are far from common. This  poster / blog post / presentation / map-mashup  might be the closest you have ever been to one.</p></div><p class="P6"/><div class="slide"><h1>It&#8217;s only 2008 <span class="spCh spChx2013">–</span> be patient!</h1><p class="center"><a name="Object3"/><img alt="Object3" class="fr2" height="564" src="http://ptsefton.com/wp-content/uploads/2008/09/m22fe7558s971x5642.jpg" style="border:0px;" width="971"/></p><p/></div><div class="slide"><h1>Produce PDF, HTML and more from word processors</h1><div class="Table6" style="width: 100%; margin: 0px; padding: 0px; text-align:left;"><table class="Table6" style="border-spacing: 0;empty-cells: show; margin-left:0cm; margin-right:0cm; width:100%; border-collapse: collapse; "><colgroup><col style="width:20.41cm;"/></colgroup><tbody><tr><td class="Table6_A1" style="vertical-align: top;  border:none;  padding:0.097cm; "><ol class="lin" style="list-style: decimal;"><li><p>Microsoft <b>Word</b> (Windows &amp; Mac)</p></li><li><p><a href="http://openoffice.org/"><b>OpenOffice.org</b></a> Writer &amp; derivatives (Windows, Mac and Linux)</p></li><li><p>Applies <b>styles behind the scenes</b> to capture structure</p></li><li><p><b>Command line or web</b> service for integration</p></li><li><p><b>Open source</b> <span class="spCh spChx2013">–</span> built on Python + OpenOffice.org</p></li><li><p>Works with <a href="http://www.zotero.org/">Zotero</a></p></li><li><p>Built in <b>version control</b> via <a href="http://subversion.tigris.org/">Subversion</a></p></li><li><p>Integrated with ePrints and other repositories (coming soon via the <a href="http://www.jisc.ac.uk/whatwedo/programmes/digitalrepositories2007/theorem-ice.aspx">ICE-TheOREM</a> project)</p></li></ol></td><td class="Table6_A1" style="vertical-align: top;  border:none;  padding:0.097cm; "><p><span style="display: block"><a name="Object2"/><img alt="Object2" class="fr3" height="477" src="http://ptsefton.com/wp-content/uploads/2008/09/m36029705s714x4772.jpg" style="border:0px; vertical-align: baseline; margin-bottom: 0.000376911479647px;" width="714"/></span></p></td></tr></tbody></table></div><p/></div><div class="slide"><h1><a href="http://ice.usq.edu.au/">ICE</a>: a hub for collaborative authoring</h1><p class="P2"><a name="Object1"/><img alt="Object1" class="fr4" height="641" src="http://ptsefton.com/wp-content/uploads/2008/09/7af7e9e8s903x6412.jpg" style="border:0px;" width="903"/></p></div><p/><div class="slide"><h1>Ask me how</h1><div class="Table5" style="width: 100%; margin: 0px; padding: 0px; text-align:left;"><table class="Table5" style="border-spacing: 0;empty-cells: show; margin-left:0cm; margin-right:0cm; width:100%; border-collapse: collapse; "><colgroup><col style="width:20.41cm;"/></colgroup><tbody><tr><td class="Table5_A1" style="vertical-align: top;  border:none;  padding:0.097cm; "><p>(Metadata is embedded in the hyperposter using styles)</p><div class="Table10" style="width: 100%; margin: 0px; padding: 0px; text-align:left;"><table class="Table10" style="border-spacing: 0;empty-cells: show; margin-left:0cm; margin-right:0cm; width:100%; border-collapse: collapse; "><colgroup><col style="width:11.005cm;"/><col style="width:9.209cm;"/></colgroup><tbody><tr><td class="Table10_A1" style="vertical-align: top;  border:none;  padding:0.097cm; "><p class="center">Peter Sefton </p></td><td class="Table10_A1" style="vertical-align: top;  border:none;  padding:0.097cm; "><p>{p-meta-author-name}</p></td></tr><tr><td class="Table10_A1" style="vertical-align: top;  border:none;  padding:0.097cm; "><p class="center">The University of Southern Queensland</p></td><td class="Table10_A1" style="vertical-align: top;  border:none;  padding:0.097cm; "><p>{p-meta-author-affiliation}</p></td></tr><tr><td class="Table10_A1" style="vertical-align: top;  border:none;  padding:0.097cm; "><p class="center"><a href="mailto:peter.sefton@usq.edu.au"><span class="T2">peter.sefton@usq.edu.au</span></a></p></td><td class="Table10_A1" style="vertical-align: top;  border:none;  padding:0.097cm; "><p>{p-meta-author-email}</p></td></tr><tr><td class="Table10_A1" style="vertical-align: top;  border:none;  padding:0.097cm; "><p class="center">+61 (0) 410 326955</p></td><td class="Table10_A1" style="vertical-align: top;  border:none;  padding:0.097cm; "><p>{p-meta-author-phone-mobile}</p></td></tr></tbody></table></div><p> </p></td><td class="Table5_A1" style="vertical-align: top;  border:none;  padding:0.097cm; "><p>Also available in machine readable form:</p><ul class="lib"><li><p>Dublin Core</p>
<pre>&lt;oai_dc:dc&gt;</pre>
<pre> &lt;dc:title&gt;ICE: eResearch for Word users&lt;/dc:title&gt;</pre>
<pre> &lt;dc:creator&gt;Peter Sefton&lt;/dc:creator&gt;</pre>
<pre>&lt;/oai_dc:dc&gt;</pre></li><li><p>RDF <span class="spCh spChx2013">–</span> <a href="http://www.google.com.au/search?q=ore+rdf&amp;ie=utf-8&amp;oe=utf-8&amp;aq=t&amp;rls=org.mozilla:en-US:official&amp;client=firefox-a">ORE</a> resource map for migration to repositories</p></li></ul></td></tr></tbody></table></div><p/></div><p/></div></div>]]></content:encoded>
			<wfw:commentRss>http://ptsefton.com/2008/09/30/ice-eresearch-for-word-users.htm/feed</wfw:commentRss>
		</item>
		<item>
		<title>Is this thing working?</title>
		<link>http://ptsefton.com/2008/09/19/is-this-thing-working.htm</link>
		<comments>http://ptsefton.com/2008/09/19/is-this-thing-working.htm#comments</comments>
		<pubDate>Fri, 19 Sep 2008 02:54:47 +0000</pubDate>
		<dc:creator>ptsefton</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://ptsefton.com/2008/09/19/is-this-thing-working.htm</guid>
		<description><![CDATA[Just a geohash test]]></description>
			<content:encoded><![CDATA[<abbr class="unapi-id" title="http://ptsefton.com/?p=197"><!-- &nbsp; --></abbr>
<script type="text/javascript" src="/jquery.js"><!-- --></script><script type="text/javascript" src="/geo.js"><!-- --></script><div><div class='page-toc'></div><div><p>I&#8217;m working on my hyperposter for <a href="http://www.eresearch.edu.au/programme">eResearch Australasia 2008</a>. This is a test to see if the mapping system here is still working.</p><p>This document has embedded semantics It was written in: </p><ol class="lin" style="list-style: decimal;"><li><p> <a href="http://geohash.org/r7h4cr2dt6pz">Toowoomba at USQ</a> [Update: fixed spelling] (S 27.601335<span class="spCh spChxb0">°</span> E 151.930854<span class="spCh spChxb0">°</span>),  </p></li><li><p>for a <a href="http://geohash.org/r1r0ejdh0yd6">conference in Melbourne</a> (S 37.849925<span class="spCh spChxb0">°</span> E 144.978368<span class="spCh spChxb0">°</span>)</p></li></ol><p/><p class="meta-abstract" style="margin-left:0cm; margin-right:1.27cm; text-indent:0cm; "/></div></div>]]></content:encoded>
			<wfw:commentRss>http://ptsefton.com/2008/09/19/is-this-thing-working.htm/feed</wfw:commentRss>
		</item>
		<item>
		<title>Embedding XML in word processing documents (if you really must)</title>
		<link>http://ptsefton.com/2008/09/09/embedding-xml-in-word-processing-documents-if-you-really-must.htm</link>
		<comments>http://ptsefton.com/2008/09/09/embedding-xml-in-word-processing-documents-if-you-really-must.htm#comments</comments>
		<pubDate>Tue, 09 Sep 2008 03:38:31 +0000</pubDate>
		<dc:creator>ptsefton</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://ptsefton.com/2008/09/09/embedding-xml-in-word-processing-documents-if-you-really-must.htm</guid>
		<description><![CDATA[Rick Jelliffe has posted a comparison of how foreign XML can be embed in OOXML (that&apos;s the XML format for Microsoft Office) and ODF (the Open Document Format).]]></description>
			<content:encoded><![CDATA[<abbr class="unapi-id" title="http://ptsefton.com/?p=195"><!-- &nbsp; --></abbr>
<div><div class='page-toc'></div><div><p>Rick Jelliffe has posted a <a href="http://news.oreilly.com/2008/08/wrapping-with-foreign-elements.html">comparison</a> of how foreign XML can be embed in OOXML (that&#8217;s the XML format for Microsoft Office) and ODF (the Open Document Format).</p><p>Rick starts with:</p><blockquote class="bq"><p>First the caveat: Word and OpenOffice are not general-purpose XML editors. </p></blockquote><p>Right. That means that if you do decide that there&#8217;s a case for embedding extra XML in OOXML or ODF then you are going to have to supply add-ons to the applications in question to edit it. So what does this mean for the two formats? (As usual I&#8217;ll just talk about the word processing format here and ignore spreadsheets and the rest.)</p><p><b>For OOXML</b>, you would have to create a Word Addin such as the one I&#8217;ve looked at here before. There could be business case for that, but you&#8217;d have to accept that your documents were only going to be editable in Word 2007+. I gather from recent posts that Rick does some work on projects where this does make good business sense. </p><p><b>For OpenOffice.org</b> you&#8217;re out of luck. Rick&#8217;s tests show that OpenOffice.org strips out foreign markup. It&#8217;s unclear whether this is conformant behaviour or not:</p><blockquote class="bq"><p>But the bottom line for foreign elements as wrappers in ODF and OOXML is that ODF allows them to be stripped out while OOXML doesn&#8217;t allow that; neither of course require that any particular application understands them. The bottom line for OpenOffice and Office seems to be that OpenOffice strips them (dangerously, but perhaps allowed because of bad drafting of that part of the ODF standard) while Office 2007 does allow them.</p></blockquote><p>As<a href="http://delicious.com/ptsefton/odfconformance+ptsefton"> I&#8217;ve covered here many times ODF interoperability between applications is basically non-existent</a> except between Microsoft Office and OpenOffice.org and its derivatives where some things work quite well. Bottom line is, ODF doesn&#8217;t have any formal notion of what&#8217;s conformant <span class="spCh spChx2013">–</span> it&#8217;s up to application developers to implement the bits they feel like implementing. </p><blockquote class="bq"><p>The OpenDocument specification does not specify which elements and attributes conforming  application must, should, or may support. The intention behind this is to ensure that the  OpenDocument specification can be used by as many implementations as possible, even if these applications do not support some or many of the elements and attributes defined in this specification. Viewer applications for instance may not support all editing relates elements and attributes (like change tracking), other application may support only the content related elements and attributes, but none of the style related ones. </p><p><a href="http://www.oasis-open.org/committees/download.php/12572/OpenDocument-v1.1-os.pdf">http://www.oasis-open.org/committees/download.php/12572/OpenDocument-v1.1-os.pdf</a> </p></blockquote><p>I think for most uses a much better bet is to use microformats which leverage the built in features of the formats. These not only work in the aforementioned major applications for OOXML or ODF, in many cases they interchange between the formats quite nicely as well.</p><p>What&#8217;s a word processing microformat? One example would be using a one-cell borderless table with a paragraph in it of style &#8216;h-warning&#8217; to indicate a bit of content that&#8217;s a warning, to use Rick&#8217;s example. Ok, so using a table is inelegant, but it works in both Word and OpenOffice.org writer and will survive round tripping between .doc and .odt and .docx. You could use a frame, which is a more semantically neutral element and sacrifice some interop, or you could use styles only, which is a bit harder for users to manage and more error prone. Actually, Rick gives an example of a styles-based microformat approach.</p><p><a href="http://ice.usq.edu.au/">We</a> use this kind of technique to do things like <a href="http://ice.usq.edu.au/introduction/about.slide.htm">generate slide shows</a> from text embedded in <a href="http://ice.usq.edu.au/introduction/about.htm">documents</a>, and we&#8217;re developing methods for embedding metadata in documents using styles.</p></div></div>]]></content:encoded>
			<wfw:commentRss>http://ptsefton.com/2008/09/09/embedding-xml-in-word-processing-documents-if-you-really-must.htm/feed</wfw:commentRss>
		</item>
		<item>
		<title>More ideas about online and offline word processor integration - is anybody listening?</title>
		<link>http://ptsefton.com/2008/09/05/more-ideas-about-online-and-offline-word-processor-integration-is-anybody-listening.htm</link>
		<comments>http://ptsefton.com/2008/09/05/more-ideas-about-online-and-offline-word-processor-integration-is-anybody-listening.htm#comments</comments>
		<pubDate>Fri, 05 Sep 2008 01:47:17 +0000</pubDate>
		<dc:creator>ptsefton</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://ptsefton.com/2008/09/05/more-ideas-about-online-and-offline-word-processor-integration-is-anybody-listening.htm</guid>
		<description><![CDATA[Adobe are not thinking along the same lines as me at all . They don&apos;t see it as important to be able to interchange with other word processors because they&apos;re going to make theirs broadly available and they don&apos;t care much for HTML]]></description>
			<content:encoded><![CDATA[<abbr class="unapi-id" title="http://ptsefton.com/?p=193"><!-- &nbsp; --></abbr>
<div><div class='page-toc'></div><div><p>Via Glyn Moody who <a href="http://opendotdotdot.blogspot.com/2008/09/i-dont-want-to-say-we-told-you-so.html">doesn&#8217;t want to say he told us so</a> I see that Adobe is <a href="http://uk.techcrunch.com/2008/09/04/startups-in-chaos-as-adobes-flashpaper-discontinues/">discontinuing support</a> for Flashpaper, a proprietary Adobe (via Macromedia) technology for disseminating documents online. This means that anyone who has put stuff in there now has to migrate all their stuff to some other format. That&#8217;s what you get for using technology that&#8217;s controlled by a single vendor.</p><p>That reminded me that I had this piece I&#8217;ve been working on about Adobe Buzzword, another Adobe proprietary document format. </p><p>Following my <a href="http://ptsefton.com/2008/07/24/more-on-buzzword.htm">last post on Buzzword</a>,  I  had an email from Tad Staley at Adobe which seemed encouraging:</p><blockquote class="bq"><p>You had an interesting point about exporting named styles to Word. By this, I assume you mean that we create a handful of styles that correspond to Buzzword fonts and paragraph settings, and use them within the .doc or .docx file we create on exporting? This would then allow us to &#8220;round trip&#8221; the document better from Buzzword to Word and back again.</p><p>We&#8217;d like to hear any other thoughts you have with respect to styles - as I said, we&#8217;re working on them now, so the timing is good.</p></blockquote><p>So I drafted something along the lines of what you&#8217;re about to read and sent it off to Tad. It was pretty clear from Tad&#8217;s reply that Adobe are not thinking along the same lines as me at all . They don&#8217;t see it as important to be able to interchange with other word processors because they&#8217;re going to make theirs broadly available and they don&#8217;t care much for HTML because they care too much about controlling the fine details of document presentation. What this means it they&#8217;d like you to use Buzzword / Flash / PDF to disseminate your work rather than an open web format <span class="spCh spChx2013">–</span> OK some PDF is a bit open but it is very much page oriented and much harder to integrate with other services than HTML. I think that&#8217;s terribly short sighted and reduces considerably what people can do with their documents. Mashups and so on that are built for the open web would all have to be redone for the Buzzword world for example my <a href="http://ptsefton.com/2008/06/19/adventures-in-geocoding-part-2-embedding-data-points-in-documents.htm">geocoding example</a>.</p><p>I&#8217;d be cautious of Buzz word out if I were you, because this format could easily go the way of Flashpaper.</p><p>Anyway here&#8217;s the gist of what I sent to Tad at Adobe. </p><p>I think it would be a good idea for you to map Buzzword docs to a set of styles when you export to .doc or .docx- I&#8217;d like to see .odt as well. I&#8217;ll go into specifics below but first some general comments.</p><p>There are so many issues with styles in word processors regarding styles, interop and HTML export it&#8217;s a bit hard to summarize in a short email or blog post, but here are some of the main problems. It would be great for someone to get it right for once:</p><ol class="lin" style="list-style: decimal;"><li><p><b>No standard set of styles: </b>Nobody ships a rational set of styles by default - I&#8217;d be looking for something that covers headings (both numbered and not in the same document) lists, block-quotes, pre-format text; the list is actually very similar to the set of elements in HTML, which is no coincidence as that&#8217;s a generic schema.</p></li><li><p><b>Awful HTML export:</b> Word processors almost always try to reproduce whatever the user inputs in the way of formatting resulting in all sorts of crap in the HTML they output. Building a new product is a great opportunity to do it differently.</p><p>(Buzzword&#8217;s HTML isn&#8217;t bad by comparison with some, within its current limitations. But really, you should fix the list nesting. I think the HTML model is silly too, but it is what it is.)</p></li><li><p><b>No &#8217;structure-only&#8217; mode:</b> Why not have UI mode where you can&#8217;t do gratuitous formatting, only structure your document using headings, do lists and blockquotes etc and then choose from a menu of stylesheets? That is, turn off the font panel. This may have been hard to sell in the old days but now I think people would get it <span class="spCh spChx2013">–</span> particularly when they are writing for multiple media <span class="spCh spChx2013">–</span> where the same document could be published as both HTML and PDF. If you restrict users to a known style set then you can reliably change the presentation of their documents automatically. If not then you have problems. A couple of examples:</p><ol class="li-lower-roman" style="list-style: lower-roman;"><li><p>If people have chosen colours then you can&#8217;t change the background colour of a page in case you have readability problems.</p></li><li><p>If you allow absolute indents (say 4cm) then you might not be able to reformat into multiple columns and still have the document look OK.</p></li></ol></li><li><p><b>Extreme confusion in the area of lists:</b> Both Word (and by extension OOXML), and Writer (and by extension ODF) have mind-blowingly crazy list models. </p><ul class="lib"><li><p>Word has paragraph styles to which you can attach list formatting, and it has named outlines (with one of the worst GUIs <b>ever </b>even before Word 2007 took it to new heights) AND it has list styles which came along circa Word 2003. </p></li><li><p>Writer has paragraph styles and list styles both of which can be applied independently, and lists are represented as a hierarchy in the file format, although he GUI gives almost no clues as to what the hierarchy actually is.</p></li></ul><p>In the ICE project we deal with this by automatically creating paragraph styles and list styles / named outlines and providing toolbars to apply both at once, resulting in much more stable, interoperable documents than you get if you leave users to deal with all this by themselves.</p></li></ol><p>So here&#8217;s what I would do if I had a chance to influence Buzzword,  in addition to building in a standard kind of word processor style system. </p><p>Based on my observation of the behavior of the list formating Buzzword obviously has some notion of structure built into it even if it doesn&#8217;t (yet) have headings. So lets look at what you could do with lists. </p><p>I think the Buzzword UI for lists is pretty cool <span class="spCh spChx2013">–</span> one thing I like is that lists stay connected. In most online editors if you change an item in the middle of a list into a plain paragraph and then back into a list item you get two disconnected lists, something that makes no structural or practical sense. Buzzword gets this right and makes sure that list items adjacent to each other are part of the same list.</p><p>Here&#8217;s a test-list in Buzzword:</p><p><span style="display: block"><a name="graphics1"/><img alt="graphics1" class="fr1" height="241" src="http://ptsefton.com/wp-content/uploads/2008/09/m7409a702s552x241.jpg" style="border:0px;" width="552"/></span></p><p>The UI is really slick <span class="spCh spChx2013">–</span> it actually understands the structure of the list so when you hit the promote (&lt;-) and demote (-&gt;) buttons it does The Right Thing. My only quibble is the way it insists on all the items at a particular level being the same kind of list item even if they are not siblings. </p><p>Oh, and don&#8217;t call the list level &#8216;outline level&#8217; because in other word processors that term is used for the heading structure.</p><p>My proposal is that on export Buzzword should not just use formating it should create styles. As I mentioned before this is more complex than it needs to be, due to the legacy of gratuitous features in the target applications but it is doable. </p><p>Lets take the example of .odt export for use in the OpenOffice.org family of word processors. I&#8217;ll use the ICE version of the style names, chosen for their brevity but you could use longer versions. </p><p>Here&#8217;s the same  test list embedded in this document which I&#8217;m writing using NeoOffice. The paragraph style names are shown in curly-braces at the end of each paragraph, behind the scenes my toolbar also applies a list style of the same name.</p><blockquote class="bq"><p>Buzzword test document. {p}</p><ul class="lib"><li><p>    Bullet list first item. {li1b}</p></li><li><p>    Bullet list second item. {li1b}</p><ol class="li-lower-alpha" style="list-style: lower-alpha;"><li><p>    Embedded list with lowercase-alpha first. {li2a}</p></li><li><p>Embedded list with lowercase-alpha second {li2a}</p><p> Continuing paragraph (&#8217;Skip&#8217; in Buzzword-speak) {li2p}</p></li><li><p>Embedded list with lowercase-alpha third.{li2a}</p></li></ol></li><li><p> Bullet list third item. {li1b}</p></li></ul></blockquote><p>When exporting, you could embed some macros that provide a buzzword-like interface via a toolbar. In ICE we have a <a href="http://ice.usq.edu.au/instructions/templates/toolbars_and_templates.htm">toolbar</a> which tries to Do The Right Thing (doesn&#8217;t always succeed I have to admit, but we&#8217;re getting there). We take a different approach from Buzzword&#8217;s modal interface and re-use the same buttons in different contexts. So the promote button in a list will move your list item to the left and it should pick up the right list style by looking back through the document to see what is appropriate - whereas for a heading it would change the heading level in the document outline.</p><p>Why do we do this? It&#8217;s all about interoperability. The styles mean that we can produce good HTML, and also move documents between Word and Writer pretty easily, correcting for the differences between their wacky, annoying, productivity-sapping list models. And we give users on both word processor the same toolbar running the same code.</p><p>One advantage for Adobe and their buzzword product would be the same <span class="spCh spChx2013">–</span> good interoperability with offline word processors. But there&#8217;s another potential benefit, the same one I suggested to Google. Adobe could start &#8216;infecting&#8217; documents with a benign structure virus. Lets see how this could work:</p><ol class="lin" style="list-style: decimal;"><li><p>I draft a blog post like this one in Buzzword and send it via Buzzword&#8217;s sharing feature to a colleague to add their contribution. I was going to say &#8216;a paper&#8217; but Buzzword is a long way from ready for that.</p></li><li><p>My colleague doesn&#8217;t want to sign up to yet another online service, and besides is going to be editing the document later at home, so chooses the option to download it as a Word document and saves it on a USB drive.</p></li><li><p>Later at home Word prompts to say that the document contains macros and should they be allowed to run? If no, then it&#8217;s not the end of the world as we still have a Word document that can be re-imported to Buzzword later. If yes <span class="spCh spChx2013">–</span> then read on.</p></li><li><p>On opening the document, it&#8217;s got a Buzzword-style or ICE-style toolbar, so my colleague is able to make some changes to the document without realizing that they are dealing with the styles that were added to the document automatically on export.</p></li><li><p>When the editing is done, they can save the document locally, but since there&#8217;s a toolbar installed they can click the &#8216;Return to sender&#8217; button and it gets automatically uploaded back into my Buzzword account via an inbox.</p><p>Because they used the toolbar all the headings are set properly and the lists are nice and orderly. </p><p>(If you don&#8217;t understand why I&#8217;m going on about this go over to Google docs and try importing and exporting documents using OpenOffice.org Writer).</p></li><li><p>Later, if my colleague decides that they <i>did</i> like the Buzzword experience they can click the &#8216;Install the Buzzword template&#8217; button and have the toolbar show up all the time. If they go further and sign up for an account then they can draft things in Buzzword and have them save automatically into  Buzzword.</p></li></ol><p>You can see how this could spread the Buzzword way of life not by replacing offline word processors but by providing a bridge into the online service. If the online way is better then people will naturally stop using their offline programs.</p><p>A couple of other things that would help drive the service:</p><ul class="lib"><li><p>AtomPub support so you can post to your blog, both from the online service and from your word processor. ICE does this already.</p></li><li><p>Simple web page publishing. At the moment Buzzword does HTML export in a Zip file <span class="spCh spChx2013">–</span> why can&#8217;t it just put the page up online for you? </p></li><li><p>An import feature where when a user uploads an unstyled word processing document Buzzword gives it back with added styleage. (See the ongoing conversation I&#8217;m having with <a href="http://ptsefton.com/2008/06/26/a-few-words-on-magic.htm">MJ Suhonos</a>).</p></li></ul><p>There would be a couple of ways for the online word processor vendors to approach this. One would be to work with the ICE team. As far as I know there is nobody else out there with our commitment to generic word processing based web and print content management. The first mover would have an advantage and if it worked others would follow. The users would win.</p><p>Another would be to invent a proprietary set of styles and toolbars and go for more of a lockin effect. Might work. Wouldn&#8217;t be so great for the users.</p><p>I am reminded writing this that all the recent activity on word processing standards hasn&#8217;t changed things much for users. For complex documents, like business documents with embedded fields and so on interoperability between packages both online and offline is still really poor, and interoperability between word processing packages and the web is terrible. It&#8217;s not about whether you&#8217;re doing OOXML or ODF. It&#8217;s about <b>what</b> you&#8217;re doing with them.</p><p/><p/></div></div>]]></content:encoded>
			<wfw:commentRss>http://ptsefton.com/2008/09/05/more-ideas-about-online-and-offline-word-processor-integration-is-anybody-listening.htm/feed</wfw:commentRss>
		</item>
		<item>
		<title>Put on The Fascinator</title>
		<link>http://ptsefton.com/2008/09/03/put-on-the-fascinator.htm</link>
		<comments>http://ptsefton.com/2008/09/03/put-on-the-fascinator.htm#comments</comments>
		<pubDate>Tue, 02 Sep 2008 23:30:26 +0000</pubDate>
		<dc:creator>ptsefton</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://ptsefton.com/2008/09/03/put-on-the-fascinator.htm</guid>
		<description><![CDATA[A new name for our Fedora front end]]></description>
			<content:encoded><![CDATA[<abbr class="unapi-id" title="http://ptsefton.com/?p=190"><!-- &nbsp; --></abbr>
<div><span class='pdf-rendition-link'><a href='http://ptsefton.com/wp-content/uploads/2008/09/thefascinator.pdf'>View as PDF</a></span><div class='page-toc'></div><div><p>At the <a href="http://www.usq.edu.au/adfi">Australian Digital Futures Institute</a> (ADFI, n<span class="spCh spChxe9">é</span>e LFII) we have been working on a software project, funded by our friends at <a href="http://arrow.edu.au/">ARROW</a>, to build a lightweight web front-end to the <a href="http://www.fedora-commons.org/">Fedora Commons</a> repository software. It used to go by the name of Sun of Fedora, which was just a temporary off the cuff in-joke kind of a name. (It uses the <a href="http://lucene.apache.org/solr/">Apache Solr</a> search engine).</p><p>It now has a new name.</p><p>Choosing a name mainly consisted of the ADFI doing a lot of &#8216;research&#8217; on Google and Wikipedia and IMing each other lke crazy. The process threatened to consume what remains of the project budget so we cut it short after a couple of hours.</p><p>I suggested <i>Christine</i> after the <i>Siouxsie and the Banshees</i> <a href="http://www.youtube.com/watch?v=OMcLAsAzCmM">song</a> about a person with multiple personalities on account of the software is used to show the same repository in many different ways. Most of the ADFI staff turn out to be too young, too inattentive or too sheltered to remember <i>Christine</i> although I&#8217;m pretty sure it would have been on <a href="http://en.wikipedia.org/wiki/Countdown_(Australian_TV_series)">Countdown</a>. It would have made for a good tag-line for the software. </p><blockquote class="bq"><p>Now she&#8217;s in purple, now she&#8217;s a turtle.</p></blockquote><p>Anyway, Bron Chandler suggested <a href="http://en.wikipedia.org/wiki/Fascinator">Fascinator</a>, amongst many many other names. I liked that one, as it&#8217;s a kind of add-on to a hat and is typically smaller than a <a href="http://espace.library.uq.edu.au/documentation/">Fez</a>. It also sounds a bit like &#8216;facet&#8217; which is nice, as the software uses facets to help you discover stuff in the repository. I think having an &#8216;F&#8217; is nice too. <span class="spCh spChx201c">“</span>The Fascinator, powered by Fedora<span class="spCh spChx201d">”</span>. </p><p>This, from the <a href="http://en.wikipedia.org/w/index.php?title=Fascinator&amp;oldid=219836545">current</a> Wikipedia page is apt for a bit of open source software:</p><blockquote class="bq"><p>They are available pre-made, but are also quite easy and cost effective to self assemble. They are also sold in kit form.</p></blockquote><p>Turns out <i>The Fascinator</i> is also the name of a <a href="http://www.ragtimepiano.ca/rags/scott.htm#2">ragtime tune by James Scott.</a> I haven&#8217;t been able to source an Open Access version you can listen to, but maybe someone out there can knock it out for us using the <a href="http://www.ragtimepiano.ca/images/fascinator.pdf">sheet music. <a href="#ftn1" name="ftn1-text"><span style="vertical-align: super;"><span class="footnote">*</span></span></a></a></p><p><a name="graphics1"/><img alt="graphics1" class="fr1" height="170" src="http://ptsefton.com/wp-content/uploads/2008/09/111e4460s134x170.jpg" style="border:0px;" width="134"/></p><p><i>The Fascinator</i> it is. </p><p>It is not an acronym and very importantly it is not in upper case but we await construction of a gratuitous <a href="http://en.wikipedia.org/wiki/Backronym">backronym</a>, from <a href="http://andrew.treloar.net/">the man</a> who brought you <a href="http://arrow.edu.au/">ARROW</a>, <a href="http://archer.edu.au/">ARCHER</a> and <a href="http://dart.edu.au/">DART</a> or from the <a href="http://www.unisanet.unisa.edu.au/Staff/homepage.asp?Name=Prashant.Pandey">creator</a> of <a href="http://www.fedora.info/wiki/index.php/Fedora_Tools#FABULOUS">FABULOUS</a> and Absolutely Fabulous.</p><p>We have soft-released the software before but now there is a new, open <a href="http://ice.usq.edu.au/projects/fascinator/trac/wiki">project site</a> where you can download it, if you&#8217;re comfortable with Subversion and installing software on Linux and such. There are instructions for Ubuntu.</p><p><i>The Fascinator</i> will also be used in a project that <a href="http://sophiaca.wordpress.com/">Caroline Drury</a> and I are leading to take a snapshot of the contents of Australian university institutional repositories, partly to test the software and partly to give a series of point in time snapshots of what is in them for research purposes. We&#8217;d like to look at the range of ways people describe their content and compare the way different repository platforms are used.</p><p>I road tested the name on <i>The Long Suffering Sandra</i>.</p><p>PT: You know that software I&#8217;ve been working on called Sun of Fedora?</p><p>SC: No.</p><p>PT: Well anyway, we&#8217;re going to call it The Fascinator. Is that a good name?</p><p>SC: Only if it&#8217;s a project to do with hats.</p><p>PT: Well it is, it builds on Fedora, which is a kind of repository.</p><p>SC: In that case it&#8217;s a stupid name, you don&#8217;t put a fascinator on a fedora.</p><p>Oh yes we do. Here&#8217;s the <a href="http://rspilot.usq.edu.au:8080/sun-of-fedora">demo site</a>. And besides, here&#8217;s a <a href="http://www.etsy.com/view_transaction.php?transaction_id=8166760">thing</a> which is both a fascinator AND a Fedora. Unfortunately it&#8217;s already sold. (I hope <a href="http://www.etsy.com/shop.php?user_id=9048">Glamour Bomb</a> doesn&#8217;t mind me borrowing this image).</p><p><a name="graphics2"/><img alt="graphics2" class="fr1" height="152" src="http://ptsefton.com/wp-content/uploads/2008/09/3fbba5f4s121x152.jpg" style="border:0px;" width="121"/></p><hr/><p><div style="font-size: .9em;"><span class="footnote"><p><a href="#ftn1-text" name="ftn1">*</a> I wonder if anyone in the ADFI happens to be a piano teacher in her spare time? (There are a couple of <a href="http://www.emusic.com/search.html?mode=x&amp;QT=%22The+Fascinator%22&amp;x=0&amp;y=0">tracks on eMusic</a> in case you&#8217;re interested (no, I&#8217;m not an eMusic affiliate cos the form was too scary)).</p></span></div></p></div></div>]]></content:encoded>
			<wfw:commentRss>http://ptsefton.com/2008/09/03/put-on-the-fascinator.htm/feed</wfw:commentRss>
		</item>
		<item>
		<title>More thoughts on an application to find structure in word processing documents</title>
		<link>http://ptsefton.com/2008/08/26/more-thoughts-on-an-application-to-find-structure-in-word-processing-documents.htm</link>
		<comments>http://ptsefton.com/2008/08/26/more-thoughts-on-an-application-to-find-structure-in-word-processing-documents.htm#comments</comments>
		<pubDate>Tue, 26 Aug 2008 03:24:02 +0000</pubDate>
		<dc:creator>ptsefton</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://ptsefton.com/2008/08/26/more-thoughts-on-an-application-to-find-structure-in-word-processing-documents.htm</guid>
		<description><![CDATA[Some reflections on how Ian Barnes&apos; Digital Scholar&apos;s Workbench might be enhanced with added strucutre-sniffing powers]]></description>
			<content:encoded><![CDATA[<abbr class="unapi-id" title="http://ptsefton.com/?p=186"><!-- &nbsp; --></abbr>
<div><div class='page-toc'></div><div><p>In my <a href="http://ptsefton.com/2008/08/26/a-courseware-authoring-dashboard-using-schematron.htm">last post I said I&#8217;d write more</a> about how Ian Barnes&#8217; S<i>tructure Guesser</i> AKA S<i>tructure Sniffer</i><i><a href="#ftn1" name="ftn1-text"><span style="vertical-align: super;"><span class="footnote">1</span></span></a></i> might work, and how it might be able to leverage <a href="http://en.wikipedia.org/wiki/Schematron">Schematron</a>.</p><p>The sniffer is part of Ian&#8217;s <a href="http://www.apsr.edu.au/word/">Digital Scholar&#8217;s Workbench</a> concept, where you can upload an unstructured  word processing document, and use the workbench to add explicit structure in as automated a way as possible. Explicit structure really helps in being able to convert the document to other formats such as HTML for the web, or structured PDF with a table of contents, but also for preservation formats that might keep the words and other content for posterity without necessarily worrying about exact formatting. Ian has looked at using <a href="http://docbook.org/">DocBook</a> for this, but I reckon HTML might be good enough, and I know others are thinking the same thing<span class="T5"><a href="#ftn0" name="ftn0-text"><span style="vertical-align: super;"><span class="footnote">2</span></span></a></span>.</p><p>Ian&#8217;s looked at the statistical approach to guessing structure used by in the <a href="http://pkp.sfu.ca/lemon8">Lemon8-XML</a> project, found that particular implementation wanting and is now thinking about more of a machine learning approach. </p><p>I too have been thinking about how this application might work for a while now and I&#8217;m getting increasingly enamored of the idea of using an HTML interface, something like this:</p><ol class="lin" style="list-style: decimal;"><li><p><b>Upload</b> a word style-free processing document to a web site.</p></li><li><p>You see an <b>interactive preview of an HTML version of the document</b>, complete with a full table of contents (so you can see where the sniffer application thought the headings were). </p><p>Interactive? Hover the mouse over a top level (h1) heading in the preview and see some details about why the machine formatted it that way, such as <span class="spCh spChx201c">“</span>Paragraphs at 18pt (10 instances) and 19pt (1 instance) Helvetica look like Heading 1<span class="spCh spChx201d">”</span>. You&#8217;d be able to correct the machine, either on a case by case basis or wholesale. </p><p>Another area where some interaction might be needed would be in disambiguating various kinds of indented text, some indentation might mean <i>block-quote</i> some might be example while other text might just be, you know, indented. We had to add an indent style in addition to the <code><span style="font-style:normal; "><span class="T4">bq1</span></span></code> (block-quote) style to ICE to support this because some authors just, you know, want to indent stuff.</p></li><li><p>Once you were happy with the HTML view of the document, there would be an option to <b>improve your original by adding styles</b> without changing its presentation too much (Did I mention? You too should <a href="http://delicious.com/ptsefton/usestyles">use styles</a>.) or you could just use the rendition and leave the original alone. Either way, the choices you made would constitute feedback to the learning system. So even if you don&#8217;t choose to use styles, the next time it sees the same document it will be able to handle it better.</p></li></ol><p>So where does Schematron come into this? Well, leaving aside the (very) hard problem of actually writing the learning system, that system could <b>generate Schematron rules</b>, which could be used to annotate the original document with suggested styles for each paragraph. Having done that, you could then feed the document into the existing ICE HTML formatter, which is style-driven and it could use the suggested styles to render the document. </p><p>These rules can be hierarchical meaning that based on certain cues different sets of rules might apply. For example, there might be a family of documents which all come from a user who uses Palatino 11pt for the main text, and makes use of an idiosyncratic mixture of formating and styles <span class="spCh spChx2013">–</span> the learner could derive rules for that situation. I know nothing about this kind of thing, I wonder if it would be like the <a href="http://digitalhistoryhacks.blogspot.com/2008/05/naive-bayesian-in-old-bailey-part-1.html">Na<span class="spCh spChxef">ï</span>ve Bayseian in the Old Bailey</a> where a machine is trained to classify trials.</p><p>Using Schematron rules would mean that they could also be written or tweaked by humans. Returning to the example before, a human could add a rule that if a bit of text is indented relative to the text around it and it contains something that looks like a citation <span class="spCh spChx2013">–</span> which could mean either that it uses something like a Zotero field, or  is formatted like a citation with brackets or a footnote <span class="spCh spChx2013">–</span> then it&#8217;s a blockquote.</p><p>This would be a nice modular approach. Chances are we&#8217;re going to be looking at Rick Jelliffe&#8217;s in-zip Schematron for use on Open Document Format documents, so the sniffer could piggyback on that<span class="T5"><a href="#ftn2" name="ftn2-text"><span style="vertical-align: super;"><span class="footnote">1</span></span></a></span>.</p><p/><p class="meta-abstract" style="margin-left:0cm; margin-right:1.27cm; text-indent:0cm; "/><hr/><p><div style="font-size: .9em;"><span class="footnote"><a href="#ftn1-text" name="ftn1">1</a> Also know as that by <b>me</b> , at least.</span></div></p><p><div style="font-size: .9em;"><span class="footnote"><a href="#ftn0-text" name="ftn0">2</a> And no, OOXML and ODF are not necessarily the answer for preservation although they are important, I&#8217;ll expand on this in a future post as I think about a presentation for <a href="http://www.open-standards.com/">Open Standards 08</a> .</span></div></p><p><div style="font-size: .9em;"><span class="footnote"><a href="#ftn2-text" name="ftn2">1</a> Actually there is an issue with this, it&#8217;s not that simple to write rules that work on the formatting in an ODT file, cos it uses these automatically defined styles that introduce a layer of indirection. We could consider a pre-processor that <i>remembers</i>  these automatic styles between documents, it would also probably need to annotate docuents with some kind of weighted score like they use in Lemon8-XML.</span></div></p></div></div>]]></content:encoded>
			<wfw:commentRss>http://ptsefton.com/2008/08/26/more-thoughts-on-an-application-to-find-structure-in-word-processing-documents.htm/feed</wfw:commentRss>
		</item>
		<item>
		<title>A courseware authoring dashboard using Schematron</title>
		<link>http://ptsefton.com/2008/08/26/a-courseware-authoring-dashboard-using-schematron.htm</link>
		<comments>http://ptsefton.com/2008/08/26/a-courseware-authoring-dashboard-using-schematron.htm#comments</comments>
		<pubDate>Tue, 26 Aug 2008 00:58:29 +0000</pubDate>
		<dc:creator>ptsefton</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://ptsefton.com/2008/08/26/a-courseware-authoring-dashboard-using-schematron.htm</guid>
		<description><![CDATA[
As with busses, sometimes you can wait ages for a Schematron and suddenly a whole pack of them come along together*.For those of you who don&#8217;t know:In Markup Languages, Schematron is a rule-based validation language for making assertions about the presence or absence of patterns in XML trees. It is a simple and powerful structural [...]]]></description>
			<content:encoded><![CDATA[<abbr class="unapi-id" title="http://ptsefton.com/?p=184"><!-- &nbsp; --></abbr>
<div><div class='page-toc'></div><div><p>As with busses, sometimes you can wait ages for a <a href="http://en.wikipedia.org/wiki/Schematron">Schematron</a> and suddenly a whole pack of them come along together<a href="#ftn0" name="ftn0-text"><span style="vertical-align: super;"><span class="footnote">*</span></span></a>.</p><p>For those of you who don&#8217;t know:</p><blockquote class="bq"><p>In Markup Languages, Schematron is a rule-based validation language for making assertions about the presence or absence of patterns in XML trees. It is a simple and powerful structural schema language. It typically uses XPath to describe patterns.</p><p>(Wikipedia contributors 2008)<span class="spCh spChx2060">⁠</span></p></blockquote><p>Instead of the all or nothing syntactic approach that you get with other kinds of schemas Schematron lets you pick and choose things to worry about. So instead of saying <span class="spCh spChx201c">“</span>all course books must begin with a Learning Outcomes section<span class="spCh spChx201d">”</span> you can write a rule that simply reports on whether there&#8217;s a Learning Outcomes section or not without letting there be any variation. Why? In some courses it might be important to add something before that section while I have heard arguments that in some situations specifying learning outcomes upfront scares off potential students.</p><p>We&#8217;ve discussed using Schematron to provide reports on <a href="http://ice.usq.edu.au/">ICE</a> content but have never got around to using it. This week it has resurfaced in couple of contexts.</p><p>Relevant to <b>ICE as a course-authoring system</b>, the Learning and Teaching Support Group at USQ have a checklist, <i>The USQ course writing guide</i><span style="font-style:normal; "><span class="T3"> which authors can use to see if their courses meet our standards for fleximode courseware. At the moment it&#8217;s a manual process to tick the boxes. We met with Michael Sankey from LTSU this week, and it&#8217;s pretty clear that Schematron could play a part in automating lots of the checklist.</span></span></p><p><span style="font-style:normal; "><span class="T3">As part of our ongoing exploration of how we might create an automatic or semi-automatic </span></span><b style="font-style:normal; "><span>system for inferring structure in documents</span></b><span style="font-style:normal; "><span class="T3"> Ian Barnes has pointed out that Schematron might play a role there too. </span></span></p><p>Ian&#8217;s insight was <span style="font-style:normal; "><span class="T3">prompted by a recent post of Rick Jelliffe&#8217;s about a project to add annotations to a corpus of (presumably) word documents in the the OOXML zip package format:</span></span></p><blockquote class="bq"><p>The brief was for an organization with a large number of documents from multiple sources, but with each source supposed to use stylesheets. The idea was to make a rules base that would distinguish all the different ways that a few structures (titles, table of contents, potentially citations, etc) were represented. This would allow classification of documents according to the structures found, the discovery of outliers and exceptions (e.g. incorrectly marked up documents, or where additional rules were needed), and automated annotation back to the original documents.</p><p><a href="http://news.oreilly.com/2008/08/a-standardsbased-expert-system.html">http://news.oreilly.com/2008/08/a-standardsbased-expert-system.html</a> </p></blockquote><p>I&#8217;ll come back to Ian&#8217;s structure guesser (or as I like to call it the structure sniffer) in another post and talk here about the possibilities for adding validation or <b>dashboard</b> services for courseware written using ICE, via Schematron.</p><p>Rick&#8217;s idea of Schematron rules that can reach inside Zip files would be perfect for the USQ courseware context as our content is in Open Document Format files (actually some of it is Word docs but we convert it to ODF as part of the process). We could translate a lot of the checkboxes in the <i>USQ course writing guide</i> into Schematron rules to do things like check that there is a an acknowledgements section in the course introduction. Not only could the system report issues, it could open up the documents in question for you and take you to the trouble spots and insert comments in the documents. </p><p>Not everything needs to be seen as a validation issue though, just some reporting would be useful to create a kind of dashboard for courseware. <span class="spCh spChx201c">“</span>Module 4 contains no activities<span class="spCh spChx201d">”</span> might a worthwhile thing to report along with word counts for various modules and how many citations there are, etc.</p><p>Another place we could use Schematron to report on course structure would be in the course organizer, which is part of the IMS package manifest file in every ICE course. An organizer is a kind of table of contents for the course, and it is used to generate the navigation. Schematron could easily be used to validate things such as <span class="spCh spChx201c">“</span>There must be a <i>Study schedule</i><span class="spCh spChx201d">”</span>, and check things like whether the links to study modules have names that are not just like <span class="spCh spChx201c">“</span>Module 1<span class="spCh spChx201d">”</span> but convey a bit more about what&#8217;s in the module.</p><p>A few years ago Ron Ward and I were involved in a project that used Schematron. There we used it to validate metadata for documents as they were uploaded into a content management system <span class="spCh spChx2013">–</span> Schematron would look for patterns in the metadata and complain when it was wrong. The complaint took the form of an HTML form that the user could fill-out to fix the metadata to the Schematron system&#8217;s satisfaction. The Schematron rules worked well to create a true declaratively specified interface, but our implementation was a bit inflexible, like my attitude at the time, so usability suffered. Lesson learnt, I hope.</p><p>I think that <b>presenting this as a dashboard</b> that lets you know what your course is like will be better than presenting it as <i>validation</i> which has connotations of centralized control, something that doesn&#8217;t always go down well in a university, even when we do have agreed standards to maintain.</p><p>It will be a little while before we get to implementing this I just wanted to record our current thinking.</p><p/><p class="meta-abstract" style="margin-left:0cm; margin-right:1.27cm; text-indent:0cm; "/><hr/><p><div style="font-size: .9em;"><span class="footnote"><p><a href="#ftn0-text" name="ftn0">*</a>  Although come to think of it I don&#8217;t think I&#8217;ve ever seen two busses in a row in Toowoomba. </p></span></div></p></div></div>]]></content:encoded>
			<wfw:commentRss>http://ptsefton.com/2008/08/26/a-courseware-authoring-dashboard-using-schematron.htm/feed</wfw:commentRss>
		</item>
		<item>
		<title>Compound documents in ICE and beyond: referencing parts of things</title>
		<link>http://ptsefton.com/2008/08/20/compound-documents-in-ice-and-beyond-referencing-parts-of-things.htm</link>
		<comments>http://ptsefton.com/2008/08/20/compound-documents-in-ice-and-beyond-referencing-parts-of-things.htm#comments</comments>
		<pubDate>Wed, 20 Aug 2008 03:44:40 +0000</pubDate>
		<dc:creator>ptsefton</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://ptsefton.com/2008/08/20/compound-documents-in-ice-and-beyond-referencing-parts-of-things.htm</guid>
		<description><![CDATA[Some ICE related comments on a post by Ben O&apos;Steen]]></description>
			<content:encoded><![CDATA[<abbr class="unapi-id" title="http://ptsefton.com/?p=182"><!-- &nbsp; --></abbr>
<div><div class='page-toc'></div><div><p>Ben O&#8217;Steen has put up <a href="http://oxfordrepo.blogspot.com/2008/08/four-rules-of-web-and-compound.html">some thoughts</a> on what he refers to as &#8216;compound&#8217; documents and how to store them in repositories and allow for referencing of parts of a document, such as a table, a graph or even a paragraph. </p><p>Why did I add the scare quotes to <i>compound</i>? </p><p>While to a computer scientist a research paper with its graphs and tables and paragraphs might be compound, I suspect most authors tend to think of a research article as a single entity. Until we start giving them access to services that make it clear that it&#8217;s not monolithic, that is.</p><p>As background, Ben gives four rules:</p><blockquote class="bq"><p>Note that the four rules of the web (well, of Linked Data technically) are in essence:</p><ul class="lib"><li><p>give everything a name,</p></li><li><p>make that name a URL &#8230;</p></li><li><p>which results in data about that thing,</p></li><li><p>and have it link to other related things. </p></li></ul><p>I strongly believe that applying this to the individual components of a document is a very good and useful thing.</p><p><a href="http://oxfordrepo.blogspot.com/2008/08/four-rules-of-web-and-compound.html">http://oxfordrepo.blogspot.com/2008/08/four-rules-of-web-and-compound.html</a></p></blockquote><p>Agreed.</p><p>He goes on to talk about repository services will have to have an explicit contract with authors that lets them know that their document is not just going to be presented in one monolithic format, by default the dreaded PDF. </p><blockquote class="bq"><p>One thing first, we have to get over the legal issue of just storing and presenting a bitwise perfect copy of what an author gives us. We need to let author&#8217;s know that we may present alternate versions, based on a user&#8217;s demands. This actually needs to be the case for preservation and the repository needs to make it part of their submission policy to allow for format migrations, accessibility requirements and so on.</p></blockquote><p>As we get authors using a system like <a href="http://ice.usq.edu.au/">ICE</a> then this will be:</p><ol class="li-lower-alpha" style="list-style: lower-alpha;"><li><p>Easier for them to understand because they can see multiple formats generated automatically.</p></li><li><p>Easy to implement, by hooking up ICE (or similar) directly to repositories. Just this week Oliver Lucido has ICE putting content straight in to ePrints via OAI-ORE <span class="spCh spChx2013">–</span> that&#8217;s automatically adding an HTML and PDF view. </p></li></ol><p>So far with ICE we have done a number of demo hook-ups to repository software. It&#8217;s now time to turn this on for real <span class="spCh spChx2013">–</span> we will get ICE hooked up to USQ ePrints ASAP. This will mean that all the images in a document will automatically become referenceable. That is, in Ben&#8217;s terms each image will have a name which is a URL.</p><p>Going beyond images, we have already done some work in ICE on making paragraphs referenceable, not in a repository context but in an editorial workflow. For example, this blog post has been created in ICE. Here&#8217;s a screenshot of an earlier version of this very paragraph in the HTML view.</p><p><span style="display: block"><a name="graphics1"/><img alt="graphics1" class="fr1" height="54" src="http://ptsefton.com/wp-content/uploads/2008/08/7d9da2b3s554x54.jpg" style="border:0px;" width="554"/></span>See the blue pilcrow? That&#8217;s the symbol that <a href="http://www.tbray.org/ongoing/When/200x/2004/05/31/PurpleAgain">Tim Bray uses on his blog</a> to make each paragraph referenceable. Go and have a look, you can link to or refer to any part of any post on his site. In ICE, however, the plicrow is not for referencing elsewhere, it&#8217;s for commenting. </p><p>See the spelling error? I can annotate the document:</p><p><a name="graphics2"/><img alt="graphics2" class="fr2" height="140" src="http://ptsefton.com/wp-content/uploads/2008/08/m5b3fb20as554x140.jpg" style="border:0px;" width="554"/></p><p>Now, if I fix the paragraph, the comment will disappear from the main body of the text but the old, broken version of the paragraph is kept <span class="spCh spChx2013">–</span> it shows at the bottom of the page until I delete it.</p><p>So, ICE already knows how to identify any paragraph and has some rudimentary version control for document parts<span class="T3"><a href="#ftn0" name="ftn0-text"><span style="vertical-align: super;"><span class="footnote">*</span></span></a></span>, but the context matters.In an authoring context we needed something that was not too sensitive to document order, and it had to work with documents created by word processors, so we can&#8217;t just assign unique IDs to paragraphs the way Tim Bray can in his bespoke workflow. But when it comes to pushing (or pulling) a document into a repository, where there is some expectation that it will not change, there is no reason that we can&#8217;t mint IDs for parts of a document, and figure out a way to make them obviously citable along the lines of Tim&#8217;s purple pilcrows.  </p><p>Coming back to Ben&#8217;s post. Why not make the HTML view the &#8216;normal&#8217; way to look at an article where possible? This would mean that you don&#8217;t have to store a document in fragments, merely label the parts of the HTML. I guess I&#8217;m agreeing with Ben&#8217;s tentative suggestion that HTML might be a good format to hang this on:</p><blockquote class="bq"><p>I have yet to settle on basing it on the content XML format inside the OpenDocument format, or on something very lightweight, using HTML elements, which would have a double benefit of being able to be sent directly to a browser to &#8216;recreate&#8217; the document roughly.</p></blockquote><p>Forget &#8216;roughly&#8217;, at least for documents created with an HTML-ready workflow like ICE. It would even less rough if authors choose something like the <a href="http://ptsefton.com/2008/08/05/another-look-at-the-article-authoring-add-in-for-microsoft-office-word-2007.htm">Article Authoring Add-in for Microsoft Office Word 2007</a>. But Ben&#8217;s right; for documents that are deposited in PDF or in unstructured word processing formats then HTML is going to be rough. </p><p>Just how we might handle the user interface issues for exposing names (URLs) of the parts of a document is unresolved, but we&#8217;ll give it a go here at USQ with our ICE and ePrints systems.</p><hr/><p><div style="font-size: .9em;"><span class="footnote"><p><a href="#ftn0-text" name="ftn0">*</a>  There&#8217;s the current version and then there are obsolete versions. ICE of course has rich version control at the document level courtesy of subversion</p></span></div></p></div></div>]]></content:encoded>
			<wfw:commentRss>http://ptsefton.com/2008/08/20/compound-documents-in-ice-and-beyond-referencing-parts-of-things.htm/feed</wfw:commentRss>
		</item>
		<item>
		<title>Study shows real-world  ODF/OOXML interoperability is not great</title>
		<link>http://ptsefton.com/2008/08/11/study-shows-real-world-odfooxml-interoperability-is-not-great.htm</link>
		<comments>http://ptsefton.com/2008/08/11/study-shows-real-world-odfooxml-interoperability-is-not-great.htm#comments</comments>
		<pubDate>Mon, 11 Aug 2008 00:31:24 +0000</pubDate>
		<dc:creator>ptsefton</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://ptsefton.com/2008/08/11/study-shows-real-world-odfooxml-interoperability-is-not-great.htm</guid>
		<description><![CDATA[Via Doug Mahugh at Microsoft comes this study (Shah &#38; Kesan 2008)â  on interoperability of word processing applications using the Open Document Format and  Office Open XML.]]></description>
			<content:encoded><![CDATA[<abbr class="unapi-id" title="http://ptsefton.com/?p=178"><!-- &nbsp; --></abbr>
<div><div class='page-toc'></div><div><p>Via <a href="http://blogs.msdn.com/dmahugh/archive/2008/08/09/links-for-08-09-2008.aspx">Doug Mahugh</a> at Microsoft comes <a href="http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1201708">this study</a> (Shah &amp; Kesan 2008)<span class="spCh spChx2060">⁠</span> on interoperability of word processing applications using the Open Document Format and  Office Open XML.</p><p>After outlining some possible approaches to testing conformance of applications against the standards and pointing out what a gargantuan task that would be, they settle on a pragmatic approach: <b>test interoperability with the dominant application for each format</b>.</p><blockquote class="bq"><p>This research tested the interoperability for ODF and OOXML document formats based on a reference implementation approach.  For ODF, the test documents are developed in OpenOffice, which is currently the dominant implementation for ODF.  For OOXML, the test documents are developed in Microsoft Office 2007 for Windows.  These are not reference implementations in a true sense, because they do not perfectly implement the standard.  However, they act as de facto reference implementations, because they are the dominant implementations that all developers seek compatibility with.  </p></blockquote><p>This makes perfect sense for real-world testing. The results are interesting and unsurprising (<a href="http://ptsefton.com/2008/05/13/claims-about-odf-support-are-typically-meaningless.htm">to me, at least</a>). Basically the best interoperability is between Microsoft Office Word and OpenOffice.org Writer <span class="spCh spChx2013">–</span> even when they are reading each other&#8217;s formats. I reckon that would be because the OOo team have invested person-decades of effort in reverse engineering the Word document model, and Writer is more or less able to deal with Word docs. The document serialization format is not that relevant. It&#8217;s the document models that count. And some of the applications they test are not really even word processors.</p><p>This paper makes a great case that it is interop that counts and the goes on to show how poor interop really is.</p><p>Unfortunately, this study didn&#8217;t get as far as looking at styles compatibility as that&#8217;s one area where there are some frustrating problems but also great opportunities to help in interoperability.  If you <a href="http://delicious.com/ptsefton/ptsefton+usestyles">use styles</a> then at least the semantics and structure of documents can be preserved even if page fidelity is not.</p><p>And there&#8217;s a way to <b>improve interoperability</b>. You don&#8217;t have to leave users to their own devices, you can advise them of which features of which applications to use for particular tasks. This is what we try to do on the <a href="http://ice.usq.edu.au/">ICE project</a>. We provide <a href="http://ice.usq.edu.au/instructions/templates/toolbars_and_templates.htm">templates</a> and <a href="http://ice.usq.edu.au/packages/user_guide/default.htm">advice</a> to help people create interoperable documents.</p><p>Inspired by this paper, I&#8217;m off to start work on a paper looking at <b>proactive interoperability</b>, by helping users to pick features that <b>will</b> interoperate. As noted in this study there&#8217;s not much out there to choose from apart from Writer and Word. That&#8217;s why we will continue to work with Writer and Word looking for practical solutions.</p><p/><p class="P3">Shah, R.C. &amp; Kesan, J.P., 2008. Lost in Translation: Interoperability Issues for Open Standards - ODF and OOXML as Examples by Rajiv Shah, Jay Kesan. In <i>The proceedings of the 36th Research Conference on Communication, Information and Internet Policy (TPRC), Arlington, VA Sept. 26-28, 2008</i>. Available at: http://ssrn.com/abstract=1201708 [Accessed August 10, 2008].</p></div></div>]]></content:encoded>
			<wfw:commentRss>http://ptsefton.com/2008/08/11/study-shows-real-world-odfooxml-interoperability-is-not-great.htm/feed</wfw:commentRss>
		</item>
		<item>
		<title>Another look at the Article Authoring Add-in for Microsoft Office Word 2007</title>
		<link>http://ptsefton.com/2008/08/05/another-look-at-the-article-authoring-add-in-for-microsoft-office-word-2007.htm</link>
		<comments>http://ptsefton.com/2008/08/05/another-look-at-the-article-authoring-add-in-for-microsoft-office-word-2007.htm#comments</comments>
		<pubDate>Tue, 05 Aug 2008 05:32:51 +0000</pubDate>
		<dc:creator>ptsefton</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://ptsefton.com/2008/08/05/another-look-at-the-article-authoring-add-in-for-microsoft-office-word-2007.htm</guid>
		<description><![CDATA[The Ã¢â¬ÅArticle Authoring Add-in for Microsoft Office Word 2007Ã¢â¬ï¿½ (AAAiMOW1) has been turned loose as a release candidate.]]></description>
			<content:encoded><![CDATA[<abbr class="unapi-id" title="http://ptsefton.com/?p=169"><!-- &nbsp; --></abbr>
<div><div class='page-toc'><ul><li><a href="#id1">Usability?</a></li><li><a href="#id2">Lock in</a></li><li><a href="#id3">Preservation</a></li><li><a href="#id4">An alternative</a></li></ul></div><div><p>The <span class="spCh spChx201c">“</span><a href="http://www.microsoft.com/downloads/details.aspx?familyid=09c55527-0759-4d6d-ae02-51e90131997e">Article Authoring Add-in for Microsoft Office Word 2007</a><span class="spCh spChx201d">”</span> (AAAiMOW<a href="#ftn0" name="ftn0-text"><span style="vertical-align: super;"><span class="footnote">1</span></span></a>) has been turned loose as a release candidate. I <a href="http://ptsefton.com/2008/04/24/some-comments-on-the-nlm-xml-plugin-for-word-2007.htm">looked at an earlier version of this a while a ago</a>. </p><p>The name of the thing doesn&#8217;t let on that it is targeting just one version of what an article looks like, in the form of the <a href="../%EF%BB%BFhttp://dtd.nlm.nih.gov/publishing/tag-library/2.2/">NLM schema</a>  I&#8217;m not sure if that reflects confidence that the NLM schema is generic enough to cope with all articles or anticipates a future version which can support multiple formats.</p><p>I had a lot of questions in my previous post <span class="spCh spChx2013">–</span> most of which I think are not yet answered, although Pablo Fernicola did drop by my blog and shed light on some of the issues.</p><p>This time, with a fresh virtual installation of Windows XP running under VirtualBox on OS X the plugin worked a bit better for me so I could see it in full flight. I still have some serious concerns about this add-in thing and what it might mean for organizations. </p><p>I was going to make a few quick comments about usability, preservation and lock-in but this post kept growing,  I emailed Jon Udell for his take and did a few tests, and it&#8217;s ended up well on the way to 2000 words.</p><p>[Update : (Minor edits and fixes)</p><p> I should point out that while this post is quite picky I&#8217;m glad to see this work going on in Microsoft and I&#8217;d love to see how it works out.</p><p>Look, if you&#8217;re not concerned about using an application which is only for Word 2007 on Windows XP or Vista to create articles which you don&#8217;t need to re-use or archive then most of what I&#8217;ve got to say here is irrelevant.]</p><h1><a id="id1" name="id1"><!--id1--></a>Usability?</h1><p>I can&#8217;t find any reports of how this plugin works in real life. Has anyone tried it? Are you all under NDAs?</p><p>I&#8217;m concerned about the way that you can add NLM structural elements all over the place, and nested inside each other in bizarre ways, but then you can&#8217;t save to the new proprietary .nlmx format because of validation errors. </p><p>It would be pretty easy to show how you can create invalid structures using this plugin but I don&#8217;t really think that&#8217;s a useful stunt to pull <span class="spCh spChx2013">–</span> what I want to see is what <b>real</b> problems, or lack of them people have with the structural stuff. </p><p>Me, I found it a bit weird but as I said I didn&#8217;t try to write an article with it. </p><p>There&#8217;s one interface device that I really like. Each &#8217;section&#8217; element gets a little handle above it so you can drag the whole thing around:</p><p><span style="display: block"><a name="graphics1"/><img alt="graphics1" class="fr1" height="87" src="http://ptsefton.com/wp-content/uploads/2008/08/m11c0c37s248x871.jpg" style="border:0px;" width="248"/></span>It would be really nice if this applied to the document outline as well as part of the normal Word interface not just to the special embedded XML sections. I could just style a bit of content as <code>Heading 2</code>, which is part of the document outline structure, and be able to drag around the whole of that implied section. Word already does something very like this if you use the Outline view. Of course, dragging sections in an NLM document doesn&#8217;t make sense as they&#8217;re supposed to be in a particular order, but I don&#8217;t imagine most people would drag the high-level sections. (There&#8217;s some kind of complex process for dealing with section ordering or editors, I think). </p><p>I&#8217;m not sure if I get why the embedded XML is any better than just recognizing that the text &#8216;Abstract&#8217; in <code>Heading 2</code> style is a the start of the Abstract section. Or you could define sub-classes of heading if you really wanted to such as  <code>Heading 2 - Abstract</code>.</p><p>You could still have a toolbar like this so that people can drop in sections where they want them:</p><h1><a id="id2" name="id2"><!--id2--></a><span style="display: block"><a name="graphics2"/><img alt="graphics2" class="fr2" height="120" src="http://ptsefton.com/wp-content/uploads/2008/08/m43505a27s552x1201.jpg" style="border:0px;" width="552"/></span>Lock in</h1><p>This add-in represents a new opportunity for Microsoft to lock users in to Word, having just moved on from the proprietary .doc format. This is not just a matter of trying to sell more copies of Microsoft Office it&#8217;s about encouraging users to create documents that only work with a particular version of Office.</p><p>We have just been through a great long debate about standardizing word processing formats. Microsoft got their way and had their OOXML format accepted as an ISO Standard (<a href="http://www.iso.org/iso/pressrelease.htm?refid=Ref1123">ISO/IEC DIS 29500</a>). The benefit is supposed to be that when you write a word processing document it can be managed and edited in more than one application but I have always been very dubious about how this fits with the way you can embed arbitrary foreign XML in Word documents. By contrast the Open Document Format approach is an <a href="http://blogs.sun.com/GullFOSS/entry/new_extensible_metadata_support_with">RDF based extension mechanism</a> which seems a lot cleaner.</p><p>I tried out some simple interop with an AAAiMOW document. </p><ol class="lin" style="list-style: decimal;"><li><p>Word 2008 on OS X can open it, and you can edit the document at least a little bit, apparently without breaking it, but anything you add doesn&#8217;t have the magic embedded XML. It round tripped without error but I assume you could break some of the XML.</p></li><li><p>NeoOffice Writer on the Mac can open the .docx file and you can edit, but if you save it and re-open in Word then you get an error . The good news was that Word 2007 was apparently able to rescue the content but the bad news was that embedded XML went AWOL.</p></li></ol><p>At the moment I would not have any confidence that anything except Word 2007 can deal with documents created with the add-in, which is as advertised. Of course, if that&#8217;s what your team of scientists is using then no problem, provided you think about how you will preserve the outputs (see below).</p><p>That quick interop test was using the new .docx format which is <b>not </b>the same as <a href="http://www.iso.org/iso/pressrelease.htm?refid=Ref1123">ISO/IEC DIS 29500</a>, which won&#8217;t be available as a Word format until the next version. </p><p>One of the features of the AAAiMOW is a new file format. Yes. A new non-standard file format which is a misbegotten mashup of OOXML and NLM. I&#8217;m not sure how this is different from the way the content is embedded in the .docx file.  From the readme file:</p><blockquote class="bq"><p>Both the article contents and metadata authored through the add-in are stored using XML, as part of a single file, using the Open XML format for the content and the NLM tagset for the metadata.  Content which does not have an equivalent in Word, or extends existing Word elements, is stored as custom XML elements within the Open XML data stream.  When a file is saved in the NLM format, the resulting XML file is stored within a nlmx file, using the same Open Packaging Conventions used by docx files, providing a single file which can package all related content (such as images) and supports extensibility.</p></blockquote><p>Meanwhile, the next service pack for Word 2007 will add support for the Open Document Format (ODF) as a native file format. I&#8217;m assuming the plugin won&#8217;t work with ODF. (Pablo, am I wrong?)</p><p>There&#8217;s some very alarming use of the passive voice in the documentation too, a classic computer industry trick. Say it <b>can</b> be done without mentioning who&#8217;s going to do it and how much it&#8217;s going to cost.</p><blockquote class="bq"><p>Based on the use of Open Packaging Conventions, the Open XML format, and the NLM tagset, tools can be built to access any part of the file, content or metadata, and extract, validate, or add information to the file, as part of the publishing workflow.</p></blockquote><p><span class="spCh spChx201c">“</span>Can be built?<span class="spCh spChx201d">”</span>  Please. We have one format mixed in to another format using a user interface that is only accessible from an expensive proprietary application. I&#8217;m sure I could write a script to pull the NLM bits out of the Open XML but for each new kind of embedded XML I would have to rewrite my code and test that it works with the user interface code that has been added to Word <span class="spCh spChx2013">–</span> in this case it involves dealing with some special attributes to re-order sections (I think) <span class="spCh spChx2013">–</span> doesn&#8217;t look easy or pleasant to me.</p><p>[Update <span class="spCh spChx2013">–</span> to be clear there is no way that I can see for an author to export he NLM XML format only. I&#8217;m assuming that must be something that happens using a different tool.]</p><p>And it is worth remembering that this plugin is not accessible to the majority even of Windows users. For example here at USQ Word 2007 has not been rolled out yet<a href="#ftn1" name="ftn1-text"><span style="vertical-align: super;"><span class="footnote">2</span></span></a>.  And the plugin is not available at all on platforms other than Windows. That&#8217;s not what I hoped the new standards-wielding Microsoft was on about. </p><h1><a id="id3" name="id3"><!--id3--></a>Preservation</h1><p>There are going to be serious issues with preservation. What are archivists supposed to do with bastard mashed-up formats like this which depend on a particular package to make sense of them?</p><p>It is true that for documents that make it to publication in the NLM XML format this should not be an issue: the resulting XML should be perfect for archiving. But I can see that a lot of things that are of value might not make it through to XML. What about archived author&#8217;s manuscripts which are one of the backbones of Open Access? What about the original editable files for images drawn using Microsoft Office tools, which are embedded in the source file? </p><p>Think about what would happen if this approach became common for different XML formats <span class="spCh spChx2013">–</span> there could be a proliferation of non-standard polluted Word document to deal with in repositories.</p><p>This add-in represents the Microsoft business model in action. See Brian Jones&#8217; response to my probing on the issue of how this  bastard mashup stuff is supposed to work. I quoted <a href="http://blogs.msdn.com/brian_jones/archive/2005/07/08/436973.aspx#452483">this</a> last time, but it&#8217;s worth reminding ourselves that this is what Microsoft is about, never mind the standards:</p><blockquote class="bq"><p>There is a huge market that exists today for custom Office solutions. People customize the Office applications in all kinds of ways to try to get more out of their documents. By adding the support for custom defined schemas, we made it much easier to build semi-structured solutions on top of Word. Rather than rely on hacks with styles or bookmarks, folks could create a simple schema and add some XML tags into their existing document solutions.</p><p><a href="http://blogs.msdn.com/brian_jones/archive/2005/07/08/436973.aspx#452483">http://blogs.msdn.com/brian_jones/archive/2005/07/08/436973.aspx#452483</a> </p></blockquote><p>Brian Jones calls using styles to carry semantics &#8216;a hack&#8217; and yet embedding foreign XML in a Word document and hand-crafting a user interface to deal with the resulting mishmash of tags is somehow not a hack? I agree that styles and bookmarks (and tables <span class="spCh spChx2013">–</span> we use them a lot) are somewhat limited carriers for microformats but the XML embedding thing has always looked like a trap to me <span class="spCh spChx2013">–</span> too expensive to set up and maintain and too much embedded in the Windows world. As I mentioned above, I think the new extension mechanism for ODF may be a better compromise <span class="spCh spChx2013">–</span> maybe we&#8217;ll see that in the ODF support in the service pack release in 2009.</p><h1><a id="id4" name="id4"><!--id4--></a>An alternative</h1><p>There&#8217;s an alternative approach which is to use features that are common to word processors in general and which are expressed in the underlying file formats directly, which I wrote about in my last post. There would be some interesting challenges in finding interoperable ways to embed all the &#8217;special&#8217; items that are allowed <span class="spCh spChx2013">–</span> some of these are already supported in our ICE templates but not quite with the same structural rigor as in this the add-in.</p><p/><p><span style="display: block"><a name="graphics3"/><img alt="graphics3" class="fr2" height="73" src="http://ptsefton.com/wp-content/uploads/2008/08/m24c568es552x731.jpg" style="border:0px;" width="552"/></span></p><p>Chris Rusbridge from the  <a href="http://ptsefton.com/2008/04/24/some-comments-on-the-nlm-xml-plugin-for-word-2007.htm#comment-1542">lamented in the comments of my last post</a> that we don&#8217;t do NLM export from <a href="http://ice.usq.edu.au/">ICE</a> <span class="spCh spChx2013">–</span> but I reckon we could produce NLM XML from ICE documents with no more subsequent work required of editors than you would get using the AAAiMOW (I&#8217;m guessing <span class="spCh spChx2013">–</span> we have no data about how well it [the add-in] works and I have yet to work through the section of the documentation for editors).</p><p>The readme tells us that styles don&#8217;t work (emphasis mine):</p><blockquote class="bq"><p>Custom XML elements are used to represent other abstractions that exist in the NLM tagset, but that are not found in Word, and to do so in a manner that can be presented to the author for editing in a robust way <b>(unlike the use of custom styles, which was one of the ways to try to solve this problem in earlier versions of Word, and was not very reliable).</b></p></blockquote><p class="P6">I have no doubt that there are lots of terrible style based systems out there, but we have worked hard on making styles usable, interoperable and easy to apply and providing robust rapid-feedback document conversion. </p><p class="P6">(Maybe both of us are wrong <span class="spCh spChx2013">–</span> MJ Suhonos at <a href="http://pkp.sfu.ca/lemon8">PKP thinks that you can create XML</a> without using either styles or embedded XML by using document formatting to infer structure.)</p><p>Would anyone care to fund a small project to see if we can use ICE to produce similar overall results (in terms of overall ROI) to the AAAiMOW but in a cross-platform solution? Microsoft? Anyone in the UK with access to JISC funds? A publisher?</p><hr/><p><div style="font-size: .9em;"><span class="footnote"><p><a href="#ftn0-text" name="ftn0">1</a>  Sounds like a noise emanating from a petulant feline.</p></span></div></p><p><div style="font-size: .9em;"><span class="footnote"><p><a href="#ftn1-text" name="ftn1">2</a>  We&#8217;re bracing for the onslaught on the help desk as hundreds of users have to re-learn commands they&#8217;ve been using for their whole working lives. It seems to us on the <a href="http://ice.usq.edu.au/">ICE</a> team that this is a perfect time to introduce our users to the copy of OpenOffice.org which will be on their computer.</p></span></div></p></div></div>]]></content:encoded>
			<wfw:commentRss>http://ptsefton.com/2008/08/05/another-look-at-the-article-authoring-add-in-for-microsoft-office-word-2007.htm/feed</wfw:commentRss>
		</item>
		<item>
		<title>Improving VALET - part 2</title>
		<link>http://ptsefton.com/2008/07/31/improving-valet-part-2.htm</link>
		<comments>http://ptsefton.com/2008/07/31/improving-valet-part-2.htm#comments</comments>
		<pubDate>Thu, 31 Jul 2008 01:14:02 +0000</pubDate>
		<dc:creator>ptsefton</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://ptsefton.com/2008/07/31/improving-valet-part-2.htm</guid>
		<description><![CDATA[More on VALET camp 2008]]></description>
			<content:encoded><![CDATA[<abbr class="unapi-id" title="http://ptsefton.com/?p=164"><!-- &nbsp; --></abbr>
<div><div class='page-toc'></div><div><p>This is my second post on the VALET repository deposit tool. Again, if you&#8217;re not a repository aficionado you can probably move on<a href="#ftn0" name="ftn0-text"><span style="vertical-align: super;"><span class="footnote">1</span></span></a>.</p><p>Still here?</p><p>One of the issues we confronted with VALET was to rewrite in Java or not to rewrite in Java?  VALET is written in Perl and quite nicely written in my opinion, apart from the HTML forms which are a big mess of non-valid HTML. There&#8217;s nothing wrong with that as such, but it does have a couple of downsides relative to Java:</p><ol class="lin" style="list-style: decimal;"><li><p>VALET requires a web server to be installed. VITAL used to ship with Apache but it no longer does, so to run VALET you can end up having to compile and install Apache, and obtain some other dependencies. If it were a Java application then you could just drop it in to the same servlet container as you use for VITAL and Fedora.</p></li><li><p>We have heard from some of the, um, younger techies in the ARROW community that Perl is a complete mystery. Others report difficulties in hiring Perl programmers, whereas everyone does Java at uni these days.</p></li></ol><p>On the other hand, there are some reasons not to want to do a port:</p><ol class="lin" style="list-style: decimal;"><li><p>Some of the ARROW contingent have been using Perl since 1934 and can at least tolerate it. I&#8217;d count myself in that group. Fortran anyone?</p></li><li><p>Hacking a Java program is not as simple as using a text editor to change a Perl file, because you need to compile (and worry about stuff like CLASSPATH, ugh).</p></li><li><p>A port will create a huge fork. </p></li></ol><p>All these points count for something, but Prashant from University of South Australia has pointed out that using JSP (to which I&#8217;m allergic, like PHP and ASP) gives a much easier entry point for &#8216;casual&#8217; developers and even if it does fork VALET is actually a fairly small application so the investment is not huge and the gain for sites where they want to just consume the software should be worth it.</p><p>In the end the group here at the VALET camp decided that there was enough interest in a Java version that they were going to go for it. Nobody would own up to being a Java expert but four or five confessed to having written production Java code. </p><p>They&#8217;re creating an application as I write this. While they do that Harry, Duncan and David are integrating all the changes that ARROW sites made to VALET and submitted to the Google group. So the Java team will have a moving target as they re-implement the Perl code.</p><p>The Perl version won&#8217;t be going away <span class="spCh spChx2013">–</span> but it looks like at least some sites will move straight over to the Java version once it&#8217;s done.</p><p>So what are the Java team (Tim, Guy, Prashant and Cyrus) doing?</p><p>They&#8217;re starting a VALET compatible clone. The idea is that you should be able to take an existing VALET workflow and data entry forms and with minimal effort, port it to run in the new application. Best case would be no work at all required; the new application will be a drop-in replacement for VALET. We&#8217;ll see if that can be achieved.</p><p>The new app rejoices in the working title of <i>Squire</i>, which is not an acronym; it shows that the developers know how to use a thesaurus. Or is it named for the <a href="http://www2.dpi.qld.gov.au/fishweb/2532.html">fish</a>? I reckon they should call it <a href="http://en.wikipedia.org/wiki/Alfred_Pennyworth">Alfred</a> or Pennyworth<a href="#ftn1" name="ftn1-text"><span style="vertical-align: super;"><span class="footnote">2</span></span></a>. Either way, it&#8217;s better than the original working title of <i>Black Hole</i>. which would be like calling your deposit interface <a href="http://digital.library.wisc.edu/1793/22088">Roach Motel</a>. Although at least if you had a repository deposit called Black Hole you could claim very high rates of compression for data. Just don&#8217;t mention decompression.</p><p>The new JAVA platform will make it easier to do some of the other changes that the community are asking for (we&#8217;re discussion this on the ARROW Google group for those of you in the inner-circle), in some cases because there are more repository-oriented libraries for Java than for Perl but also just because as a community we have more competent Java programmers than Perl programmers these days. </p><p>Here are some enhancements that we will probably do at USQ at some stage <span class="spCh spChx2013">–</span> there are lots of other requirements too which we are not going to forget these are just the ones that I can speak for at this stage:</p><ol class="lin" style="list-style: decimal;"><li><p>A <a href="http://sourceforge.net/projects/sword-app/#item3rd-1">SWORD </a>deposit so the application can push content to repositories other than Fedora. We&#8217;re going to look at deposit of complex objects over SWORD in the TheOREM-ICE project very soon so this will be a quick add-on.</p></li><li><p>The inevitable ICE interface so that if you submit a styled word processing document to Squire if will generate good quality HTML and PDF renditions automatically. We&#8217;re working with Ian Barnes at ANU and <a href="http://ptsefton.com/2008/06/26/a-few-words-on-magic.htm">talking to the PKP people</a> about how we might be able to do a better job of inferring document structure than the standard, breathtakingly abysmal <span class="spCh spChx201c">“</span>Save as HTML<span class="spCh spChx201d">”</span> feature in word processors. Another step in my campaign to stamp out PDF-only Web 0.5 repositories, at least in Queensland.</p></li><li><p>Automatic embedding of metadata and license in the PDF file in XMP format, based on some work which is apparently going on in collaboration between QUT and an Australian Government agency.</p></li><li><p>A lightweight complete open source repository package with Squire for deposit plus <a href="http://ptsefton.com/2008/06/27/tim-mccallum-shows-off-sun-of-fedora.htm">Sun Of Fedora</a> as a portal. Not a lot of features, or complexity, just the basics.</p></li></ol><p/><hr/><p><div style="font-size: .9em;"><span class="footnote"><p><a href="#ftn0-text" name="ftn0">1</a>  If you don&#8217;t want to read about repositories, I recommend Bike Snob NYC. Which prominent fast but not fast enough Australian cyclist was he talking about last week?</p><blockquote class="bq"><p>Firstly, there was Saunier Duval&#8217;s impressive one-two finish, proving once again that there is no &#8220;I&#8221; in &#8220;team.&#8221; (Though there is a &#8220;moi&#8221; in &#8220;chamois.&#8221;) Secondly, ___ ____ (whose collarbones are only intact after yesterday&#8217;s crash because they have both been replaced by titanium) proved he is in fact a great stage racer by taking the Maillot Jaune by one second. (Anybody can blast his way up a mountainside in a distateful display of power, but it takes a certain dignified restraint to sidle up behind people and pilfer seconds the way ___ does, like an uninvited party guest nabbing cocktail weiners.)</p><p><a href="http://bikesnobnyc.blogspot.com/2008/07/rest-day-roundup-stealing-seconds-and.html">http://bikesnobnyc.blogspot.com/2008/07/rest-day-roundup-stealing-seconds-and.html</a> </p></blockquote></span></div></p><p><div style="font-size: .9em;"><span class="footnote"><a href="#ftn1-text" name="ftn1">2</a> Bron Chandler points out that there is some potential for <a href="http://en.wikipedia.org/wiki/Recursive_acronym">recursive naming</a>  in the tradition of GNU and HURD. Alfred Pennyworth is sometime know as Batman&#8217;s batman. What would VALET&#8217;s nemesis be called? Do valets have nemeses? Do nemeses have valets?</span></div></p></div></div>]]></content:encoded>
			<wfw:commentRss>http://ptsefton.com/2008/07/31/improving-valet-part-2.htm/feed</wfw:commentRss>
		</item>
		<item>
		<title>Improving VALET - part 1</title>
		<link>http://ptsefton.com/2008/07/30/improving-valet-part-1.htm</link>
		<comments>http://ptsefton.com/2008/07/30/improving-valet-part-1.htm#comments</comments>
		<pubDate>Wed, 30 Jul 2008 06:33:35 +0000</pubDate>
		<dc:creator>ptsefton</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://ptsefton.com/2008/07/30/improving-valet-part-1.htm</guid>
		<description><![CDATA[Some notes on a workshop on the VALET-camp being held this week at QUT]]></description>
			<content:encoded><![CDATA[<abbr class="unapi-id" title="http://ptsefton.com/?p=162"><!-- &nbsp; --></abbr>
<div><div class='page-toc'></div><div><p>This week the ARROW community is having get together for developers to work on the VALET repository ingest tool. This is probably of little interest if you&#8217;re not a repository person (or rat) but if you are then this may be of interest whether you are associated with the VITAL / Fedora world or not.</p><p>VALET is a deposit tool designed to allow self-deposit of electronic stuff into a <a href="http://fedora-commons.org/">Fedora</a> repository, specifically one running <a href="http://www.vtls.com/vital">VTLS VITAL</a>. The bit about VITAL is crucially important <span class="spCh spChx2013">–</span> Fedora is an underlying storage layer, a kind of database, and different software will use it in different ways. VITAL has some tricks for storing datastreams derived form other assets, such as full-text extracted from PDF that other software like <a href="http://fez.sourceforge.net/">Fez</a> would not understand.</p><p>VALET comes in two versions. </p><ol class="lin" style="list-style: decimal;"><li><p>There&#8217;s an open source one <span class="spCh spChx201c">“</span>Valet for ETDs<span class="spCh spChx201d">”</span> which is set up initially just to deal with Electronic Theses and Dissertations (ETDs). It&#8217;s available from the VTLS website or from Google Code (last week the one at the VTLS site was out of date, and the package for download from Google Code was slightly less out of date but I think they might be up-to-date now). </p></li><li><p>The other version is mostly the same but is not free. It is important to make the distinction because if you customize the non-free version then you would have to ask VTLS for permission to redistribute it, possibly even within your own institution. I am not a lawyer (although I have a 10 year old who is threatening to become one) but I would be very cautious about changing a file that says <span class="spCh spChx201c">“</span>(c) &lt;Some Corporation&gt; All rights reserved<span class="spCh spChx201d">”</span> (Her other potential career is being a computer programmer <span class="spCh spChx2013">–</span> might be a good idea to do both so she can be rich <b>and</b> happy).</p></li></ol><p>So the outcome of the workshop will be to get a version of the open-source VALET with the best of the modifications that people have made at their sites, with maybe some new features.</p><p>One much requested feature for VALET (and for VITAL too) is to be able to edit submissions that have already been approved and pushed through VALET workflow into the repository. It&#8217;s kind-of surprising that VALET doesn&#8217;t do this already but it doesn&#8217;t.</p><p>I had an idea about how this might work last week, and Tim McCallum has implemented the first part of it already. To explain it we have to go into a little bit of detail about how VALET works. VALET takes a very simple approach to workflow, of which I for one approve. In simple terms:</p><ul class="lib"><li><p>An administrator defines a workflow with a set number of steps and says who can approve a submission at each step.</p></li><li><p>An administrator defines a web form, based on the example(s) shipped by VTLS to collect the metadata required for a submission. </p></li><li><p>At each stage the software simply serializes the information in the form into XML and saves it on disk.</p></li><li><p>For each new stage the program picks up the information from disk and puts the values back into the form.</p></li><li><p>At the final stage the program runs XSLT stylesheets (supplied by the administrator) to transform the serialized form data into the &#8216;proper&#8217; metadata for the repository.</p></li></ul><p>What Tim has done is simply to create an additional data stream containing the form data along with the other data streams when an item is approved. This means that it will be there alongside the repository item and all the other metadata streams. I think this will be really useful in solving some of the ongoing issues people are having with their repositories. For example, you might want to capture author email addresses but there is no sensible place to put them in a MODS datastream.</p><p>I know, some of you are thinking about standards <span class="spCh spChx2013">–</span> how can I save my important data in a non-standard format? To which I say, better to save your data in a form which is not standard and not pretending to be standard, than to rush into inventing a new standard which only you support. Is there a standard out there that captures all the data you want to save? Then use it. If not, capture the data now and work with the community to define the standard you need. </p><p>I&#8217;m not the only one who had this idea. I found out that Vicki Picasso from Newcastle also thought it would be good to capture the VALET form.</p><p>This approach is actually very similar to what you do in ePrints <span class="spCh spChx2013">–</span> you can define any old metadata you want (as long as it&#8217;s flat name-value pairs) and map it to Dublin Core as you see fit for dissemination purposes.</p><p>In VITAL, and in our <a href="http://ptsefton.com/2008/06/27/tim-mccallum-shows-off-sun-of-fedora.htm">Sun Of Fedora</a> repository portal project you can index any XML datastream you like. So if you want to collect HERDC categories  (that&#8217;s to do with reporting research publications to the Australian Government <span class="spCh spChx2013">–</span> very important stuff) then you can, without having to jam them into a metadata schema that was not designed to take them.</p><p>Next steps in the work Tim started:</p><ol class="lin" style="list-style: decimal;"><li><p>Work out how to search for and retrieve an item to be re-edited, putting it back in the workflow.</p></li><li><p>Work out how to create the formdata from existing items that did not get put in the repository. We already have some experience with generating VALET form data based on a very cool idea by Simon McMillan of UNE who can&#8217;t make it to the workshop. Get well Simon!</p></li></ol><p>(I put it to my daughter that she could be a programmer and a lawyer and that would make her rich and happy. She said of course being a lawyer would make her rich and happy. I asked what would being a programmer make her? A nerd, apparently.)</p><p/><p/><p/><p/></div></div>]]></content:encoded>
			<wfw:commentRss>http://ptsefton.com/2008/07/30/improving-valet-part-1.htm/feed</wfw:commentRss>
		</item>
		<item>
		<title>More on Buzzword</title>
		<link>http://ptsefton.com/2008/07/24/more-on-buzzword.htm</link>
		<comments>http://ptsefton.com/2008/07/24/more-on-buzzword.htm#comments</comments>
		<pubDate>Thu, 24 Jul 2008 01:12:13 +0000</pubDate>
		<dc:creator>ptsefton</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://ptsefton.com/2008/07/24/more-on-buzzword.htm</guid>
		<description><![CDATA[Two people have recently reminded me about Adobe&apos;s online word processor, Buzzword. Coincidence? Groundswell of popularity? Probably not as they are married to each other.]]></description>
			<content:encoded><![CDATA[<abbr class="unapi-id" title="http://ptsefton.com/?p=160"><!-- &nbsp; --></abbr>
<div><div class='page-toc'></div><div><p>Two people have recently reminded me about Adobe&#8217;s online word processor, <a href="https://buzzword.acrobat.com/">Buzzword</a>. Coincidence? Groundswell of popularity? Probably not as they are married to each other.</p><p>Anyway, it has improved a bit since I first <a href="http://ptsefton.com/blog/2007/10/03/10-01-45.649977/">looked at it</a>. At least it has HTML export now (it handles lists wrongly, nesting lists inside lists instead of inside list items, but that&#8217;s a common mistake). Still no styles or headings and I fear that it is trying to get people to lock up their documents in some kind of proprietary Flash and/or PDF format.</p><p>Adobe are asking for feedback so I <a href="http://blogs.adobe.com/acom/2008/06/buzzword_looking_ahead.html">gave some over at the Acrobat.com blogs</a>. </p><p>I think that there&#8217;s an opportunity to Adobe to do what I Google should have done with Google Docs (used to be Writely). I suggested this:</p><blockquote class="bq"><p>What could be done differently over at Writely so they can reliably import documents and get the lists right, and better still, let people start off in Writely online and produce word processing docs to send out to others?</p><p>The Writely / Google people could design a well thought out, freely available generic word processing template that works more or less equally well in various different word processing environments (hint - you&#8217;ll need some clean-up code to help the poor word processors keep their lists straight). </p><p><a href="http://ptsefton.com/blog/2006/03/21/writely,__meet_the_ice_template/">http://ptsefton.com/blog/2006/03/21/writely,__meet_the_ice_template/</a> </p></blockquote><p>I think Buzzword should not only <a href="http://del.icio.us/ptsefton/usestyles">use styles</a>, it should get a well designed set of generic styles as a basis and the Adobe folks should build templates which are Buzzword compatible <span class="spCh spChx2013">–</span> the online service that does this first has the best chance of bridging the gap from the offline to the online world.</p><p>If I create a document in Buzzword why not make the default export to Word use some Adobe-defined styles and give the user a buzzword-like toolbar to play with them, post the doc back to Buzzword etc? In all the online word processors I have tried import and export is appalling and I&#8217;m sure this must slow adoption.</p><p>At the moment all the online word processors are far behind on features that are needed for some documents, you couldn&#8217;t write a thesis in Buzzword (not if you wanted tables of contents and figures and numbering and reference management) but you could draft some stuff in there or collaborate on papers then export into Word, or FrameMaker or something to finish the job. Here a well thought out style set would really help with interop.</p><p>Adobe <span class="spCh spChx2013">–</span> if you want any advice on word processing templates <a href="mailto:pt@ptsefton.com">drop me a line.</a> (Someone from Google did, but the conversation didn&#8217;t go anywhere). The <a href="http://ice.usq.edu.au/">ICE project</a> has some templates you might like to look at.</p><p/><p/><p/></div></div>]]></content:encoded>
			<wfw:commentRss>http://ptsefton.com/2008/07/24/more-on-buzzword.htm/feed</wfw:commentRss>
		</item>
		<item>
		<title>Some architectural changes to ICE</title>
		<link>http://ptsefton.com/2008/07/15/some-architectural-changes-to-ice.htm</link>
		<comments>http://ptsefton.com/2008/07/15/some-architectural-changes-to-ice.htm#comments</comments>
		<pubDate>Tue, 15 Jul 2008 06:59:23 +0000</pubDate>
		<dc:creator>ptsefton</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://ptsefton.com/2008/07/15/some-architectural-changes-to-ice.htm</guid>
		<description><![CDATA[Getting over Subversion issues with some new modes of operation for ICE]]></description>
			<content:encoded><![CDATA[<abbr class="unapi-id" title="http://ptsefton.com/?p=156"><!-- &nbsp; --></abbr>
<div><span class='pdf-rendition-link'><a href='http://ptsefton.com/wp-content/uploads/2008/07/ice-architecture.pdf'>View as PDF</a></span><div class='page-toc'></div><div><p>This post is a look at some architectural changes we&#8217;re looking at for the <a href="http://ice.usq.edu.au/">ICE </a>system, as we hit the limits of what we could squeeze out of the old architecture. </p><p>Ron Ward has just finished a major rewrite of lots of the application, designed to make it work on a central web server with multiple users, in addition to the &#8216;classic&#8217; mode where everyone has their own ICE server running on their own computer. He&#8217;s spent the last few months trying to get Subversion to do things it was clearly never meant to do. </p><p>ICE uses Subversion as a back-end version controlled data store. In the ICE classic mode multiple users work with checked-out working copies of a repository and hit &#8216;Sync&#8217; to send their changes back to the server and get updates. Behind the Sync button is a fiendishly complicated bit of code that gets updates from the server, detects conflicts, tries to resolve them as gracefully as possible and provide a usable web GUI for the authors. </p><p><p><span style="display: block"><a name="Object1"/><img alt="Object1" class="fr3" height="210" src="http://ptsefton.com/wp-content/uploads/2008/07/785b09d7.gif" style="border:0px;" width="534"/></span>Figure 1: ICE Classic mode: each user has their own ICE application which looks after their working copy, ICE uses the Subversion protocol to synchronize everyone&#8217;s work</p></p><p>Ron&#8217;s big rewrite has lots of unit tests based on all the trouble we&#8217;ve come across (mis)using Subversion for the last couple of years so we&#8217;re happy that it will be robust when running in classic mode.</p><p>But the new server version is a problem. If you have multiple users  trying to access the same working copy all at once, then Subversion gets in the way <span class="spCh spChx2013">–</span> it starts locking files all over the place for example. One simple solution is just to put out a server version that doesn&#8217;t allow distributed editing like ICE classic does, but our courseware authors really need the ability to manage large volumes of stuff on their own PCs as some courses are pretty big, with a lot of digital assets, while we want to have web access for reviewers and casual contributors to the same courses via a central web service.</p><p>So we&#8217;re looking at a new server mode where ICE still has a working copy but it knows that it is the only user-agent who has it checked out so it doesn&#8217;t need to do updates, it can just do commits. If all you want is a web based content management system then this will be all you need to install and it should run pretty well.</p><p>If you are following this technobabble then you&#8217;ll be asking <span class="spCh spChx201c">“</span>but how does that help the ICE classic users work when there&#8217;s an ICE server? That would mean that changes made on an ICE client would never make it to the server!<span class="spCh spChx201d">”</span></p><p/><p><p><span style="display: block"><a name="Object2"/><img alt="Object2" class="fr4" height="210" src="http://ptsefton.com/wp-content/uploads/2008/07/m423f1e98.gif" style="border:0px;" width="534"/></span>Figure 2: ICE Server mode: No subversion updates required as it is the only user-agent committing changes to the working copy</p>That&#8217;s the tricky part <span class="spCh spChx2013">–</span> we need to create a new mode of operation for ICE where people want the benefits of the server version AND the classic distributed mode of working. In this mode the ICE application will work in a new &#8216;client&#8217; mode. It will only ever get updates from the central repository. Any additions or changes won&#8217;t be fed back to subversion directly <span class="spCh spChx2013">–</span> the ICE client will post them just like any other user into the ICE server. </p><p>This will require some more coding, but probably not as much as it would have taken to get the ICE server working any other way <span class="spCh spChx2013">–</span> and it opens up the possibility that we can replace Subversion and use a simpler version control system, possibly of our own devising in future. So a future model might have the ICE server acting not only as interface for humans but for other ICE systems.</p><p><p><span style="display: block"><a name="Object3"/><img alt="Object3" class="fr4" height="335" src="http://ptsefton.com/wp-content/uploads/2008/07/m1c10f8b2.gif" style="border:0px;" width="534"/></span>Figure 3: ICE Client mode: Users can update their local repository but all changes go via the ICE server. We will automate this so it is seamless for users.</p></p><p class="P6">Having made this architectural decision we can press on with testing the ICE server straight away, even without making any changes to the client version. Here&#8217;s the plan which we will roll through over then few weeks:</p><ol class="lin" style="list-style: decimal;"><li><p>For the repositories which currently allow both server and classic access we turn off the ability for users to commit using ICE classic. If people want to check out their own copy of the content they can, as long as they post their changes back in through the server version manually.</p></li><li><p>We modify the ICE server so it now assumes that it has THE working copy and only commits changes <span class="spCh spChx2013">–</span> never updates <span class="spCh spChx2013">–</span> this will mean we can support multiple users with no dramas (that&#8217;s the plan anyway).</p></li><li><p>We will make a new client mode for ICE which automate the process of detecting changes and posting them from the client version of ICE through the &#8216;front door&#8217; of the server version pretty much like any other user. Updates will happen as they do now, from the subversion repository. </p></li></ol><p/></div></div>]]></content:encoded>
			<wfw:commentRss>http://ptsefton.com/2008/07/15/some-architectural-changes-to-ice.htm/feed</wfw:commentRss>
		</item>
		<item>
		<title>Tim McCallum shows off Sun of Fedora</title>
		<link>http://ptsefton.com/2008/06/27/tim-mccallum-shows-off-sun-of-fedora.htm</link>
		<comments>http://ptsefton.com/2008/06/27/tim-mccallum-shows-off-sun-of-fedora.htm#comments</comments>
		<pubDate>Fri, 27 Jun 2008 07:12:09 +0000</pubDate>
		<dc:creator>ptsefton</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://ptsefton.com/2008/06/27/tim-mccallum-shows-off-sun-of-fedora.htm</guid>
		<description><![CDATA[A pointer to a screencast of our new Solr portal for Fedora]]></description>
			<content:encoded><![CDATA[<abbr class="unapi-id" title="http://ptsefton.com/?p=151"><!-- &nbsp; --></abbr>
<div><div class='page-toc'></div><div><p>Here in the Repository Services group at USQ we have been working on a project funded by <a href="http://www.arrow.edu.au/">ARROW</a> and in partnership with the <a href="http://nla.gov.au/">National Library of Australia</a>. It&#8217;s a bit of repository software originally designed to explore the <a href="http://lucene.apache.org/solr/">Apache Solr </a>search application.</p><p>We looked at Solr last year at USQ, and <a href="http://aanro-repo.blogspot.com/2007/10/solr-demo-is-up.html">I blogged about it as part of a consulting job</a> to compare <a href="http://vtls.com/products/vital">VTLS Vital</a>, <a href="http://espace.library.uq.edu.au/view.php?pid=UQ:11924">Fez</a> and <a href="http://drama.ramp.org.au/">Muradora</a>. Since then, Muradora and Fez have both started using Solr, there is a plugin for Fedora&#8217;s standard text search package to use Solr. As far as I know VTLS have not announced anything to do with Solr apart from their Visualizer product.</p><p>The goal of the current project is to create a simple interface to Fedora that uses a single technology <span class="spCh spChx2013">–</span> that&#8217;s Solr <span class="spCh spChx2013">–</span> to handle all browsing, searching and security. This contrasts with  solutions that use RDF for browsing by &#8216;collection&#8217;, <a href="http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=xacml">XACML</a> for security and a text indexer for fulltext search, and in some cases relational database tables as well. We want to see if taking out some of these layers makes for a fast application which is easy to configure. So far so good.</p><p>This is not a replacement for <a href="http://vtls.com/products/vital">VTLS Vital</a>, and is not intended to replace the <a href="http://search.arrow.edu.au/">NLA&#8217;s ARROW Discovery service</a> which is also based on Solr.</p><p>We now have a working demonstration with content pulled from a number of repositories, and are able to show the main things we set out to achieve. Administrators can set up a new portal which shows a subset of the main index with a few clicks, and we have a security model which can restrict access to metadata and data based on group roles.</p><p>I will post some more information about the emerging architecture of the application soon, but for now Tim McCallum has put together a <a href="http://www.youtube.com/watch?v=NLVRjh2af1Y">demo screencast</a>, which had him slaving over a hot video editor over the weekend (forgive any glitches, it&#8217;s his first time). Or <a href="http://rspilot.usq.edu.au:8080/sun-of-fedora/">you can try it out for yourself</a> (Demo URL may not work after October 2008). If you want to log in contact me for a password.</p><p>Thanks to Oliver Lucido who did most of the development, building on work he did for the FRED project last year with David Levy. Tim has also been assisting, with project coordination from Bron Chandler and stake-holding from Neil Dickson at ARROW and Alison Dellit at the NLA.</p><p/></div></div>]]></content:encoded>
			<wfw:commentRss>http://ptsefton.com/2008/06/27/tim-mccallum-shows-off-sun-of-fedora.htm/feed</wfw:commentRss>
		</item>
		<item>
		<title>A few words on magic</title>
		<link>http://ptsefton.com/2008/06/26/a-few-words-on-magic.htm</link>
		<comments>http://ptsefton.com/2008/06/26/a-few-words-on-magic.htm#comments</comments>
		<pubDate>Thu, 26 Jun 2008 03:00:13 +0000</pubDate>
		<dc:creator>ptsefton</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://ptsefton.com/2008/06/26/a-few-words-on-magic.htm</guid>
		<description><![CDATA[MJ Suhonos from PKP has patiently explained where I got some things wrong about Lemon8XML in my previous hasty post.  I&apos;d like to pick up one theme from MJ&apos;s post.]]></description>
			<content:encoded><![CDATA[<abbr class="unapi-id" title="http://ptsefton.com/?p=149"><!-- &nbsp; --></abbr>
<div><div class='page-toc'></div><div><p>MJ Suhonos from PKP has patiently explained where I got some things wrong about Lemon8XML in <a href="http://ptsefton.com/2008/06/23/lemon8-xml-beta-released.htm">my previous hasty post</a>.</p><p>I&#8217;d like to pick up one theme from MJ&#8217;s post. MJ says (with emphasis by me):</p><blockquote class="bq"><p>The larger problem, of course, is that <b>L8X is encumbered, in a way, by </b><a href="http://cavlec.yarinareth.net/2007/08/30/apologies-and-musings-on-progress/"><b>the common expectation</b></a><b> that it should just &#8220;magically&#8221; work on whatever format the author or user is providing</b> &#8212; it is an application that is designed to solve, in part, an infinitely-unsolvable problem. So, the user has to meet the application halfway.</p></blockquote><p>I agree that this expectation that tools should perform <b>magic</b> is a problem. We see this in the HTML export from word processors; they take arbitrary input and turn it into HTML. In the inevitable absence of magic you typically get sub-standard output.</p><p>I understand the requirement to try to understand the structure of ad hoc documents if you can, but I don&#8217;t think it&#8217;s a good idea to encourage people to keep creating them; if L8X has a version of  <span class="spCh spChx201c">“</span>meet me half way<span class="spCh spChx201d">”</span> which involves direct formatting instead of styles then that will be a step backwards in my opinion. My version of meet me half way would be at least to try to get people to use headings.  If they don&#8217;t then the structure guesser will step in, try to guess and <b>give them their document back to correct</b> when the inevitable errors occur. </p><p>I took a look at the single sample document for L8X on the demo site. It&#8217;s clear that the structure-guesser part of the application is going to have to be very clever to work well. It seems, for example, that the goal is to detect captions either before or after a graphic or table even when they have no special formatting. Introducing edge cases like short paragraphs both before and after an image seem to cause it problems, including loss of text but I could be wrong, again.</p><p>(I&#8217;ve had a look at the document parser code and it is taking into account paragraph length, and doing some reasoning based on text-size and formatting attributes).</p><p>So, even though I had some of the architecture wrong, I <b>still</b> think that Lemon8 XML would be vastly more useful if it had a two part architecture:</p><ol class="lin" style="list-style: decimal;"><li><p><b>Styled word processing document to XML conversion</b>, with the obvious caveat that if you&#8217;re turing a generic format into a domain specific one you&#8217;re going to be producing stuff that doesn&#8217;t use the whole of the target format and may have gaps that need to be filled in.</p><p>Lemon8 XML has its own XML format, but I&#8217;m wondering if it couldn&#8217;t just use ODF which is a well specified standard, with the ability to give the document back to the user. (Checking with MJ via email about this).</p><p>The goal would be to get as many people using this mode as possible because it is the least work for everyone <span class="spCh spChx2013">–</span> no guessing strucutre required if people can use markup. </p></li><li><p><b>Ad hoc-formatting to styled word processing conversion</b> using the best available heuristics to guess structure and <b>give the document back to the author in an improved form</b>. As far as I can tell that&#8217;s not a goal for the PKP team, but the code is out there so we could do it, using their algorithm. We&#8217;re looking into it.</p></li></ol><p>It is important to help our colleagues who are authoring documents in word processors to<a href="http://del.icio.us/ptsefton/usestyles/"> use styles</a>. It&#8217;s good for them. It will improve their working lives. And it will open the door for them to start dealing with real eResearch and the semantic web. A project like the <a href="http://www.jisc.ac.uk/whatwedo/programmes/digitalrepositories2007/theorem-ice.aspx">TheOREM-ICE</a> would be impossible with documents like the L8X sample document.</p></div></div>]]></content:encoded>
			<wfw:commentRss>http://ptsefton.com/2008/06/26/a-few-words-on-magic.htm/feed</wfw:commentRss>
		</item>
		<item>
		<title>Lemon8 XML beta released</title>
		<link>http://ptsefton.com/2008/06/23/lemon8-xml-beta-released.htm</link>
		<comments>http://ptsefton.com/2008/06/23/lemon8-xml-beta-released.htm#comments</comments>
		<pubDate>Mon, 23 Jun 2008 03:12:58 +0000</pubDate>
		<dc:creator>ptsefton</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://ptsefton.com/2008/06/23/lemon8-xml-beta-released.htm</guid>
		<description><![CDATA[Some impressions of the L8X beta release with suggestions for improvement]]></description>
			<content:encoded><![CDATA[<abbr class="unapi-id" title="http://ptsefton.com/?p=147"><!-- &nbsp; --></abbr>
<div><div class='page-toc'></div><div><p>The PKP people have released a beta of <a href="http://pkp.sfu.ca/lemon8">Lemon8-XML</a>, (L8X) their journal-oriented word processor-driven XML publishing system.</p><p>I tried out the demo server with an <a href="http://ice.usq.edu.au/">ICE</a> test document. </p><p><b>The bad news </b>is that the service had significant problems with my document; It could not locate author metadata, incorrectly identified some ordinary text as being citations, and lost most of the document text, which is obviously a very major issue.</p><p><b>The good news</b> is that MJ Suhonos from PKP was onto me straight away with an email and is keen to work on support for styles in general and ICE styles in particular. (It&#8217;s <a href="http://pkp.sfu.ca/lemon8_faq">in the FAQ</a> that we will collaborate on this).</p><p>If the PKP team can get a decent structure guessing application to work on arbitrary input that would be great, but even better would be to close the loop and give back documents with more structure than you put in. At the ICE project we will help however we can.</p><p>If it was me doing this I would break this problem into two parts:</p><ol class="lin" style="list-style: decimal;"><li><p>Build a converter that can take <b>structured</b> word processing documents and map them to the NLM XML format used by L8X. ICE offers one well worked out structure for generic documents, others may exist for specific formats.</p></li><li><p>Build a structure-guessing application to <b>add structure</b> to word processing documents (something which Ian Barnes has been chipping away at for a while).</p></li></ol><p>With both of these in place you can improve documents in the wild as you go; every time someone submits a draft add styles and give it back to them, rather than trying to guess structure at the end.  I would like to see this embedded in the OJS journal management system from PKP so that authors get rapid and continual feedback every time they upload a draft. This would allow some editorial and review processes to take place in an HTML interface as well <span class="spCh spChx2013">–</span> rather than via PDF on word processing files.</p><p>If you leave L8X as the final step, authors will have little feedback as to how they can improve the structure of their drafts.</p><p>My two-part plan would re-ordering sections in L8X become redundant <span class="spCh spChx2013">–</span> word processors have outlining tools with which you can reorder content, so why try to do it through an HTML interface?</p><p>On a technical note, last time I looked at L8X I concluded that Docvert is a weak link <span class="spCh spChx2013">–</span> it tries to to use XSLT to guess structure; our experience with ICE was that XSLT (version one at least) was not a productive way to do this as the austere functional programming environment in XSLT made the structure-reasoning code very hard to maintain and very slow, so we moved to more traditional parser written in Python which is much easier for typical programmers to work with.</p></div></div>]]></content:encoded>
			<wfw:commentRss>http://ptsefton.com/2008/06/23/lemon8-xml-beta-released.htm/feed</wfw:commentRss>
		</item>
		<item>
		<title>An ICE like ODF based web publishing system</title>
		<link>http://ptsefton.com/2008/06/20/an-ice-like-odf-based-web-publishing-system.htm</link>
		<comments>http://ptsefton.com/2008/06/20/an-ice-like-odf-based-web-publishing-system.htm#comments</comments>
		<pubDate>Fri, 20 Jun 2008 05:55:10 +0000</pubDate>
		<dc:creator>ptsefton</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://ptsefton.com/2008/06/20/an-ice-like-odf-based-web-publishing-system.htm</guid>
		<description><![CDATA[Sun have put up a basic demo of an ODF driven web publishing system]]></description>
			<content:encoded><![CDATA[<abbr class="unapi-id" title="http://ptsefton.com/?p=145"><!-- &nbsp; --></abbr>
<div><div class='page-toc'></div><div><p>From Kay Ramme at the GullFOSS blog at Sun comes <a href="http://blogs.sun.com/GullFOSS/entry/odf_www_an_odf_wiki">this demo</a> of a wiki-like system using ODF as a document format and OpenOffice.org as an editor.</p><p>It seems to be using WebDAV to allow users to edit documents on a server, then convert them to HTML automatically when they load the document in a browser.</p><p>Good idea to have the user change a document and automatically render it to HTML on request.</p><p>Same idea, in fact as the <a href="http://ice.usq.edu.au/">ICE system.</a></p><p>Some differences with ICE:</p><ul class="lib"><li><p>ICE doesn&#8217;t use WebDAV because, well, it doesn&#8217;t work with Windows reliably and it doesn&#8217;t work with the Mac too well either. </p></li><li><p>ICE doesn&#8217;t rely on OpenOffice&#8217;s native save as HTML feature which will produce awful results on all but the simplest text documents. A few of several reasons not to use it:</p><ul class="lib"><li><p>It gets list formatting badly wrong.</p></li><li><p>It exports photos at full resolution and puts height and width attributes on them to resize them meaning that you end up shipping megabytes when you should be shipping kilobytes.</p></li><li><p>It is not styles-based so you have no way of configuring it to do things like use pre formatted text in the right places.</p></li></ul></li><li><p>ICE is styles-driven which means it produces very clean HTML compared the rubbish that office suites spit out.</p></li><li><p>ICE uses templates to help people apply styles.</p></li><li><p>ICE can deal with Microsoft Word documents and has cleanup code to correct some of the interop issues with OpenOffice.org.</p></li><li><p>ICE has a version-controlled back end courtesy of Subversion so it can be used by distributed teams.</p></li><li><p>ICE can create IMS content packages for courseware.</p></li><li><p>ICE has an Atom Publishing Protocol button which can send stuff to a blog <span class="spCh spChx2013">–</span> and do a much better job of formatting than the Sun Weblog Publisher addin too.</p></li><li><p>ICE has a plugin architecture and a growing number of hooks for integrating other content types like chemistry data.</p></li><li><p>ICE doesn&#8217;t deal with spreadsheets, but we could add that pretty easily.</p></li><li><p>ICE doesn&#8217;t have a mechanism to create new pages by linking to a target that doesn&#8217;t exist <span class="spCh spChx2013">–</span> if we add that we&#8217;ll make it a bit smoother than what&#8217;s shown in the demo.</p></li><li><p>ICE can be used as a conversion service by other systems.</p></li></ul><p>I could go on. </p><p>If you like the demo, check out <a href="http://ice.usq.edu.au/presentations/demos/index.htm">some of ours</a> although I note that we don&#8217;t have a really basic one that shows what Kay shows in hers. We&#8217;ll get on to that.</p></div></div>]]></content:encoded>
			<wfw:commentRss>http://ptsefton.com/2008/06/20/an-ice-like-odf-based-web-publishing-system.htm/feed</wfw:commentRss>
		</item>
		<item>
		<title>Adventures in Geocoding part 2: Embedding data points in documents</title>
		<link>http://ptsefton.com/2008/06/19/adventures-in-geocoding-part-2-embedding-data-points-in-documents.htm</link>
		<comments>http://ptsefton.com/2008/06/19/adventures-in-geocoding-part-2-embedding-data-points-in-documents.htm#comments</comments>
		<pubDate>Thu, 19 Jun 2008 06:18:39 +0000</pubDate>
		<dc:creator>ptsefton</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://ptsefton.com/2008/06/19/adventures-in-geocoding-part-2-embedding-data-points-in-documents.htm</guid>
		<description><![CDATA[Autogenerating maps showing points of interest mentioned in a document]]></description>
			<content:encoded><![CDATA[<abbr class="unapi-id" title="http://ptsefton.com/?p=140"><!-- &nbsp; --></abbr>
<div>[update: the map doesn&#8217;t seem to work well in IE - works well for me in Firefox.]  <script type="text/javascript" src="/jquery.js"><!-- --></script><script type="text/javascript" src="/geo.js"><!-- --></script><span class='pdf-rendition-link'><a href='http://ptsefton.