WordDown: Word to HTML5 conversion tool
2011-10-18
HTML5 Case Study
WordDown: Word to HTML5 conversion tool
Document details
Author: Peter Sefton Author’s ID (URI): http://nla.gov.au/nla.party-541658
Document Type: http://purl.org/ontology/bibo/Report Date: 2011-09-22 Version: 0.1 File Name:
QANotes: This is a first draft – I will post it to my blog, using the tool it describes, to seek feedback.
\
Rights
This work has been published under a Creative Commons attribution-sharealike 2.0 licence.
Acknowledgements
UKOLN is funded by the Joint Information Systems Committee (JISC) of the Higher and Further Education Funding Councils, as well as by project funding from the JISC and the European Union. UKOLN also receives support from the University of Bath where it is based.
\
Contents Page
\
About This Case Study
Target Audience
The main target for this work is tool developers building authoring systems, repositories and publishing infrastructure for academic documents.
It may also be useful to committed academic authors who are comfortable with HTML already and have some technical skills, who would be able to install a bookmarklet and possibly run Python code on a Windows machine (ie not at this stage the broad academic community).
What Is Covered
This case study examines ways that academic authors working with word processors such as Microsoft Word, the OpenOffice.org family and Google Docs would be able to produce compliant Scholarly HTML5. Due to time constraints, the tool developed for this project handles Microsoft Word documents only, but the principles outlined here apply more broadly.
Word processors are used very widely in academia for all sorts of document authoring, yet the articles, essays, theses, course materials and so on produced in Microsoft Word et al in the Higher Education sector in huge volumes are not easy to convert to good quality, clean, semantically rich HTML 5 of the type being discussed in the jiscHTML5 project.
This is a critical piece of work for the overall project of bringing scholarship to the (semantic) web. If the tools still being used to create academic content do not create HTML natively then who will do the markup? What new tools are needed if any? This case study will consider these questions, as well as produce some demonstrations of what is possible with the demonstration application, WordDown.
What Isn’t Covered
The questions posed above about tools for creating HTML5 are even more pressing if we are targeting XML for scholarly documents – XML is in its teens now, and there have been no widely available tools produced to create XML for DTDs such as DocBook or TEI that have gained any kind of traction with large user groups. This is an interesting question for the Higher Education sector but it is out of scope for this case study.
Use Case
There are no formal studies that we are aware of that show the usage rates of different academic authoring tools by country or discipline, but it is quite clear that Microsoft Word (along with other word processors) is a very widely used document creation tool in many, many disciplines at our Higher Education initiations and research organisations. For example on this study the UKOLN team have requested that case studies be submitted in Word format, using a template supplied by UKOLN. While the template gives the case study documents some structure, using ‘Save as HTML…’ from Word will not produce good quality HTML – far from the kind of structured documents that are being produced as exemplars in the jiscHTML5 project, Word’s output is focussed a decade-old approach of trying to match paper formatting.
The use case here is using Microsoft Word in the the production of any academically oriented document that’s destined for the web, including articles, theses and other student work, reports, course materials, academically included blog posts or other web pages, and reports such as this one.
[TODO: more review of other approaches – Lemon8XML etc]
Solution
The solution to the use case use Word to create HTML5 is a tool called WordDown created for this JISC project. This is a JavaScript application which runs in a web browser and processes Word documents into clean HTML5. It takes Word’s HTML output as an input.
The wiki at the jiscHTML5 project at Google code covers how to run the application and the basics of how to format documents.
Background to WordDown
The Word 2000 HTML format has been a feature of Microsoft’s flagship word processor since Word 2000 and has been the target of much criticism. At the time it was introduced it was capable of rendering almost all features of Word documents into a kind of HTML, using a combination of extended CSS formatting and islands of very obscure non standard markup. It was, however, actually not very far away from XML, and could be processed into XML with a small transformation program^1^.
The solution presented here revisits the format with modern tools by loading it into a modern web browser and using the jQuery framework to interrogate various aspects of the formatting. Recent versions of web browsers are all coded to deal gracefully with the ‘mutant markup’ in Word’s HTML output, hiding Word specific code in comments, because HTML5 parsing rules take account of all kinds of legacy issues like Microsoft’s non-standard markup.
The application WordDown is inspired by the success of lightweight wiki-style markup languages which allow users to create HTML (and PDF in some cases) from simple text files . One of the foremost examples of this class of language is the MarkDown format, used by the Pandoc processing framework. (Peter Krautzberger has a useful introduction to Markdown and Pandoc for academic authors and explains how it may be considered what we might call ‘the new LaTeX’ for academic authoring.)
In Markdown, one way of making a heading is to preface some text with #.
# Introduction (turns into <h1> in HTML)
## About markdown (turns into <h2>)
In WordDown, to accomplish the same thing the author uses the built in heading styles – Heading 1 and Heading 2 respectively. These styles are the only widely used standard way of structuring documents in the word processing world– other elements such as quotes or lists have no equivalent standard implementations.
To make a block-quote in Markdown, you use a greater-than character:
> This is a block-quote.
In Word Down, just indent the paragraph either using the formatting tools on the Word ribbon, toolbar or menus, or define a style – but (at this stage at least) the WordDown processor does not use style names other than for detecting some headings, it uses indenting. So any indented style which is not a list or a heading will be treated as a block-quote.
The algorithm WordDown uses is being documented on the Google Code site, initially via the code. In essence it is designed around the assumption that the user wants to create clean HTML, not to recreate the look of a paper document so in a similar way to the lightweight wiki markup languges, it uses formatting and indenting as structural cues. The main device used is to look at the left margin:
Features summary
WordDown has the following features, described in more detail on the Google Code site for the software:
-
Creates HTML from Word documents saved using “Save as HTML…” on Word for Windows versions from 2000 to 2010. The code runs in a web browser and is packaged both as a bookmarklet and as a small Python web server that user needs to run from their documents directory.
-
Works with Zotero citations and embeds them inline using best-practice Scholarly HTML5 conventions.
-
Can create rich semantic HTML5 with embedded microdata, given microformats in the source document.
Demonstration – screenshots
The simplest way to run WordDown is to manually save documents as HTML, load them into a web browser and use the WordDown bookmarklet as documented on the Google code wiki. A slightly easier workflow (which is harder to set up) is to run the WordDown server:
Figure 1: Browsing local files using the WordDown web server
When the user selects a word document, the WordDown server runs Microsoft Word in the background, saves the document as HTML, inserts Javascript into the head and serves the result back the user’s browser. The result is that the user is presented with an HTML version of the Word document using a stylesheet derived from the one used by the W3C for their standards documents:
Figure 2: A word document (this one) converted to HTML5 by WordDown running the browser
The resulting document is HTML5 – and can be saved by the user for reuse. Alternatively, using another JavaScript application developed for the jiscHTML5 project, parts of the document can be copied and pasted via the Show5ource bookmarklet.
Figure 3: The Show5 bookmarklet identifies the HTML5 sections in a document and lets the user click to see copy and paste-ready source or grab the whole document as a Zip file with all images.
Figure 4: Show5 encodes image data in dataURIs to entire web pages can be copied and pasted, for example into a CML such as WordPress
Finally, the tool can create semantically rich documents. Here is the JSON format data which can be extracted from the page by clicking on the {} link:
![](
This data was embedded in the document details table, using a microformat-like technique which is documented on the Google Code wiki.
Demonstration web documents
TODO: More demonstrations.
- Demonstration documents in Word format (as noted above) that can be automatically transformed to HTML5 with embedded document semantics and re-processable citations.
Impact
This work has had no impact so far as it is very new, but could be important to the uptake of HTML5 in academia if it is picked up by user communities, like for example the authors who publish to KnowledgeBlogs, or agencies such as UKOLN involved in publishing a variety of academic materials.
To have a substantial impact there would need to be a driver for people to create HTML5 materials for academia. Except in pockets of activity (eg academic blogging) this is not currently the case. One current trend – the move to ebooks away from paper may finally tip the balance and have academic authors looking for tools that can create the HTML they need as the building block for epub and amazon Kindle ebook publications.
This tool would need work to make it easier to deploy in academic contexts.
Of course, an official HTML and/or EPUB plugin from Microsoft itself working along similar lines could make this work obsolete overnight.
Challenges
The biggest challenge in this project has been the cross site scripting rules in web browsers which prevent code from accessing certain domains. In this case, if the Word document is loaded from the local file system, code in the browser may not access images from the local file system – this means the plugin is prevented from doing any processing on images, such as creating data URIs, or creating a zip file of the entire document with all its images. To get around this, a simple web service was created using Python, repeating a design pattern used in a previous project, the Integrated Content Environment (ICE)^2^ which used a local web server on user’s machines to convert office documents to HTML. At the moment this web server is only suitable for use by technically adept users who can install Python 3 and run it, as well as download the source code, but it could be packaged as a Windows executable given the resources.
Things Done Differently / Lessons Learnt
TODO: closer to the project end.
Conclusions
TODO after some feedback from UKOLN.
References
2. Sefton, P. The integrated content environment. AUSWEB 2006 (2006).at <http://eprints.usq.edu.au/archive/00000697/01/Sefton_ICE-ausweb06-paper-revised-3.pdf>
- HTML5 itself.
- The JQuery framework.