Scholarly HTML5: experimenting on myself with microdata and Schema.org vocabs

2011-09-12

[Update 2011-09-12 & 2011-09-13 Fixed formatting/encoding issues from using my new Word2HTML converter without proper testing]

Introduction: I’m embedding machine readable data about me and my career in a web page

Things have been very quiet on the Scholarly HTML front, but I have been working away in the background, on a number of things that I will report on here soon.

First up, I have just put up my CV here on this site. New! With added Semantics! (Check out what’s hidden in the page using the Structured Data Linter here). The CV started life as a word processing document but I wanted to publish it in HTML using embedded attributes to make it clear to machines which bit is my name, which is my phone number and so on. To do this I worked out a way to capture the semantics in Word and automated the conversion from Word to HTML (I know, I know I just can’t help it). This process is at an early trial stage, a proof of concept. The resulting HTML document uses microdata, which is a standard part of HTML5. My experimental microdata uses the vocabulary defined by the Schema.org consortium as far as it can. The Schema.org people are search engine vendors so they are interested in being able to find semantically marked up pages to provide better previews in search results, amongst other things which I guess we have to hope are not too evil. If the experiment works, it should help out a bit with my Search Engine Optimisation, but I’m actually more interested in providing rich semantics and metadata in web pages for scholarly resources for other reasons. Like, say:

being able to do away with hand-formatting and re-typing of references and citations forever, via reliable machine-readable citation data and/or URIs embedded in documents,
pulling out all the activities from an online course guide and putting them in one page,
being able to label the rhetorical parts of an article (method, conclusion etc) or
making it easy to mine data from publications.

The CV is a bit ugly at the moment because of the tool-chain I used to make it. The thing is, to be able to capture the semantics needed to identify me, and data about me, such as my address, you need some kind of framework in Word to be able to make sure the data is applied to the text reliably. I tried to do this in a way that didn’t turn the original document into a big tabulated form and left it as a printable document, but both the print and the resulting HTML is a bit compromised in this first attempt. This builds on previous work a bunch of us did at USQ, ANU and Cambridge on embedding data in word processing documents ^1^.

I have tried other approaches than microdata in the past, including microformats and RDFa. My understanding is that broadly speaking RDFa and microdata are similar beasts in that they are sourcing your meaning from the bibliographic ontology and then choose use the syntax or RDFa. On the other hand, Microformats tend to be a bit of both, mixing up the syntactic framework with the meaning layer. So, in a Microformat you might say something like “when an element has a class attribute of myMicroformatBook” it’s about a book, whereas in RDFa you choose your ontology or vocabulary and use a URI for book (say http://purl.org/ontology/bibo/Book ). In RDFa once you have chosen a meaningful URI all that’s left to do is try to remember how you say something ‘is about’ something else. It’s the ‘rel’ attribute right? Oh no, that’s for relations, so, it must be ‘about’. Or is it ‘property’? RDFa is really confusing. More than one way to do it and all of them are complicated.

There is a new site all about this at structureddata.org, which has links to tutorials and tools for playing with embedded semantics. Like me, they recommend Mark Pilgrim’s Dive into HTML5 as a starting point to learn about microdata. And Jenni Tennison has a great series of posts on the ins and outs of microdata and RDFa which is a resource I’m sure I’ll be drawing on a lot.

What I’m doing here is just a little experiment. I’m not endorsing microdata or Schema.org as the One True Way or anything, OK? I will say, though that my adventures with microdata have been enjoyable, and while there aren’t many tools to help, it’s been much, much easier than trying work with RDFa. Microdata has some limits compared to RDFa, but I didn’t find myself reaching for the spec every time I wanted to do anything. It has a few simple bits of syntax to learn at a level of complexity that feels about right to me And microdata’s official status of being built in to HTML5 makes it pretty compelling to consider for new applications. We’ll see how it goes with some of the other use-cases I listed above, particularly labelling rhetorical/textual structure.

The tool chain

I will go through process I went through to make the cv.

The goal is to produce markup like this:

Curriculum Vitae – Dr Peter Sefton …

The Microdata itemtype and itemscope attributes say that this section is about a person. A bit further down, we add the person’s name.

Peter Malcolm Sefton

Now this ‘name’ property is not just a string, it is actually a full URI, courtesy of the scope established in the section above making it: http://schema.org/Person/name.

Using microformats Word to make microdata

I have been working on a new Word to HTML5 conversion application written in Javascript plus JQuery that runs in a web browser. I will write that up soon and release it. For now all you need to know to follow along here is that it is set up to run as a bookmarklet. You use Word’s Save as HTML, the Windows version, load the page into your modern web browser and click the bookmarklet to have it change into HTML 5 which you can then save.

Remember, the challenge in this exercise was to be able to create a Word document where it was explicitly stated that the main part of the document was about a person, ie a “http://schema.org/Person”. To be able to do that we need to attach the microdata attributes we’re using to structural elements in HTML.

Microsoft Word and other word processors are not ‘unstructured’ authoring tools, although they are often accused of it. They all have outlines and headings and sections that can be used to create a logically hierarchical document. But they don’t do this in the same way as a typical XML schema, with nested structural elements like <chapter> or <section>. Word processing documents tend to be a flat series of paragraphs, rather than tree-shaped, with the structure carried by properties of those paragraphs. But the structure is still there, and it can still be used to navigate the document, and to cut it into pieces. HTML5 falls between these two approaches to structure, the hierarchical XML way and the flat word-processor way. It has explicit elements <article>, <section> but it also has headings that can you can use to scaffold a document without using nested sections. That is, an HTML5 document containing headings, with no section elements has an outline which can be worked out according to an algorithm which is part of the HTML5 standard. You can explore this with the h5o outline tool. That said, if you want to start creating semantically rich documents in HTML then the implied structure approach is probably not the best way to go because Microdata needs to be attached to some kind of container so you’ll want to use sections.

Having to have containers is a challenge in Word, because of the way it uses implied structure. There is certainly no way to add a Word section with attributes itemscope and itemtype to a Word document. But one thing we can do is to add some containers on conversion to HTML5 and my tool does that, wrapping headings in <section> elements – if you attach the information you need about the Type to a heading then you are in business. The other kind of container available is the table, more on that below.

Having sorted out our containers, how do we say that the stuff inside a container is ‘about’ something using a type? My trial uses hyperlinks for itemtypes. So, on the heading for my section in the word document I put a link to the Schema.org Person type. I like links here because they are robust and well supported in word processors, including interop with other applications like the OpenOffice family.

The link has been styled to look the same as the rest of the text but in the screenshot above you can see that it’s there. Now, my conversion tool will spot that link to Schema.org and attach the type Person to the nearest container element. Since this is a heading, the nearest container is a section in the HTML5 output, so the tool produces the markup I showed above: ``

. This will also work for other containers. For example for the jobs in my Experience section, I used tables. A job table has a slot for the start and end dates, job title and description, and the first cell has a link to this: http://schema.org/Event/Job . Here I am using the schema.org extension mechanism and subclassing the Event type, within the new Type I didn’t need to reinvent anything, the properties I have used are all part of Schema.org already.

(I know, this has one glaring problem – If I want to refer to Schema.org URIs in my text then the tool will add all sorts of spurious semantics to the page. The way around this will be to use a technique like the one I used for ‘triplinks’, adding a magic parameter to the URI such as http://schema.org/Event/Job?triplink=http:// or I will fix this up later.)****

The other technique is to use styles for the properties. For my name, I used a paragraph with the style itemprop-name, the converter catches that and adds the itemprop attribute to the HTML it generates. This works, and it’s easy to build into a template, but the problem is that styles are easy to change by accident, if you hit backspace too often, or to copy and paste, and it’s very fiddly to fix the document once that’s happened (same goes for Microdata in HTML5 source, only very committed people are going to want to type this stuff by hand). The best compromise I can come up with is to lock down the layout of the document a bit using tables. By putting the basic name and address data in a table, and creating a table for each job in the experience section, it makes it much less likely that styles will get mixed up. We used this technique on the ICE project at USQ to add little semantic islands to courseware document, the trick is to use a one or two cell table which is easy to work with in Word, but to turn it into a <div> or <section> in the HTML later.

In future, it would be possible to reformat the tables into something else because every piece of text in the table has explicit semantics. Each of the jobs for example is completely captured in microdata – if there I a good way to display this in HTML + CSS then re-casting the page should be easy.

And there is an interesting possibility here. Why not just collect the data in Excel or a database application and generate the HTML from that? The more field-structured a document becomes the more it would make sense to turn it into a series of forms, or a spreadsheet and separate the document view from the dataResumes might be a class of document where that makes sense, but research articles or theses certainly aren’t and I want to explore these techniques for documents. . You could of course design an XML schema, an approach which has only one minor drawback. Nobody will ever use it except you.

Tools for the web view

Along the way I needed to be able to check what I was doing. When Schema.org launched there was no way of checking Microdata. The schema.org site points you to the Google Rich Snippets Testing Tool, but that turns out not to understand Microdata, at least not for people. So, I put together a little tool to help me extract the data for myself, based on a library from the people on the MicrodataJS project. I made a bookmarklet which can go through a page and find the places where it looks like there is Microdata, and show it to you using JSON. So, if I click my [Show data](javascript:(function()%7bvar%20head=document.getElementsByTagName('head')[0],script=document.createElement('script');script.type='text/javascript';script.src='http://tools.scholarlyhtml.org/showsource/showsource-bookmarklet.js?'%20+%20Math.floor(Math.random()*99999);head.appendChild(script);%7d)();%20void%200) bookmarklet while I’m looking at the CV page, it adds little text buttons {} where it finds “itemprop” attributes. For example, on each job in my experience section I have microdata attached to the tables I have used to lay out each position:

The interface is pretty crude, but it does work; it pops up a text area from which you can copy and paste the data in JSON format.

{ "items": [ { "type": "http://schema.org/Event/Job", "properties": { "startDate": [ "2011-03" ], "endDate": [ "2011-08" ], "jobTitle": [ "Word processing expert,\n Digital Monograph Technical Landscape study " ] } } ] }

Conclusion and next steps

According to my testing, I now have working Schema.org metadata expressed in microdata. It hasn’t made a difference to how my pages show up in Google, but that’s not the main consideration here. We have never had very much control over what the search engines do with our pages and our data anyway. What we do have some control over is how the systems under our control interact. For the Scholarly HTML project I think Microdata shows promise – it remains to be seen how useful Schema.org is going to be, but it is certainly time to be trying it out.

For the next step, there’s a JISC-funded project ramping up at the moment to look at HTML5, and there will be case studies looking at some of the uses of this in the academy. I think that one very important place to start is with citations/references, so my next bit of work, with others will look at how to embed data with the citations in a page. One thing I’d like to look at if there’s time is automating the formatting of citations and reference lists using the Citation Style Language (CSL). All the pieces are there:

Zotero now embeds JSON metadata about references when you cite in your word processor.
There Is a CSL processor in written Javascript, with which I did some proof of concept work back in March, over at the Scholarly HTML hackfest in Cambridge organized by Peter Murray-Rust. If you feed it reference data in JSON format it can re-format your document for you with in text citations and a reference list.
The open bibliography people are building a large scale data store for bibliography, meaning it should become increasingly viable to cite by URI – just link to what you’re citing and a machine can find metadata about it an include it in a reference list.

1. Sefton, P., Barnes, I., Ward, R. & Downing, J. Embedding Metadata and Other Semantics in Word Processing Documents. International Journal of Digital Curation 4, (2009).

[ptsefton.com] | [CV & Bio]