A few weeks ago, I felt there was something in the air to do with desktop tools for eResearch. We went with that in the form of The Desktop Fascinator and have been enjoying a productive conversation with others in the, um, what’s the word for the twitter-o-blog-o-sphere?
Dennis Hamilton over at the ODF TC made an interesting point recently about floors and ceilings and interoperability: both are needed. I think one of Peter’s main points is this issue of having an adequate floor: that HTML should be this floor but its support by desktop vendors in their applications continues to be ratty.
The difference is that .005% of all web users gets us Wikipedia. .005% of geneticists gets us a table at T.G.I. Friday’s. My point was that the math breaks down for crowds and science.
Anyway, I now realize that there were some hidden assumptions behind my design decisions in 2000. Some of those assumptions turned out to be wrong, or at least not-completely-right. Sure, a lot of people downloaded DiP, but it still pales in comparison to the number of visitors I got from search traffic. In 2000, I fretted about my “home page” and my “navigation aids.” Nobody cares about any of that anymore, and I have nine years of access logs to prove it.
So, I am writing DiP3 in pure HTML and, modulo some lossless minimizations, publishing exactly what I write. This makes the proofreading feedback cycle faster — instead of “building” the HTML output, I just hit
Ctrl-R. I expected it to make some things more complicated, but they turn out not to matter very much.
Why should documents have page numbers? why should be have two columns per page. Because the publishers force it on us.
Musing on the European Commission’s findings, Tim Bray wondered if maybe XHTML had gotten short shrift:
They considered, and rejected, XHTML as a standard office document format. I think that it can do most things you need in a modern office document and has remarkably few real drawbacks. [ongoing]
I’m not ready to go along with the other conclusion he reaches in that posting — that custom schemas are a red herring. But I agree that XHTML is more valuable than most people think. For the vast majority of useful documents, it can have as much structure as we need, and for the rest it can be extended internally with namespaced inclusions. But the real power arises from its hypertextual nature. For me, increasingly, there is no office, and there is no desktop, there is only a network of linked documents. A successful open document format will have to be supremely well-adapted to that environment, as XHTML is.
All content on this site is licensed under
<a rel="license" href="http://creativecommons.org/licenses/by/3.0/">
a Creative Commons License
All content on this site is licensed under
a Creative Commons License
Headings. Documents should definitely have headings. They don’t need nested sections because the headings are enough for a machine to work out the nesting. If a journal/discipline really really must have some fixed structure you could validate that the first heading element contains the text ‘Introduction’ and so on.
The page should have title, and the title should be put in the top of the document in an
<h1 class=”title”>element. Thereafter the next heading should be a plain-old
h1. If naughty users jump from
h3we’ll just normalize that back to
If you really really need more formal structure then how about putting an empty link in the heading to a web page for something like an NLM section, meaning this ‘h1′ is the same as that NLM element. But do you really need all those elements if the paper is the HTML?
Protocols for representing things like examples. Mark Pilgrim says he uses blockquotes, but there should be some class that lets you tell an example of your own from a quotation. We don’t need a standards committee yet, we need some examples.
Metadata, using a linked data approach (which I am going to describe in another post, but which I have touched on before).
What if my local ePrints repostiory had a page for me-as-author. It might look like http://eprints.usw.edu.au/authors/PeterMacolmSefton and resolve to a page that describes me and the works that are attributed to me with a note there that says “if you want to indicate that this person is an author of a paper, link their name as it appears on the work to this page”. That is way simpler to implement than the stuff we talked about in our paper on embedding semantics in word processing documents.
The journal itself could provide is-author pages for those who don’t have them yet as they collect data about the authors in their journal websites anyway. We’ll work out a way to relate the is-author pages together later. We have to start somewhere.
Links from terms mentioned in the text to ontologies that describe them. This is going to be kind of silly if all we do is link the same term to the same ontology over and over again like the discussion about ‘Potter syndrome’ we had here last week as that string unlikely to be ambiguous, but could be useful if you link different strings of text to the same ontology. Like, say you could link “distinctive lightning shaped scarring on forehead” to Potter syndrome.
Linkable paragraphs – via a mechanism like Tim Bray’s purple pilcrows. These could be used to cite the paper but could also be key to stand-off annotation/discussion services. This does not need to be done by the author – we can make scripts that do that when the page crosses the curation boundary. ICE has cute tricks it uses to provide Ids for pages that are partly based on their content so we can tell when a paragraph has changed, but keep it around if someone has commented on it.
Dead simple reference management via links to trusted sources. In many disciplines the manuscript should be able to do what Peter-Muray Rust said and just use a DOI – the publisher can be responsible to fetching the metadata and making a bibliography. How about you let the user choose the referencing style and select it on the fly using Zotero’s growing CSL library? Or skip that altogether an let readers import the references into their own libraries.
Illustrations? I don’t know, that’s one we will have to work through. Maths likewise.
Out of scope for now is how ORE would be used to describe a data-supported paper with the text and all its associated images and data files as an aggregation. This is partly because there are very few end-user tools that would let someone package a document in this way but I think they will start to appear. There’s one for WordPress for example.
Beyond just putting stuff on the web, I think this approach would make it simpler to write lots of the tools that are going to machine-read the semantic-scholarly web; make them all work on Scholarly HTML rather than trying to deal with several different upstream formats.