Australian Digital Futures Institute, University of Southern QueenslandI’m posting this as a kind of extended abstract of my proposed presentation at the Beyond the PDF workshop. I want to demonstrate some of the services we have built in my group but more importantly I’d like to discuss what researchers and publishers would like to see, and find out how they react to some of the things we’ve done and the reasons we’ve done them, and how these might help out with a post-workshop project. Some of what I will be talking about, like using heading styles in a word processor seems quite mundane compared with the work that will be presented at the workshop on stuff like text mining and automated article recommendation (Kurtz et al. 2009), but I think it’s important – we need to be able to add metadata to word processing documents in a robust way, for example, and come up with simple ways for people to link to data so that downstream services like journals or repositories can do useful things with the link.
- Bron Chandler
- Daniel de Byl
- Duncan Dickinson
- Pamela Glossop
- Oliver Lucido
- [Linda Octalina UPDATE: 2010-12-14 – Sorry Linda!]
- Greg Pendlebury
- Ron Ward
- Cynthia Wong
- Jason Zejfert
- Managing draft documents and local data sets together.
- Formatting draft documents with as much semantics as possible, using a word processor.
- Pre-publication collaboration with immediate collaborators and via the web using annotations.
- To make all resources part of a repository as soon as they are created or acquired.
- To provide a web view of all resources as early as possible in their production process. A major motivation for this is that the web provides a platform for doing a new kind of research where data are available and linked to publications.
- To provide a hub from which resources can be pushed to other services – journal review processes, blogs, repositories etc (if we go Beyond the PDF then these things might all become part of one repository-journal-blog-thing).
- To make interoperable, reusable services, not monolithic systems, so the stuff I will demo here for formatting documents, annotations, adding semantics is all designed to be used not just in our systems but in others as well.
- Microsoft have produced a Word (2007 plus) Add-in that allows authors to create rich XML to the NLM DTD. This is much more specialised than our tool.
- The PKP team have a product called Lemon8XML which tries to create rich documents without using style information at all.
p-meta-abstract. An author would not need to apply this, it could be built in to an article template. The ICE service can extract this metadata and provide it in Dublin Core:
But as I said in a recent post this use of styles is a bit fragile and hard to manage, for anything more complicated than an abstract, so we are working on other ways to embed document semantics. In that post I wrote about how to encode metadata about authorship inline using hyperlinks. My team has done some more work on this now, thanks Ron Ward and Linda Octalina. To demonstrate on the sample paper this week I:
<dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Proteome-wide analysis of protein-drug interaction network on multiple scales, ... the methodology is ready for other pathogen genomes. </dc:description>
- Made myself a bookmarklet, AuthorIze. What this does is generate a URI that says someone is an author. (This is an early prototype – it does not work on URLs which contain parameters, at the moment).
- Went to the Thomson Researcher ID site and looked for IDs for the authors. Only the last one, Bourne seems to have one. I could have just linked his name to his page at Researcher ID but that’s just a link, it does not express authorship.
- I clicked my new bookmarklet, which gave me this machine parseable URL: http://ontologize.me/?tl_p=http://purl.org/dc/terms/creator&triplink=http://purl.org/triplink/v/0.1&tl_o=http://www.researcherid.com/rid/C-2073-2008
And the ICE service has added RDFa (v 1.1) to the page which means that it should be possible to auto-discover the relationship.
<dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Philip E.
We can do the same trick with document semantics. I have made a similar bookmarklet for mentions of document subject-matter inline. This idea is to enable markup similar to that used by the Microsoft Word Ontology Add-in, and to a text mining tool called Whatizit both of which automatically scan a document for terms and match them to formal ontologies. In the version of the MS plugin we looked at a couple of years ago, though, it was simply linking the ontological term to the text but not saying what the relationship was, in other words, imparting exactly as much information as a simple link to the ontological term. I asked on the list for an example where linking a term to an ontology might be useful for disambiguation, in the sample paper, and I got this from Tudor Groza explaining that katG means two different things:
<span rel="dc:creator" resource="
<span property="foaf:name" resource="
Philip E. Bourne
On Mon, Dec 6, 2010 at 6:50 PM, Tudor Groza <email-removed> wrote: > Hi Peter, > Not sure if this will help … you could look at the term _katG_ on page 4 > in the FinalPaper.pdf (8 lines from the end of the page), which can act as a > vaccine (http://purl.obolibrary.org/obo/VO_0012369 – DNA vaccines encoding > KatG antigen …), or in that particular context as a protein > (http://purl.obolibrary.org/obo/PRO_000023043 – A protein that is a > translation product of the katG gene or a 1:1 ortholog thereof.) > Regards, > TudorSo in the demo document I marked up the term katG with this link that says it that it is the subject of the document: http://ontologize.me/?tl_p=http://purl.org/dc/terms/subject&triplink=http://purl.org/triplink/v/0.1&tl_o=http://pir.georgetown.edu/cgi-bin/pro/entry_pro?id=PRO_000023043. As with identifying authors, it should be possible to embed lookup widgets in the authoring environment to make this much easier to user. The point of all this is that this information is now embedded in the document, and my desktop repository can extract it:
Improvement 2: Think of all resources including data and documents, as being on the web and in a managed repository from birthI have talked about how our services convert documents to web pages, but it is not only word processing documents that ICE understands – there is an extensible collection of conversion plugins that work with different formats. I’ll have a quick look at a few here. Here’s the data spreadsheet from the sample files has been rendered for the web automatically by the Fascinator, which called the ICE conversion service: The ICE conversion service just processes the spreadsheet as a table, but for known kinds of data we could add a plugin that did useful things with the data – in particular we could work out conventions for linking from the paper to the spreadsheet. So the opportunity here would be to host services for scientific workflows in the desktop or team repository environment rather than in a word processing document cf (Mesirov 2010). Our conversion services also deal with a wide range of common media types:
- PowerPoint is rendered as a series of images (lots more work to be done on this to break decks into slides and let people re-purpose them and re-publish new slide shows). Here’s one of the sample files rendered for the web.
- Images: the system generates thumbnails – this is highly configurable.
- Video: we have worked on scripts to convert common video formats to web-ready files at multiple sizes, and for the key platforms including HTML 5 and iOS devices. I’ll come back to images and video below in the section on annotations.
- Scientific: we’ve done work on adding visualisations for stuff like Chemical Markup Language, including a project with Peter Murray Rust’s group at Cambridge to handle the packaging and delivery of web-ready chemical theses (Sefton & Downing 2010). There is a demo on the ICE site. The way this works is via simple link in the original document to the CML – when the ICE service sees the link it inserts the viewer. This idea could be generalised for all sorts of visualisation and workflow tools very cheaply.
- Check out the current supported format conversions at our site.
- To make implementing packaging systems in web applications easy, via re-usable interface code.
- To provide a pure-web way to read packages – see our demo. The hope here is that HTML 5 browsers will be able to save packages like this using ‘Save as App’ – giving us a way, at last, to package rich sets of web material, including potentially workflow-engines in a way that can be moved around.
- Tags – both ad hoc and using ontologies.
- Image region annotation
- Video time-span annotation.
- Whatever YOU want: rating systems, peer-review voting, thesis examination, markup of chemical structures. What DO you want?
- Lack of a good way to turn word processing documents into HTML (or XML), and lack of standard style-name sets to assist in that process. Publishers must be spending huge amounts of money on document formatting because of poor-quality input documents.
- Lack of a way to embed useful semantics in documents being authored in non-specialist tools like Word or WordPress.
- Lack of repository-services that reach back upstream to where the users are doing their research and authoring (there are lots of other groups working in this space (Leggott n.d.; Green & Awre 2009; Tennant 2009)).
- Lack of software components that are easy for developers to use to build annotation services – to build the range of annotation services in Anotar we had to hunt-around for software libraries and create a framework.
- Lack of ways to packaged multiple resources into a single file that can be moved around as easily as a PDF.
- How important is interop? I think the answer is very. This includes making sure that user-communities can choose their own tools, and within those communities, can work together. For the stuff that I’m interested in this means supporting at least Microsoft Word, OpenOffice.org and probably Google Docs for word processor-based authoring, and for dissemination, making sure we can deposit to Eprints, DSpace, Fedora et al, and interact with Drupal, SharePoint, WordPress and the gang.
- What to package where? This is a complex trade off between interoperability, usability, future proofing, cost to develop, cost to deploy and risk.
- Word processor files as containers.
- Zip packages (Epub, IMS packages, Plain old ZIP with manifests such as Bagit, OAI-ORE, METS)
- “PDF plus” – jamming extra semantics into PDF.
- Web-native conventions such as HTML 5 apps.
- Can we come up with some best-practice guidelines for tool developers to decide where to build their tool? That is, does a particular tool make sense as a web service or a word processor plugin, or an new PDF viewer or all of the above?
Dickinson, D. & Sefton, P., 2009. Creating an eResearch desktop for the Humanities. In eResearch Australasia 2009. Sydney. Available at: http://eprints.usq.edu.au/6090/ [Accessed December 9, 2009].
Green, R. & Awre, C., 2009. Towards a Repository-enabled Scholar’s Workbench. D-Lib Magazine, 15(5/6). Available at: http://www.dlib.org/dlib/may09/green/05green.html [Accessed June 25, 2009].
Hunter, J. et al., 2010. The Open Annotation Collaboration: A Data Model to Support Sharing and Interoperability of Scholarly Annotations. In Digital Humanities 2010. pp. 175-177. Available at: http://dh2010.cch.kcl.ac.uk/academic-programme/abstracts/papers/pdf/book-final.pdf#page=201.
Kurtz, M.J. et al., 2009. Using Multipartite Graphs for Recommendation and Discovery. 0912.5235. Available at: http://arxiv.org/abs/0912.5235 [Accessed December 8, 2010].
Leggott, M.A., Islandora: a Drupal/Fedora Repository System. Available at: http://smartech.gatech.edu/handle/1853/28495 [Accessed November 30, 2010].
Mesirov, J.P., 2010. Accessible reproducible research. Science, 327(5964), p.415.
Sefton, P. et al., 2009. Embedding Metadata and Other Semantics in Word Processing Documents. International Journal of Digital Curation, 4(2). Available at: http://www.ijdc.net/index.php/ijdc/article/view/121 [Accessed October 22, 2009].
Sefton, P. & Downing, J., 2010. ICE-Theorem – End to end semantically aware eResearch infrastructure for theses. Journal of Digital Information, 11(1). Available at: http://journals.tdl.org/jodi/article/viewArticle/754 [Accessed March 24, 2010].
Tennant, R., 2009. Rochester Releases Their “IR+” Repository Platform « Tennant: Digital Libraries. Available at: http://blog.libraryjournal.com/tennantdigitallibraries/2009/12/16/rochester-releases-their-ir-repository-platform/ [Accessed November 30, 2010].
Copyright Peter Sefton, 2010. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>