Category Archives: jiscPUB

New Avatars of the Book in Digital Culture

New Avatars of the Book in Digital Culture

Why are we here?

This week the University of Western Sydney held a symposium, New Avatars of the Book in Digital Culture. I was invited to contribute to the event, in my capacity as eResearch manager. One member of my team thought it meant I was going to get to slap on the blue body paint, but sadly it wasn’t that kind of Avatar.

This symposium focuses on the changing nature and status of those peculiarly useful interactive objects we call ‘books’ in online contexts. In contrast to web-pages, files, ‘sites’ and ‘contexts’ of reading, the book still presents a useful model of rich ‘containment’ and productive constraint.

But are books possible in digital form? What elements of the book survive online? Which ones are transformed and in
 what ways? The end of the book has long been prophesied,
but how do we replicate the particular functionality (such as searching without knowing what you’re looking for), and forms 
of knowledge particular to books in online environments? 
What forms of interface can be envisioned bearing in mind 
the characteristic feedback loops between (especially) literary reading and writing? Distinguishing between the form of the book and its functionality, this symposium will explore possibilities for replicating the book’s useful functions in online environments.

(Anna Gibbs and Maria Angel, from the invitation to the event)

In this context I didn’t want to present too formally but prepared some notes for the discussion, after the keynote talks. This post contains my pre-event note/slides and a few observations from the event.

The keynotes

The keynotes were both really engaging discussions of digital culture (not so much the book).

Mitchell Whitelaw, from Canberra University talked about building “generous interfaces” to online databases. The idea is to enable a browse-based interface that attempts to show as many entry points to a collection as possible. Mitchell’s visual tools allow you to see the shape of large archival collections and explore them using both visual cues and metadata facets such as dates or creators. Dan Cohen recently asked how this might work for university web sites, I can’t see how, but it’s an interesting design challenge (hint for uni web teams – the big images that take up a third or more of the screen and change all the time don’t improve the utility of the site).

Mitchell talked a bit in the discussion about how a ‘generous’ book interface might look, showing the ‘weight’ of items in the Table of Contents, giving a sense of the shape of a work,  a potential site for further research.

Jason Nelson, from Griffith uni talked about his practice as a digital poet, which didn’t involve any pretentious reading out loud, but did involve an array of amusing, rude and occasionally moving online stuff, which I won’t attempt to describe or critique it.

As Jason demoed various Flash and HTML stuff he’d made I was thinking about all the problems he’d create for archivists, and indeed he does; he told us about how he’d had to deposit screen-capture walk-throughs into the Griffith repository as proxies for the actual games and interactive poem-things he creates[1]. He’s a textbook example of the difficult-to-archive new media artist. See game game game and again game, which apparently may ruin your life.

I was glad I didn’t have to follow Jason directly with my rather more sedate stuff about techno-socio-political considerations, standards and such. But at least my cut-price embedded slide-shows don’t look too shabby compared to his low-tech, low-culture hand-drawn stuff.

The readings

There were two readings for this symposium. One is a long essay Graphesis: Visual Knowledge Production and Representation by Joanna Drucker (2010). It’s a well-referenced survey of different approaches to “critical understanding of visual knowledge production” with only passing reference to books. I thought this was a really useful way of mapping the jagged edges of a huge theoretical hole, but it doesn’t provide any answers about how we might explore bookishness, or redefine the book.

The second reading is a piece from Alan Liu, (2009) which, I think, assumes that the end of the book is obvious, apparently because we can put books online and search and tag them and dismember them in an online environment. He cites three scholarly web applications to demonstrate that books are dead, but I don’t really buy it. One of them is the Open Journal Systems software, which is not about books anyway, it’s a workflow system for managing journal production. Last time I checked it was mostly used by editors to produce journals consisting of PDF files, on the web but not of the web.

The end of the sign?

We don’t herald the “end of the sign” just because people have annotated a sign. And remember, books have always been ‘hypertextual’ via referencing, quoting and wholesale stealing.

The end of the book?

Liu says, in reference to some other symposium:

As suggested by the title of this symposium (Bookishness: The New Fate of Reading in the Digital Age), the best way to think about the book in the digital age may well be to focus on bookishness. From the point of view of the digital, the book has already gone away. So the remaining question is “what happens to bookishness?” Or, again, “where does bookishness go?”

I don’t know where bookishness goes, or what happens to it.

From the point of view of this eResearch person books have not gone away. And from the point of view of the digital (who knew the digital had a point of view?) books have most emphatically not gone away and do have digital analogues:

Rumours of demise of book greatly exaggerated

That’s actually quite a lot of books, with no apparent existential crisis. Not so many magazines; they’re not making the transition to eReaders and tablets as happily as books. I think Craig Mod’s post on ‘subcompact publishing’ is worth reading on this issue, he looks at new models for magazine-like publications which might work better than the current approach of putting entire magazines into tablet apps complete with page-layout and ads.

It goes without saying that an e book is different in many ways from a paper book but nobody in the mainstream is having any trouble with the concept of delivering or buying books digitally (though they may be less relaxed when they finally realize that they’ve been renting the books, not building the family library). Scholars can have all the fun they like debating the demise of the form, and speculating about function, but the book is obviously very much alive.

Craig Mod’s subcompact recipe

I propose Subcompact Publishing tools and editorial ethos begin (but not end) with the following qualities:

  • Small issue sizes (3-7 articles / issue)

  • Small file sizes

  • Digital-aware subscription prices

  • Fluid publishing schedule

  • Scroll (don’t paginate)

  • Clear navigation

  • HTML(ish) based

  • Touching the open web

I think that Mod’s list of qualities looks like at least a good check-list for any bookish project to consider.

The book is not the only bounded object to make the transition from the physical to the virtual.

Take today’s Wikipedia entry on album:

In musical usage the word was used for collections of short pieces of printed music from the early nineteenth century.[2] Later, collections of related 78rpm records were bundled in book-like albums.[3] When long-playing records were introduced, a collection of pieces on a single record was called an album; the word was extended to other recording media such as compact discMiniDiscCompact audio cassette, and digital or MP3 albums, as they were introduced.[4]

(Anon. 2012)

And the ‘album’ is still alive even in all-digital online distribution. If the recorded-music-album can survive, and it obviously has, then the much older book is no danger. In a nice bit of back-formation, the scholars at Amazon have made the link between the book-omnibus or magazine to the record-album and come up with a way of describing long essays that they presumably feel a bit coy about marketing as eBooks.

Singles:

I should point out that this workshop was not about the end of the book. In fact, Maria Angel asked, in the introduction; how might the constraints, and bounds of the book create an interesting space for innovation?

Interesting things about C21 books

Mapping the enlightenment book trade

The book trade no longer looks like this:

Figure 1 Searching for a sales destinations for an author

(From a project led by Professor Simon Burrows of Leeds, who is joining UWS soon)

A map

Figure 2 The search returns a map

Potential research project: What does a map of the modern book/eBook trade look like?

How to visualise this?

  • Vertical sales channels*

  • Software distribution across channels and device types.**

  • Geographic distribution of channels

  • DRM; Digital Rights (Restrictions) Management.

  • Copyright territories

  • Geographic distribution of DMCA-type laws

  • The naughty-net DRM-free distribution system.

  • * There is some ‘leakage’ between channels, eg Kindle apps on Apple iOS, but there complicated dynamics at play, such as no in-app purchasing on Kindle.

  • **Globally, the PC is still the most popular eBook device, but eReaders win in the US and UK. (According to this summary of a 2012 report).

  • Useful functions of the book?

     Persistence, archive-ability

    Figure 3 Don’t let your scholarly works/tools end up all, like, 404

    The above screenshot is one of the sights I saw when looking for PreE – the post book research environment mentioned by Liu. All the researchers at our symposium should be worrying about this. How are you going to preserve all your works, tools and experiments?

    Preservation?

    When I raised preservation as an issue, Jason Nelson said (more or less):

    Don’t obsess about preserving everything. Build things that people love – and they’ll work to preserve.

    Good point – but don’t lose sight of the value of a professional portfolio (and not everyone’s a rock-star internet poet like Jason :).

    eBooks are little web sites

    • HTML has more or less ‘won’ as the basis for most eBook formats.

    • EPUB is the standard. A book is essentially a (complicated) zipped-up website.

    • Beware of licensing/platform traps like the iBooks authoring application.

    • See the JISC EPUB project, on which I worked in 2011.

    • Think about how to design things and interfaces to things that are driven by declarative markup (ie – be explict first, delightful second)

    What about apps? Games?

    –  see Jason Nelson –

    Potential projects?

    I figured there might be postgrads and others at the symposium interested in where to direct their bookish work here’s a list from the technical perspective.

    Arising from the JISC EPUB project

    Recommendations for further work:

    1. Provide rich search tools for individual collections of ebooks

    2. Tools for generating or traversing ebook citations

    3. Development of a pilot to produce ebooks with linked-data content*

    4. Native EPUB output for Microsoft Word or Open Office

    5. LaTeX to EPUB 3/MathML

    6. Ereading systems with scholarly annotation systems

    7. Community resources for individual scholars wishing to epublish

    8. Aggregate resources for digital conversion for small scholarly presses

    9. Maximize use of orphan works

    10. Community resources for institutions with digital collections

    Think about separating data from presentation

    To finish, I tried to show a little of what is possible with declarative semantically-marked up HTML5, as one potential means to create new bookish things that might last.

    • Think about how to separate content from presentation and ‘engine’. (Might not be possible in some avant-garde experiments, but there are usually ways).

    • Consider standards-based experiments like this one from Tim Sheratt.

    Conclusion

    Well, there isn’t a conclusion really as these are just notes but I’m sure that continued work led by Anna and Maria and the Writing and Society Research Centre will be critically important to UWS, seeing as we’re the first Australian university to give Every New Student a Free iPad.

    Anon. 2012. “Album.” Wikipedia, the Free Encyclopedia. http://en.wikipedia.org/w/index.php?title=Album&oldid=526533373.

    Drucker, J. 2010. “Graphesis: Visual Knowledge Production and Representation.” Poetess Archive Journal 2 (1): 1–50.

    Liu, Alan. 2009. “The End of the End of the Book: Dead Books, Lively Margins, and Social Computing.” http://hdl.handle.net/2027/spo.act2080.0048.404.



    [1] There was one aspect of Jason’s work which I can’t talk about here which I think will make a fantastic online case study in research data archiving. I’m going to suggest it to the eResearch and repository staff at Griffith.

    Sliding towards declarative Scholarly HTML

    [Update 2012-09-18 - Changed "microformats" to "microdata"]

    I have been working with others on ways to embed simple PowerPoint-like overhead slide presentations in longer documents for years. I like to combine an essay or blog post with the presentation in one source document, to give some context to the slides.

    I also find it much easier to develop some kinds of presentations if I write them out essay-style at the same time I am developing the slides.

    Two things I think you should be able to do
    • Write in document mode and drop in some ‘slides’, like the thing you’re reading now.

    • Write and present in presentation mode (you know, PowerPoint) then save as a document where you can embellish the notes, but see the slides. (I’m working on that part, more on that soon).

      This is what I’m working towards with posts like this one, which shows slides and text.

    BTW: Did you know that the latest versions of PowerPoint don’t even have a Save As HTML option! Sure you can link to a presentation in Microsoft’s cloud, just like you can with Google, but do you trust that to be around forever?

    This week, I’ve put together a demo of how to merge presentation and explanation using declarative Scholarly HTML. The resulting presentations are not as slick as PowerPoint, don’t have transitions or slide-building point by point, but they’re light and portable. But apart from the slides this declarative Scholarly HTML technique I’m talking about is a really important part of how we publish, present and preserve scholarship.

    The declarative Slide format

    First, lets look at the HTML that drives this, then I’ll come back to how you author it, and how you get it to display, and why this matters.

    In an HTML document mark up any block level (container) element you’d like treated as a slide to say “This is a slide”, by referring to The Bibliographic Ontology.

    Declarative slide markup in HTML using Microdata*
    <section itemtype="http://purl.org/ontology/bibo/Slide" itemscope="">
    …
    </section>

    *RDFa 1.1 would also work, I don’t care which format wins even though I know there are good arguments why one should care.

    This is like the approach used in the venerable Slidy, and several other HTML based slide show systems, although most of them use a microformat convention such as <div class=”slide”> rather than one of the standard mechanisms for HTML semantics, and as far as I know none are designed to mix slides with other content.

    What’s wrong with <div class=’slide’>?
    • It’s a mere convention.

    • Slide is a noun that sometimes means things other than the parts of a Powerpoint presentation.

    • And it’s a verb too.

    • We can do better using microdata or RDFa structures and a standard ontology to say “This section here, this is a slide”*

    * Feel free to start a debate in the comments about whether it says that really says this is a slide or is about a slide. You can refer to this, which will make everything much clearer.

    My aim is to produce documents that are independent of the systems used to store, serve and process them, hence the declarative approach. You can drop one of these documents into any CMS or processing tool and it will behave as normal HTML. But a system which is aware of the semantics can do something special with the content. To demonstrate this I have written a very small WordPress plugin which you should be able to see in action here in this post, and in my previous post, on collaboration around research data.

    Slidyfy WordPress plugin; initial, crude modus operandi
    1. Look for slides; elements of type http://purl.org/ontology/bibo/Slide

    2. Wrap said slides in a border, with a link to view slideshow in new window. Then in new window:

      1. Build a new <body> element with just the slides.

      2. Replace existing Body with new slides-only version.

      3. Load the W3C Slidy CSS and Javascript.

        TODO: Support some of the other presentation engines.

      4. Result: user sees slideshow.

    Authoring Workflow Separating content from delivery

    How would someone use this? Well, they’d reach for their Scholarly HTML editing environment. This might be a text editor, where they’d make HTML documents by hand, or they might work in Word or Open Office Writer, or PowerPoint or a combination of both (more on that later) via WordDown. Or, if they were prepared to do a little work coding they could use something like Markdown, which is one of the family of Wiki-like markup languages. Asciidoc is another. Or, this could be done in an online content management system like WordPress or SharePoint or something. None of these are really that convenient for many users, but then strangely there is no really convenient way to make good HTML.

    The important part is to separate content from the delivery mechanism.

    Right now, if you want to make a presentation for the web, it’s pretty hard to beat Pandoc. It takes Markdown format or HTML and makes slideshows. This is not quite what I want in my use case, but it’s on the right track. Pandoc automates taking plain-old content and making it into a slideshow using your choice of HTML-based slideshow systems. Most of these systems require you to author not just a specially formatted HTML document but to include one or more scripts and CSS stylesheets, and Pandoc takes all the hard work out of that. You just can just give it either Markdown or HTML with headings and you’re away.

    But, the result is still useful only for stand-alone documents for people who are prepared to use command line tools and have access to a web server, or run the slides from their local machine. It would be next to impossible to post the result as a WordPress post for example. And it doesn’t allow mixing free-form document content with embedded slides.

    The Pandoc approach – full marks for separation of content from presentation*

    *Pun intended

    As far as I know, though Pandoc and the Markdown format it supports don’t recognise embedded semantics via RDFa 1.1 or Microdata. That’s something to look into, and I’ll be exploring that in work we’re doing at the University of Western Sydney on embedding references to data in publications, because we want to be able to refer to data sets and code etc in research articles and README files, and a text-based markup language may well be ideal for authoring those.

    Anyway, Pandoc’s nice but it’s not what we’re after here, which is something that runs in the content management system, or at time of viewing.

    Some workflow options for making Scholarly Slides

    Why?

    This is not just about one person’s hopefully harmless fetish for mixing up blog posts with slides; there are lots of things that go on the scholarly web where embedding semantics is important.

    One example I have looked at before is chemistry – to embed a visualisation of a molecule you don’t want to have to hand-code the code to load, say, the JMOL molecule viewer applet into a page, and leave it on the web – JMOL will get updated, and may be obsoleted, it would be better to try to use a more future-proof declarative markup.

    To illustrate this see a previous post of mine with a plugin to show a 3D molecule. When I started writing this something had broken in that plugin caused (I think) by a new version of  the jQuery library, so the page just showed a link to a CML file instead of the molecule viewer. This is the whole point of using declarative markup. The pretty part broke but the science didn’t.

    Another example which I don’t think anyone has yet done in HTML + Semantic markup would be to format a research article into its parts. Instead of slides, you’d be marking up Method, Abstract, and Results et al.

    Reasons this kind of technique is important

    This allows us to:

    1. Control and host or own stuff (cf embedding a slideshare player)

    2. Produce long-term maintainable web documents that may degrade but will still be readable

    3. Continue to build the semantic web*

    4. Improve indexing and discovery

    5. Pave the way for robots to read and process documents

    *Even if we don’t believe in it

    Getting it?

    If this kind of cheap web stunt appeals, then you can get it from the UWS eResearch Google Code repository. You’d need to check this out into your WordPress plugins directory, so at this stage only for those of you who know what that means.

    Copyright  Peter Sefton, 2012. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

    The repository is watching: automated harvesting from replicated filesystems

    [This is a repost of http://jiscpub.blogs.edina.ac.uk/2011/07/15/the-repository-is-watching-automated-harvesting-from-replicated-filesystems-2/ please comment over there]

    One of the final things I’m looking at on this jiscPUB project is a demonstration of a new class of tool for managing academic projects not just documents. For a while we were calling this idea the Desktop Repository, the idea being that there would be repository services watching your entire hard disk and exposing all the content in a local website with repository and content management services that’s possibly a very useful class of application for some academics, but in this project we are looking at a slightly different slant on that idea.

    The core use case I’m illustrating here is thesis writing, but the same workflow would be useful across a lot of academic projects, including all the things we’re focussing on in the jiscPUB project academic users managing their portfolio of work, project reporting and courseware management. This tool is about a lot more than just ebook publishing, but I will look at that aspect of it, of course.

    In this post I will show some screenshots of The Fascinator repository in action, talk about how you can get involved in trying it out, and finish with some technical notes about installation and setup. I was responsible for leading the team that built this software at the University of Southern Queensland. Development is now being done at the University of Central Queensland and the Queensland Cyber Infrastructure Foundation where Duncan Dickinson and Greg Pendlebury continue work on the ReDBox research data repository which is based on the same platform.

    I know Theo Andrew at Edinburgh is keen to get some people trying this. So this blog post will serve to introduce it and give his team some ideas we’ll follow up on their experiences if there are useful findings.

    Managing a thesis

    The short version of how this thesis story might work is:

    • The university supplies the candidate with a dropbox-like shared file system they can use from pretty much any device to access their stuff. But there’s a twist there is a web-based repository watching the shared folder and exposing everything there to the web.

    • The university helpfully adds into the share a thesis template that’s ready to go, complete with all the cover page stuff, margins all set, automated tables of contents for sections and tables and figures and the right styles and trains the candidate in the basics of word processing.

    • The candidate works away on their project, keeping all their data, presentations, notes and so on in the Dropbox and filling out the thesis template as they go.

    • The supervisor can drop in on the work in progress and leave comments via an annotation system.

    • At any time, the candidate can grab a group, which we call a package of things to publish to a blog or deposit to a repository at the click of a button. This includes not just documents, but data files (the ones that are small enough to keep in a replicated file system), images, presentations etc.

    • The final examination process could be handled using the same infrastructure and the university could make its own packages of all the examiners reports etc for deposit into a closed repository.

    The result is web-based, web-native scholarship where everything is available in HTML, not just PDF or application file formats and there are easy ways to route content to other repositories or publish it in various ways.

    Where might ebook dissemination fit into this?

    Well, pretty much anywhere in the above that someone wants to either take a digital object ‘on the road’ or deposit it in a repository of some kind as a bounded digital thing.

    Demonstration

    I have put a copy of Joss Winn’s MA thesis into the system to show how it works. It is available in the live system (note that this might change if people play around with it). I took an old OpenOffice .sxw file Joss sent me and changed the styles a little bit to use the ICE conventions, I’m writing up a much more detailed post about templates in general, so stay tuned for a discussion of the pros and cons of various options for choosing style names and conventions and whether or not to manage the document as a single file or multiple chapters.

    graphics2Illustration 1: The author puts their stuff in the local file system, in this case replicated by Dropbox.

    graphics7Illustration 2: A web-view of Joss Winn’s thesis.


    The interface provides a range of actions.

    graphics9Illustration 3: You can do things with content in The Fascinator including blogging and export to zip or (experimental) EPUB

    The EPUB export was put together as a demonstration for the Beyond The PDF effort by Ron Ward. A the moment it only works on packages, not individual documents, and it is using some internal Python code to stitch together documents, rather than calling out to Calibre as I did in earlier work on this project. The advantage of doing it this way is that you don’t have Calibre adding extra stuff and reprocessing documents to add CSS but the disadvantage is that a lot of what Calibre does is useful, for example working around known bugs in reader software, but it does tend to change formatting on you, not always in useful ways.

    I put the EPUB into the dropbox so it is available in the demo site (you need to expand the Attachments box to get the download that’s not great usability I know). Or you can go to the package and export it yourself. Log in first, using admin as a username and a the same for a password.

    graphics8Illustration 4: Joss Winn’s thesis exported as EPUB.

    I looked a different way of creating an EPUB book from the same thesis a while ago which will be available for a while here at the Calibre server I set up.

    One of the features of this software is that more than one person can look at the web site and there are extensive opportunities for collaboration.

    graphics5Illustration 5: Colleagues and supervisors can leave comments via inline annotation (including annotating pictures and videos)

    graphics6Illustration 6: Annotations are threaded discussions

    graphics3Illustration 7: Images and videos can be annotated too. At USQ we developed a Javascript toolkit called Anotar for this, the idea being you could add annotation services to any web site quickly and easily.

    This thesis package only contains documents, but one of the strengths of The Fascinator platform is that it can aggregate all kinds of data, including images, spreadsheets, presentation and can be extended to deal with any kind of data file via plugins. I have added another package, modestly calling itself the research object of the future, using some files supplied by Phil Bourne for the Beyond the PDF group The Fascinator makes web views of all the content and can package it all as a zip file or an EPUB.

    graphics10Illustration 8: A spreadsheet rendered into HTML and published into an EPUB file (demo quality only)

    This includes turning PowerPoint into a flat web page.

    graphics11Illustration 9: A presentation exported to EPUB along with data and all the other parts of a research object

    Installation notes

    Installing The Fascinator  (I did it on Amazon’s EC2 cloud on Ubuntu 10.04.1 LTS) is straightforward. These are my notes not intended to be a detailed how-to, but possibly enough for experienced programmers/sysadmins to work it out.

    • Check it out.

      sudo svn co https://the-fascinator.googlecode.com/svn/the-fascinator/trunk /opt/fascinator
    • Install Sun’s Java

      sudo apt-get install python-software-properties
      sudo add-apt-repository ppa:sun-java-community-team/sun-java6
      sudo apt-get update
      sudo apt-get install sun-java6-jdk

      http://stackoverflow.com/questions/3747789/how-to-install-the-sun-java-jdk-on-ubuntu-10-10-maverick-meerkat/3997220#3997220

    • Install Maven 2.

      sudo apt-get install maven2
    • Install ICE or point your config at an ICE service. I have one running for the jiscPUB project you can point to this by changing the ~/.fascinator/system-config.json file.

    • Install Dropbox or your file replication service of choice a little bit of work on a headless server but there are instruction linked from the Dropbox.com site.

    • Make some configuration changes, see below.

    • To run ICE and The Fascinator on their default ports on the same machine add this stuff to /etc/apache2/apache.conf (I think the proxy modules I’m using here is non-standard).

      LoadModule  proxy_module /usr/lib/apache2/modules/mod_proxy.so
      LoadModule  proxy_http_module /usr/lib/apache2/modules/mod_proxy_http.so
      ProxyRequests Off
      <Proxy *>
      Order deny,allow
      Allow from all
      </Proxy>
      ProxyPass        /api/ http://localhost:8000/api/
      ProxyPassReverse /api/  http://localhost:8000/api/
      ProxyPass       /portal/ http://localhost:9997/portal/
      ProxyPassReverse /portal/ http://localhost:9997/portal/
    • Run it.

      cd /opt/fascinator
      ./tf.sh restart

    Configuration follows:

    • To set up the harvester, add this to the empty jobs list in ~/.fascinator/system-config.json

    "jobs" : [
                       {
                           "name": "dropbox-public",
                           "type": "harvest",
                           "configFile":
    "${fascinator.home}/harvest/local-files.json",
                           "timing": "0/30 * * * * ?"
                       } 

    And change /harvest/local-files.json to point at the Dropbox directory

    "harvester": {
            "type": "file-system",
            "file-system": {
                "targets": [
                    {
                        "baseDir": "${user.home}/Dropbox/",
                        "facetDir": "${user.home}/Dropbox/",
                        "ignoreFilter": ".svn|.ice|.*|~*|Thumbs.db|.DS_Store",
                        "recursive": true,
                        "force": false,
                        "link": true
                    }
                ],
                "caching": "basic",
                "cacheId": "default"
            }

    To add the EPUB support and the red branding, unzip the skin files in this zip file into the portal/default/ directory: http://ec2-50-19-86-198.compute-1.amazonaws.com/portal/default/download/551148ce6d80bfc0c9c36914f9df4f91/jiscpub.zip

    unzip -d /opt/fascinator/portal/src/main/config/portal/default/ jispub.zip

    [This is a repost of http://jiscpub.blogs.edina.ac.uk/2011/07/15/the-repository-is-watching-automated-harvesting-from-replicated-filesystems-2/ please comment over there]

    Copyright Peter Sefton, 2011-07-12. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

    graphics1

    This post was written in OpenOffice.org, using templates and tools provided by the Integrated Content Environment project.

    Making EPUB from WordPress (and other) web collections

    [This is a re-post of from the jiscPUB project please make any coments over there: http://jiscpub.blogs.edina.ac.uk/2011/05/25/making-epub-from-wordpress-and-other-web-collections/]

    Background

    As part of Workpackage 3 I have been looking at WordPress as a way of creating scholarly monographs. This post carries on from the last couple, but it’s not really about EPUB or about WordPress, it’s about interoperability and how tools might work together in a Scholarly HTML mode so that people can package and repackage their resources much more reliably and flexibly than they can now.

    While exploring WordPress I had a look at the JISC funded KnowledgeBlog project. The team there has released a plugin for WordPress to show a table of contents made up of all the posts in a particular category. It seemed that with a bit of enhancement this could be a useful component of a production workflow for book-like project, particularly for project reports and theses (where they are being written online in content management systems maybe not so common now, but likely to become more common) and for course materials.

    Recently I looked at Anthologize, a WordPress-based way of creating ebooks from HTML resources sourced from around the web (I noted a number of limitations which I am sure will be dealt with sooner or later). Anthologize is using a design pattern that I have seen a couple of times with EPUB, converting the multiple parts of a project to an XML format that already has some tools for rendering and using those tools to generate outputs like PDF or EPUB. Asciidoc does this using the DocBook tool-chain and Anthologize uses TEI tools. I will write more on this design pattern and its implications soon. There is another obvious approach; to leave things in HTML and build books from that, for example using Calibre which already has ways to build ebooks from HTML sources. This is an approach which could be added to Anthologize very easily, to complement the TEI approach.

    So, I have put together a workflow using Calibre to build EPUBs straight from a blog.

    Why would you want to do this? Two main reasons. Firstly, to read a report, thesis or course, or an entire blog on a mobile device. Secondly, to be able to deposit a snapshot of same into a repository.

    In this post I will talk about some academic works:

    The key to this effort is the KnowledgeBlog table of contents plugin ktoc, with some enhancements I have added to make it easier to harvest web content into a book.

    The results are available on a Calibre server I’m running in the Amazon cloud just for the duration of this project. (The server is really intended for local use, the way I am running it behind an Apache reverse proxy it doesn’t seem very happy you may have to refresh a couple of times until it comes good). This is rough. It is certainly not production quality.

    graphics1

    These books are created using calibre ‘recipes’: available here. You run them like this:

    ebook-convert thesis-demo.recipe .epub --test

    If you are just trying this out, to be kind to site owners --test will cause it to only fetch a couple of articles per feed.

    I added them to the calibre server like this:

    calibredb add --library-path=./books thesis-demo.epub

    The projects page at my site has two TOCs for two different projects.

    I the title is used to create sections in the book, in both cases the post are displayed in date-order and I am not showing the name of the author on the page because that’s not needed when it is all me.

    The resulting book has a nested table of contents, seen here in Adobe Digital Editions.

    graphics2Illustration 1: A book built from a WordPress page with two table of contents blocks generated from WordPress categories.

    Read on for more detail about the process of developing these things and some comments about the problems I encountered working with multiple conflicting WordPress plugins, etc.

    The Scholarly HTML way to EPUB

    The first thing I tried in this exploration was writing a recipe to make an EPUB book from a Knowledge Blog, for the Ontogenesis project. It is a kind of encyclopaedia of ontology development maintained in a WordPress site with multiple contributors. It worked well, for a demonstration, and did not take long to develop. The Ontogenesis recipe is available here and the resulting book is available on the Calibre server.

    But there was a problem.

    The second blog I wanted to try it on was my own, so I installed ktoc changed the URL in the recipe and ran it. Nothing. The problem is that Ontogenesis and my blog use different WordPress themes so the structure is different. Recipes have stuff like this in them to locate the parts of a page, such as <p class='details_small'>:

    remove_tags_before = dict(name='p', attrs={'class':'details_small'})

    remove_tags_after = dict(name='div', attrs={'class':'post_content'})

    That’s for Ontogenesis, different rules are needed for other sites. You also need code to find the table of contents amongst all the links on a WordPress page, and deal with pages that might have two or more ktoc-generated tables for different sections of a journal, or parts of a project report.

    Anyway, I wrote a different recipe for my site, but as I was doing so I was thinking about how to make this easier. What if:

    • The ktoc plugin output a little more information in its list of posts that made it easy to find no matter what WordPress theme was being used.

    • The actual post part of each page (ie not the navigation, or ads) identified itself as such.

    • The same technique could be extended to other websites in general.

    There is already a standard way to do the most important part of this, listing a set of resources that make up an aggregated resource; the Object Reuse and Exchange specification, embedded in HTML using RDFa. ORE in RDFa. Simple.

    Well no, it’s not, unfortunately. ORE is complicated and has some very important but hard to grasp abstractions such the difference between an Aggregation, and a Resource Map. An Aggregation is a collection of resources which has a URI, while a resource map describes the relationship between the aggregation and the resources it aggregates. These things are supposed to have different URIs. Now, for a simple task like making a table of contents of WordPress posts machine-readable so you can throw together a book, these abstractions are not really helpful to developers or consumers. But what if there were a simple recipe/microformat what we call a convention in Scholarly HTML to follow, which was ORE compliant and that was also simple to implement at both the server and client end?

    What I have been doing over the last couple of days, as I continue this EPUB exploration is try to use the ORE spec in a way that will be easy implement, say in the Digress.it TOC page, or in Anthologize, while still being ORE compliant. That discussion is ongoing, and will take place in the Google groups for Scholarly HTML and ORE. It is worth pursuing because if we can get it sorted out then with a few very simple additions to the HTML they spit out, any web system can get EPUB export quickly and cheaply by adhering to a narrowly defined profile of ORE subject to the donor service being able to supply reasonable quality HTML. More sophisticated tools that do understand RDFa and ORE will be able to process arbitrary pages that use the Scholarly HTML convention, but developers can choose the simpler convention over a full implementation for some tasks.

    The details may change, as I seek advice from experts, but basically, there are two parts to this.

    Firstly there’s adding ORE semantics to the ktoc (or any) table of contents. It used to be a plain-old unordered list, with list items in it:

    <p><strong>Articles</strong></p>
    <ul>
    <li><a href="http://ontogenesis.knowledgeblog.org/49">Automatic
    maintenance of multiple inheritance ontologies</a> by Mikel Egana
    Aranguren</li>
    <li><a href="http://ontogenesis.knowledgeblog.org/257">Characterising
    Representation</a> by Sean Bechhofer and Robert Stevens</li>
    <li><a href="http://ontogenesis.knowledgeblog.org/1001">Closing Down
    the Open World: Covering Axioms and Closure Axioms</a> by Robert
    Stevens</li>
    </ul>

    The list items now explicitly say what is being aggregated. The plain old <li> becomes:

    <li  rel="http://www.openarchives.org/ore/terms/aggregates"
    resource="http://ontogenesis.knowledgeblog.org/49">

    (The fact that this is an <li> does not matter, it could be any element.)

    And there is a separate URI for the Aggregation and resource map courtesy of different IDs. And the resource map says that it describes the Aggregation as per the ORE spec.

    <div id=”AggregationScholarlyHTML">

    <div rel="http://www.openarchives.org/ore/terms/describes" resource="#AggregationScholarlyHTML" id="ResourceMapScholarlyHTML" about="#ResourceMapScholarlyHTML">

    It is verbose, but nobody will have to type this stuff. What I have tried to do here (and it is a work in progress) is to simplify an existing standard which could be applied in any number of ways and boil it down to a simple convention that’s easy to implement but that still honours the more complicated specifications in the background. (Experts this will realise that I have used an RDFa 1.1 approach here, meaning that current RDFa processors will not understand, this is so that we don’t have to deal with namespaces and CURIES which complicate processing for non-native tools.)

    Secondly the plugin wraps a <div> element around the content for every post to label it as being scholarly HTML, this is a way of saying that this part of the whole page is the content that makes up the article, thesis chapter or similar. Without a marker like this finding the content is a real challenge where pages are loaded up with all sorts of navigation, decoration and advertisements, it is different on just about every site, and it can change at the whim of the blog owner if they change themes.

    <div rel="http://scholarly-html.org/schtml">

    Why not define an even simpler format?

    It would be possible to come up with a simple microformat that had nice human readable class attributes or something to mark the parts of a TOC page. I didn’t do that because then people will rightly point out that ORE exists and we would end up with a convention that covered a subset of the existing spec, making it harder for tool makers to cover both and less likely that services will interoperate.

    So why not just use general ORE and RDFa?

    There are several reasons:

    • Tool support is extremely limited for client and server side processing of full RDFa, for example in supporting the way namespaces are handled in RDFa using CURIES. (Sam Adams has pointed out that it would be a lot easier to debug my code if I did use CURIES and RDFa 1.0 so I followed his advice, did some search and replacing and checked that the work I am doing here is indeed ORE compliant).

    • The ORE spec is suited only for experienced developers with a lot of patience for complexities like the difference between an aggregation and a resource map.

    • RDFa needs to apply to a whole page, with the correct document type and that’s not always possible to do when we’re dealing with systems like WordPress. The convention approach means you can at least produce something that can become proper RDFa if put into the right context.

    Why not use RSS/Atom feeds?

    Another way to approach this would be to use a feed, in RSS or Atom format. WordPress has good support for feeds there’s one for just about everything. So you can look at all the posts on my website:

    http://ptsefton.com/category/uncategorized/feed/atom

    or use Tony Hirst’s approach to fetch a singe post from the jiscPUB blog

    http://jiscpub.blogs.edina.ac.uk/2011/05/23/a-view-from-academia-on-digital-humanities/feed/?withoutcomments=1

    The nice thing about this single post technique is that it gives you just the content in a content element so there is no screen scraping involved. The problem is that the site has to be set up to provide full HTML versions of all posts in its feeds or you only get a summary. There’s a problem with using feeds on categories too, I believe, in that there is an upper limit to how many posts a WordPress site will serve. The site admin can change that to a larger number but then that will affect subscribers to the general purpose feeds as well. They probably don’t want to see three hundred posts in Google Reader when they sign up to a new blog.

    Given that Atom (the best standardised and most modern feed format) is one of the official serialisation formats for ORE it is probably worth revisiting this question later if someone, such as JISC, decides to invest more in this kind of web-to-ebook-compiling application.

    What next?

    There are some obvious things that could be done to further this work:

    • Set up a more complete and robust book server which builds and rebuilds books from particular sites and distributes them in some way, using Open Publication Distribution System (OPDS) or something like this thing that sends stuff to your Kindle.

    • Write a ‘recipe factory’. With a little more work the ScholarlyHTML recipe can be got to the point where the only required variable is a single page URL everything else can be harvested from the page or over-ridden by the recipe.

    • Combining the above to make a WordPress plugin that can create EPUBs from collections of in-built content (tricky because of the present calibre dependency but it could be re-coded in PHP).

    • Add the same ScholarlyHTML convention for ORE to other web systems such as the Digress.it plugin and Anthologize. Anthologize is appealing because it allows you to order resources in ‘projects’ and nest them into ‘parts’ rather than being based on simple queries but at the moment it does not actually have a way to publish a project directly to the web.

    • Explore the same technique in the next phase of WorkPackage 3 when I return to looking at word processing tools and examine how cloud replication services like DropBox might help people to manage book-like projects that consist of multiple parts.

    Postscript: Lessons and things that need fixing or investiging

    I encountered some issues. Some of these are mentioned above but I wanted to list them here as fodder for potential new projects.

    • As with Anthologize, if you use the WordPress RSS importer to bring-in content it does not change the links between posts so they point to the new location. Likewise with importing a WordPress export file.

    • The RSS importer applied to the thesis created hundreds of blank categories.

    • I tried to add my ktoc plugin to a Digress.it site, but ran into problems. It uses PHP’s simplexml parser which chokes on what I am convinced is perfectly valid XML in unpredictable ways. And the default Digress.it configuration expects posts to be formatted in a particular way as a list of top-level paragraphs, rather than with nested divs. I will follow this up with the developers.

    • Calibre does a pretty good job of taking HTML and making it into EPUBs but it does have its issues. I will work through these on the relevant forums as time permits.

      • There are some encoding problems with the table of contents in some places. Might be an issue with my coding in the recipes.

      • Unlike other Calibre workflows, such as creating books from raw HTML, ebook-convert adds navigation to each HTML page in the book created by a recipe. This navigation is redundant in an EPUB, but apparently it would require a source code change to get rid of it.

      • It does something complicated to give each book its style information. There are some odd presentation glitches in the samples as a result of Calibre’s algorithms. This requires more investigation.

      • It doesn’t find local links between parts of a book (ie links from one post to another which occur a lot in my work and in Tony’s course), but I have coded around that in the Scholarly HTML recipes.

    It will be up to Theo Andrew, the project manager if any of these next steps or issues get any attention during the rest of this project.

    [This is a re-post of from the jiscPUB project please make any coments over there: http://jiscpub.blogs.edina.ac.uk/2011/05/25/making-epub-from-wordpress-and-other-web-collections/]

    Copyright Peter Sefton, 2011-05-25. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

    graphics3

    This post was written in OpenOffice.org, using templates and tools provided by the Integrated Content Environment project.

    Anthologize: a WordPress based collection tool

    [This is a copy of a post on the jiscPUB project if you have comments please do so over there: http://jiscpub.blogs.edina.ac.uk/2011/05/11/anthologize-a-wordpress-based-collection-tool/] In this post I’ll look at Anthologize. Anthologize lets you write or import content into a WordPress instance, organise the ‘parts’ of your ‘project’ and publish to PDF or EPUB, HTML or into TEI XML format. This is what I referred to in my last post about WordPress as an aggregation platform.

    Anthologize background and use-cases

    Anthologize was created in an interesting way. It is the (unfinished as yet) outcome of a one-week workshop conducted at the Centre for History and New Media the same group that brought us Zotero and Omeka, which is one good reason to take it seriously. They produce very high quality software.
    Anthologize is a project of One Week | One Tool a project of the Center for History and New MediaGeorge Mason University. Funding provided by the National Endowment for the Humanities. © 2010, Center for History and New Media. For more information, contact infoATanthologizeDOTorg. Follow @anthologize.

    Anthologize is a WordPress plugin that adds import and organisation features to WordPress. You can author posts and pages as normal, or you can import anything with an RSS/Atom feed. The imported documents don’t seem to be able to be published for others to view but you can edit them locally. This could be useful but introduces a whole lot of management issues around provenance and version control. When you import a post from somewhere else the images stay on the other site, so you have a partial copy of the work with references back to a different site. I can see some potential problems with that if other sites go offline or change.

    Let’s remind ourselves about the use-cases in workpackage 3:

    The three main use cases identified in the current plan, and a fourth proposed one: [numbering added for this post]
    1. Postgrad serializing PhD (or conference paper etc) for mobile devices
    2. Retiring academic publishing their best-of research (books)
    3. Present final report as epub
    4. Publish course materials as an eBook (Proposed extra use-case proposed by Sefton)
    http://jiscpub.blogs.edina.ac.uk/2011/03/03/workpackage-3/

    Many documents like (a) theses or (c) reports are likely to be written as monolithic documents in the first place, so it would be a bit strange to write, say, a report in Word, or LaTeX or asciidoc (which is how I think Liza Daly will go about writing the landscape paper for this project) , export that as a bunch of WordPress posts for dissemination, then reprocess back into an Anthologize project, and then to EPUB. There’s much more to go wrong with that, and information to be lost than going straight from the source document to EPUB. It is conceivable that this would be a good tool for thesis by publication, where the publications were available as HTML that could be fed or pasted in to WordPress.

    I do see some potential with (d) courseware here it seems to me that it might make sense to author course materials in a blog-post like way covering topics one by one. I have put some feelers out for someone who might like to test publishing course materials, without spending too much of this project’s time as this is not one of the core use cases. If anyone wants to try this or can point me to some suitable open materials somewhere with categories and feeds I can use then I will give it a go.

    There is also some potential with (c), project reports, particularly if anyone takes up the JiscPress way of doing things and creates their project outputs directly in WordPress+digress.it. It would also be ideal for compiling stuff that happens on the project blog as a supporting Appendix. So, an EPUB that gathers together, say all the blog posts I have made on WorkPackage 3 or the whole of the jiscPUB blog might make sense. These could be distributed to JISC and stakeholders as EPUB documents to read on the train, or deposited in a repository.

    The retiring academic (b) (or any academic really) might want to make use of Anthologize too particularly if they’ve been publishing online. If not they could paste their works into WordPress as posts, and deal with the HTML conversion issues inherent in that, or try to post from Word to WordPress. The test project I chose was to convert the blog posts I have done for jiscPUB into an EPUB book. That’s use case (c) more or less.

    How did the experiment go?

    I have documented the basic process of creating an EPUB using Anthologize below, with lots of screenshots, but here is a summary of the outcomes. Some things went really well.
    • Using the control panel at my web host I was able set up a new WordPress website on my domain, add the Anthologize plugin and make my first EPUB in well under an hour. (But as usual, it takes a lot longer to back-track and investigate and try different options, and read the google group to see if bugs have been reported and so on).
    • The application is easy to install and easy to use with some issues I note below.
    • Importing a feed just works if you search to find out how to do it on a standard WordPress host (although I think there might be issues trying to get large amounts of content if the source does not include everything in the feed).
    • Creating parts and dragging in content is simple.
    • Anthologize looks good.
    The good looks and simple interface are deceptive, lots of functionality I was expecting to be there just wasn’t yet. I have been in contact with the developers and noted my biggest concerns, but here’s a list of the major issues I see with the product at this stage of its development:
    • There does not seem to be a way to publish the project (or the imported docs) directly to the web rather than export it. Seems like an obvious win to add that. I can see that being really useful with Digress.it for one thing. The other big win there would be if the Table of Contents could have some semantics embedded in it so it could act like an ORE resource map – meaning that machines would be able to interpret the content. (I will come back to this idea soon with a demo of using Calibre to make an EPUB)
    • There are no TOC entries for the posts within a ‘part’ that is, if you pull in a lot of WordPress posts, they don’t get individual entries in the EPUB ToC.
    • Links, even internal ones, like the table of contents links on my posts all point back to the original post this makes packaging stuff up much less useful you’d need to be online, and you lose the context of an intra-linked resource. This is a known problem, and the developers say they are going to fix it.
    • Potentially a problem is the way Anthologize EPUB export puts all the HTML content for the whole project into one HTML file I gather from poking around with Calibre etc that many book readers need their content chunked into multiple files.
    • There’s a wizard for exporting your EPUB, and you can enter some metadata and choose some options all of which is immediately forgotten by the application, so if you do it again, you have to re-enter all the information.
    • Epubcheck complains about the test book I made:
      • It says the mimetype (a simple file that MUST be there in all EPUB) is wrong looks OK to me.
      • It complains about the XHTML containing stuff from the TEI namespace and a few other things.
    • Finally, PDF export fails on my blog with a timeout error but that’s not an issue for this investigation.

    Summary

    For the use case of bundling together a bunch of blog posts (or anything that has a feed) into a curated whole Anthologize is a promising application, but unless your needs are very simple it’s probably not quite ready for production use. I spent a bit of time looking at it though, as it shows great promise and comes from a good stable. Here’s the result I got importing the first handful of posts from my work on this project.

    graphics8Illustration 1: The test book in Adobe Digital Edtions – note some encoding problems bottom right and the lack of depth in the table of contents. There are several posts but no way to navigate to them. Also, clicking on those table of contents links takes you back to tbe jiscPUB blog not to the heading.

    Walk through

    graphics1Illustration 2: Anthologize uses ‘projects’. These are aggregated resources, in many cases they will be books but project seems like a nice media-neutral term.

    graphics2Illustration 3: A new project in a fresh WordPress install only two things can be added to it until you write or import some content.

    graphics3Illustration 4: Importing the feed for workpackage 3 in the jiscPUB project. http://jiscpub.blogs.edina.ac.uk/category/workpackage-3/feed/atom/

    graphics4Illustration 5: You can select which things to keep from the feed. Ordering is done later. Remember that imported documents are copies, so there is potential for confusion if you edit them in Anthologize.

    graphics5Illustration 6: Exporting content is via a wizard, easy to use but frustrating becuase it asks some of the same questions every time you export.

    graphics6Illustration 7: Having to retype the export information is a real problem as you can only export one format at a time. Exported material is not stored in the WordPress site, either, it is downloaded, so there is no audit trail of versions. [This is a copy of a post on the jiscPUB project if you have comments please do so over there: http://jiscpub.blogs.edina.ac.uk/2011/05/11/anthologize-a-wordpress-based-collection-tool/]

    Copyright Peter Sefton, 2011-05-04. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

    HTTP://DBPEDIA.ORG/SNORQL/?QUERY=SELECT+%3FRESOURCE%0D%0AWHERE+{+%0D%0A%3FRESOURCE+%3CHTTP%3A%2F%2FDBPEDIA.ORG%2FONTOLOGY%2FPERSON%2FBIRTHPLACE%3E+%3CHTTP%3A%2F%2FDBPEDIA.ORG%2FRESOURCE%2FSYDNEY%3E+%3B%0D%0A%3CHTTP%3A%2F%2FDBPEDIA.ORG%2FONTOLOGY%2FPERSON%

    This post was written in OpenOffice.org, using templates and tools provided by the Integrated Content Environment project.

    WordPress [and the jiscPUB project]

    [This is a repost from the jiscPub project please comment over there: http://jiscpub.blogs.edina.ac.uk/2011/05/10/wordpress/ ]

    Introduction

    So far in the jiscPUB project I have been looking at word processing applications and EPUB, as well as how repositories and other web applications might support EPUB document production. One of the tasks in workpackage 3 is to look at WordPress as an example of an online tool that’s being used quite a bit in academia for both writing and publishing.
    The three main use cases identified in the current plan, and a fourth proposed one: [numbering added for this post]
    1. Postgrad serializing PhD (or conference paper etc) for mobile devices
    2. Retiring academic publishing their best-of research (books)
    3. Present final report as epub
    4. Publish course materials as an eBook (Proposed extra use-case proposed by Sefton)
    The next few posts will explore web based authoring and publishing with a focus on WordPress, and how they relate to packaging content as electronic books. WordPress can be used in a number of different ways. For this project I am thinking of it as:
    • A publishing platform.
    • A collaboration platform.
    • A content aggregation platform.
    • An authoring environment where people might write academic content. (I put this last, because I think it’s the most controversial).
    All of these overlap, and the same installation of WP might be doing all or none, as might other content management systems being used in academia. In future posts I’m going to look at building ebooks via aggregation, using the Anthologize plugin, look at an alternative way of building EPUB books from lists of WordPress posts using Calibre, and take a look at Martin Fenner’s EPUB plugin for WordPress. In this post I will look at some of the issues around WordPress as used in a couple of projects related to this one, looking particularly at JISC-funded or JISC-friendly work. This is not a survey of how WordPress is being used in academia everywhere there’s no time for that. Please use the comments below if I’ve missed something that’s important to this project. At the moment, I am thinking that the most compelling match up between the use cases for this project and what is being done with WordPress are these:
    • b: Retiring academic publishing their best-of research (books): not so much books but using a tool like Anthologize to draw together papers or other documents.
    • d: Publish course materials as an eBook (Proposed extra use-case proposed by Sefton): I see great potential for tools like Anthologize as a way of compiling reading packages from web resources and packaging them to take-away on mobile devices, likewise for conference proceedings and programs and other aggregated documents.
    And possibly, where people are using JiscPress this use-case: c: Present final report as epub.

    Publishing platform

    A great example of using a blogging platform for scholarship is the KnowledgeBlog project:
    We are investigating a new, light-weight way of publishing scientific, academic and technical knowledge on the web. Currently, Knowledge Blog is being funded by a JISC grant.
    And the sites it has under its wing. KnowledgeBlogs use the WordPress platform to publish articles and for article review and serves as a live example of a new mode of scholarship. It’s a publisher, but not as we know it. A new entrant in the WordPress backed publishing space (and in the Authoring space) is Annotum which has not released any code, but has very lofty ambitions. I’ll come back to Annotum below.

    An aggregation platform bringing together content from elsewhere.

    I’ll cover this in my next post, looking at Anthologize, which is a promising but immature tool for pulling together stuff from multiple sources and/or authoring it locally, then grouping it with a customized table of contents and publishing to a variety of media.

    An authoring platform

    I has to be said that WordPress as an editor gets some bad press from time to time. Phillip Lord at KnowledgeBlog advises against using it for authoring. WordPress is not an authoring environment
    http://www.knowledgeblog.org is hosted using WordPress. Its a very good tool in many ways, but it was intended for and is most suited for use as a publishing tool; most blogs are written by single authors who wish to place their thoughts on the web either for authors or themselves to be able to read. It is not an authoring tool, however. It does not provide a particularly rich environment for editing, and particularly not for collaborative editing. Most people get tired of the wordpress authoring tool very quickly, as its just not suited for serious scientific authoring. Nor does it provide good facilities for collaborative editing; normally, only one person can see a draft post, so you cannot pass this around between several authors. http://process.knowledgeblog.org/3
    The KnowledgeBlog site encourages people to use their current authoring tools and treat the KnowledgeBlog WordPress platform as a publishing and review system. Others are more positive about WordPress as an editor. Martin Fenner, for example is a tireless promoter of the practice. And the Digress.it help recommends using WordPress to create content from scratch, the opposite of the advice coming from KnowledgeBlogs:
    We recommend using the WordPress editor directly for a number of reasons:
    • Multiple authors can easily collaborate on a single document;
    • A complete revision history of the document is maintained with the ability to roll-back to earlier versions;
    • This method produces a web-ready document, native to WordPress, and avoids the two-stage process of re-publishing on your Digress.it site; and
    • You can easily embed video and other objects.
    And then there’s Annotum. The site says:
     Annotum will build upon the WordPress platform as a foundation, filling in the gaps by providing the following additional features:
    • Rich, web-based authoring and editing:
      • What you see is what you get (WYSIWYG) authoring with rich toolset (equations, figures, tables, citations and references)
      • coauthoring, comments, version tracking, and revision comparisons
      • Strict conformance to a subset of the NLM  journal article publishing tag set
    And a long list of other features. There is no code to show yet, though.

    Collaboration platform

    Others are seeing WordPress as a place for collaborative authoring and editing. Annotum promises this on a grand scale. For those who would like to get started, Martin Fenner listed some resources late last year:
    The Co-Authors Plus Plugin enables multiple authors per article. Each author can be linked to an author page for displaying biographical info. WordPress could be extended to include additional info such as institution or past publications. Linking the WordPress user account to the unique author identifier ORCID, and describing the role of the author in the paper (e.g. conceived and designed the experiments or analyzed the data) would be particularly interesting. Plugins such as Edit Flow can extend the workflow by adding custom status messages (e.g. resubmission), reviewer comments, and email notifications. http://blogs.plos.org/mfenner/2010/12/05/blogging-beyond-the-pdf/
    Collaboration post publication is handled by a WordPress tool that’s been a hit in the UK, and with JISC. Digress.it is a tool for public annotation and discussion of long-form documents. The JISC incarnation is at jiscpress.org. Digress.it is related to Commentpress. (They’re different things although sometimes confused with each other at least by me. See them compared here.) For a JiscPress example see this document, which has a number of comments.

    Issues

    Some issues I have observed with WordPress in the past include the problems with its authoring environment, covered above but also a number of other considerations. There is the WordPress version of Microsoft’s DLL hellPlugin hell – many WordPress plugins and/or themes interact with each other in unpredictable ways. I found this out first hand, trying to show-off some work my team at USQ had done on an annotation system. It worked (with bugs) in a plain WordPress site, but failed completely in Martin Fenner’s demo site where there are many other plugins installed. I never got to the bottom of that. Plugins also go out out sync with the WordPress as it evolves, so a site with lots of plugins can be hard to maintain, this is also the case with systems like Drupal which have their own enthusiastic following. Some of the above systems require the content management system to be used in very particular ways for example Digress it treats each document as a new WordPress site and asks you to upload posts in a particular order so that the Table of Contents for the site looks right. There are two issues with this kind of approach. I’m not saying that people are not already aware of these issues, but noting that they are there:
    • There’s sometimes a fair bit of overhead involved in setting things up just so. Sometimes, it would make sense to automate some of the processes. Other times maybe a re-think to reduce complexity might be in order.
    • There is a risk of creating a new form of the proprietary lock-in we had up until recently (and arguably we still have) with document formats like Microsoft’s .doc. The documents we create in some of these systems may end up being unusable in other systems. If you author a long document in Digress.it and depend on a particular configuration of WP and, having posts in a certain order and so on for the document’s integrity, then it is essential to consider an exit strategy and an archiving strategy (more on that soon an EPUB export might be just the ticket). There are similar issues/risks with stuff like WordPress shortcodes such as KCite from KnowledgeBlogs. It’s a great tool for authors, allowing them to cite things in a rational way:
      DOI Example [cite source=doi]10.1021/jf904082b[/cite] PMID example [cite source=pubmed]17237047[/cite]
      But it’s proprietary to a particular processing environment. If one wants to be able to re-used these documents or archive them then it is important to consider which version of the documents in WP to keep. (I’d argue that in this case best practice would be to transform the above to an RDFa representation in HTML and treat the HTML version as the version of record more on this later in the project).
    All this adds up to saying that WordPress + plugins can be fragile the application itself needs to be updated frequently for security reasons, and so does the operating system underneath and inevitably stuff breaks. The more complex the plugin-set and the further you stray from straight WordPress the worse the risk. Even on simple sites there can be issues. For example one of the WordPress sites I use regularly currently has a bug with remote publishing via Atompub and XMLRPC. One day it was working and the next all my attempts to post from the tools I use everyday, as per the best practice advice from the KnowledgeBlog people, were minus the characters < and > in the document source, both of which are obviously essential to the web. For those interested in learning more about WordPress for scholarship, there’s a Google Group called WordPress for Scientists that is worth joining even if you are not a scientist and a test site that Martin Fenner has set up for WordPress plugins. [This is a repost from the jiscPub project please comment over there: http://jiscpub.blogs.edina.ac.uk/2011/05/10/wordpress/ ]

    Copyright Peter Sefton, 2011-05-09. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

    HTTP://DBPEDIA.ORG/SNORQL/?QUERY=SELECT+%3FRESOURCE%0D%0AWHERE+{+%0D%0A%3FRESOURCE+%3CHTTP%3A%2F%2FDBPEDIA.ORG%2FONTOLOGY%2FPERSON%2FBIRTHPLACE%3E+%3CHTTP%3A%2F%2FDBPEDIA.ORG%2FRESOURCE%2FSYDNEY%3E+%3B%0D%0A%3CHTTP%3A%2F%2FDBPEDIA.ORG%2FONTOLOGY%2FPERSON%

    This post was written in OpenOffice.org, using templates and tools provided by the Integrated Content Environment project.

    How to add EPUB support to EPrints

    [This is a repost from the JISCPub project please comment over there http://jiscpub.blogs.edina.ac.uk/2011/05/03/how-to-add-epub-support-to-eprints-8/ ]

    In a previous post here on the jiscPUB project I said it would be good for the EPrints repository software to support EPUB uploads.

    Id love to do something with a repository  Im thinking that it would be great to deposit theses in EPUB format  and the repository could provided a web-based reader, along the lines of IbisReader, which Liza Daly and company created. Im looking at you, Eprints! Eprints already almost supports this, if you upload a zip file it will stash all the parts for you in a single record. All we would need would be something like this little reader my colleagues at USQ made. It would just be a matter of transforming the EPUB TOC into JSON, and loading the JavaScript into an Eprints page.

    I Called Les Carr’s attention to the post and he responded:

    lescarr @ptsefton just tell us what to do and we’ll do it.

    OK. Here goes with my specification for how EPrints could add at least basic support for EPUB.

    Putting EPUB into EPrints as-is

    To explore this, I ran the EPrints live CD (livecd_v3.1-x.iso) under VirtualBox on Windows 7 this worked well when I gave it a decent amount of memory it didn’t manage to boot in several hours at 256Mb. (Note that no repositories were harmed in the making of this post I did not change the Eprints code at all.)

    The EPUB format is a zipfile containing some XHTML payload documents, a manifest, and a table of contents. On one level EPRINTS already supports this, in that there is support for uploading ZIP files. I tested this using Danny Kingsley’s thesis (as received, with no massaging or adding metadata apart from tweaking the title in Word) converted to EPUB via the ICE service I have been working on.

    The procedure:

    1. Generated an EPUB using ICE.

    2. Changed the file extension to .zip.

    3. Uploaded it into EPrints.

    The result is an EPrints item with many parts. If you click on any of the HTML files that make up the thesis then they work as web pages ie the table of contents (if you can find it amongst the many files) links to the other pages. But there is no navigation to tie it all together you have to keep hitting back each HTML page from the EPUB is a stand alone fragment.

    graphics1Illustration 1: The management interface in EPrints showing all the parts of an EPUB file which has been uploaded and saved as a series of parts in a single record.

    At this point I went off on a side trip, and wrote this little tool to add an HTML view to an EPUB file.

    Putting enhanced EPUB into Eprints

    Now, lets try that again with the version where I added an HTML index page to the EPUB using the new demo tool, epub2html. I uploaded the file, clicked around semi-randomly until I figured out how to see all the files listed from the zip, and selected index.html as the ‘main’ file. From memory I thought the repository would do that for me but it didn’t. Anyway, I ended up with this:

    graphics3Illustration 2: The details screen that users see – clicking on the description takes you to the HTML page I picked as the main file.

    graphics2Illustration 3: A rudimentary ebook reader using an inline frame.

    If I click on the link starting with Other, there we have it more-or-less working navigation within the limits of this demo-quality software. All I had to do was change the extension from .epub to .zip and select the entry page, and I had a working, navigable document.

    The initial version of epub2html used the unsupported epubjs as a web based reader-application but Liza Daly suggested I use the more up to date Monocle.js library instead. I tried that but I’m afraid the amount of setup required is too much for the moment so what you see here is an HTML page with an inline frame for the content.

    What does the repository need to do?

    So what does the EPrints team need to do to support EPUB a bit better?

    • Add EPUB to the list of recognised files.

    • Upon recognising an EPUB…

      • Use a service like epub2html that can generate an HTML view of the EPUB. I wrote mine in Python, Eprints is written in Perl but I’m sure that can be sorted out via a re-write or a web service or something*.

      • Allow the user to download the whole EPUB, or choose to use an online viewer. Could be static HTML, frames (not nice), or some kind of JavaScript based viewer.

      • Embed some kind of viewer in the EPrints page itself, or at least provide a back-link in the document viewer to the EPrints page.

    Does that make sense, Les?

    [This is a repost from the JISCPub project please comment over there http://jiscpub.blogs.edina.ac.uk/2011/05/03/how-to-add-epub-support-to-eprints-8/ ]

    Copyright Peter Sefton, 2011-04-15. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

    This post was written in OpenOffice.org, using templates and tools provided by the Integrated Content Environment project.


    * Maybe there’s a Python interpreter written in Perl?

    Introducing Epub2Html – adding a plain HTML view to an EPUB

    [This was originally posted on the jiscPub blog if you have any comments please go there.]

    Background

    EPUB ebook files are useful if you have an application to read them, but not everyone does. We have been discussing this in the Scholarly HTML movement; to some of us EPUB looks like a good general purpose packaging format for scholarship. Not just for HTML (if you can make it XTHML, that is) but potentially for other stuff that makes up a research object, such as data files or provenance information. One of the big problems, though is that the format is still not that widely known; what is a researcher to do when they are given file ending in .epub? That question remains unresolved at the moment, but in this post I will talk about one small step to making EPUB potentially more useful in the general academic community.

    This week, I was looking at the potential for EPUB support in repositories, which I will cover in my next post. An EPUB is full of HTML, but it’s not something that is necessarily straightforward to display on the web. jiscPUB colleague Liza Daley’s company has a thing called IbisReader that serves EPUB over the web and worked on BookWorm, parts of which are also available as open source.

    What I wanted was a bit different I wanted to be able to add something equivalent to a README file to an EPUB that let people read the content and web site or repository managers would be able to do something with it. So, I wrote a small tool intended as demonstrator only which:

    • Generates a plain HTML table of contents.

    • Adds an index.html page to the root of an EPUB (this is legit, it gets added to the manifest as well, but not the TOC) with a simple frame-based navigation system so if you can open the EPUB zip, you can browse it.

    • Bundles in a lightweight JavaScript viewer. Initially I tried the Paquete system from USQ, but it turned out to have a few more issues than I had hoped. For this first release I have used a bit of Liza’s code from a couple of years ago, epubjs with couple of modifications. Status? Works for me. [Update a day later, not so good for long docs but the point on the jiscPUB project is to show the kind of thing that can be done; we can look for other toolkits or improve this one.]

    Demo

    So here’s what it looks like in real life, warts and all.

    I used the test file I was working on earlier in the week with embedded metadata.

    graphics1Illustration 1: Test epub from Edinburgh thesis template, with added metadata in Adobe Digital Editions

    I ran the new code:

    python epub2html.py Edinburgh-ThesisSingleSided-plus-inline-metadata.epub

    Which made a new file. (It does make epubckeck complain, but that’s mostly to do with HTML attributes it doesn’t like, not EPUB structural problems).

    Edinburgh-ThesisSingleSided-plus-inline-metadata-html.epub

    Now, if I unzip it there is an index.html, and some JavaScript from epubjs. In Firefox that looks like this.

    graphics2Illustration 2: HTML view of the EPUB being served from the file system, using epubjs for navigation

    But, if the JavaScript is not working, then you can still see the content courtesy of the less than ideal inline frame:

    graphics3Illustration 3: Fall-back to plain HTML with no JavaScript, the index.html file has an inline frame for the EPUB content. Not elegant, but lets the content be seen.

    Trying it out / the future

    If you want to try this out, or help out you can get the tool from Google code.

    svn co https://integrated-content-environment.googlecode.com/svn/branches/temp-2011/epub2html

    There are lots of things to do, like add command line options for output files, extracting the EPUB+HTML for immediate use (after safety checking it), choosing whether to bundle the JavaScript in the EPUB or linking to it via the web. Does anyone want this? Let us know.

    One of the things I like about Paquete is that it generates # URLS for the different pages you view, making bookmarking chapters possible like this: http://demo.adfi.usq.edu.au/paquete/demo/#configuration.htm. I will explore whether this can be added to epubjs or whether it is worth pressing on with Paquete, which does have some more options like navigation buttons and a tree-widget for the table of contents.

    Like I said, I did this as part of the notes I was putting together for how repositories might support EPUB, and maybe, finally, start serving real web content rather than exclusively PDF, more on that soon.

    This approach might also help us add previews to web services so people can see their content in ereader-mode, something I know David Flanders the JISC manager on this project is keen on.

    And finally something like this approach might be part of a tool-chain that could help people break up long documents into parts, packaged in EPUB and upload them to services like http://digress.it which want things broken up into parts.

    [This was originally posted on the jiscPub blog if you have any comments please go there.]

    Copyright Peter Sefton, 2011-04-14. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

    HTTP://DBPEDIA.ORG/SNORQL/?QUERY=SELECT+%3FRESOURCE%0D%0AWHERE+{+%0D%0A%3FRESOURCE+%3CHTTP%3A%2F%2FDBPEDIA.ORG%2FONTOLOGY%2FPERSON%2FBIRTHPLACE%3E+%3CHTTP%3A%2F%2FDBPEDIA.ORG%2FRESOURCE%2FSYDNEY%3E+%3B%0D%0A%3CHTTP%3A%2F%2FDBPEDIA.ORG%2FONTOLOGY%2FPERSON%

    This post was written in OpenOffice.org, using templates and tools provided by the Integrated Content Environment project.

    Some questions about EPUB, WordPress, tools

    [This is a repost from the jiscPUB project please comment, but do so over there]

    I have a couple of questions for discussion in this jiscPUB project, please any and all of you, use the comments!

    If you publish EPUBs now, what tools do you use?

    I asked jiscPUB team member Liza Daly via email what she uses to make EPUBs, and she said asciidoc.

    Asciidoc lets you create documents in your text editor of choice using one of a family of lightweight wiki-style text formatting languages. Unlike Wiki formats, though, asciidoc is designed to create richly structured documents, as discussed on this page. This post from an O’Reilly author explains how it works to create multiple output files. I’ll do a post on how these tools work with EPUB.

    Now, I am interested in who uses what?

    • Anyone else use asciidoc?

    • Are there pandoc users reading this? Bruce D’Arcus , have you made EPUB? I tried, but it does not support intra-document links.

    • Are some of you hand-crafting HTML like Mark Pligrim then feeding through something like Calibre?

    • Anyone use their word processor to make HTML and get EPUB from that?

    (And just on the off chance, has anyone done a pandoc/markdown to asciidoc converter?)

    What’s considered best practice for EPUBs?

    I have been making EPUBs by feeding things through various processors. Different tools are using different levels of styling by default.

    What’s best practice, in terms of what level of CSS styling to put in and so on? The top hit I got on Google for this was an Adobe page from 2008 that didn’t actually tell me anything useful.

    I think that when we’re talking about word processing documents being transformed for the web what often works best is to have consistent styling for headings and plain paragraphs but authors do need some control over what goes on in tables, for example. This will require some figuring out for EPUB I know the team at USQ had problems with large and complex tables in their testing with USQ courseware, mainly using iOS devices.

    JISC project people: What do you have to do to get your reports up in JISCPress?

    JISCPress is a site where a variety of project output documents can be annotated by the community. It uses the digress.it comment system to allow paragraph-level annotation. It says on the site: We are currently operating JISCPress on a trial basis, with a view to making it a fully fledged JISC service if the trial goes well.

    I wondered if anyone reading this has used it, and what the experience of contributing to it is like. This is both relevant to this project and to potential future explorations of how something like JISCPress might work in an environment where some people might be commenting on documents using ebook reader software and some using the plain-old web with some way of aggregating both.

    When I called for sample documents for this project, Owen Stephens (@ostephens) sent me a test document, I am still working on making a nice EPUB out of it, fiddling with the tool as I go. He tells me it was ‘converted by hand’ to go on this site, which is not quite like jiscPress but does allow comments.

    Anyway, I am wondering:

    • How much effort are people putting in to getting JISC project outcome documents on the web?

    • I know there are templates for JISC reports, which seem pretty light and simple but what about JISC deliverables, like toolkit documents etc?

    • Assuming most of this kind of output is written in Word or other word processors, would people be interested in a template (and tools) that had:

      • Embedded metadata that could be used by machines to process documents.

      • A way to preview your work quickly and easily to make sure that the final output is going to be OK?

      • Enough styling cues to create good web pages, maybe ebooks via automated uploads.

        There’s a trade-off here between having something that’s easy for authors to use, like treating the word processor like a typewriter (which is usually more costly in the long run) and getting people to invest in learning tools.

    Comments?

    [This is a repost from the jiscPUB project please comment, but do so over there]

    Copyright Peter Sefton, 2011-04-12. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

    HTTP://DBPEDIA.ORG/SNORQL/?QUERY=SELECT+%3FRESOURCE%0D%0AWHERE+{+%0D%0A%3FRESOURCE+%3CHTTP%3A%2F%2FDBPEDIA.ORG%2FONTOLOGY%2FPERSON%2FBIRTHPLACE%3E+%3CHTTP%3A%2F%2FDBPEDIA.ORG%2FRESOURCE%2FSYDNEY%3E+%3B%0D%0A%3CHTTP%3A%2F%2FDBPEDIA.ORG%2FONTOLOGY%2FPERSON%

    This post was written in OpenOffice.org, using templates and tools provided by the Integrated Content Environment project.

    Metadata in word processing monographs

    [This is a repost of a document I posted to the jiscPub blog posting here as well to reach more people but please use the comments over there.]

    Introduction why worry about metadata?

    I have been working on a simple service to take word processing documents Word and OpenOffice.org mainly and create mobile-readable EPUBs from them. One of the issues in this process is metadata: how do we get quality metadata into the EPUB format?

    EPUB readers, like music applications use metadata to provide browse and search access to content.

    graphics1Illustration 1: Calibre’s metadata-driven management interface

    Obviously, for books to be useful to readers, and to store-owners, publishers and repositories, metadata is an issue.

    But it’s not just for ebook delivery that this is an issue. A thesis has to be submitted for examination, and sent to an institutional repository, and maybe to a discipline repository or a publisher. And papers are often submitted to multiple sites over their lives conference management systems, journal management systems, repositories and so on. The current state of scholarship is that every time you make such a submission you have to re-enter metadata. Upload a paper to a conference site, and chances are you will have to enter the author names into a form, even if they are already on the paper. Not to mention that every time you type in a name, you are generating low-quality string-based non-linked data. Some of us think there is a slow revolution happening in metadata, using URIs and making links.

    So one of the things I would like to consider for this project is how to embed metadata within documents so that the various applications that process them can do all the hard work. And I want to think not just about strings but high-quality linked-data metadata. To discuss this I will work through one of the use cases for the jiscPUB project and look at the life-cycle of a thesis.

    Thesis workflow

    The aspect of workflow we’re interested in here is that:

    1. If the candidate is lucky the university or supervisor provides them with a template for writing up their thesis.

    2. The candidate writes up the thesis and sends it to their supervisor and possibly other reviewers during this process.

    3. Depending on the quality of the template there is work to do for submission, generating tables of contents, making PDF files maybe, probably, in future, making web and mobile-ready versions.

    4. Someone deposits the thesis file into (at least) the repository at the university, maybe also other databases, entering metadata about it who knows how many times.

    5. Also in the future making sure all the provenance for all claims is available via data that is linked to or bundled with the thesis. (Out of scope for this post, but I will come back to it).

    In this post I am going to look at 1-4 above, looking at how template design might aid in preparing a thesis for mobile delivery. I’ve been thinking for a few years now that the university should not just provide a template but pre-fill as much of it as possible with machine-readable metadata. And note that there’s probably a much more compelling case for machine readable metadata in articles, which tend to be submitted to more places.

    Thesis metadata

    The university of Edinburgh, host of this jiscPUB project via EDINA, has a word template for PhD theses on its wiki. I showed in the last post that if you feed that template, sans any content, through the experimental Word to EPUB converter I’ve been working on, then it more or less worked, but without very much metadata (it was also dropping heading numbering, which I have now, sort-of, kinda, fixed).

    To add the metadata that should be in the EPUB you would have to type it in somewhere. Either I could add fields to the conversion service, or you could use something like Calibre, but the thing is, most of the metadata you need is in the document it’s just not marked as such. The title page has the Title (in AUTHOR style) the author’s name, and the name of the institution, degree and date in the footer.

    graphics2Illustration 2: Thesis metadata is there – in the text, just not marked as such

    So it should be possible, given that this metadata is all there to mark it in such a way that downstream processing systems can recognise it. One of the best places to start is with the document metadata fields. The Edinburgh template does use document metadata for the title.

    graphics3Illustration 3: Document metadata in Word 2010

    But it could go one step further, and instead of requiring the author to enter the same thing in two places, use a field to show the title on the title page. In Word 2010 the field function is hiding in the Ribbon under Insert, Quick Parts.

    graphics4Illustration 4: Adding a field so the title entered in the document metadata can be placed on the title page without re-typing.

    Now the title is linked to the document properties, and any application, such as search engine can extract that metadata. But there is a cost you have to be able to explain to your authors that they need to set the title in the properties, and how to do it, for the different word processing applications they’re using.

    The same thing works for the author field as well. That’s OK for theses but it is less useful for other kinds of scholarly content where there are often many authors. Word 2010 supports multiple authors in its metadata but the fields don’t all you can get using a field is a semicolon separated list of authors, which is not useful for laying out the content. An approach I think is useful for scholarly templates in general is to embed the metadata in-line.

    Some colleagues and I wrote up some of the approaches for embedding inline metadata for the Journal of Digital Curation1. The short version of that is that the most reliable cross-platform way of adding semantics like metadata in-line to documents is to use styles, or a new technique I have been developing since that work, using links. Both styles and links are supported by major word processors, so they tend to survive being loaded into different word processors or different versions of the same word processor. I will give examples of both approaches here.

    Styles are fiddly to apply if you are expecting people to manage the process for themselves, but in the case of a template like this one for theses they should be robust enough thesis candidates are not going to be changing the title page except to fill in their details. Even better why doesn’t the university do it for the candidate? I’ll come back to this idea. Using tables for metadata like the one on the top of this document is also a reliable approach the metadata can be identified using style, or just text in a cell adjacent to each metadata item.

    So to demonstrate the use of styles for metadata in the Edinburgh thesis template, I:

    1. Used style p-meta-author instead of AUTHOR so the ICE conversion system would recognise it.

      graphics5Illustration 5: Applying the style p-meta-author the author name in the template. This dialogue box is a bit hard to find, good luck.

    2. Added an inline/character style for the date i-date. [TODO: get this working or remove from post]

      graphics6Illustration 6: The inline style for the date, i-meta-date. It has no special formatting.

    Getting both of these to work required a bit of hacking on ICE itself, as this metadata handling was only partially implemented.

    The result is that both author and date are now included in the metadata for the EPUB file.

    There is a problem with this approach, though, in that it is not giving us very high quality metadata in a linked-data sense. The author name is just a string, which as we know is not a good way to uniquely identify an author. More than one person might be identified by a string, and more than one string often identifies an author2. It would be much better if we could give the Author an HTTP URI. That is to name them using a URL that will be stable and unambiguous whether they are called Name of Author or Author, N or they change their name to Nom de Plume, which might occur as a string like de Plume, N or many other variants.

    There’s a big project coming, ORCID, which will aim to give researchers URIs, but an university could easily give each candidate a URI now, and match up with ORCID later.

    I have included a demonstration of how to identify a party, the Publisher, using a URI. Here’s a walk-through of a possible technique for including URIs for metadata in a template. Remember, only the template designer has to do this, not the poor candidate. And if we wanted to use this technique for personal names we could automate it and use a university-assigned URI for each candidate:

    1. I chose a URI for the university: http://www.ed.ac.uk/ . Just using that as a link does not amount to metadata though. Instead I,

    2. Visited http://www.ed.ac.uk/ – which redirects to http://www.ed.ac.uk/home

    3. Clicked my Publisherize.me bookmarklet.

    4. Copied the resulting link which is encodes an RDF statement/triple, and wrapped it around the text in the template.

      http://ontologize.me/?tl_p=http://purl.org/dc/terms/publisher&triplink=http://purl.org/triplink/v/0.1&tl_o=http://www.ed.ac.uk/home

    5. Now, when documents using that template are fed through ICE, including the word-processing-to-EPUB service I have been prototyping, ICE recognises the metadata, extracts it into a data structure so it can be passed-on to Calibre, which makes the EPUB.

      ebook-convert … –title “Title of Thesis” –authors “Author-name” –publisher “The University of Edinburgh (http://www.ed.ac.uk/home)” –pubdate “2011-05-01″

      But wait, there’s more! ICE also embeds the metadata in the HTML it produces, like so (I did edit out some cruft that it should not be producing):

      <span rel="http://purl.org/dc/elements/1.1/publisher" resource="http://www.ed.ac.uk/home">

      <span property="http://xmlns.com/foaf/0.1/name" resource="http://www.ed.ac.uk/home">

      <a href="http://ontologize.me/?tl_p=http://purl.org/dc/terms/publisher&amp;triplink=http://purl.org/triplink/v/0.1&amp;tl_o=http://www.ed.ac.uk/home">The University of Edinburgh

      </a>

      </span></span>

      This is intended to be compatible with RDFa 1.1, and this approach for embedding metadata in scholarly documents is on of the approaches we’re promoting in the nascent Scholarly HTML movement.

    Summary

    In this post I have looked at three ways to embed metadata in a word processing document, so that when people use the template the metadata they or the template designer, enter can be machine-processed from then on.

    1. Using the metadata fields in the document: good for very basic metadata like titles, but limited and not particularly interoperable for other kinds of metadata.

    2. Using styles: flexible but fragile, and requires that each processing system knows about the styles you are using.

    3. Using my proposed way of making linked data metadata statements encoded in links; triplinks, as seen on my demo site: http://ontologize.me. This is potentially quite robust, and could be supported by tool-chains that are much easier to use than the current half-baked infrastructure provided by yours truly.

    Here’s a final screenshot showing how the embedded metadata has made its way from the sample template using those three methods to the EPUB metadata, as seen in the Firefox EPUB plugin.

    graphics7Illustration 7: Metadata from the thesis template demo, in the Firefox EPUB plugin.

    All three of these require that software systems know how to find and process metadata what we’re trying to achieve over at the Scholarly HTML site (when I get time to add pages on conventions for encoding metadata) is to document common ways of doing this so that tool-builders can create interoperable systems.

    To try this out for yourself:

    1. Go here in your browser: http://ec2-50-16-170-243.compute-1.amazonaws.com/api/convert/doc

    2. Either:

    The default check-boxes at that service will make you an EPUB if you don’t have an EPUB reader you can change the file extension to .zip, open it up and have a look. If you do, you’ll see something like this:

    graphics9Illustration 9: Test thesis template in Adobe Digital Editions – note the title and author have been automatically extracted from the Word document.

    Where now?

    There’s potential here to test some of this stuff out with the folks who support thesis candidates and their supervisors, or in journal templates.

    • I will keep working on the Edinburgh template to show how we might add to it in ways that increase the utility of the documents it produces, by making it easier to build ebooks. My thinking is to provide demos of what can be done for Word, OpenOffice.org/LibreOffice both using generic styles, or for people prepared to invest a little more time using the ICE styles.

    • I’d love to do something with a repository I’m thinking that it would be great to deposit theses in EPUB format and the repository could provided a web-based reader, along the lines of IbisReader, which Liza Daly and company created. I’m looking at you, Eprints! Eprints already almost supports this, if you upload a zip file it will stash all the parts for you in a single record. All we would need would be something like this little reader my colleagues at USQ made. It would just be a matter of transforming the EPUB TOC into JSON, and loading the JavaScript into an Eprints page.

    • There are improvements to be made to ICE currently the style-based metadata does not produce Scholarly HTML / RDFa output, and is in a separate part of the code from the link-based metadata; these could be brought together.

    • Is it worth adding Scholarly HTML / RDFa metadata support to Calibre so it can auto-detect metadata in HTML input?

    Longer term I would like to see:

    • A properly resourced end-to-end thesis project looking at how an institution could provide technical resources to candidates and supervisors, from templates, and a content, data and annotation management system . I will be showing demo service of some of this later in the project, but at the moment the demos are just toys we need some real users and some institutional commitment to trying this stuff out.

    • A journal and conference paper service where authors can write once and then submit to multiple journals. This idea comes from Timo Hannay who I met when I was in the UK he’s worked with Nature where there is a 95%-ish rejection rate, so a service that could automatically re-work your document for you and submit would be really useful. Also sounds a bit like the Repository Junction project that Theo Andrew is involved in.

    1. Sefton P, Barnes I, Ward R, Downing J. Embedding Metadata and Other Semantics in Word Processing Documents. International Journal of Digital Curation. 2009;4(2). Available at: http://www.ijdc.net/index.php/ijdc/article/view/121. Accessed October 22, 2009.

    2. Salo D. Name Authority Control in Institutional Repositories – Cataloging & Classification Quarterly. Cataloging & Classification QuarterlyWhere. 2009;47(3 & 4):249 – 261. Available at: Accessed September 9, 2009.

    [This is a repost of a document I posted to the jiscPub blog posting here as well to reach more people but please use the comments over there.]

    Copyright Peter Sefton, 2011. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

    HTTP://DBPEDIA.ORG/SNORQL/?QUERY=SELECT+%3FRESOURCE%0D%0AWHERE+{+%0D%0A%3FRESOURCE+%3CHTTP%3A%2F%2FDBPEDIA.ORG%2FONTOLOGY%2FPERSON%2FBIRTHPLACE%3E+%3CHTTP%3A%2F%2FDBPEDIA.ORG%2FRESOURCE%2FSYDNEY%3E+%3B%0D%0A%3CHTTP%3A%2F%2FDBPEDIA.ORG%2FONTOLOGY%2FPERSON%

    This post was written in OpenOffice.org, using templates and tools provided by the Integrated Content Environment project.