Category Archives: ScholarlyHTML

Notes on ownCloud robustness

I’m on my way to a meeting at Intersect about the next phase of the Cr8it data packaging and publishing project. Cr8it is an ownCloud plugin, and ownCloud is widely regarded as THE open source dropbox-like service, but it is not without its problems.

Dropbox has been a huge hit, a killer app with what I call powers to "Share, Sync & See". Syncing between devices, including mobile (where it’s not really syncing) is what made Dropbox so pervasive, giving us a distributed file-system with almost frictionless sharing via emailed requests, with easy signup for new users. The see part refers to the fact that you can look at your stuff via the web too. And there is a growing ecosystem of apps that can use Dropbox as an underlying distributed filesystem.

ownCloud is (amongst other things) an open source alternative to Dropbox.com’s file-sync service. A number of institutions and service providers in the academic world are now looking at it because it promises some of the killer-app qualities of dropbox in an open source form, meaning that, if all goes well it can be used to manage research data, on local or cloud infrastructure, at scale, with the ease of use and virality of dropbox. If all goes well.

There are a few reasons dropbox and other commercial services are not great for a university:

  • We need to be able control where data are stored and have the flexibility to bring data close to large facilities. This is why CERN have the largest ownCloud test lab in the world, so I’ve heard.

  • It is important to be able to write applications such as Cr8it without being beholden to a company like Dropbox.com, Apple, Google or Microsoft who can approve or deny access to their APIs at their pleasure, and can change or drop the underlying product. (Google seem to pose a particular risk in this department, they play fast and loose with products like Google Docs, dumping features when it suits them)

But ownCloud has some problems. The ownCloud forum is full of people saying, "tried this out for my company/workgroup/school. Showed promise but there’s too many bugs. Bye." At UWS eResearch we have been using it more or less successfully for several months, and have experienced some fairly major issues to do with case-sensitivty and other incompatibilities between various file systems on Windows, OS X and Linux.

From my point of view as an eResearch manager, I’d like to see the emphasis at ownCloud be on getting the core share-sync-see stuff working, and then on getting a framework in place to support plugins in a robust way.

What I don’t want to see is more of this:

Last week, the first version of OwnCloud Documents was released as a part of OwnCloud 6. This incorporates a subset of editing features from the upstream WebODF project that is considered stable and well-tested enough for collaborative editing.

We tried this editor at eResearch UWS as a shared scratchpad in a strategy session and it was a complete disaster, our browsers kept losing contact with the document, and when we tried to copy-paste the text to safety it turned out that copying text is not supported. In the end we had to rescue our content by copying HTML out of the browser and stripping out the tags.

In my opinion, ownCloud is not going to reach its potential when the focus remains on getting shiny new stuff out all the time, far from making ownCloud shine, every broken app like this editor tarnishes its reputation substantially. By all means release these things for people to play with but the ownCloud team needs to have a very hard think about what they mean by "stable and well tested".

Along with others I’ve talked to in eResearch, I’d like to see work at owncloud.com focus on:

  • Define Sync behaviour in detail, complete with automated tests and have a community-wide push to get the ongoing sync problems sorted. For example, fix this bug reported by a former member of my team along with several others to do with differences between file systems.

  • Create a standard way to generate and store file derivaties such as image thumbnails, or HTML document previews, as well as additional file metadata. At the moment plugins are left to their own devices, so there is no way for apps to reliably access each others data. I have put together a simple Alpha-quality framework for generating web-views of things via the file system, Of the Web, but I’d really like to be able to hook it in to ownCloud properly.

  • Get the search onto a single index rather than the current approach of having an index per user, something like Elastic Search, Solr or Lucene could easily handle a single metadata-and-text index with information about sharing, with changes to files on the server fed to the indexer as they happen.

  • [Update 2014-04-11] Get the sync client to handle connecting to multiple ownCloud servers, in Academia we will definitely have researchers wanting to use more than one service, eg AARNet’s Cloudstor+ and an institutional ownCloud. (Not to mention proper dropbox integration)

Creative Commons License
Notes on ownCloud robustness by Peter Sefton is licensed under a Creative Commons Attribution 4.0 International License

Round table on vocabularies for describing research data: where’s my semantic web?

[UPDATE: Fixed some formatting]

Creative Commons Licence
Round table on vocabularies for describing research data: where’s my semantic web? by Peter Sefton is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Summary: in this post I talk about an experimental semantic website for describing what I’m calling ‘research context’, wondering if such as site can be used as a ‘source of truth’ for metadata entry, for example when someone is uploading a file into a research data repository. The post assumes some knowledge of linked data and RDF and/or an interest in eResearch software architecture.

Thanks to twitter correspondents Jodi Schneider, Kristi Holmes and Christopher Gutteridge.

On Friday 7th September I attended a meeting at Intersect about metadata vocabularies for managing research data, in the context of projects sponsored by the Australian National Data Service (ANDS). Ingrid Mason asked me to talk about my experiences describing research data. I approached this by starting with a run-through of the talk Peter Bugeia and I put together for Open Repositories with an emphasis on our attempts to use Linked Data principles for metadata. In this work we encountered two big problems, which I brought to the round-table session as questions.

  1. It’s really hard to work out which ontology, or set of vocabulary terms to use to describe research context. Take ‘experiment’ what is a good linked data term for that?

    Q. What to use as a URI for an experiment?

  2. In trying to build linked-data systems I have not found any easy to use tools. (I got lots of useful leads from Kristy Holmes and Jodi Schneider on Twitter, more on that below).

    Q. Where’s my semantic web!

Answers at the end of the post, but you have to read the whole thing anyway.

The problem I’m working on at the moment with colleagues at the University of Western Sydney is how we can provide a framework for metadata about research data. We’re after efficient interfaces for researchers to contextualise research data sets, across lots of different research domains where the research context looks quite different.

For example, take the HIEv system at the Hawkesbury Intitute for the Environment (HIE). HIEv is basically a file-repository for research data files. It has information about each file (size, type, date range etc) and contextual metadata about the research context, in this case using a two-part hierarchy: Facility / Experiment where facilities are associated with multiple experiments and files are associated with experiments. Associating a data file with research context is easy in HIEv because it’s built in to the system. A human or machine uploading a data file associates it with an experiment using a form, or a JSON data structure respectively. The framework for describing research context is built-in to the application, and the data lives in its internal database.

This approach works well, until:

  1. We try to re-use the software behind HIEv in another context, maybe one where the research domain does not centre on facilities, or experiment is not quite the right concept, or the model needs to be further elaborated.

    Example: In the MyTardis project, a development team added an extra element to that package’s research hierarchy – porting the application to new domains means substantial rework. See this message on their mailing list.

  2. We want to re-use the same contextual descriptions to describe research data in another system where we are faced with either programming a whole new framework for the same context, or adding a new interface for our new system to talk to the research context framework in the old one.

    Example: At HIE, with the help of some computing students, Gerry Devine and I are exploring the use of OwnCloud (the dropbox-like Share/Sync/See application) to manage working files, with a simple forms interface to add them to HIEV. As it stands the students have to replicate the Facility/Experiment data in their system, meaning they are hard-coding facility / Experiment hierarchies into HTML forms.


Gerry Devine and I have been sketching an architecture designed to help out in both of these situations. The idea is to break-out the description of the research-context into a well-structured application. This temporary site of Gerry’s, shows what it might look like in one aspect, a web site which describes stuff at HIE; facilities, and their location, experiments taking place at those facilities, and projects. The question we’re exploring is: can we maintain a description of the HIE research context in one place, such as an institute research site or wiki, and have our various data-management applications use that context, rather than having to build the same research-context framework into each app and populate with lists of values? Using a human-readable website as the core home for research context information is appealing because it solves another problem, getting some much needed documentation on the research happening at our organisation online.

Here’s an interaction diagram showing what might transpire when a researcher wants to use a file management application, such as ownCloud (app) to upload some data to HIEv, the working data repository at the institute:


We don’t have much of this implemented, but last week I had a play with the research context website part of the picture (the system labelled ‘web’, in the above diagram). I wanted to see if I could create a web site like the one Gerry made, but with added semantics, so that when an application, like an ownCloud plugin asked ‘gimme research context’ it could return a list of facilities, experiments and projects in machine readable form.

For a real institute or organisation-wide research context management app, you’d want to have an easy to use point and click interface, but for the purposes of this experiment I decided to go with one of the many markdown-to-html tools. See this page which summarises why you’d want to use one and lists an A-Z of alternatives.This is the way many of the cool kids make their sites theses days – they maintain pages as markdown text files, kept under version control and run a script to spit out a static website. Probably the best-known of these is Jekyl, which is built in to GitHub. I chose Poole because it’s Python, a language in which I can get by, and it is super-simple, and this is after-all just an experiment.

So, here’s what a page looks like in Markdown. The top part of the file, up to ‘—–‘ is metadata which can be used to lay out the page in a consistent way. Below the line, is structured markup. # Means “Heading level 1” (h1), ## is ‘h2′ and so on.

title: Glasshouse S30
long: 150.7465
lat:  -33.6112
typeOf: @facility
full_name: Glasshouse facility at UWS Hawkesbury building S30
code: GHS30
description: Glasshouse in the S-precinct of the University of Western Sydney, Hawkesbury Campus, containing eight naturally lit and temperature-controlled compartments (3 x 5 x 3.5m, width x length x height). This glasshouse is widely used for short-term projects, often with a duration of 2-3 months. Air temperature is measured and controlled by an automated system at user-defined targets (+/- 4 degrees C) within each compartment. The concentration of atmospheric carbon dioxide is controlled within each compartment using a system of infrared gas analyzers and carbon dioxide injectors. Supplementary lighting will be installed in 2013.
---

Contact: Renee Smith (technician, R.Smith@uws.edu.au), John Drake (Post-doc, je.drake@uws.edu.au), Mike Aspinwall (Post-doc, m.aspinwall@uws.edu.au).

# References: 

Smith, R. A., J. D. Lewis, O. Ghannoum, and D. T. Tissue. 2012. Leaf structural responses to pre-industrial, current and elevated atmospheric CO2 and temperature affect leaf function in Eucalyptus sideroxylon. Functional Plant Biology 39:285-296.

Ghannoum, O., N. G. Phillips, J. P. Conroy, R. A. Smith, R. D. Attard, R. Woodfield, B. A. Logan, J. D. Lewis, and D. T. Tissue. 2010. Exposure to preindustrial, current and future atmospheric CO2 and temperature differentially affects growth and photosynthesis in Eucalyptus. Global Change Biology 16:303-319.



# Data organisation overview

There have been a large number of relatively short-duration experiments in the Glasshouse S30 facility, often with multiple nested projects within each experiment.  The file naming convention captures this hierarchy.



# File Naming Convention

Convention: GHS30_<EXPERIMENT>_<PROJECT>_<VARIABLE COLLECTION CODE>_<DATA PROCESSING>_<DATE or DATERANGE>[_<VERSION>].<filetype>

The resulting HTML looks like this:

But wait, there’s more! Inside the human-readable HTML page is some machine-readable code to say what this page is about using linked-data principles. The best way I have been able to work out how to describe a facility is using the Eagle-I ontology, where I think the appropriate term for what HIE calls a facility is ‘core-laboratory’. You can browse the ontology and tell me if I’m right. This says that the glasshouse facilty is a type of core-laboratory.

<section
resource="http://uws.edu.au/facilities/glasshouse-s30.html"
typeof="http://vivoweb.org/ontology/core#CoreLaboratory">

<h1 property="dc:title">Glasshouse facility at UWS Hawkesbury building S30</h1>

(I’m not an RDF expert so if I have this wrong somebody please tell me! And yes, I know there are issues to consider here What URIs should we use for naming facilities and other contextual things? Should we use Handles? PURLS? Plain old URLs like the one above?)


The code that produced this snippet is really simple, but I did have to code it:


def hook_postconvert_semantics():
    for p in pages:
    	if p.typeOf <> None:
       		p.html = "\n\n<section resource='http://hie.uws.edu.au/research-context/%s' \
                typeof='%s'>\n\n%s\n\n<section>\n\n" % (p.url, types[p.typeOf], p.html)

Now, the part that I’m quite excited about is that if you point an RDFa distiller at this you get the following. This is JSON-LD format which is (sort of) RDF wrapped up in JSON. Part time programmers like me often find RDF difficult to deal with, but everyone loves JSON, you can slurp it up into a variable in your language of choice and access the data using native idioms.

{
    "@context": {
        "dcterms": "http://purl.org/dc/terms/"
    }, 
    "@graph": [
      
        {
            "@id": "facilities/glasshouse-s30.html", 
            "@type": "http://vivoweb.org/ontology/core#CoreLaboratory", 
            "http://www.w3.org/2003/01/geo/wgs84_pos#long": {
                "@value": "150.7465", 
                "@language": "en"
            }, 
            "dcterms:title": {
                "@value": "Glasshouse facility at UWS Hawkesbury building S30", 
                "@language": "en"
            }, 
            "http://www.w3.org/2003/01/geo/wgs84_pos#lat": {
                "@value": "-33.6112", 
                "@language": "en"
            }
        }
    ]

That might look horrible to some, but should be easy for our third-year comp-sci students to deal with. Iterate over the items in the @graph array, find those where @type is equal to “http://vivoweb.org/ontology/core#CoreLaboratory“, get the title, and build a drop-down list for the user, to associate their data file with this facility (using the ID). This potentially lets us de-couple our file management app from our HIEv repository, from our Research Data repository, and let them all share the same ‘source of truth’ about research context. In library terms, my hacked-up version of Gerry’s website is acting as a name-authority for entities in the HIE space.

There is a lot more to cover here, including how experiments are associated with facilities, and how, when a user publishes a data set from HIEv a file can be linked to a facility/experiment combination using a relation “wasGeneratedBy” from the World Wide Web Consortium’s PROV (provenance) ontology.

As I noted above, the markdown based approach is not going to work for some user communities. What is needed to support this general design pattern, assuming that one would want to, is some kind of combination of a research-context database application and a web content management system (CMS). A few people, including Jodi Schneider suggested I look at Drupal, the open source CMS. Drupal does ‘do’ RDF, but not without some serious configuration.

Jodi also pointed me to VIVO, which is used for describing research networks, usually focussing on people more than on infrastructure or context. I remember from a few years ago a presentation from one of the VIVO people that said very explicitly that VIVO was not designed to be a source of primary data so I wondered if it was appropriate to even consider it as a place to enter, rather than index and display data. The VIVO wiki says it is possible, but building a site with the same kind of content as Gerry’s would be a lot of work just as it would be in Drupal.

Oh, and those answers? Well thanks to Arif Shaonn from the University of New South Wales, I know that http://www.w3.org/ns/prov#Activity is probably a good general type for experiments (no, I’m going to define an ontology of my own, I already have enough pets).

And where’s my semantic web? Well, I think we may need to build a little more proof-of-concept infrastructure to see if the idea of a research-context CMS acting as a source of truth for metadata makes sense, and if so, make the case for building it as part of future eResearch data-management apps.

My dodgy code including the input and output files for a small part of Gerry’s website is on github, to run it you’ll need to install Poole first.

Trip report: Open Repositories 2013

Creative Commons License
Trip report: Open Repositories 2013 by Peter Sefton is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

From July 8th to July 12th I was on Prince Edward Island in Canada for the Open Repositories conference. I try to participate as a member of the OR committee, particularly in matters relating to the developer challenge which is a key part of the conference, when I can manage the early-morning conference calls with Europe and the North America. My trip was funded by my employer, the University of Western Sydney. In this report I’ll talk about the conference overall, the developer challenge and the paper I presented, written with my colleague Peter Bugeia.

This was my first trip to Canada. I liked it.

wpid-trip-report-for-ptsefton.com-final.html_trip-report-for-ptsefton.com-final_html_1715e81a.jpg Prince Edward Island’s largest land mammal is the lighthouse.

Summary

The main-track conference started and ended with talks which, to me at least, were above all about Research Integrity. We started with Vitoria Stodden’s Re-use and Reproducibility: Opportunities and Challenges, which I’ll cover in more detail below. One of Stodden’s main themes was the gap in our scholarly practice and infrastructure where code and data repositories should be. It is no longer enough to publish research articles that make claims if those claims are impossible to evaluate or substantiate in the absence of the data and the code that support them. The closing talk touched on some of the same issues, looking at the current flawed and corruptible publishing system, claiming for example that the journal based rewards system encourages cheating. Both of these relate to repositories, in overlapping ways.

But OR is not just about the main track, which was well put together by Sarah Shreeves and Jon Dunn, it remains a practical, somewhat technical conference where software user and developer groups are also important strands and the Developer’s Challenge is a key part of the event.

The conference: the “Two Rs and a U”

First up, the main conference. The theme this year was “Use, Reuse, Reproduce”. The call for proposals said:

Some specific areas of interest for OR2013 are:

  • Effective re-use of content–particularly research data–enabled by embedded repository tools and services

  • Effective re-use of software, services, and infrastructure to support repository development

  • Facilitation of reproducible research through access to data, workflows, and code

  • Services making use of repository metadata

  • Focused, disciplinary or community-based software, services, and infrastructure for use and reuse of content

  • Integration of data, including linked data, and external services with repositories to provide solutions to specific domains

  • Added-value services for repositories

  • Long-term preservation of repositories and their contents

  • Role and impact of repositories in the research ecosystem

These are all great things to talk about, and show how repositories, at least in universities as expanding from publications to data. The catch-phrase “Use, Reuse, Reproduce” is worthy, but I think maybe we’re not there yet. What I saw and heard, which was of course just a sample, was more along the lines of “Here’s what we’re doing with research data” rather than stories about re-use of repository content or reproducible research. I hope that some of the work that’s happening the Australian eResearch scene on Virtual Labs and eResearch tools finds its way to OR2014, as I think that these projects are starting to really join-up some of the basic data management infrastructure we’ve been building courtesy of the Australian National Data Service (ANDS) with research practices and workflows. It’s the labs that will start to show the Use and Reuse and maybe some Reproduction.

Keynote:

Victoria Stodden’s opening keynote was a coherent statement of some of the challenges facing scholarship, which is currently evaluated on the basis publications, citations and journals. But publications are most often not supported by data and/or code that can be used to check them, Stodden talked mainly about computationally-based research, but the problem affects many disciplines. For a keynote I found it little dry – there was only one picture, and I would have preferred a few stories or metaphors to make it more engaging. I was also hoping she’d talk about the difference between repeatability and reproducibility, which she did in another talk. Our community needs to get on top of this, so here’s an ‘aside-slide’ from another of her talks:

Aside: Terminology

  • Replicability (King 1995)*: Now: regenerate results from existing code, data.

  • Reproducibility (Claerbout 1992)*: Now: independent recreation of results

  • without existing code or data,

  • Repeatability: re-run experiments to determine the sensitivity of results when

  • underlying measurements are retaken,

  • Verification: the accuracy with which a computational model delivers the

  • results of the underlying mathematical model,

  • Validation: the accuracy of a computational model with respect to the

  • underlying data (model error, measurement error).

See: V. Stodden, “Trust Your Science? Open Your Data and Code!” Amstat News, 1 July 2011. http:// magazine.amstat.org/blog/2011/07/01/trust-your-science/

*These citations are not in the reference list in the slide-deck.

Stodden made some references to repositories, summarized thus on The Twitter:

Simon Hodson ‏@simonhodson9910 Jul

@sparrowbarley: #OR2013 keynote, V Stodden called for sharing of data & code to “perfect the scholarly record” & “root out error”” #jiscmrd

Peter Ruijgrok ‏@pruijgrok9 Jul

#or2013 Victoria Stodden: A publication is actually an advertisement. Data and software code is what it is about as proof / reproducing

This was a useful contribution to Open Repositories – Reuse, Replicability, Reproducibility et al have to be amongst our raisons de etre. Just as the Open Access movement drove the initial wave of institutional publications repositories, the R words will drive the development of data and code repositories, both institutional and disciplinary. OR is a very grounded conference, for practitioners more than theorists, so I would expect that over the next decade we’ll be talking about how to build the infrastructure to support researchers like Victoria, and the researchers we’ve been working with at UWS, which brings us to our talk.

Our paper: 4a Data Management

The presentation I gave, written with Peter Bugeia talks about how UWS collaborated with Intersect, using money from the Australian National Data Service to work on a suite of projects that cover the four As. It’s up on the UWS eResearch blog and with slightly cleaned-up formatting on my own site.

At the University of Western Sydney we’ve been working on end-to-end data management. The goal of being able to support the R words for researchers is certainly behind what we’re doing but before we get results in that area we have to build some basic infrastructure. For the purposes of this paper, then, we settled on the idea of talking about a set of ‘A’ words that we have tried to address with useful services:

  1. Acquiring data

  2. Acting on data

  3. Archiving data

(we could maybe have made more of the importance of including as much context about research as possible, including code, but we certainly did mention it).

  1. Advertising data

(note the accidental alignment with Victoria Stodden’s comment that an article is an ad.)

Note that the A words have been retrofitted to the project as a rhetorical device, this is not the framework used to develop the services

Everyone in the repository world knows that “if we build it they will come” is not going to work, which is why this is not just about Archiving and Advertising, two core traditional repository roles, it’s about providing services for Acquiring and Acting data for researchers. Reproducibility et al are going to be more and more important to the conduct of research, and as awareness of this spreads the most successful researchers will be the ones who are already making sure their data and code are well looked after and well described.

The closing plenary

Jean-Claude Guédon’s wrap up complemented the opening well, drawing together a lot of familiar threads around open access, and looking at the way the scholarly rewards system has created a crisis in research integrity, he rehearsed the familiar argument that journal-level metrics are not useful, and can be counter-productive, calling for an article-level measuring system which can operate independently of the publishing companies who control so much of our current environment. He warned against letting corporations take ‘our’ data and sell that back to us as well. There was nothing really ground breaking here, but it was a timely reminder to think about why we’re even at a conference called Open Repositories.

Like Stodden, Guédon didn’t offer much of a roadmap for the repository movement, which after all is our job, although he did try talk in context maybe a little more than Stodden’s opening which, while it did reference repositories had the air of a well-practiced stump-speech.


The developer challenge

This year the develop challenge judging panel was decisively chaired by Sarah Shreeves who was also on the program committee. We struggled to get entrants this time – this still needs some analysis, but at this stage it looks like the relatively remote location meant that many developers didn’t get funding to attend, and we had a little confusion around a new experiment for this year which was a non-compulsory hackfest a few kilometers from the main venue which left a couple people thinking they’d missed out on a chance to join in. And the big one was that there was no dedicated on-the-ground dev wrangler on hand; for the last several years JISC have been able to send staff, notably Mahendra Mahey. I did try to encourage teams to enter, with modest success, but Mahendra was definitely missed.

So who won?

William J Nixon ‏@williamjnixon11 Jul

#OR2013 Developers challenge winners – Team Raven’s PDF/A and Team ORCID. Congratulations. More details on these at http://or2013.net/content/or-2013-dev-challenge-event …

This year we based the judging on the criteria I put together last year in the form of a manifesto about the values of the conference. I think that helped focus the judging process and feedback was generally good from the panel but we’ll see how people feel after some reflection. I talked about this in the wrap-up. Torsten Reimer got the heart of it:

Torsten Reimer ‏@torstenreimer11 Jul

#OR2013 Developer Challenge co-judge @ptsefton summarises the dev manifesto http://bit.ly/1dmQ8UQ  creating new networks is key

Robin Rice (front left) had organized a photo of the smuggler’s den in which we convened:

Robin Rice ‏@sparrowbarley10 Jul

The Dev Challenge judges are convening in an appropriate venue. #OR2013 pic.twitter.com/ToIkzQueeb

The committee is putting together a manual for future organizers, and I will be suggesting something along these lines:

  • Dev facilities should be as close as possible to the main conference rooms, even remote rooms in the same facility cause problems as people need to be able to be in and out of presentations.

  • There needs to be a dedicated mentor for the dev challenge to help teams coalesce and do stuff like make sure that winners are announced formally.

The future of Fedora

I was really interested in the new Fedora Futures / Fedora 4 project (FF). Fedora is the back-end repository component behind projects like Islandora, Hydra (parts of which power the HCS virtual laboratory) and optionally for ReDBOX.

The new Fedora project is not ready for real use yet, but it shows lots of promise as a simple-to-use linked-data-ready data storage layer for eResearch projects, where you want to keep data and metadata about that data. New built-in self-repairing clustering and a simple REST interface make it appealing.

I was particularly excited about the (promised) ability for FF to be able to run on top of an existing file tree and provide repository-like services over the files. This is exactly what we have been looking for in the Cr8it project, where the idea is to bridge the yawning chasm from work-in-progress research files to well described data sets. Small detail: doesn’t actually do what I fondly hoped it would yet I wanted to be able to point it at a set of files, have it extract metadata and generate derived views and previews, allow extra metadata to be linked to files and watch for changes. Working on that.

The venue and all that

Prince Edward Island took some getting to for some of us, but it was worth it. Mark Leggot’s team at UPEI and the related spin-off company Discovery Garden were consummate hosts and Charlottetown was a great size for a few hundred conference attendees; it was impossible not to network unless you stayed in your hotel room. I didn’t. I particularly appreciated the local CD in the conference bag. Mine is Clocks and Hearts Keep Going by Tanya Davis, laid-back pop-folk, which kind of reminded me of The Be Good Tanyas, in a good way, and no not because of the Tanya thing, but it may be a Canadian thing. And there were decent local bands at the conference dinner, which was held in three adjacent restaurants and involved oysters and lobsters, like just about every other meal on PEI.

wpid-trip-report-for-ptsefton.com-final.html_trip-report-for-ptsefton.com-final_html_m463f855a.jpg

Just about everyone I went to dinner with was obsessed with these things, and I do think they might be better than the ones we have here

There was pretty good fish and chips too.

HTML Slide presentations, students to the rescue

Creative Commons Licence
HTML Slide presentations, students to the rescue by Peter Sefton is licensed under a Creative Commons Attribution 3.0 Unported License.

Thanks to Andrew Leahy’s organising skills I am now the client for a group of third year computing students from the School of Computing, Engineering and Mathematics at the University of Western Sydney who have chosen to work on an HTML slide viewer project for their major project. I’m not going to name them here or point to their work without their permission, but who knows, they might start up an open source project as part of this assignment.

You might have noticed that on this blog I have been experimenting with embedding slide presentations in posts, like this conference presentation on research data management which embeds slide images originally created in the Google Drive presentation app along with speaker notes, or this one on the same topic where the slides are HTML sections rather than images. These posts mix slides with text, so you can choose to read the story or watch the show using an in browser HTML viewer. I think this idea has potential to be a much better way of preserving and presenting slides than throwing slide-decks online, but at the moment the whole experience on this blog is more than a bit clunky and leaves lots to be desired, which is where the students come in.

Hang on, there are dozens of HTML slide-viewer applications out there – so why do I think we need a new one?

There are a few main requirements I had which are not met by any of the existing frameworks, that I know of. These are:

  • It should be possible to mix slide and discursive content.

    That is, slides should be able to sprinkled through an otherwise ‘normal’ HTML document which should display as plain-old-html without any tricks.

  • Slide markup should be declarative and use modern semantic-web conventions.

    That is, the slides and their parts should be identified by markup using URIs instead of the framework assuming, for example that <section> or <div> means ‘this is a slide’. Potentially, different viewing engines could consume the same content. You could have a dedicated viewer for use in-venue with speaker notes on one screen and presentation on another and another to show a slide presentation embedded in a WordPress post.

  • Following from (2), the slide show behaviour should be completely independent of the format for the slides.

    That is adding the behaviour should be a one or two liner added to the top, or even better dropping the HTML into a ‘slide-ready’ content management system like, um, my blog.

There are plenty of frameworks with some kind of open license that students should be able to adapt for this project. That’s what I did with my attempt, I wrote a few lines of code to take slides embedded in blog posts, get rid of other HTML and marshal the result into the venerable W3C Slidy format. The format is declarative, and the documents don’t ‘do’ anything at all until a wordpress plugin sniffs-out slide markup hiding in them.

I’m going to be working with the team to negotiate what seems like a reasonable set of goals for this project, but my current thinking is something like the following:

  • In consultation with me, define a declarative format for embedding slides in HTML that can cover (at least):

    • Identifying slides using a URI.

    • Identify parts of slides (the slide vs notes etc).

  • Allow slides to consist of one or both of an image of the slide or a text/HTML version of the same thing. Eg a nicely rendered image of some bullet points from PowerPoint with equivalent HTML formatting also available to search engines and screen-readers.

  • Improve on the current slide-viewing experience in WordPress with:

    • Some kind of player that works in-post (ie without going fullscreen). A simple solution that came up in our meeting would be to automatically add navigation that just skips between slides, with some kind of care taken to show the slide at the top of the screen with context below it.

    • An improved full-screen player that can (at least) recognise when a full-screen image version of the slide is available and display that scaled to fit rather than the sub-optimal thing I have going on now with Slidy putting a heading at the and the image below.

There are lots more things that could be done with this, given time, which might make good material for future projects:

  1. Adding support for this format to Pandoc or similar.

  2. Creating a converter or exporter for slide presentations in common formats (.pptx, odp) targeting the new format.

  3. Extending the support I have already built into WordDown and the ICE content converter to allow authors to embed slides in word processing documents.

  4. Adding support for syncronised audio and video.

  5. Allowing more hyper-presentations like prezi.

  6. Dealing with progressive slide builds.

  7. Slide transitions.

  8. Different language versions of the same content.

  9. Synchronising display on multiple machines, eg student’s ipads or a second computer.

  10. Master slides and branding – point to a slide template somewhere? Include a suggested slide layout somehow?

  11. Adding a presenter mode with slides on one screen and notes on another.

  12. For use with mult-screen rigs like Andrew Leahy’s Wonderama maybe the extra screens could be used to show more context, slides on one screen video of presenter on another – photos, maps on other screens. Eg a Wonderama presentation rig could look for geo-coded stuff in the presentation and throw up maps or Google Earth viz on spare screens, or other contextual material.

Of course depending on which framework, if any, the students decide to adopt and/or adapt some of the above may come for free.

Putting data on the web

I attended this data newsroom (#datanews) event in Melbourne Monday Feb 3rd [Correction – it was the 4th] 2013. David Flanders asked me to come prepared to give a talk on tools and techniques for embedding data into web pages, particularly using Schema.org, the corporate sponsored ontology of everything that matters for commerce.

So here are my semantically rich[i] notes for the presentation. This is neither a tutorial nor a coherent story, so you may want to leave now, but there is a picture of Tim Berners-Lee about half way through.

Why embed data in web pages?

You can make new things happen. Let other people or machines do things with the data. Here’s an example by Tim Sherratt showing how data embedded in the page (left) can drive new behaviour (the stuff on the right).

What is this Schema.org?

(I have added a couple of tags to discuss later)

Many sites are generated from structured data, which is often stored in databases. When this data is formatted into HTML, it becomes very difficult to recover the original structured data. Many applications, especially search engines, can benefit greatly from direct access to this structured data. On-page markup [#inlinedata] enables search engines to understand the information on web pages and provide richer search results in order to make it easier for users to find relevant information on the web. Markup [#semanticsyntax] can also enable new tools and applications that make use of the structure.

A shared markup vocabulary [#sharedvocab] makes it easier for webmasters to decide on a markup schema and get the maximum benefit for their efforts. So, in the spirit of sitemaps.org, search engines have come together to provide a shared collection of schemas that webmasters can use.

http://schema.org/

Use Schema.org – get snippets

The point of this, for the vast majority of web practitioners is to get into the world of ‘rich snippets’. If you use the schema.org way then you get ‘better’ search results – but you’re also allowing the search engines, and anyone else who views your page to use the data. Right now if I search for movie times at my local cinema I enter a Google-trap which shows me films and when they’re on with no link to the cinema site itself. It’s also hard to tell what role Schema.org plays in all these search engine things – some of the data you see is harvested using older conventions for data services and who knows, maybe the cinemas just give Google a spreadsheet with the movie times in it.

For data journalism and research, we presumably want to get the data out in a form that it can be reused so the concerns are different – you want the data to be used, and your part in its collection or creation to be cited.

The other thing you need to know about: RDF

RDF is the Resource Description Framework.

The Resource Description Framework (RDF) is a family of World Wide Web Consortium (W3C)specifications [1] originally designed as a metadata data model. It has come to be used as a general method for conceptual description or modeling of information [#sharedvocabularies] that is implemented in web resources, using a variety of syntax formats [#semanitcsyntax].

http://en.wikipedia.org/wiki/Resource_Description_Framework

In the work I do in eResearch systems and repositories, RDF is clearly a very good framework for extensible metadata, and the associated “Linked data” approach of using URIs to describe things and concepts is a good way to implement shared vocabularies, but RDF is very hard to get to grips with as a general modelling framework.

Now it’s time to over-simplify the process of getting data into web pages via schema.org et al.

Putting data on the web?

  • Is it in some kind of web ready format?*

    • Yes: Put it on the web as-is #justpublish

    • No:  Make it into a web ready format. Options:

      1. Reformat to a spreadsheet or something #justpublish

      2. Embed the data in human readable HTML

        #inlinedata and #semanticmarkup  

      3. Publish as a stand-alone RDF resource**

  • In any case publish a web page about it***

  • Include metadata in the web page. #pagelevelmetadata

  • Make the metadata standards-based and proper****. #sharedvocab

  • Choose a syntax for the embedding #semanticsyntax

The fine print

*What is a web-ready format depends on how much of a pedant you are – for some only gold-plated RDF is good enough

**And, you know, keep the web page UP.

*** At Tim Berners-Lee’s talk in Melbourne that night David Flanders asked him what advice he had for researchers re data – should they put it on the web?

Tim’s response was that researcher should work with their data in the format that suits them but they should get a ‘shim’ or adaptor built to provide an RDF interface to the data so others could use it as part of the semantic web.

I think that’s easy for Sir Tim to say and he’s right that it would be a Good Thing, but experience has shown that projects like that run to about $200K in Australia and don’t always get results, so I’d add “and while you’re working on the RDF adaptor, publish what you have in the format in which you have it with as much metadata as you can manage”.

****Good luck. If anybody comments at all it will be to ask “why didn’t you use the European/ISO/W3C Standard” (which will turn out to be a document that has been in development for 5 years but expected to be released in six months for the last four of those years)

Figure 1 Tim Berners-Lee (right) dwarfed by the happy head on a sponsor’s banner, which in turn is dwarfed by Art – at the University of Melbourne

Google Scholar

Case study Getting a scholarly work into Google Scholar

  • A repository somewhere advertised the existence of the work via extensive use of the venerable meta-tag. #page-level-metadata

  • Google found the data, entered it in its database.

  • When you search it puts the metadata back in the page so other software can scrape it out #microformats*

*Microformats mean

Worst-case: maintaining a web-load of converters – see this from a patch to keep the Zotero reference manager working with Google Scholar. Google changes their page? You change your code and redeploy to millions of people.

‘//div[@class=”gs_r”]/div[@class=”gs_fl”]/a[contains(@href,”q=related:”)]’) +       ‘//div[@class=”gs_r”]//div[@class=”gs_fl”]/a[contains(@href,”q=related:”)]’)  

These are XPath expressions looking in the webpage for stuff that Google coded for their own reasons, probably to make it look right, not primarily for data interchange.

You see what’s happening there? Google indexes pages that conform to a standard they defined (not the one the repository community uses for its own interchange). Then to get the data back out the scholarly community has to keep track of a non-standard convention, again invented by Google.

Sounds like a case for Schma.org?

You’d certainly think so.

But don’t underestimate the power of commercial interests to distort the shape of the semantic web.

There are (at least) two things to be standardised in web semantics

  • The (hopefully) shared vocabulary / world view – “ontology” #sharedvocab

  • The encoding method; how the meaning is embroidered on to the web #semanticsyntax

And of course we have multiple overlapping but incomplete standards, best practices, worst practices and flame-wars for both.

There are four basic ways to embed data into web pages.

Four ways to #inlinedata

  • Metadata about a whole page via meta tags in the head #pagelevelmetadata  #traditional

  • Metadata/data about parts of a page: #semanticsyntax

    • Microformats (obsolete but persisting) using conventions #byconvention

    • Microdata – part of the (non W3c) HTML5 spec  simple, flawed, controversial #worksbutpissedpeopleoff

    • RDFa – obscenely complicated unless you use RDFa 1.1 lite #theonetrueway

I have been working with researchers at the Hawkesbury Institute for the Environment at UWS and the technical folks at Intersect NSW to implement an HTML readme file to accompany environmental researcher data sets – we’re working on a case-study that goes into how we made the choice of RDFa (#semanticsyntax) and how we chose which vocabularies and terms to use (#sharedvocab) which we’ll publish as soon as possible.



[i] Semantically rich? Look at the source – I’ve used a web-police-approved mechanism for embedding slides in my prose. That is, I have used a standard vocabulary (the bibliographic ontology #sharedvocab) and a syntactic specification (RDFa 1.1 lite #semanticsyntax) for saying that some parts of the page are special.

The repository is watching: automated harvesting from replicated filesystems

[This is a repost of http://jiscpub.blogs.edina.ac.uk/2011/07/15/the-repository-is-watching-automated-harvesting-from-replicated-filesystems-2/ please comment over there]

One of the final things I’m looking at on this jiscPUB project is a demonstration of a new class of tool for managing academic projects not just documents. For a while we were calling this idea the Desktop Repository, the idea being that there would be repository services watching your entire hard disk and exposing all the content in a local website with repository and content management services that’s possibly a very useful class of application for some academics, but in this project we are looking at a slightly different slant on that idea.

The core use case I’m illustrating here is thesis writing, but the same workflow would be useful across a lot of academic projects, including all the things we’re focussing on in the jiscPUB project academic users managing their portfolio of work, project reporting and courseware management. This tool is about a lot more than just ebook publishing, but I will look at that aspect of it, of course.

In this post I will show some screenshots of The Fascinator repository in action, talk about how you can get involved in trying it out, and finish with some technical notes about installation and setup. I was responsible for leading the team that built this software at the University of Southern Queensland. Development is now being done at the University of Central Queensland and the Queensland Cyber Infrastructure Foundation where Duncan Dickinson and Greg Pendlebury continue work on the ReDBox research data repository which is based on the same platform.

I know Theo Andrew at Edinburgh is keen to get some people trying this. So this blog post will serve to introduce it and give his team some ideas we’ll follow up on their experiences if there are useful findings.

Managing a thesis

The short version of how this thesis story might work is:

  • The university supplies the candidate with a dropbox-like shared file system they can use from pretty much any device to access their stuff. But there’s a twist there is a web-based repository watching the shared folder and exposing everything there to the web.

  • The university helpfully adds into the share a thesis template that’s ready to go, complete with all the cover page stuff, margins all set, automated tables of contents for sections and tables and figures and the right styles and trains the candidate in the basics of word processing.

  • The candidate works away on their project, keeping all their data, presentations, notes and so on in the Dropbox and filling out the thesis template as they go.

  • The supervisor can drop in on the work in progress and leave comments via an annotation system.

  • At any time, the candidate can grab a group, which we call a package of things to publish to a blog or deposit to a repository at the click of a button. This includes not just documents, but data files (the ones that are small enough to keep in a replicated file system), images, presentations etc.

  • The final examination process could be handled using the same infrastructure and the university could make its own packages of all the examiners reports etc for deposit into a closed repository.

The result is web-based, web-native scholarship where everything is available in HTML, not just PDF or application file formats and there are easy ways to route content to other repositories or publish it in various ways.

Where might ebook dissemination fit into this?

Well, pretty much anywhere in the above that someone wants to either take a digital object ‘on the road’ or deposit it in a repository of some kind as a bounded digital thing.

Demonstration

I have put a copy of Joss Winn’s MA thesis into the system to show how it works. It is available in the live system (note that this might change if people play around with it). I took an old OpenOffice .sxw file Joss sent me and changed the styles a little bit to use the ICE conventions, I’m writing up a much more detailed post about templates in general, so stay tuned for a discussion of the pros and cons of various options for choosing style names and conventions and whether or not to manage the document as a single file or multiple chapters.

graphics2Illustration 1: The author puts their stuff in the local file system, in this case replicated by Dropbox.

graphics7Illustration 2: A web-view of Joss Winn’s thesis.


The interface provides a range of actions.

graphics9Illustration 3: You can do things with content in The Fascinator including blogging and export to zip or (experimental) EPUB

The EPUB export was put together as a demonstration for the Beyond The PDF effort by Ron Ward. A the moment it only works on packages, not individual documents, and it is using some internal Python code to stitch together documents, rather than calling out to Calibre as I did in earlier work on this project. The advantage of doing it this way is that you don’t have Calibre adding extra stuff and reprocessing documents to add CSS but the disadvantage is that a lot of what Calibre does is useful, for example working around known bugs in reader software, but it does tend to change formatting on you, not always in useful ways.

I put the EPUB into the dropbox so it is available in the demo site (you need to expand the Attachments box to get the download that’s not great usability I know). Or you can go to the package and export it yourself. Log in first, using admin as a username and a the same for a password.

graphics8Illustration 4: Joss Winn’s thesis exported as EPUB.

I looked a different way of creating an EPUB book from the same thesis a while ago which will be available for a while here at the Calibre server I set up.

One of the features of this software is that more than one person can look at the web site and there are extensive opportunities for collaboration.

graphics5Illustration 5: Colleagues and supervisors can leave comments via inline annotation (including annotating pictures and videos)

graphics6Illustration 6: Annotations are threaded discussions

graphics3Illustration 7: Images and videos can be annotated too. At USQ we developed a Javascript toolkit called Anotar for this, the idea being you could add annotation services to any web site quickly and easily.

This thesis package only contains documents, but one of the strengths of The Fascinator platform is that it can aggregate all kinds of data, including images, spreadsheets, presentation and can be extended to deal with any kind of data file via plugins. I have added another package, modestly calling itself the research object of the future, using some files supplied by Phil Bourne for the Beyond the PDF group The Fascinator makes web views of all the content and can package it all as a zip file or an EPUB.

graphics10Illustration 8: A spreadsheet rendered into HTML and published into an EPUB file (demo quality only)

This includes turning PowerPoint into a flat web page.

graphics11Illustration 9: A presentation exported to EPUB along with data and all the other parts of a research object

Installation notes

Installing The Fascinator  (I did it on Amazon’s EC2 cloud on Ubuntu 10.04.1 LTS) is straightforward. These are my notes not intended to be a detailed how-to, but possibly enough for experienced programmers/sysadmins to work it out.

  • Check it out.

    sudo svn co https://the-fascinator.googlecode.com/svn/the-fascinator/trunk /opt/fascinator
  • Install Sun’s Java

    sudo apt-get install python-software-properties
    sudo add-apt-repository ppa:sun-java-community-team/sun-java6
    sudo apt-get update
    sudo apt-get install sun-java6-jdk

    http://stackoverflow.com/questions/3747789/how-to-install-the-sun-java-jdk-on-ubuntu-10-10-maverick-meerkat/3997220#3997220

  • Install Maven 2.

    sudo apt-get install maven2
  • Install ICE or point your config at an ICE service. I have one running for the jiscPUB project you can point to this by changing the ~/.fascinator/system-config.json file.

  • Install Dropbox or your file replication service of choice a little bit of work on a headless server but there are instruction linked from the Dropbox.com site.

  • Make some configuration changes, see below.

  • To run ICE and The Fascinator on their default ports on the same machine add this stuff to /etc/apache2/apache.conf (I think the proxy modules I’m using here is non-standard).

    LoadModule  proxy_module /usr/lib/apache2/modules/mod_proxy.so
    LoadModule  proxy_http_module /usr/lib/apache2/modules/mod_proxy_http.so
    ProxyRequests Off
    <Proxy *>
    Order deny,allow
    Allow from all
    </Proxy>
    ProxyPass        /api/ http://localhost:8000/api/
    ProxyPassReverse /api/  http://localhost:8000/api/
    ProxyPass       /portal/ http://localhost:9997/portal/
    ProxyPassReverse /portal/ http://localhost:9997/portal/
  • Run it.

    cd /opt/fascinator
    ./tf.sh restart

Configuration follows:

  • To set up the harvester, add this to the empty jobs list in ~/.fascinator/system-config.json

"jobs" : [
                   {
                       "name": "dropbox-public",
                       "type": "harvest",
                       "configFile":
"${fascinator.home}/harvest/local-files.json",
                       "timing": "0/30 * * * * ?"
                   } 

And change /harvest/local-files.json to point at the Dropbox directory

"harvester": {
        "type": "file-system",
        "file-system": {
            "targets": [
                {
                    "baseDir": "${user.home}/Dropbox/",
                    "facetDir": "${user.home}/Dropbox/",
                    "ignoreFilter": ".svn|.ice|.*|~*|Thumbs.db|.DS_Store",
                    "recursive": true,
                    "force": false,
                    "link": true
                }
            ],
            "caching": "basic",
            "cacheId": "default"
        }

To add the EPUB support and the red branding, unzip the skin files in this zip file into the portal/default/ directory: http://ec2-50-19-86-198.compute-1.amazonaws.com/portal/default/download/551148ce6d80bfc0c9c36914f9df4f91/jiscpub.zip

unzip -d /opt/fascinator/portal/src/main/config/portal/default/ jispub.zip

[This is a repost of http://jiscpub.blogs.edina.ac.uk/2011/07/15/the-repository-is-watching-automated-harvesting-from-replicated-filesystems-2/ please comment over there]

Copyright Peter Sefton, 2011-07-12. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

graphics1

This post was written in OpenOffice.org, using templates and tools provided by the Integrated Content Environment project.

Real life scenarios for creating and disseminating linked-data publications

In this presentation for Semantic Web Technologies for Libraries and Readers (STLR 2011), which I can’t attend in person I want to talk about what happens before things hit the library. I have pre-recorded a couple of demos and asked Jodi Schneider if she would mind introducing the talk for me I know she has been following this work for a while maybe she can read out this brief blog post and pretend to be me?

When I submitted this paper I selected three categories for it.

  • Strategies for semantic publishing (technical, social, and economic)

  • Approaches for consuming semantic representations of digital documents and electronic media
    and collaboration

  • Social semantic approaches for using, publishing, and filtering scholarly
    objects and personal electronic media

But really it’s mainly about the first one, about getting linked data semantics into digital libraries.

The proposal for this paper also had some stuff in it about ‘big themes’ but given that I am not going to be there in person, and it is only a short demo slot I will not attempt to address those themes.

The ongoing issue with publishing to the web

Ever since the web began to hit the mainstream, there has been a big gap between what’s on the web as HTML and the kinds of documents that people write in academia papers, theses, reports, and so on.

  1. For research publications PDF files rule. PDF is what is deposited in institutional repositories, and what people manage (or mismanage) in their personal digital libraries. PDF is not conducive to rich semantics (Yes, it can be done, the web is already ready for linked-data and all the action is on the web. And yes, some publishers are doing good things with the web, but they don’t typically allow DIY authoring and repository deposit of rich semantic materials).

  2. Word processors and tools like LaTeX don’t Just Work for making web documents it’s more like Just Doesn’t Work.

  3. When we start talking about semantics and wanting to have stuff like RDFa in web pages it is really hard to do with run-of-the mill scholarship, because our authoring tools don’t support formal semantics. (Yes there are XML tool-chains such as TEI but the heavy-duty XML approach has never been shown to work for large cohorts of non-technical users).

Demos

I want to show a couple of demos

  • A method for encoding Linked Data statements in URLS, so they can be used in any system that support simple HTTP hyperlinks. URLs are supported everywhere and will survive being saved as .doc, copied and pasted, emailed and so on, if not a nuclear winter. See the screencast. This includes a lightning look at the The Integrated Content Environment2 which tries to close the gap between document authoring tools like word processors and the web.

  • Packaging all of the above using EPUB, the open ebook standard using various tools and techniques from the Digital Monograph Technical Landscape study  using ORE4 resource maps. See the screencast. One thing I didn’t mention in the screencast is that this work builds on work done by the KnoweldgeBlog project in the UK who are on in the next slot in the workshop we’ve never met. Hello!

  • I also wanted to look at new models for scholarly objects that allow documents, data and provenance to be managed, reposited and disseminated by demoing The Fascinator Desktop.3 but unfortunately that’s not possible right now as the software in question is being moved to Google Code and the builds are broken it should be fixed next week. I refer you to a long posting I put together for the Beyond the PDF workshop which has a lot of screenshots.

    They key feature of this tool is that it can work with a wide variety of things and bundle them together into a single object,. The idea is to provide a web interface to your hard disk, or Dropbox-like share. In the post I look at a research object consisting of a paper, some data in a spreadsheet, some provenance information, and touch on how semantics about the scientific content of the paper could be marked up using the ‘triplink’ technique I demonstrated above. (I’ll comment below when I am able to post a screencast).

Sorry I couldn’t make it to Ottawa, hope you all enjoy the workshop.

Copyright Peter Sefton, 2011-06-16. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

graphics1

This post was written in OpenOffice.org, using templates and tools provided by the Integrated Content Environment project.

Making EPUB from WordPress (and other) web collections

[This is a re-post of from the jiscPUB project please make any coments over there: http://jiscpub.blogs.edina.ac.uk/2011/05/25/making-epub-from-wordpress-and-other-web-collections/]

Background

As part of Workpackage 3 I have been looking at WordPress as a way of creating scholarly monographs. This post carries on from the last couple, but it’s not really about EPUB or about WordPress, it’s about interoperability and how tools might work together in a Scholarly HTML mode so that people can package and repackage their resources much more reliably and flexibly than they can now.

While exploring WordPress I had a look at the JISC funded KnowledgeBlog project. The team there has released a plugin for WordPress to show a table of contents made up of all the posts in a particular category. It seemed that with a bit of enhancement this could be a useful component of a production workflow for book-like project, particularly for project reports and theses (where they are being written online in content management systems maybe not so common now, but likely to become more common) and for course materials.

Recently I looked at Anthologize, a WordPress-based way of creating ebooks from HTML resources sourced from around the web (I noted a number of limitations which I am sure will be dealt with sooner or later). Anthologize is using a design pattern that I have seen a couple of times with EPUB, converting the multiple parts of a project to an XML format that already has some tools for rendering and using those tools to generate outputs like PDF or EPUB. Asciidoc does this using the DocBook tool-chain and Anthologize uses TEI tools. I will write more on this design pattern and its implications soon. There is another obvious approach; to leave things in HTML and build books from that, for example using Calibre which already has ways to build ebooks from HTML sources. This is an approach which could be added to Anthologize very easily, to complement the TEI approach.

So, I have put together a workflow using Calibre to build EPUBs straight from a blog.

Why would you want to do this? Two main reasons. Firstly, to read a report, thesis or course, or an entire blog on a mobile device. Secondly, to be able to deposit a snapshot of same into a repository.

In this post I will talk about some academic works:

The key to this effort is the KnowledgeBlog table of contents plugin ktoc, with some enhancements I have added to make it easier to harvest web content into a book.

The results are available on a Calibre server I’m running in the Amazon cloud just for the duration of this project. (The server is really intended for local use, the way I am running it behind an Apache reverse proxy it doesn’t seem very happy you may have to refresh a couple of times until it comes good). This is rough. It is certainly not production quality.

graphics1

These books are created using calibre ‘recipes': available here. You run them like this:

ebook-convert thesis-demo.recipe .epub --test

If you are just trying this out, to be kind to site owners --test will cause it to only fetch a couple of articles per feed.

I added them to the calibre server like this:

calibredb add --library-path=./books thesis-demo.epub

The projects page at my site has two TOCs for two different projects.

I the title is used to create sections in the book, in both cases the post are displayed in date-order and I am not showing the name of the author on the page because that’s not needed when it is all me.

The resulting book has a nested table of contents, seen here in Adobe Digital Editions.

graphics2Illustration 1: A book built from a WordPress page with two table of contents blocks generated from WordPress categories.

Read on for more detail about the process of developing these things and some comments about the problems I encountered working with multiple conflicting WordPress plugins, etc.

The Scholarly HTML way to EPUB

The first thing I tried in this exploration was writing a recipe to make an EPUB book from a Knowledge Blog, for the Ontogenesis project. It is a kind of encyclopaedia of ontology development maintained in a WordPress site with multiple contributors. It worked well, for a demonstration, and did not take long to develop. The Ontogenesis recipe is available here and the resulting book is available on the Calibre server.

But there was a problem.

The second blog I wanted to try it on was my own, so I installed ktoc changed the URL in the recipe and ran it. Nothing. The problem is that Ontogenesis and my blog use different WordPress themes so the structure is different. Recipes have stuff like this in them to locate the parts of a page, such as <p class='details_small'>:

remove_tags_before = dict(name='p', attrs={'class':'details_small'})

remove_tags_after = dict(name='div', attrs={'class':'post_content'})

That’s for Ontogenesis, different rules are needed for other sites. You also need code to find the table of contents amongst all the links on a WordPress page, and deal with pages that might have two or more ktoc-generated tables for different sections of a journal, or parts of a project report.

Anyway, I wrote a different recipe for my site, but as I was doing so I was thinking about how to make this easier. What if:

  • The ktoc plugin output a little more information in its list of posts that made it easy to find no matter what WordPress theme was being used.

  • The actual post part of each page (ie not the navigation, or ads) identified itself as such.

  • The same technique could be extended to other websites in general.

There is already a standard way to do the most important part of this, listing a set of resources that make up an aggregated resource; the Object Reuse and Exchange specification, embedded in HTML using RDFa. ORE in RDFa. Simple.

Well no, it’s not, unfortunately. ORE is complicated and has some very important but hard to grasp abstractions such the difference between an Aggregation, and a Resource Map. An Aggregation is a collection of resources which has a URI, while a resource map describes the relationship between the aggregation and the resources it aggregates. These things are supposed to have different URIs. Now, for a simple task like making a table of contents of WordPress posts machine-readable so you can throw together a book, these abstractions are not really helpful to developers or consumers. But what if there were a simple recipe/microformat what we call a convention in Scholarly HTML to follow, which was ORE compliant and that was also simple to implement at both the server and client end?

What I have been doing over the last couple of days, as I continue this EPUB exploration is try to use the ORE spec in a way that will be easy implement, say in the Digress.it TOC page, or in Anthologize, while still being ORE compliant. That discussion is ongoing, and will take place in the Google groups for Scholarly HTML and ORE. It is worth pursuing because if we can get it sorted out then with a few very simple additions to the HTML they spit out, any web system can get EPUB export quickly and cheaply by adhering to a narrowly defined profile of ORE subject to the donor service being able to supply reasonable quality HTML. More sophisticated tools that do understand RDFa and ORE will be able to process arbitrary pages that use the Scholarly HTML convention, but developers can choose the simpler convention over a full implementation for some tasks.

The details may change, as I seek advice from experts, but basically, there are two parts to this.

Firstly there’s adding ORE semantics to the ktoc (or any) table of contents. It used to be a plain-old unordered list, with list items in it:

<p><strong>Articles</strong></p>
<ul>
<li><a href="http://ontogenesis.knowledgeblog.org/49">Automatic
maintenance of multiple inheritance ontologies</a> by Mikel Egana
Aranguren</li>
<li><a href="http://ontogenesis.knowledgeblog.org/257">Characterising
Representation</a> by Sean Bechhofer and Robert Stevens</li>
<li><a href="http://ontogenesis.knowledgeblog.org/1001">Closing Down
the Open World: Covering Axioms and Closure Axioms</a> by Robert
Stevens</li>
</ul>

The list items now explicitly say what is being aggregated. The plain old <li> becomes:

<li  rel="http://www.openarchives.org/ore/terms/aggregates"
resource="http://ontogenesis.knowledgeblog.org/49">

(The fact that this is an <li> does not matter, it could be any element.)

And there is a separate URI for the Aggregation and resource map courtesy of different IDs. And the resource map says that it describes the Aggregation as per the ORE spec.

<div id=”AggregationScholarlyHTML">

<div rel="http://www.openarchives.org/ore/terms/describes" resource="#AggregationScholarlyHTML" id="ResourceMapScholarlyHTML" about="#ResourceMapScholarlyHTML">

It is verbose, but nobody will have to type this stuff. What I have tried to do here (and it is a work in progress) is to simplify an existing standard which could be applied in any number of ways and boil it down to a simple convention that’s easy to implement but that still honours the more complicated specifications in the background. (Experts this will realise that I have used an RDFa 1.1 approach here, meaning that current RDFa processors will not understand, this is so that we don’t have to deal with namespaces and CURIES which complicate processing for non-native tools.)

Secondly the plugin wraps a <div> element around the content for every post to label it as being scholarly HTML, this is a way of saying that this part of the whole page is the content that makes up the article, thesis chapter or similar. Without a marker like this finding the content is a real challenge where pages are loaded up with all sorts of navigation, decoration and advertisements, it is different on just about every site, and it can change at the whim of the blog owner if they change themes.

<div rel="http://scholarly-html.org/schtml">

Why not define an even simpler format?

It would be possible to come up with a simple microformat that had nice human readable class attributes or something to mark the parts of a TOC page. I didn’t do that because then people will rightly point out that ORE exists and we would end up with a convention that covered a subset of the existing spec, making it harder for tool makers to cover both and less likely that services will interoperate.

So why not just use general ORE and RDFa?

There are several reasons:

  • Tool support is extremely limited for client and server side processing of full RDFa, for example in supporting the way namespaces are handled in RDFa using CURIES. (Sam Adams has pointed out that it would be a lot easier to debug my code if I did use CURIES and RDFa 1.0 so I followed his advice, did some search and replacing and checked that the work I am doing here is indeed ORE compliant).

  • The ORE spec is suited only for experienced developers with a lot of patience for complexities like the difference between an aggregation and a resource map.

  • RDFa needs to apply to a whole page, with the correct document type and that’s not always possible to do when we’re dealing with systems like WordPress. The convention approach means you can at least produce something that can become proper RDFa if put into the right context.

Why not use RSS/Atom feeds?

Another way to approach this would be to use a feed, in RSS or Atom format. WordPress has good support for feeds there’s one for just about everything. So you can look at all the posts on my website:

http://ptsefton.com/category/uncategorized/feed/atom

or use Tony Hirst’s approach to fetch a singe post from the jiscPUB blog

http://jiscpub.blogs.edina.ac.uk/2011/05/23/a-view-from-academia-on-digital-humanities/feed/?withoutcomments=1

The nice thing about this single post technique is that it gives you just the content in a content element so there is no screen scraping involved. The problem is that the site has to be set up to provide full HTML versions of all posts in its feeds or you only get a summary. There’s a problem with using feeds on categories too, I believe, in that there is an upper limit to how many posts a WordPress site will serve. The site admin can change that to a larger number but then that will affect subscribers to the general purpose feeds as well. They probably don’t want to see three hundred posts in Google Reader when they sign up to a new blog.

Given that Atom (the best standardised and most modern feed format) is one of the official serialisation formats for ORE it is probably worth revisiting this question later if someone, such as JISC, decides to invest more in this kind of web-to-ebook-compiling application.

What next?

There are some obvious things that could be done to further this work:

  • Set up a more complete and robust book server which builds and rebuilds books from particular sites and distributes them in some way, using Open Publication Distribution System (OPDS) or something like this thing that sends stuff to your Kindle.

  • Write a ‘recipe factory’. With a little more work the ScholarlyHTML recipe can be got to the point where the only required variable is a single page URL everything else can be harvested from the page or over-ridden by the recipe.

  • Combining the above to make a WordPress plugin that can create EPUBs from collections of in-built content (tricky because of the present calibre dependency but it could be re-coded in PHP).

  • Add the same ScholarlyHTML convention for ORE to other web systems such as the Digress.it plugin and Anthologize. Anthologize is appealing because it allows you to order resources in ‘projects’ and nest them into ‘parts’ rather than being based on simple queries but at the moment it does not actually have a way to publish a project directly to the web.

  • Explore the same technique in the next phase of WorkPackage 3 when I return to looking at word processing tools and examine how cloud replication services like DropBox might help people to manage book-like projects that consist of multiple parts.

Postscript: Lessons and things that need fixing or investiging

I encountered some issues. Some of these are mentioned above but I wanted to list them here as fodder for potential new projects.

  • As with Anthologize, if you use the WordPress RSS importer to bring-in content it does not change the links between posts so they point to the new location. Likewise with importing a WordPress export file.

  • The RSS importer applied to the thesis created hundreds of blank categories.

  • I tried to add my ktoc plugin to a Digress.it site, but ran into problems. It uses PHP’s simplexml parser which chokes on what I am convinced is perfectly valid XML in unpredictable ways. And the default Digress.it configuration expects posts to be formatted in a particular way as a list of top-level paragraphs, rather than with nested divs. I will follow this up with the developers.

  • Calibre does a pretty good job of taking HTML and making it into EPUBs but it does have its issues. I will work through these on the relevant forums as time permits.

    • There are some encoding problems with the table of contents in some places. Might be an issue with my coding in the recipes.

    • Unlike other Calibre workflows, such as creating books from raw HTML, ebook-convert adds navigation to each HTML page in the book created by a recipe. This navigation is redundant in an EPUB, but apparently it would require a source code change to get rid of it.

    • It does something complicated to give each book its style information. There are some odd presentation glitches in the samples as a result of Calibre’s algorithms. This requires more investigation.

    • It doesn’t find local links between parts of a book (ie links from one post to another which occur a lot in my work and in Tony’s course), but I have coded around that in the Scholarly HTML recipes.

It will be up to Theo Andrew, the project manager if any of these next steps or issues get any attention during the rest of this project.

[This is a re-post of from the jiscPUB project please make any coments over there: http://jiscpub.blogs.edina.ac.uk/2011/05/25/making-epub-from-wordpress-and-other-web-collections/]

Copyright Peter Sefton, 2011-05-25. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

graphics3

This post was written in OpenOffice.org, using templates and tools provided by the Integrated Content Environment project.

Anthologize: a WordPress based collection tool

[This is a copy of a post on the jiscPUB project if you have comments please do so over there: http://jiscpub.blogs.edina.ac.uk/2011/05/11/anthologize-a-wordpress-based-collection-tool/] In this post I’ll look at Anthologize. Anthologize lets you write or import content into a WordPress instance, organise the ‘parts’ of your ‘project’ and publish to PDF or EPUB, HTML or into TEI XML format. This is what I referred to in my last post about WordPress as an aggregation platform.

Anthologize background and use-cases

Anthologize was created in an interesting way. It is the (unfinished as yet) outcome of a one-week workshop conducted at the Centre for History and New Media the same group that brought us Zotero and Omeka, which is one good reason to take it seriously. They produce very high quality software.
Anthologize is a project of One Week | One Tool a project of the Center for History and New MediaGeorge Mason University. Funding provided by the National Endowment for the Humanities. © 2010, Center for History and New Media. For more information, contact infoATanthologizeDOTorg. Follow @anthologize.

Anthologize is a WordPress plugin that adds import and organisation features to WordPress. You can author posts and pages as normal, or you can import anything with an RSS/Atom feed. The imported documents don’t seem to be able to be published for others to view but you can edit them locally. This could be useful but introduces a whole lot of management issues around provenance and version control. When you import a post from somewhere else the images stay on the other site, so you have a partial copy of the work with references back to a different site. I can see some potential problems with that if other sites go offline or change.

Let’s remind ourselves about the use-cases in workpackage 3:

The three main use cases identified in the current plan, and a fourth proposed one: [numbering added for this post]
  1. Postgrad serializing PhD (or conference paper etc) for mobile devices
  2. Retiring academic publishing their best-of research (books)
  3. Present final report as epub
  4. Publish course materials as an eBook (Proposed extra use-case proposed by Sefton)
http://jiscpub.blogs.edina.ac.uk/2011/03/03/workpackage-3/

Many documents like (a) theses or (c) reports are likely to be written as monolithic documents in the first place, so it would be a bit strange to write, say, a report in Word, or LaTeX or asciidoc (which is how I think Liza Daly will go about writing the landscape paper for this project) , export that as a bunch of WordPress posts for dissemination, then reprocess back into an Anthologize project, and then to EPUB. There’s much more to go wrong with that, and information to be lost than going straight from the source document to EPUB. It is conceivable that this would be a good tool for thesis by publication, where the publications were available as HTML that could be fed or pasted in to WordPress.

I do see some potential with (d) courseware here it seems to me that it might make sense to author course materials in a blog-post like way covering topics one by one. I have put some feelers out for someone who might like to test publishing course materials, without spending too much of this project’s time as this is not one of the core use cases. If anyone wants to try this or can point me to some suitable open materials somewhere with categories and feeds I can use then I will give it a go.

There is also some potential with (c), project reports, particularly if anyone takes up the JiscPress way of doing things and creates their project outputs directly in WordPress+digress.it. It would also be ideal for compiling stuff that happens on the project blog as a supporting Appendix. So, an EPUB that gathers together, say all the blog posts I have made on WorkPackage 3 or the whole of the jiscPUB blog might make sense. These could be distributed to JISC and stakeholders as EPUB documents to read on the train, or deposited in a repository.

The retiring academic (b) (or any academic really) might want to make use of Anthologize too particularly if they’ve been publishing online. If not they could paste their works into WordPress as posts, and deal with the HTML conversion issues inherent in that, or try to post from Word to WordPress. The test project I chose was to convert the blog posts I have done for jiscPUB into an EPUB book. That’s use case (c) more or less.

How did the experiment go?

I have documented the basic process of creating an EPUB using Anthologize below, with lots of screenshots, but here is a summary of the outcomes. Some things went really well.
  • Using the control panel at my web host I was able set up a new WordPress website on my domain, add the Anthologize plugin and make my first EPUB in well under an hour. (But as usual, it takes a lot longer to back-track and investigate and try different options, and read the google group to see if bugs have been reported and so on).
  • The application is easy to install and easy to use with some issues I note below.
  • Importing a feed just works if you search to find out how to do it on a standard WordPress host (although I think there might be issues trying to get large amounts of content if the source does not include everything in the feed).
  • Creating parts and dragging in content is simple.
  • Anthologize looks good.
The good looks and simple interface are deceptive, lots of functionality I was expecting to be there just wasn’t yet. I have been in contact with the developers and noted my biggest concerns, but here’s a list of the major issues I see with the product at this stage of its development:
  • There does not seem to be a way to publish the project (or the imported docs) directly to the web rather than export it. Seems like an obvious win to add that. I can see that being really useful with Digress.it for one thing. The other big win there would be if the Table of Contents could have some semantics embedded in it so it could act like an ORE resource map – meaning that machines would be able to interpret the content. (I will come back to this idea soon with a demo of using Calibre to make an EPUB)
  • There are no TOC entries for the posts within a ‘part’ that is, if you pull in a lot of WordPress posts, they don’t get individual entries in the EPUB ToC.
  • Links, even internal ones, like the table of contents links on my posts all point back to the original post this makes packaging stuff up much less useful you’d need to be online, and you lose the context of an intra-linked resource. This is a known problem, and the developers say they are going to fix it.
  • Potentially a problem is the way Anthologize EPUB export puts all the HTML content for the whole project into one HTML file I gather from poking around with Calibre etc that many book readers need their content chunked into multiple files.
  • There’s a wizard for exporting your EPUB, and you can enter some metadata and choose some options all of which is immediately forgotten by the application, so if you do it again, you have to re-enter all the information.
  • Epubcheck complains about the test book I made:
    • It says the mimetype (a simple file that MUST be there in all EPUB) is wrong looks OK to me.
    • It complains about the XHTML containing stuff from the TEI namespace and a few other things.
  • Finally, PDF export fails on my blog with a timeout error but that’s not an issue for this investigation.

Summary

For the use case of bundling together a bunch of blog posts (or anything that has a feed) into a curated whole Anthologize is a promising application, but unless your needs are very simple it’s probably not quite ready for production use. I spent a bit of time looking at it though, as it shows great promise and comes from a good stable. Here’s the result I got importing the first handful of posts from my work on this project.

graphics8Illustration 1: The test book in Adobe Digital Edtions – note some encoding problems bottom right and the lack of depth in the table of contents. There are several posts but no way to navigate to them. Also, clicking on those table of contents links takes you back to tbe jiscPUB blog not to the heading.

Walk through

graphics1Illustration 2: Anthologize uses ‘projects’. These are aggregated resources, in many cases they will be books but project seems like a nice media-neutral term.

graphics2Illustration 3: A new project in a fresh WordPress install only two things can be added to it until you write or import some content.

graphics3Illustration 4: Importing the feed for workpackage 3 in the jiscPUB project. http://jiscpub.blogs.edina.ac.uk/category/workpackage-3/feed/atom/

graphics4Illustration 5: You can select which things to keep from the feed. Ordering is done later. Remember that imported documents are copies, so there is potential for confusion if you edit them in Anthologize.

graphics5Illustration 6: Exporting content is via a wizard, easy to use but frustrating becuase it asks some of the same questions every time you export.

graphics6Illustration 7: Having to retype the export information is a real problem as you can only export one format at a time. Exported material is not stored in the WordPress site, either, it is downloaded, so there is no audit trail of versions. [This is a copy of a post on the jiscPUB project if you have comments please do so over there: http://jiscpub.blogs.edina.ac.uk/2011/05/11/anthologize-a-wordpress-based-collection-tool/]

Copyright Peter Sefton, 2011-05-04. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

HTTP://DBPEDIA.ORG/SNORQL/?QUERY=SELECT+%3FRESOURCE%0D%0AWHERE+{+%0D%0A%3FRESOURCE+%3CHTTP%3A%2F%2FDBPEDIA.ORG%2FONTOLOGY%2FPERSON%2FBIRTHPLACE%3E+%3CHTTP%3A%2F%2FDBPEDIA.ORG%2FRESOURCE%2FSYDNEY%3E+%3B%0D%0A%3CHTTP%3A%2F%2FDBPEDIA.ORG%2FONTOLOGY%2FPERSON%

This post was written in OpenOffice.org, using templates and tools provided by the Integrated Content Environment project.

Scholarly HTML website up at http://scholarlyhtml.org

I have set up a website for Scholarly HTML at http://scholarlyhtml.org. The site is intended to hold some key documents about Scholarly HTML, what it is, lists of tools etc. It will be populated as time allows. I will announce this on the Scholarly HTML mailing list now, and invite people from Beyond the PDF when it is a bit more mature.

We are going to continue with document authoring taking place over on the EtherPads provided by the Open Knowledge Foundation I will call for people to review and update documents from time to time, and when there is consensus I will post them to the site. I am happy to give other responsible adults those powers as well if they want them. The EtherPad entry point is: http://scholarly-html.okfnpad.org/1.

I am trying out a WordPress deployment pattern I have been thinking about for a while to use the WordPress ‘stack of posts’ as a version control mechanism every version of every document is a post, and is intended to be immutable (that’s a governance issue). There will be a WordPress page for each node in the site, but it won’t have any content, rather it will run a query to find the last post (as opposed to page) from a particular category. For example, The faq page at http://scholarlyhtml.org/faq/ runs a query to find the latest post labelled as faq eg http://scholarlyhtml.org/2011/03/25/faq-2011-03-25/. The point of this is to get a full revision history in a simple way.

My experience with EtherPad is that it is great for collaboration and awful for formatting, so I am proposing to use wiki-style markup in the pad making the job of publishing much easier. There are a number of such formats. I was introduced to asciidoc recently. It has rich formatting for technical documents and an established toolchain for creating HTML, PDF and EPUB. It is a bit finicky and I’m not sure if it is the best candidate for a format to support authoring of Scholarly HTML but it does seem to be more complete than many. Rending EtherPad documents is as easy as this:

curl http://okfnpad.org/ep/pad/export/schtml-core/latest?format=txt | asciidoc -vs - > core.html

That creates a core.html file. To post it to the site I can use this command to push the content to WordPress as an unpublished document with the category ‘core':

blogpost.py -vu -d html -t "Scholarly HTML core" -c core post core.html

As I get time I will change the markup in the EtherPads over to asciidoc and invite the collaborators back to work on them happy to discuss alternative formatting arrangements if anyone objects to asciidoc.

Copyright Peter Sefton, 2011-04-15. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

This post was written in OpenOffice.org, using templates and tools provided by the Integrated Content Environment project.