Archive for June, 2009

Living Holding Hands?

Friday, June 19th, 2009

So in my last post I mentioned the household ePrints repository. I was joking, kids. But it sort of got me thinking. I bet the Les Carr household has one. Maybe I should do something for the family heritage.

Anyway, here’s a bit of ephemera I found in a plastic sleeve in my song-lyrics folder. A poster for a band I was in in 1987 which won (I think we tied for first, or did we come second?) the Sydney Uni Band comp. We were originally called something else but when we found out we were in a heat up against a band called ‘The Fuckers of the Dead’ we changed the name and revised the repertoire. As far as I recall that did the trick and they didn’t show up. We didn’t have a drummer so we made a backdrop out of orange plastic with a black silhouette of a drum kit and a spiky haired drummer and insisted that the sound guy mic it up just like it was a real one.

Can you spot which one is me and which is Ken Maeda* and which is Maxine Whelan**? (Hint, I have the bass guitar and Max looks alarmingly like her mic is plugged into a powerpoint.) She made this picture.

graphics1

There you go. A bit of cultural heritage. I’ll see what I can do with some of this stuff I have lying around as we toil away on our desktop repository project at work. How do you scan, label, annotate and tell stories about this kind of important historical material? I’m learning about how to get the music from those shows off cassette tape onto the computer right now, with a Walkman I found in the cupboard. I’d forgotten how long it takes just to rewind.

Should I expose all my band-related digital objects as a RIF-CS collection? And what if I move families? Are there protocols for taking the digital objects with me? Is there a census date like with the ERA?


* I think Ken was at that stage still playing with Tim Freedman in Penguins on Safari.

** Maxine and I shared a house owned by Victor Kelleher with his son.

Very Early Career Researchers (Engineering)

Thursday, June 18th, 2009

In our mission to revolutionize scholarly communications we often talk about getting the early career researchers, well, early and teaching them new tricks. (Peter Murray-Rust adds that the late career researchers die, too).

At my house we have an early career researcher, who has been writing a sixth-grade report on a substantial kite project. The document has been written using the ICE [Update: fixed link] template*, of course, so it is ready to go straight onto her non-existent blog and into the household ePrints repository, in both HTML and PDF format. We’re still working on the linked data, but there are photos.

I was away when the document was finalized but it is reported that when it was, she went looking for her little brother (Grade 4) for Peer, um, mummy what do you call that.

S: You mean peer review?

F: Yeah, review. I mean even though he’s not my peer I’ll see what he thinks.

I’m so proud.


* She tried to use the template for a set of cupcake labels, cos you can’t just take cupcakes to the school sports day for a fundraiser anymore you have to detail the nutritional content. That didn’t go so well, I hear.

eResearch happiness in Australia?

Thursday, June 18th, 2009

I have posted here before about work at ADFI that may be of interest to ANDS.

After a recent visit to Canberra I have another couple of ideas presented here in a very quick blog post.

1 An implementation fund for CAIRSS

Currently we run CAIRSS as a support service, but as some of our discussions with stakeholders move along I foresee a time when people will start to say stuff like we need a normalizing OAI-PMH proxy (people other than me that is). We could have a mini project fund administered by the CAIRSS steering committee or a subcommittee which could dispense funds on the basis of short proposals. $1,000,000 over two years would be fine.

2 A developer engagement project

I have been trying to put my finger on what is lacking in the spread of projects we have now; ANDS, ARCS, NeAT and the outcomes of previous things like our own ICE-RS project and the ARCHER and DART projects. There’s lots of good stuff going on but it feels to me like the community is somewhat fragmented by discipline, or by repository people vs grid computing people or along other dimensions.

So how about a program to do something like we did on RUBRIC but for software developers instead of repository managers – and for the new role of eResearch Analysts around the country. RUBRIC was Regional Universities Building Research Infrastructure Collaboratively. This would be Something-or-other Building Research Infrastructure Collaboratively. The acronym we can leave to Dr Treloar.

I am basing this suggestion on my experience of what JISC is doing to build professional networks amongst their developer community by running competitions alongside conferences, hosting loosely structured meetings like the develop happiness days and so on. So why don’t we hire a few people; one to organize events and communications and one or more experienced open source developers with good communication skills (can we clone David Flanders or entice him out here?) who can promote:

  1. Community. Getting developers in touch with each other and with the implementers, the eResearch analysts. For example we could really use the help of some semantic web people on our work on The Fascinator Desktop, do I know who to ask? (Well, actually yes I do, but that’s just an example).

  2. Harmony. Getting teams working in similar areas to work together and share rather than re-inventing.

  3. Openness. Encourage and assist projects to work with their code under an open trunk from day zero so that others can join in at any stage and offer assistance, rather than working in a closed environment and then (maybe) releasing the code after the project has finished and it is too late for others to offer useful assistance. I won’t name names, but some publicly funded software that should be open is not and we have had to rewrite it.

  4. Friendly competition. I was forced by David Flanders to enter the developer happiness days competition and made two new connections (hi Mia, hi Ian), and I met Anna Gerber from down the road at UQ partly because I was a judge on the Developer Challenge at OR09 and now I will take my semantic web questions to her.

    Via Les Carr comes the idea of giving developers hard disks full of assorted data and documents and saying ‘what can you do with that lot’. We could do that at eResearch Australasia 2009, mix up data from the humanities and sciences to encourage discipline specialists to work with each other and throw in some documents to get the repository people involved.

I have not worked out a budget but to have decent events, travel and guests from overseas this might need $1,000,000 or more a year.

Desktop Repositories: Smashing up PowerPoint

Friday, June 12th, 2009

Les Carr has been experimenting with desktop repository services. He started by wondering how he might manage the thousands of PowerPoint slides and presentations he has, moved on to converting them into images, with embedded textual metadata, then put them in ePrints on the desktop and started speculating on how slides might be reassembled into new presentations and exported.

These workflows are exactly what we have been looking at with The Fascinator Desktop, our nascent eResearch repository platform. Our goal is to index and understand everything on an academic’s desktop, including presentations, documents, video, images, audio, data of all kinds, everything; via a plugin architecture which will be easily scriptable. We’re in the middle of a two week development sprint getting some of the pieces in place for this, so I thought that picking up on Les Carr’s PowerPoint work would make for a good target for the end of next week.

The goal is that by next week we have an automated system that can:

  1. Watch your home directory.

  2. Extract metadata and index it so we can construct a faceted browse interface.

  3. And in particular, break all your Microsoft PowerPoint or OpenOffice.org Presenter files into a set of searchable images, just as Les has done.

    We think we will be able to build a pretty cool interface for this, so that you can text-search for individual slides but also get a sense of their context. So if you search for dog cat, it will find the presentation with ‘dog’ on slide one and ‘cat’ on slide seventeen with a neat interface to show how they are related.

We have made one step towards our goal by doing something we should have done ages ago, adding PowerPoint support to ICE. It’s still a bit rough, but ICE can now turn ODP and PPT into an HTML slide presentation with images of each slide embedded in the page using a method inspired directly by Les Carr’s work. Here it is with a presentation by Les himself (Carr, Leslie, Coles, Simon and Lyon, Liz (2004) Archiving research data and research publications. At, Research Councils UK Workshop on Publication of Research Results, London, UK, 18 Oct 2004.)

I put it in ICE where you can see it in the ICE file manager.

graphics1I click the link and ICE renders the document into an HTML file, with an image for each slide. OK, so at the moment we’re not getting the title of each slide quite right but the text for each slide is sitting in the EXIF metadata, same as Les does it I think.

graphics2

Now, using ICE’s inbuilt presentation mode, I can press the button and get this that’s the slide with presentation controls:

graphics3

This will be useful in ICE where it will give people more options, flip through the slides in web mode, grab a PDF of the whole lot, or get the original format. But this is not the end of the story we’re going to use ICE as a conversion service, and view the slides using The Fascinator Desktop instead. That’s next week.

Once we have this done we could look at services such as:

(If we had all that in place we could finally help Peter Murray-Rust with his presentations, which are made up of web pages selected from a huge library of un-slides many of which included embedded data visualizations. By indexing all his individual pages we could let him ’shop’ for the ones he wants, order them and then create a presentation-by-reference which could be de-referenced and blogged or reposited. Peter, can you make your slide library available to us for experimentation?)

So far we have made some good progress on this goal of having a continuously updated repository view of all your files.

  1. Oliver Lucido has built a new abstract interface on Fedora so that we can swap in other data stores. Ron Ward will be trying out Couch DB, and we’ll probably build a simple repository layer using Pairtree and Dflat. I guess we could use ePrints, or Zentity if we were so inclined.

  2. Cynthia Wong and Linda Octalina have built slide handling into ICE as seen above. The approach taken is to load the presentation into OpenOffice.org, save as PDF, then use Imagemagick to break it into images, one per page, then look inside the .odp (ODF presentation) XML to find slide content and use ExifTool to add the slide content as metadat to each image. Lots more to do here, but even as-is it’s useful.

  3. Linda built a file watcher application, which is going to be cross platform but which at the moment is for Linux only. It, you know, watches your files and other services can ask it via HTTP for a list of recent changes, which it gives in a simple JSON format with RDF inside.

  4. Duncan Dickinson and Bron Chandler have been building a harvester; the bit that sits in between the data store where we will be storing all these slide images and so on, and the file watcher which will notice when you add a new PowerPoint, or any other file to the system:

    1. A metadata extractor: This will be based around an bit of Software called Aperture which can extract text and RDF metadata from all sorts of files, such as images and PDFs. It puts the metadata in the data store to be indexed.

    2. A transformer application which can render PPT to HTML/images, render word processing documents to HTML, generate low-res video from master files and so on. We already have lots of file transformers in ICE so the transformer will be able to call ICE if it needs to. (Some of this is slow, so we have plans to move this to another queue so it doesn’t slow down the main harvester). The renditions go in the data store.

A lot of this is similar to Jim Downing’s Lensfield project we have talked about harmonizing our projects.

Object1

This may look like a lot of stuff, but we think that it will provide a very flexible platform for doing useful work for academia, discovering and routing files of all kinds. While parts of this are connected by HTTP calls we are prepared to optimize if that proves too slow, but we think on a desktop scale this will probably work alright. What’s missing from that diagram are all the other things you will be able to do via the web interface tag stuff, label with


1 If I cycle in to work this weekend and pick up the 7GB virtual machine we made I will be able to demo depositing the PowerPoint into ePrints from ICE.

Trip report: visit to Microsoft

Thursday, June 11th, 2009

I have just returned from the USA, where I attended Open Repositories 2009 in Atrlanta. The second part of my trip was a visit to Microsoft Research in Redmond, a suburb of Seattle. It was my idea to visit Redmond, and Microsoft Research kindly set up a day’s visit for me. USQ funded my trip and accommodation. Pablo Fernicola took me sightseeing and to lunch on the weekend* thanks Pablo and Microsoft fed me lunch and put me in a town car to the airport.

Discussions

I spent Tuesday 28th May with Lee Dirks , Pablo and Alex Wade. I wanted to talk them mainly because I would like to see Microsoft Word able to play better with the web more on that below. I got to meet a few other people too:

  1. Sumit Chawla talked about the Microsoft interoperability group and a number of their projects. I am particularly interested in the attempt to build an OOXML (.docx) to HTML converter which can work without Word. One of the bits of homework I set myself was to look into how well this actually works. If it can handle embedded graphics and so on then we could look at using it with ICE, or with a Word Add-in for scholars (I have to say I’m pessimistic but I have not got it working yet). I think the converter is a product of The Planets project.

  2. Chris Wilson who spent a long time on the Internet Explorer team and chairs the W3C HTML working group made himself available to see if I had an particular issues with IE we don’t really apart from the the well known ones that are being attended to. Talking to Chris put our little scholarly corner of the web in perspective, for example ORE is not on his radar at all. I think ORE is useful and interesting but it is very very far from being mainstream something we need to remind ourselves of lest we become over excited and over estimate our own importance. Chris did encourage me to bring our requirements and concerns to the W3C though, for consideration in HTML 5 and beyond.

  3. Brian Jones of OOXML fame joined us lunch in an pretty good Indian place on the Microsoft campus. I have exchanged a few blog comments and emails with Brian over the years; it was great to meet him in person. Talking to Brian about the way custom XML is embedded in Word documents and the interface design challenges that introduces reinforced some of my opinions of how the various Add-ins MS research are producing are likely to fare, more on that below. I also tried to fill in a bit of my knowledge about the various XML and nearly-XML formats that Word has supported over the years**.

Thanks all. It was a really interesting and useful visit for me.

In this post I will report and reflect on discussions with and about Microsoft Research’s work on eScholarship. There are a few things of interest:

  1. The Zentity repository, which is exciting in that is built on top of an RDF-style data model that lets you express pretty much anything, and potentially a huge maintenance problem because it is built on top of an RDF-style data model that lets you express pretty much anything.

  2. The MS Word Ontology Add-in, which I think is an fine idea, about which I have commented here before. After my visit I am still of the opinion that it is going to be hard to make it usable, that its lack of interoperability even with older versions of Word is a major problem nothing will make people drop a tool like this quicker than a few disasters sharing with colleagues or taking a document home and finding that an older version of Word mangles it. I am also concerned that the fragility that results from imposing a strict hierarchy model on top of Word’s native implied hierarchy will be an ongoing headache for developers and it will be hard to provide a bomb-proof implementation. I hope the MS Research people will take a look at the work we did to transform the markup the Add-in uses to plain-old links but if they don’t like that it’s open source so we can go ahead and do it anyway.

  3. Chem4Word, being done by Microsoft with our associates from Cambridge. I think its heart is in the right place, but I am concerned about the same things as I am with the Ontology Add-in; interoperability and fragility of embedded custom XML being the two big ones.

  4. The SWORD Add-in which lets you post documents straight from Microsoft Word to a repository. I’m very interested in this one. I’d love to see it integrated with some kind of HTML conversion so that people can put web pages into web based repositories instead of filling them up with virtual paper. There was a meeting at OR09 to look at how Word might work with repositories, with an incredibly strong response from the ePrints team, who are very enthusiastic about supporting this kind of ingest. Me I think it’s a reasonable thing to work on, but it is only one workflow, and I think that we are probably going to see a lot of action in intermediate content management systems that manage authoring and data rather than the typical repository focus on dissemination and preservation most of the interaction from Word is likely to be with those kinds of content systems, in my opinion.

  5. The Microsoft Word Article Authoring Add-in. I’d have to say that nothing that I saw in Atlanta or Seattle has made me change my mind, I still think that this approach of trying to edit documents conforming to a large complex XML schema inside Word is going to end up as an unhappy compromise if it takes off at all it will be at the publishers who use XML, rather than with ordinary authors. Pablo Fernicola thinks otherwise, obviously. Time will tell.

Scholarly HTML

Against this background I will confine myself to the dimensions I really care about, which is how to make word processors produce good quality HMTL, and document interoperability. I’ve been over and over why this is important here, but here’s a summary.

  1. On the authoring side, offline word processors like Microsoft Word and OpenOffice.org Writer are probably still the best all round compromised for academic authoring in those disciplines which don’t use some other format like LaTeX. For now. I expect this to change soon, we are starting to see document drafting in Google Docs (which lacks citation services and styles and easy embedding of diagrams so far) , and if Google Wave realises its promise then I think it could be an end-to-end scholarly communications platform.

  2. On the delivery side, academia is one of the few places where PDF is considered acceptable as a means of communication whereas on a normal website it is regarded as an impediment to usability. We need to be getting scholarly works into HTML so we can do more with them; meshing them with data and visualisations and delivering them to mobile devises.

While we wait for Google Wave to take over the world, what I’d like to see is a Word toolbar much like the ICE toolbar to support scholarly authoring but with better integration into Word than we have had the resources to make so far here in Toowoomba. It should let people create well structured documents which can be pushed to academic systems; journals, repositories and learning systems and not just in PDF, or Word format, in some kind of formally specified Scholarly HTML. I think that idea had some support at our meeting, but Lee Dirks in particular pointed out that it would need to be done with reference to a stakeholder group who can help define and own this Scholarly HTML thing. I’d be interested in ideas on who these stakeholders might be;

  1. Publishers obviously, where MS Research have great contacts.

  2. Repository owners particularly the discipline repositories like arXive and Pubmed Central.

  3. The eResearch community; I hope that I can get the Australian National Data Service (ANDS) interested in this stuff.

  4. The Electronic Thesis and Dissertation (ETD) movement. (My group is involved in this via our CAIRSS repository support service, the Australasian Digital Thesis program in Australia will come to CAIRSS at some point.)

  5. The eLearning community, maybe.

But actually, where this matters most is on the long tail:

  1. Thousands of small repositories and journals are stuck with paper-on-screen because that’s all their tools support.

  2. The small but growing group of users who want to do more with the versions of their documents they deposit in repositories.

I’d appreciate any thoughts about who might be interested in defining a scholarly profile of HTML a few people told me they’re following these posts so please speak up in the comments.

We are not working together

I’d like to make it clear that while we had a good talk there is no project immanent between MS Research and ADFI; I think the discussion was reasonably encouraging that there might possibly be some room for collaboration.

First some ground rules about the kind of collaboration I think we should entertain with any commerical entity.

I think it would be fair that this works only in the latest version of Word, provided the documents it produced could be used in other editors, such as OpenOffice.org Writer or earlier versions of Word. I will quote an earlier post here:

In conclusion I offer this: I would consider getting our team working with Microsoft (actually Im actively courting them as they are doing some good work in the eResearch space) but it would be on the basis that:

  • The product (eg a document) of the code must be interoperable with open software. In our case this means Word must produce stuff that can be used in and round tripped with OpenOffice.org and with earlier versions, and Mac versions of Microsofts products. (This is not as simple as it could be when we have to deal with stuff like Sun refusing to implement import and preservation for data stored in Word fields as used by applications like EndNote.)

  • The NLM add-in is an odd one here, as on one level it does qualify in that it spits out XML, but the intent is to create Word-only authoring so that rules it out not that we have been asked to work on that project other than to comment, I am merely using it as an example.

  • The code must be open source and as portable as possible. Of course if it is interface code it will only work with Microsofts toll-access software but at least others can read the code and re-implement elsewhere. If its not interface code then it must be written in a portable language and/or framework.

http://ptsefton.com/2009/03/16/opening-up-microsoft.htm

A potential Add-in

So in a world where we did have an idea of what Scholarly HTML looks like what would a Scholarly HTML Word Add-in do?

The basic requirement is that it would allow scholars to:

  1. Write papers, books and theses as well as more ephemeral or less formal documents using a single interface, built on top of Word, working with its strengths rather than against its limitations.

    • It should be able to adapt itself to pre existing journal templates,

    • but also be able to help journals and institutions build useful, usable templates that will produce Scholarly HTML automatically and if necessary map to formats like NLM XML.

  2. Post that work to the web, including to content management sites, blogs, repositories, journal submission sites everywhere from within Word where that makes sense (and it doesn’t always).

  3. Make documents which are as data-integrated and machine readable as possible, with the things I outlined for Scholarly HTML; citations, metadata, embedded semantics and linked visualizable data. The idea would be to encode all this stuff in a way that could be safely moved between systems and then to build web based plugins that sites could use to make the documents come alive.

Some of this stuff we have already looked at in ICE, obviously, but it would be good to look again at how we have done things as Alex Wade said we should think about requirements and leave implementation aside. So, in a departure from form, I’ll just leave the requirements above as-is and spare you my thoughts about implementation other than to comment that it would be interesting for MS Research to make their plugins work with the new Open Document Format support in Word via simple interop strategies like encoding information in URIs or in styles.


* I was unlucky enough to be stuck in Seattle for the Memorial Day long weekend, and suffered unrelenting sunshine and blue skies, instead of the hoped-for drizzle. This was exactly as Tom Robbins put it in Jitterbug Perfume :

With the absence of the cloud cover that normally caused the sky over Seattle to resemble cottage cheese that had been dragged nine miles behind a cement truck, the city, for the first time in memory, would have an unobstructed view of one of natures most mystical spectacles. .

My hotel was right near the Seattle Centre where the NW Folklife festival was going on. Sort of like taking the Woodford folk festival and dropping it in Brisbane’s Southbank, I’m not sure if it was one of the world’s most mystical spectacles but it was very inconvenient if you’re trying to walk through the park. Buskers kept blocking my path, such as the Black Death Allstars with their infectious but downright unpatriotic This van is your van (I didn’t tell the travel office I had been exposed to the black death, but they didn’t want me to come back to work anyway). As you can understand movement was so difficult that I got stuck there at the festival at times. I met some musicians, for example guitar god Yusuf Kilgore who told me about a Bob Dylan tribute night at the Conor Byrne Pub where people kept talking to me instead of politely ignoring me like they do in Australia. One lady even misquoted the above line about clouds, apparently in defence of Seattle’s weather. To be fair there was a duo there that did a great stripped back ukulele-extreme-Bossa-Nova version of Tambourine Man, and Yusuf’s playing was pretty good when he turned up, about 5 hours after he told me he was going to be on. I didn’t run into Tom Robbins as I was hoping but Yusuf played at his birthday party and says he’s really old so that’s something. To keep away from the swine flu virus and the Black Death Allstars and maintain the peak fitness required for my job at USQ I rented a bike from Classic cycles and took myself on a tour of Bainbridge Island past all the rich people’s summer houses, at least 20 miles worth I reckon.

** I still think the Save As HTML format in Word 2000 could have been a round trippable XML format which would have been much more approachable for developers as it was HTML based. I wrote an article for XML.com back in 2004 about how to transform this format in and out of XML and I think that might still be a useful technique.

eResearch at ADFI as summary and potential projects

Wednesday, June 3rd, 2009

At the Australian Digital Futures Institute we work on eLearning and eResearch. In this post I will summarize where we are at with the latter, how I suspect it fits with the work that’s going on at the Australian National Data service (ANDS) and where I think we could do more, in the form of a few project suggestions that might spark some debate if not some funding.

There are two software applications that form the backbone of our eResearch work. ICE is an established content management system which we are taking into the research realm and The Fascinator is a new bit of infrastructure designed to bridge the gap between the desktop (or the laptop, or the small lab) and the data commons it is designed to bring repository services to the desktop, but it can also work at the server level. Over the next year I would expect these two systems to merge somewhat so that ICE content services are available as part of The Fascinator.

eResearch research themes

There are a two broad themes to our research:

  1. Scholarly HTML, endorsed by Jon Wilbanks as a good tag for what we need to do to nudge the web towards semantics in his keynote at Open Repositories 2009: getting academic writing onto the web with as much inbuilt machine-readable meaning as possible. Under this heading we have:

    1. ICE is all about being able to make web pages from academic documents. That might sound like a problem that was solved a long time ago, but no, most research is still published in metaphorical paper format (that’s PDF). As I argue in my forthcoming paper for Serials Review (Peter Sefton, Towards Scholarly HTML, Serials Review (2009), doi:10.1016/j.serrev.2009.05.001
      (doesn’t resolve to anything yet
      ))
      , only the big publishers have the tools to make HTML from research articles as a matter of course.

      It is important to get HTML into our scholarly communications platforms because HTML is what the web is made of. If we want to have rich interactive documents with embedded semantics and linked data then that’s going to happen with HTML, not PDF, or Flash.

      With ICE we now have the basics sorted out what’s needed is some work on ways to adapt journal and thesis templates, and support for LaTeX.

      Maybe we could do something with Microsoft Research along the lines of their article authoring add-in (more on my visit to Redmond soon).

      I know that this word processing / authoring stuff doesn’t always seem important to eResearch but I think it’s key, as articles and these are the jumping off points for people to discover a lot of data. We need to work on researcher practice, probably starting with the newest ones.

    2. Related to ICE we have the work we have been slowly chipping away at getting theses onto the web, not just as pretend paper (PDF) in a repository, but as real scholarly HTML that can play nicely with the semantic web. The ICE-TheOREM collaboration with Peter-Murray Rust’s group at Cambridge has wrapped up, but produced a really useful proof of concept for Scholarly HTML thesis publishing, presented by me and Jim Downing at OR09 a couple of weeks ago.

    3. Recent work on ways to encode document semantics in an interoperable way. I have made some progress on jamming semantic relationships into URLs so they can be used in all manner of authoring tools, and Duncan Dickinson had a go too. For example, Linda Octalina wrote some code that can take semantic stuff embedded in a .docx file using the MS Word Add-in and turn it into a plain-old link. The idea is not to litter the web with ugly links but to allow tools like Word or a repository to recognize when a link is encoding some semantics and let you do useful things with it while still allowing interop with tools which do not understand the magic links.

      I have now convinced myself that we could really give the semantic web a good kick-along if there was a trusted service which could construct links that encode metadata, citations, references to concepts, and data embedding. More below on a potential project.

  2. Closing the gap between the desktop and the data grid with our work on The Fascinator Desktop. This new, in progress application indexes local files, and watches the file system for changes then exposes all your stuff you via a web interface. The idea is to allow researchers to tag and classify their research materials then create virtual collections of material which will be routed to the right downstream repositories automatically. This ties in with seeding the Commons program:

    Program Aims

    • To improve the state of data capture and management across the research sector

    • To improve the fabric for data management and the amount of content in the data commons

    http://www.ands.org.au/repositories.html

    We’ve had feedback from all over the place, including inside of ANDS that the Fascinator Desktop’s ’sucker upper’ functionality is a missing link in the eResearch software stack. There are obvious links to the ANDS work:

    1. The Register My Data service. Using the RIF-CS collection description language can we get authors to label their collections and then see their data routed from the desktop to the data commons and become discoverable?

    2. The Identify My Data service. Can we get some real life researchers (including ourselves) assigning IDs to bits of data when we write about them? I have wanted to be able to do this with stuff like sample word processing files for ages; these are the data we work with a lot. Can we associate a paper with a bunch of sample data, identify it all and then make sure it is all deposited in a repository at the right time? Not to mention stuff like the 7GB virtual computer we created for ICE-TheOREM. Where can I put that?

Some potential projects

Building on (1) our ongoing work on Scholarly HTML and associated services, and (2) closing the gap between the desktop and the data Commons, there are a few projects which I think might help the ANDS agenda.

Project: A real version of ontologize.me

The point of my toy site ontologize.me is to show how a service might be created where people can do semantic web stuff. I think this would solve several real problems in a simple way:

  • Let people embed metadata in a document that is produced using pretty much any authoring system. With the People Australia IDs now coming onstream this is becoming a practical reality.

    Here you go: I assert that I, Dr Petey, am the author of this here blog post.

    If I link to another Sefton, my sister Catherine then I’m not asserting anything I’m just linking. Click and you’ll see. And note how we have all these different forms of our names she publishes as Cath1, even though she’s really a Catherine, I publish as Peter, but use Petie a lot and the NLA has me as P. M. Sefton. The URI-as-identifier makes this no longer a problem.

    (Every time I mention this idea Bruce D’Arcus shows up on my blog and points me to RDFa yes RDFa is a way to encode this stuff in a web page and yes we will do that so that it can be indexed, but it’s too hard and too fragile for authors and flat out impossible in word processors to use in authoring workflows.).

  • Provide standard ways for people to link data into a publication (email, thesis, paper, blog post, Google Wave :-) in such a way that downstream sites can choose to do something useful with the link. Like, for example if you link to a sound file, do so in a way that makes it clear to downstream applications that that’s what you have done, so they can embed a player, or if you link to chemistry then they could embed a chemical viewer (that rotating 3d molecule trick is still very popular around here).

  • Allow you to mark up the content of a document to show what it’s about, like the Microsoft Word Ontology Add-in, but in a form that will work for everyone.

  • Provide some standard machine-readable ways to do citations using links.

I’m imagining a kind site with a nice web wizard that lets you construct these links so you can drop them into a blog post or a paper or a wiki. The ANDS Identify my Data service for example could give you a wizard to not only identify the data but what you mean by it. And ANDS and/or People Australia could provide URIs not only for a person, but for that person in different roles; author, editor, subject etc. Duncan Dickinson points out that as usable interface between authors and ontologies is going to be really important. Most of them wouldn’t know an ontology if it crawled into their lap.

I think that the ANDS crew might be working on some kind of online storehouse for ontologies and taxonomies and I reckon my extra wizard service would be a really simple but useful addon. We would love to be able to build support for this into our word processing toolbars, our conversion services and into institutional repositories, not to mention The Fascinator, which could expose these semantically rich URIs for every bit of data on researcher’s local disks.

Project: Trying the ANDS RIF-CS collection metadata schema with real desktops

We have been working on The Fascinator desktop, with a broad range of users in mind, but we have one key user (not that she has the software yet but we’re getting there). That’s Leonie Jones who has loads of video, transcripts, geo data around the battle of Fire Support Base Coral. Leonie has made a film with this material but there is lots more that goes with her PhD thesis. I think we should look at how Leonie and colleagues in Chris Lee’s Public Memory Research Centre might be able to describe their collections in such a way that we can advertise them to the data Commons using the Registry Interchange Format – Collections and Services (RIF-CS). To pull this off we need ways for users to label data and state which bits can be released publicly and we will probably need a research-centre-level repository server which can do the talking to the ANDS systems.

We’re hoping to build software that can be customized to work with any data set but which will generalize across disciplines.

Project: ASHT – Australasian Scholarly HTML Theses

Finally, something I mentioned in my post on what USQ could do to assert itself as an open access institution; theses. The Australasian Digital Thesis program (ADT) is now mature, and theses from across Australia routinely go onto the web, if only in PDF format. One of the best ways we could seed a data commons, and influence researcher practice would be to start with the earliest of early career researchers. I’d love to see a project which took a few small cohorts of PhD, Masters and Honours students and gave them the kinds of tools we are developing to manage their stuff and to write about it, in a way that will bring on the semantic web. This is definitely under consideration at USQ, but it may be of interest to the broader community here in Australia. (USQ has some responsibilities to ADT via our hosting of CAIRSS but we are still working out what our role will be).


1 I dare you to call her Cathy. Go on.