[ptsefton.com] | [CV & Bio]

You won't believe this shocking semantic web trick I use to avoid publishing my own ontologies! Will I end up going to hell for this?

2020-03-24

[Update - as soon as this went live I spotted an error in the final example and fixed it].

In this post I describe a disgusting, filthy, but possibly beautiful hack* I devised to get around a common problem in data description using semantic web techniques, specifically JSON-LD and schema.org . How can we allow people who don't happen to be Semantic Web über-geeks to be able to define their own vocabularies when they need to go beyond common vocabularies like schema.org?

* You tell me - beautiful or evil?

Jump to the spoiler at end - actually there are two hacks

For the last few years most of the posts on this blog have been presentations I've given at conferences, for example there's a series of posts on RO-Crate, the most recent of which was from eResearch Australasia. RO-Crate is a specification for describing and packaging research data (could be any data, really, but the main use cases that drove development come from research).

RO-Crate uses JSON-LD as its main metadata format, with vocabulary terms which mostly come from Schema.org: this makes it reasonably
easy for developers to write tools to generate good-quality low-ambiguity metadata in an extensible way. I'm not going to do a full JSON-LD tutorial, but to give you an idea RO-CRATE JSON-LD looks like this:

{
    "@id": "./",
    "@type": "Dataset",
    "name": "My dataset"
}

This is easy to work with because it's just JSON, with a trick up its sleeve - the keys in the JSON object, such as name are are defined in a 'context'. At its simplest, the context is a just a lookup between a key and a URI - in this case the context is defined like this:

{
    "@context": "https://w3id.org/ro/crate/1.0/context",
    
}

And in the JSON document you get from https://w3id.org/ro/crate/1.0/context, among a few hundred other properties is:

     "name": "http://schema.org/name",

If you go to http://schema.org/name you can read a definition of the name.

Having definitions is important. Let's take the metadata term "title". In Dublin Core title's the name of a resource in FOAF it's a title as in Mr, Mrs, Dr, Reverend etc in schema.org title's a job title.

With all these terms I can use a URI to get to a human readable page to read about the term but here's our problem: not everyone has the resources to define metadata terms by making an ontology and hosting it somewhere online.

So, what to do when you want to describe something, and provide some definitions but there's no obvious ontology to hand?

Lets look at an example. Dr Alana Piper at UTS has some criminal history data, which includes transcriptions of prison records - she sent me a spreadsheet with data on about 2600 prison records in PDF format.

Some of the variables in Alana's data were easy to map to schema.org vocabulary, like name and birthDate but some others that are not defined in a handy online ontology, like sentence and offence. You can see some sample data of Alana's here in an RO-Crate. RO-Crates come with web-previews - this is a bit of data that refers to a sentence for one Nora Abbot for the offence of Vagrancy.

I like JSON-LD in general but if you don't define mappings for the keys you're using then when you use JSON-LD software to process the files the undefined keys and their values disappear from your document, which is not user or developer friendly. I don't like that at all - nobody expects their data to be discarded when self-important, opinionated software library feels like it.

And more annoyingly, if you have an ad-hoc vocabulary there's no way to define that in your JSON-LD file or even the data that ships with it. Context keys MUST map to complete URLs.

There's a workaround. You can use a catch all @vocab key in the @context for your JSON-LD which points to a URI so that any undefined terms get forced into a particular vocabulary. Schema.org does this - so if you use that context then you use whatever terms you like and JSON-LD processors won't swallow your data BUT that's cheating and it's not useful as you don't get real URIs that can be resolved to read a definition of the term. You get FAKE URIs.

Here's a screenshot of what the initial RO-Crate I made with Alana's data looked like - it shows that some of terms (like name, startTime and location are defined - these have a question-mark link beside them you can follow to read the definition. But a couple of others (sentence and offence) didn't have definitions - cos while they do map to a URI, The URI is a FAKE and it's not listed in the official RO-Crate @context.

Screen capture of metadata in RO-CRATE, described above

I've been thinking about this a lot, and looking for approaches to publishing light-weight semantic web vocabs (about which I found pretty much nothing) and eventually I came across some very interesting work from ten years ago, which looked at how to encode semantic statements into a URL for use in content authoring systems that don't allow entry of linked data directly. The solution was to encode semantic web stuff like this-document has-author https://orcid.org/0000-0002-3545-944X into a URL. Hey, that idea could be adapted to this situation.

Anyway, what I came up with this decade was the idea of coding the entire definition for property into a URL so we can put that URL into a local context. Stupid? Probably. Naughty Fun that's likely to get disapproving looks from computer scientists and semantic-web purists? Certainly.

Ok, so what does a definition for a property look like? We can ask the schema.org server about that from the command line:

curl  -L -H "Accept: application/ld+json" schema.org/name

If we do that, we get: some JSON-LD which I've pruned a bit here:

{
  "@context": {
    "schema": "http://schema.org/",
    ...
    "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
    "rdfa": "http://www.w3.org/ns/rdfa#",
    "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
    "schema": "http://schema.org/",
   ...
  },
  "@id": "schema:name",
  "@type": "rdf:Property",
  
  ...
  "rdfs:comment": "The name of the item.",
  "rdfs:label": "name",
   ...
}

Seems to me the MVD (that's "Minimum Viable Document") for defining a property are an @id, rdf:label and the rdfs:comment so I threw together a simple single-page web thing on the examples bit of our repository server at work that would decode those out of a URL and, well, just show them to you (and I linked to it via the venerable PURL service so it doesn't need a domain name).

Hack 1

So, my first filthy web hack was to change the code I used to generate the RO-Crate of Alana's data - fed it some extra config with definitions of her metadata terms, then set up a super simple one-page web app which ACTs like it's the documentation for an ontology, but actually, you supply your own documentation, in the form of a link, like this:

http://purl.org/adhoc?@type=rdf:Property&rdfs:label=sentence&rdfs:comment=Penalty imposed by court for criminal conviction. As the data is drawn from prison records, this will usually consist of a specified term of imprisonment and the type of imprisonment conditions, e.g. 6 months hard labour. However, during this historical period it was common for persons convicted of a minor offence to be sentenced to a fine 'with the option' of a prison sentence if they were unable to pay it. After the introduction in Victoria of the Indeterminate Sentences Act in 1907, prisoners who had been declared 'habitual criminals' could also receive an indefinite sentence that meant they were imprisoned until the government authorities determined that they had sufficiently reformed. Some prisoners also faced additional penalties in addition to their prison sentence, such as periods of solitary confinement, in irons or whippings.&@id=http%3A%2F%2Fpurl.org%2Fadhoc%3F%40type%3Drdf%3AProperty%26rdfs%3Alabel%3Dsentence

Follow that and you get a page something like this (it will change and may be removed by the internet police):

Property: sentence

@id: http://purl.org/adhoc?@type=rdf:Property&rdfs:label=sentence

Label: sentence

Description:

Penalty imposed by court for criminal conviction. As the data is drawn from prison records, this will usually consist of a specified term of imprisonment and the type of imprisonment conditions, e.g. 6 months hard labour. However, during this historical period it was common for persons convicted of a minor offence to be sentenced to a fine 'with the option' of a prison sentence if they were unable to pay it. After the introduction in Victoria of the Indeterminate Sentences Act in 1907, prisoners who had been declared 'habitual criminals' could also receive an indefinite sentence that meant they were imprisoned until the government authorities determined that they had sufficiently reformed. Some prisoners also faced additional penalties in addition to their prison sentence, such as periods of solitary confinement, in irons or whippings.

To use this property use this text Copy to clipboard <...>

See what I did there? Got around the limitations of JSON-LD and its (I think harmful) insistence that context terms must resolve to URLs and supplied a self-documenting URL which, at least in the context of RO-Crate will allow a user to see something useful when they view the data.

(If JSON-LD is Linked Data encoded in JSON then this must be URL-LD, linked data encoded in URLs - or is it URI-LD?)

Hack 2

Having done this work, and actually put up that web page to illustrate it I then came up with what might be a more elegant solution to the actual problem at hand, which is shipping usable definitions of ad-hoc terms around in an RO-Crate Metadata File. The trick is similar to something we already do in RO-Crate to make metadata as useful as possible. The thing is, some URIs in the semantic web world don't actually resolve to anything usable by most humans which means that in the RO-Crate Website that can accompany a crate the explanatory links are not helpful, so we came up with a way to provide links that are useful by adding an item to the RO-CRate metadata that species a more useful link using the sameAs property.

The example in the RO-Crate spec uses the BIBO interviewee property. It's URL does not resolve to a useful page (that used to be because it went to an RDF file not a web page, but is doubly so at time of writing because it resolves to an error page at purl.org).

{
  "@context": [ 
    "https://w3id.org/ro/crate/1.0/context",
    {"interviewee": "http://purl.org/ontology/bibo/interviewee"},
  ],
  "@graph": [
  {
      "@id": "http://purl.org/ontology/bibo/interviewee",
      "sameAs": "http://neologism.ecs.soton.ac.uk/bibo.html#interviewee",
      "@type": "Thing"
  }
 ]
}

The above offers a more useful alternative URL and the code that generates the HTML summary of the RO-Crate can use that to provide a gloss [?]. But what if we actual also include the definition?

Instead of the above outrageously big URL with all the info to define sentence in it we could add this to our metadata with a made-up URL for the term and an on-board rdf:Property to define it:

 "@context": [ 
    "https://w3id.org/ro/crate/1.0/context",
    {"sentence": "http://example.com/criminal-characters/sentence"},
  ],
"@graph": [
  {
      "@id": "http://example.com/criminal-characters/sentence",
      "@type": "rdf:Property",
      "rdfs:label": "sentence",
      "rdf:comment": "Penalty imposed by court for criminal conviction. As the data is drawn from prison records, this will usually consist of a specified term of imprisonment and the type of imprisonment conditions, e.g. 6 months hard labour. However, during this historical period it was common for persons convicted of a minor offence to be sentenced to a fine 'with the option' of a prison sentence if they were unable to pay it. After the introduction in Victoria of the Indeterminate Sentences Act in 1907, prisoners who had been declared 'habitual criminals' could also receive an indefinite sentence that meant they were imprisoned until the government authorities determined that they had sufficiently reformed. Some prisoners also faced additional penalties in addition to their prison sentence, such as periods of solitary confinement, in irons or whippings."
      }
    }

I could then hack the RO-Crate web viewer to use the label and comment supplied here.

So - what do you think?

a. Hack 1? b. Hack 2? c. Both? d. Neither?

Will I go blind?

I think Hack 2 will work but Hack 1 is funnier.