You won't believe this shocking semantic web trick I use to avoid publishing my own ontologies! Will I end up going to hell for this?
2020-03-24
[Update - as soon as this went live I spotted an error in the final example and fixed it].
In this post I describe a disgusting, filthy, but possibly beautiful hack* I devised to get around a common problem in data description using semantic web techniques, specifically JSON-LD and schema.org . How can we allow people who don't happen to be Semantic Web über-geeks to be able to define their own vocabularies when they need to go beyond common vocabularies like schema.org?
* You tell me - beautiful or evil?
Jump to the spoiler at end - actually there are two hacks
For the last few years most of the posts on this blog have been presentations I've given at conferences, for example there's a series of posts on RO-Crate, the most recent of which was from eResearch Australasia. RO-Crate is a specification for describing and packaging research data (could be any data, really, but the main use cases that drove development come from research).
RO-Crate uses JSON-LD as its main metadata format, with vocabulary terms which
mostly come from Schema.org: this makes it reasonably
easy for developers to write tools to generate good-quality low-ambiguity metadata in an
extensible way. I'm not going to do a full JSON-LD tutorial, but to give you an idea
RO-CRATE JSON-LD looks like this:
{
"@id": "./",
"@type": "Dataset",
"name": "My dataset"
}
This is easy to work with because it's just JSON, with a trick up its sleeve - the keys in the JSON object, such as name
are are defined in a 'context'. At its simplest, the context is a just a lookup between a key and a URI - in this case the context is defined like this:
{
"@context": "https://w3id.org/ro/crate/1.0/context",
}
And in the JSON document you get from https://w3id.org/ro/crate/1.0/context, among a few hundred other properties is:
"name": "http://schema.org/name",
If you go to http://schema.org/name you can read a definition of the name
.
Having definitions is important. Let's take the metadata term "title". In Dublin Core title's the name of a resource in FOAF it's a title as in Mr, Mrs, Dr, Reverend etc in schema.org title's a job title.
With all these terms I can use a URI to get to a human readable page to read about the term but here's our problem: not everyone has the resources to define metadata terms by making an ontology and hosting it somewhere online.
So, what to do when you want to describe something, and provide some definitions but there's no obvious ontology to hand?
Lets look at an example. Dr Alana Piper at UTS has some criminal history data, which includes transcriptions of prison records - she sent me a spreadsheet with data on about 2600 prison records in PDF format.
Some of the variables in Alana's data were
easy to map to schema.org vocabulary, like name and
birthDate but some others that are not defined in
a handy online ontology, like sentence
and offence
. You can see some sample
data of Alana's
here in an RO-Crate.
RO-Crates come with web-previews - this is a bit of data that refers to a
sentence
for one Nora Abbot for the offence
of Vagrancy.
I like JSON-LD in general but if you don't define mappings for the keys you're using then when you use JSON-LD software to process the files the undefined keys and their values disappear from your document, which is not user or developer friendly. I don't like that at all - nobody expects their data to be discarded when self-important, opinionated software library feels like it.
And more annoyingly, if you have an ad-hoc vocabulary there's no way to define that in your JSON-LD file or even the data that ships with it. Context keys MUST map to complete URLs.
There's a workaround. You can use a catch all @vocab
key in the @context
for
your JSON-LD which points to a URI so that any undefined terms get forced into a
particular vocabulary. Schema.org does this - so if you use that context then
you use whatever terms you like and JSON-LD processors won't swallow your data
BUT that's cheating and it's not useful as you don't get real URIs that can be
resolved to read a definition of the term. You get FAKE URIs.
Here's a screenshot of what the initial RO-Crate I made with Alana's data looked
like - it shows that some of terms (like name
, startTime
and location
are
defined - these have a question-mark link beside them you can follow to read the
definition. But a couple of others (sentence
and offence
) didn't have
definitions - cos while they do map to a URI, The URI is a FAKE and it's not
listed in the official RO-Crate @context.
I've been thinking about this a lot, and looking for approaches
to publishing light-weight semantic web vocabs (about which I found pretty much
nothing) and eventually I came across some very interesting work from ten years ago, which
looked at how to encode semantic statements into a URL for use in content
authoring systems that don't allow entry of linked data directly. The solution was to
encode semantic web stuff like this-document
has-author
https://orcid.org/0000-0002-3545-944X
into a URL. Hey, that idea could be adapted to this situation.
Anyway, what I came up with this decade was the idea of coding the entire definition for property into a URL so we can put that URL into a local context. Stupid? Probably. Naughty Fun that's likely to get disapproving looks from computer scientists and semantic-web purists? Certainly.
Ok, so what does a definition for a property look like? We can ask the schema.org server about that from the command line:
curl -L -H "Accept: application/ld+json" schema.org/name
If we do that, we get: some JSON-LD which I've pruned a bit here:
{
"@context": {
"schema": "http://schema.org/",
...
"rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
"rdfa": "http://www.w3.org/ns/rdfa#",
"rdfs": "http://www.w3.org/2000/01/rdf-schema#",
"schema": "http://schema.org/",
...
},
"@id": "schema:name",
"@type": "rdf:Property",
...
"rdfs:comment": "The name of the item.",
"rdfs:label": "name",
...
}
Seems to me the MVD (that's "Minimum Viable Document") for defining
a property are an @id
, rdf:label
and the rdfs:comment
so I threw together a
simple single-page web thing on the examples bit of our repository server at
work that would decode those out of a URL and, well, just show them to you (and I linked to it via
the venerable PURL service so it doesn't need a domain name).
Hack 1
So, my first filthy web hack was to change the code I used to generate the RO-Crate of Alana's data - fed it some extra config with definitions of her metadata terms, then set up a super simple one-page web app which ACTs like it's the documentation for an ontology, but actually, you supply your own documentation, in the form of a link, like this:
Follow that and you get a page something like this (it will change and may be removed by the internet police):
Property: sentence
@id: http://purl.org/adhoc?@type=rdf:Property&rdfs:label=sentence
Label: sentence
Description:
Penalty imposed by court for criminal conviction. As the data is drawn from prison records, this will usually consist of a specified term of imprisonment and the type of imprisonment conditions, e.g. 6 months hard labour. However, during this historical period it was common for persons convicted of a minor offence to be sentenced to a fine 'with the option' of a prison sentence if they were unable to pay it. After the introduction in Victoria of the Indeterminate Sentences Act in 1907, prisoners who had been declared 'habitual criminals' could also receive an indefinite sentence that meant they were imprisoned until the government authorities determined that they had sufficiently reformed. Some prisoners also faced additional penalties in addition to their prison sentence, such as periods of solitary confinement, in irons or whippings.
To use this property use this text Copy to clipboard <...>
See what I did there? Got around the limitations of JSON-LD and its (I think harmful) insistence that context terms must resolve to URLs and supplied a self-documenting URL which, at least in the context of RO-Crate will allow a user to see something useful when they view the data.
(If JSON-LD is Linked Data encoded in JSON then this must be URL-LD, linked data encoded in URLs - or is it URI-LD?)
Hack 2
Having done this work, and actually put up that web page to illustrate it I then came up with what might be a more elegant solution to the actual problem at hand, which is shipping usable definitions of ad-hoc terms around in an RO-Crate Metadata File. The trick is similar to something we already do in RO-Crate to make metadata as useful as possible. The thing is, some URIs in the semantic web world don't actually resolve to anything usable by most humans which means that in the RO-Crate Website that can accompany a crate the explanatory links are not helpful, so we came up with a way to provide links that are useful by adding an item to the RO-CRate metadata that species a more useful link using the sameAs property.
The example in the RO-Crate spec uses the BIBO interviewee
property. It's URL does not resolve to a useful page (that used to be because it went to an RDF file not a web page, but is doubly so at time of writing because it resolves to an error page at purl.org).
{
"@context": [
"https://w3id.org/ro/crate/1.0/context",
{"interviewee": "http://purl.org/ontology/bibo/interviewee"},
],
"@graph": [
{
"@id": "http://purl.org/ontology/bibo/interviewee",
"sameAs": "http://neologism.ecs.soton.ac.uk/bibo.html#interviewee",
"@type": "Thing"
}
]
}
The above offers a more useful alternative URL and the code that generates the HTML summary of the RO-Crate can use that to provide a gloss [?]. But what if we actual also include the definition?
Instead of the above outrageously big URL with all the info to define sentence
in it we could add this to our metadata with a made-up URL for the term and an on-board rdf:Property to define it:
"@context": [
"https://w3id.org/ro/crate/1.0/context",
{"sentence": "http://example.com/criminal-characters/sentence"},
],
"@graph": [
{
"@id": "http://example.com/criminal-characters/sentence",
"@type": "rdf:Property",
"rdfs:label": "sentence",
"rdf:comment": "Penalty imposed by court for criminal conviction. As the data is drawn from prison records, this will usually consist of a specified term of imprisonment and the type of imprisonment conditions, e.g. 6 months hard labour. However, during this historical period it was common for persons convicted of a minor offence to be sentenced to a fine 'with the option' of a prison sentence if they were unable to pay it. After the introduction in Victoria of the Indeterminate Sentences Act in 1907, prisoners who had been declared 'habitual criminals' could also receive an indefinite sentence that meant they were imprisoned until the government authorities determined that they had sufficiently reformed. Some prisoners also faced additional penalties in addition to their prison sentence, such as periods of solitary confinement, in irons or whippings."
}
}
I could then hack the RO-Crate web viewer to use the label and comment supplied here.
So - what do you think?
a. Hack 1? b. Hack 2? c. Both? d. Neither?
Will I go blind?
I think Hack 2 will work but Hack 1 is funnier.