Reusing Standard RDF Vocabularies - Part 1
In Phase 1 of Project Omega at TNA (The National Archives) we evaluated several different existing models and vocabularies/ontologies to ascertain their suitability for expressing the data of TNA's new Pan-Archival Catalogue. We published a fairly comprehensive report of our findings and proposed a way forward: Catalogue Model Proposal.
In summary, we felt that none of the existing models were perfect. We recognised that ICA's RiC (International Council on Archives' Record in Contexts) was very promising but currently under-developed for TNA's needs. Ultimately, we felt that the approach taken in developing The Matterhorn RDF Data Model had a lot of strengths and that we would take a similar path.
We decided that the new Data Model for Project Omega would:
- attempt to adhere to the broader principles of RiC's Conceptual Model, but discard RiC's Ontology.
- follow the approach of The Matterhorn RDF Data Model, i.e. reuse existing vocabularies and NOT create our own.
We started with the model specified in Matterhorn and added additional properties and classes from other shared and standardised vocabularies as we needed. The (work-in-progress) documentation of our data model: Omega Catalogue Data Model.
Now that we are in Phase 2 of Project Omega and exporting data into this data model in the form of Turtle RDF, we are starting to revisit some of our initial assumptions about reuse.
Shared Language vs. Precision
The beauty of reusing existing vocabularies (assuming that you choose popular and standardised ones), is that any developer, data scientist, or user who has worked with RDF before can likely already understand and work with our data. For example let's consider a simplified description of a Record:
DCT (Dublin Core Terms) is a vocabulary that has been around since 2008 and its use is ubiquitous. Even if somehow the user was not aware of Dublin Core, the naming of the terms is straight-forward. As a human I can likely infer the meaning of dct:identifier
as holding an identifier for the resource, and dct:description
as holding a description of the resource. If you felt the need to, you could confirm your suspicions by checking the Dublin Core standard document itself, however, the point here is that you didn't have to, the meaning is already known or at least almost-obvious. This is a major benefit of reusing popular vocabularies as it both reduces the cognitive load for those working with the data, and enables us to form and use a shared language even when working with vastly different datasets.
On the flip-side, the disadvantage of reusing popular shared vocabularies is that they are often, by design, quite generic in their definitions. This is of course by necessity, common terms acceptable to a wide-audience need to be agreeable by that audience, and so generic and/or vaguely defined terms are more palatable.
Defining your own vocabulary has the absolute advantage of allowing you to precisely define your world-view and exactly what you mean. That's powerful stuff!
By way of contrast consider the same record expressed in a (fictional) bespoke vocabulary:
If you work in the Archives sector you might well guess that tna:scope-content
holds the Scope and Content of the record... but how many people outside of the Archives sector know exactly what is meant by the "Scope and Content" of a record? Even then, you likely wouldn't know the meaning of tna:oci
! It's the Omega Catalogue Identifier, and awareness of that is not even organisation-wide throughout TNA yet.
We would of course write OWL and documentation to define exactly what tna:oci
and tna:scope-content
mean, but the user has to go and read those before they can work with the data.
The trade-off is ultimately: Ease of consumption through reuse of Shared Vocabularies vs. Precisely/Correctly expressing your domain and data.
When and how to trade-off?
In Omega our underlying principle is to always attempt reuse first. We are discovering that sometimes however there just isn't an appropriate Property or Class that can be reused from a popular standardised vocabulary.
By way of an example let me explain a use-case that we recently had to solve for Project Omega. TNA's Catalogue currently contains Covering Dates
for each Unit of Description (a single document of folder). These covering dates are the period-of-time during which the record(s) being described were created. They are expressed using between 1 and 3 values: The Date Text (as it appears on the unit of description), the First Date (the start of the period), and the Last Date (the end of the period).
Originally we had decided to use a property from a common vocabulary to express these, Dublin Core Terms (perhaps you know it!). The property we initially selected was dct:temporal
. As I interpret the DCT (Dublin Core Terms) standard, it appears to me that dct:temporal
is intended to describe the temporal coverage of the resource, i.e. the time period discussed/indicated within the resource as opposed to the date that the resource was created. So after further consideration, we decided to use something else instead of dct:temporal
, and this is where we had to start making trade-offs.
The options we considered:
- Use
dct:created
instead.
Unfortunatelydct:created
is a Data Type Property and so requires a literal value, yet we need to store 3 literal values (Date Text, First Date, and Last Date). To achieve this we could either:
a) Encode the 3 literal values into 1 literal value using ISO 8601-1, W3CDTF, EDTF, or DCMI Period. This has the downside that querying this with SPARQL becomes complex and requires various string split operations. For example, encoding using DCMI Period might produce the single literal string value:name=1941-1951; scheme=W3C-DTF; start=1941-01-01Z; end=1951-01-01Z
.
b) Ignore Dublin Core specifics here, and use an Object Property. We could somewhat enforce this approach with SHACL and documentation. However, those that are used to Dublin Core may be surprised; SPARQL queries fordct:created
would be different in our system than other systems. This negates the advantage of using a property from a shared standardised vocabulary! - Use
time:hasTime
instead.
This is a generic property from the W3C Time Ontology in OWL. This is an Object Property that allows us to express our covering dates exactly as we would need. Unfortunatelytme:hasTime
only tells us that there is a time, not what that time represents. It is too generic and fails to adequately describe that these are the created dates of the records;dct:created
would have been much more precise! - Create our own vocabulary property.
We have two main options of how to approach this:
a) Define our own standalone property in our own vocabulary.
b) If there is a property from a common vocabulary that is close to what we need, we can define our own property which is derived from that.
Whilst dct:created
infers (to a human) the meaning that we are looking for it, doesn't allow us to store the information we need. The time:hasTime
property is the opposite, it lacks a sufficient precise meaning, but allows great flexibility in how we store our covering dates. Therefore, as there is no readily suitable property from a common vocabulary we have little choice but to create our own!
As the property time:hasTime
allows us to store the data we need, but is lacking in sufficient descriptive power, rather than defining our own standalone property, we can instead derive our property from time:hasTime
and add further descriptive information. Our new derived property will be tna:created
and could look something like this:
The above definition of tna:created
declares it as a sub-property of time:hasTime
but gives further information about its use, and also informs us that additional information can be found by looking at dct:created
and rdae:P20214
.
In practical use our earlier RDF augmented with our Covering Dates now finally looks something like:
We still utilise dct:identifier
and dct:description
above, because they are a good fit for our reuse. Whilst the new tna:created
demonstrates our trade-off perfectly!
Earlier, I explained that we had wanted to reuse dct:created
because it is a property that is widely used and understood, but that it was unsuitable for storing our covering dates (as it is specified as a Data Type Property).
As we could not find a suitable property from an existing popular vocabulary that we could reuse, we were forced to create our own. This property, tna:created
, has two important design aspects:
- It's name is straight forward. Even someone from outside of the Archival sector can likely guess it's meaning and purpose. It's unlikely that someone would have to look at our OWL definition or documentation to be able to start working with it. It's very much intentional that at a glance it looks a lot like
dct:created
. - Whilst this property is TNA specific and explicates a precise meaning, it does not standalone. Instead, it reuses the W3C Time Ontology in OWL (a popular vocabulary itself) by virtue of being derived from the
time:hasTime
property.
Conclusion
For Project Omega - we still prefer reuse wherever possible as it enables easier consumption by others. Creating a new Property or Class (even if derived from a common vocabulary) is sometimes unavoidable, but should be considered as an absolute last resort and undertaken only when no common property exists, or said property fails to adequately describe the data.
Hopefully this article has provided you with some insight into the challenges that arise when strictly trying to reuse existing vocabularies, and the trade-offs that may have to be made.
In Part 2, I work through a second use-case from Project Omega, and show a further example of where vocabulary reuse can be challenging.