In Part 1 we looked at the challenges of strictly sticking to a policy of reusing existing vocabularies within Project Omega at TNA (The National Archives) and why you may occasionally have to make concessions to correctly express your data.
The example discussed in Part 1 was concerned with the scenario where one cannot from a suitable property from an existing popular standardised vocabulary. In this shorter article, we will look at a second scenario where there may be a reusable property available, but its implementation is problematic.
Primary and Secondary Identifiers
For this example let me explain another use-case that we recently had to solve for Project Omega. TNA's current Catalogue contains multiple identifiers, for each Unit of Description (a single document of folder):
- Database Table Primary Key, e.g.
tbl_item.-4653191
- CCR (Classic Catalogue Reference), e.g.
AIR 79/1064/118667
- Optional - The Former Reference - Creating Department, e.g.
R515333
- Optional - The Former Reference - PRO (Public Records Office), e.g.
E 315/509/Fo. 11
In addition, in the new Omega Catalogue, every Resource has:
- An OCI (Omega Catalogue Identifier), e.g.
FO.2020.3J.P.1
- Optional - A related identifier from the Discovery system called an IAID (Information Asset Identifier), e.g.
01d43d64-d7a6-4250-a2f2-4153a606a948
.
The Primary Identifier in Omega is the OCI, and we can happily reuse the Dublin Core Terms' dc:identifier
property for this. For example:
This leads us to the question of - What is the best way to express our Secondary Identifiers?
Expressing Secondary Identifiers
If we were to express our Secondary Identifiers also using dct:identifier
, it becomes difficult, impossible even, to differentiate the scheme to which identifier belongs as the dct:identifier
is a Data Type Property and only permits a single literal value. Consider for example:
The difficulty in working with the above data is that it raises questions such as:
- Why are there so many identifiers?
- To which schemes do these identifiers belong, and where can I find more information about those?
- Which identifier should I use?
- If I perform a query involving
dct:identifier
, then I am querying across identifier schemes, but am I guaranteed that there are no duplicate or conflicting identifiers across those schemes? - As a maintainer of the data, are all of the identifiers that are needed present?
Ideally we want instead a mechanism for Secondary Identifiers that not only expresses the identifier itself, but also the scheme which defines the use and syntax of that identifier.
After looking through several popular and standardised vocabularies, Schema.org's identifier property appears to suit our needs - schema:identifier
.
This is an improvement over the sole and repeated use of dct:identifier
as it allows us to reserve dct:identifier
to indicate our Primary Identifier, and our secondary identifiers are now easily located via schema:identifier
. In addition, each Secondary Identifier carries information explaining its purpose.
Further Concessions Against Reuse
We could refactor this to eliminate duplication and make it easier to query against specific secondary identifier(s). Thus yielding:
The above involves trading-off the reuse of existing vocabulary properties for further precision of meaning.
We gain:
- A reduction in duplicated strings, e.g. the
schema:name
andschema:description
being placed on each secondary identifier. - The ability to easily and confidently search the data, we can match on
? schema:propertyID tna:ccr
instead of? schema:name "CCR"
. This becomes even more important where data may have been mis-keyed.
We lose:
- The ability for strangers to interpret our data easily by glancing at a Secondary Identifier (
schema:identifier
) and immediately know what it is by reading its inlineschema:name
andschema:description
. - The ability to express our data without needing to define our own vocabulary.
These trade-offs are quite severe and I think we lose too much for the (maybe as yet unknown) humans who want to work with our data. Of course we gain for the machines, but if we were only concerned about machines we would just use the most efficient binary encoding possible and this article would be redundant.
To complete the example above, we should also define tna:ccr
and tna:frcd
in a new vocabulary of our own:
Further Compromise for Human Use
We had chosen to reuse schema:identifier
for our secondary identifiers because we had reserved dct:identifier
for our primary identifier, and we wanted to follow our guiding rule of reusing common vocabularies wherever possible.
However, I feel that we have not yet arrived at a good solution. Perhaps, there is a different approach that we might take that would yield a more favourable balance between human understandability, precision or our data, and computability by machines?
What about if we took a similar approach to that which we ultimately proposed in Part 1? That is to say, that we could derive our own property(s) for secondary identifiers from an existing one, thus reusing the common definition yet adding further meaning. Of course we must still be very considerate of human users, and wisely choose straight-forward or obvious generic names for such properties so as to help them infer their purpose.
Whilst we can't use dct:identifier
directly for our secondary identifiers, there is nothing to stop us deriving our own properties for secondary identifiers from it!
Using our own derived properties would then yield an expression for our Record that looks something like:
I believe that this final approach strikes a good compromise. Whilst we are not directly reusing an existing common vocabulary here for our secondary identifiers, we have a good reason, which is that schema:identifier
is not a good fit for use considering our use-case. However, all is not lost, whilst we have had to create our own properties, they themselves are derived from a property (dct:identifier
) from an existing common vocabulary (Dublin Core Terms). Additionally, and very subjectively, I would argue that it is much easier for humans to understand a single line which says tna:formerReferenceFromDepartment
than looking within data or object properties of schema:identifier
.
Conclusion
Although we started out with a different problem, and tried different approaches along the way, we ultimately ended up with an approach that looks remarkably similar to that in Part 1.
Hopefully this has re-enforced the idea that when attempting to solely reuse existing popular vocabularies, if you falter due a lack of suitable available classes and properties, there are options available, but there are trade-offs that have to be made between reuse and precision.