In Part 1 we looked at the challenges of strictly sticking to a policy of reusing existing vocabularies within Project Omega at TNA (The National Archives) and why you may occasionally have to make concessions to correctly express your data.

The example discussed in Part 1 was concerned with the scenario where one cannot from a suitable property from an existing popular standardised vocabulary. In this shorter article, we will look at a second scenario where there may be a reusable property available, but its implementation is problematic.

Primary and Secondary Identifiers

For this  example let me explain another use-case that we recently had to solve for Project Omega. TNA's current Catalogue contains multiple identifiers, for each Unit of Description (a single document of folder):

  1. Database Table Primary Key, e.g. tbl_item.-4653191
  2. CCR (Classic Catalogue Reference), e.g. AIR 79/1064/118667
  3. Optional - The Former Reference - Creating Department, e.g. R515333
  4. Optional - The Former Reference - PRO (Public Records Office), e.g. E 315/509/Fo. 11

In addition, in the new Omega Catalogue, every Resource has:

  1. An OCI (Omega Catalogue Identifier), e.g. FO.2020.3J.P.1
  2. Optional - A related identifier from the Discovery system called an IAID (Information Asset Identifier), e.g. 01d43d64-d7a6-4250-a2f2-4153a606a948.

The Primary Identifier in Omega is the OCI, and we can happily reuse the Dublin Core Terms' dc:identifier property for this. For example:

@prefix tna:     <http://www.nationalarchives.gov.uk/> .
@prefix dct:     <http://purl.org/dc/terms/> .

tna:res.FO.2020.3J.P.1
    dct:identifier "FO.2020.3J.P.1" ;
    .
A Record with a Primary Identifier

This leads us to the question of - What is the best way to express our Secondary Identifiers?

Expressing Secondary Identifiers

If we were to express our Secondary Identifiers also using dct:identifier, it becomes difficult, impossible even, to differentiate the scheme to which identifier belongs as the dct:identifier is a Data Type Property and only permits a single literal value. Consider for example:

@prefix tna:     <http://www.nationalarchives.gov.uk/> .
@prefix dct:     <http://purl.org/dc/terms/> .

tna:res.FO.2020.3J.P.1
    dct:identifier "FO.2020.3J.P.1" ;
    dct:identifier "01d43d64-d7a6-4250-a2f2-4153a606a948" ;
    dct:identifier "tbl_item.-4653191" ;
    dct:identifier "AIR 79/1064/118667" ;
    dct:identifier "R515333" ;
    dct:identifier "E 315/509/Fo. 11" ;
    .
A Record with Many Identifiers; problematic?

The difficulty in working with the above data is that it raises questions such as:

  1. Why are there so many identifiers?
  2. To which schemes do these identifiers belong, and where can I find more information about those?
  3. Which identifier should I use?
  4. If I perform a query involving dct:identifier, then I am querying across identifier schemes, but am I guaranteed that there are no duplicate or conflicting identifiers across those schemes?
  5. As a maintainer of the data, are all of the identifiers that are needed present?

Ideally we want instead a mechanism for Secondary Identifiers that not only expresses the identifier itself, but also the scheme which defines the use and syntax of that identifier.

After looking through several popular and standardised vocabularies, Schema.org's identifier property appears to suit our needs - schema:identifier.

@prefix tna:     <http://www.nationalarchives.gov.uk/> .
@prefix dct:     <http://purl.org/dc/terms/> .
@prefix schema:  <https://schema.org/> .

tna:res.FO.2020.3J.P.1
    dct:identifier "FO.2020.3J.P.1" ;
    schema:identifier [
        a schema:PropertyValue ;
        schema:name "CCR" ;
        schema:description "The Classic Catalogue Reference" ;
        schema:value "FO 12/34/56"
    ] ;
    schema:identifier [
        a schema:PropertyValue ;
        schema:name "FRCD" ;
        schema:description "The Former Reference - Creating Department" ;
        schema:value "R123456"
    ]
    .
A Record with Primary and Secondary Identifier - Literals

This is an improvement over the sole and repeated use of dct:identifier as it allows us to reserve dct:identifier to indicate our Primary Identifier, and our secondary identifiers are now easily located via schema:identifier. In addition, each Secondary Identifier carries information explaining its purpose.

Further Concessions Against Reuse

We could refactor this to eliminate duplication and make it easier to query against specific secondary identifier(s). Thus yielding:

@prefix tna:     <http://www.nationalarchives.gov.uk/> .
@prefix dct:     <http://purl.org/dc/terms/> .
@prefix schema:  <https://schema.org/> .

tna:res.FO.2020.3J.P.1
    dct:identifier "FO.2020.3J.P.1" ;
    schema:identifier [
        a schema:PropertyValue ;
        schema:propertyID tna:ccr ;
        schema:value "FO 12/34/56"
    ] ;
    schema:identifier [
        a schema:PropertyValue ;
        schema:propertyID tna:frcd ;
        schema:value "R123456"
    ]
    .
A Record with Primary and Secondary Identifier - URIs

The above involves trading-off the reuse of existing vocabulary properties for further precision of meaning.

We gain:

  1. A reduction in duplicated strings, e.g. the schema:name and schema:description being placed on each secondary identifier.
  2. The ability to easily and confidently search the data, we can match on ? schema:propertyID tna:ccr instead of ? schema:name "CCR". This becomes even more important where data may have been mis-keyed.

We lose:

  1. The ability for strangers to interpret our data easily by glancing at a Secondary Identifier (schema:identifier) and immediately know what it is by reading its inline schema:name and schema:description.
  2. The ability to express our data without needing to define our own vocabulary.

These trade-offs are quite severe and I think we lose too much for the (maybe as yet unknown) humans who want to work with our data. Of course we gain for the machines, but if we were only concerned about machines we would just use the most efficient binary encoding possible and this article would be redundant.

To complete the example above, we should also define tna:ccr and tna:frcd in a new vocabulary of our own:

@prefix owl      <http://www.w3.org/2002/07/owl#> .
@prefix rdfs     <http://www.w3.org/2000/01/rdf-schema#> .
@prefix tna:     <http://www.nationalarchives.gov.uk/> .
@prefix dct:     <http://purl.org/dc/terms/> .
@prefix schema:  <https://schema.org/> .

tna:ccr
    a owl:Class ;
    rdfs:label "Classic Catalogue Reference" ;
    rdfs:comment "The CCR (Classic Catalogue Reference) is a secondary
                  identifier for a Unit of Description. It reflects the
                  historic ISAD(G) like archival arrangement of the unit, i.e
                  Department, Series, Piece, and Item. It has been in use
                  at The National Archives since the 19th century and is
                  aligned to the ISAD(G) standard. It is defined on page 13
                  of the document: TNA-CS13 (Cataloguing Standards - Part A
                  Data Elements, June 2013)."@en ;
    rdfs:seeAlso tna:cs13-a, schema:identifier, dct:identifier
    .
  
 tna:frcd
    a owl:Class ;
    rdfs:label "Former Reference - Creating Department" ;
    rdfs:comment "The FRCD (Former Reference - Creating Department) is
                  a secondary identifier for a Unit of Description. It holds
                  the unique identifier given to the material by the
                  originating creator. It is defined on page 17 of the
                  document: TNA-CS13 (Cataloguing Standards - Part A Data
                  Elements, June 2013)."@en ;
    rdfs:seeAlso tna:cs13-a, schema:identifier, dct:identifier
    .
    
tna:cs13-a
    a owl:Class ;
    rdfs:label "Cataloguing Standards - Part A Data Elements, June 2013" ;
    rdfs:comment "Describes the various Catalogue Elements derived from
                  ISAD(G) to manage descriptive data which is available
                  in PROCAT Editorial."@en
    .
Describing the Schemes of our Secondary Identifiers

Further Compromise for Human Use

We had chosen to reuse schema:identifier for our secondary identifiers because we had reserved dct:identifier for our primary identifier, and we wanted to follow our guiding rule of reusing common vocabularies wherever possible.

However, I feel that we have not yet arrived at a good solution. Perhaps, there is a different approach that we might take that would yield a more favourable balance between human understandability, precision or our data, and computability by machines?

What about if we took a similar approach to that which we ultimately proposed in Part 1? That is to say, that we could derive our own property(s) for secondary identifiers from an existing one, thus reusing the common definition yet adding further meaning. Of course we must still be very considerate of human users, and wisely choose straight-forward or obvious generic names for such properties so as to help them infer their purpose.

Whilst we can't use dct:identifier directly for our secondary identifiers, there is nothing to stop us deriving our own properties for secondary identifiers from it!

@prefix owl      <http://www.w3.org/2002/07/owl#> .
@prefix rdfs     <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd:     <http://www.w3.org/2001/XMLSchema#> .
@prefix tna:     <http://www.nationalarchives.gov.uk/> .
@prefix dct:     <http://purl.org/dc/terms/> .

tna:classicCatalogueReference
    a owl:DatatypeProperty ;
    rdfs:subPropertyOf dct:identifier
    rdfs:range xsd:string ;
    rdfs:label "Classic Catalogue Reference" ;
    rdfs:comment "The Classic Catalogue Reference is a secondary
                  identifier for a Unit of Description. It reflects the
                  historic ISAD(G) like archival arrangement of the unit, i.e
                  Department, Series, Piece, and Item. It has been in use
                  at The National Archives since the 19th century and is
                  aligned to the ISAD(G) standard. It is defined on page 13
                  of the document: TNA-CS13 (Cataloguing Standards - Part A
                  Data Elements, June 2013)."@en ;
    rdfs:seeAlso tna:cs13-a, schema:identifier, dct:identifier
    .

tna:formerReferenceFromDepartment
    a owl:DatatypeProperty ;
    rdfs:range xsd:string ;
    rdfs:subPropertyOf dct:identifier ;
    rdfs:label "Former Reference - Creating Department" ;
    rdfs:comment "The 'Former Reference - Creating Department' is a secondary
                  identifier for a Unit of Description. It holds the unique
                  identifier given to the material by the originating
                  creator. It is defined on page 17 of the document:
                  TNA-CS13 (Cataloguing Standards - Part A Data Elements,
                  June 2013)."@en ;
    rdfs:seeAlso tna:cs13-a
    .
Definition of our derived properties for two of our Secondary Identifiers

Using our own derived properties would then yield an expression for our Record that looks something like:

@prefix tna:     <http://www.nationalarchives.gov.uk/> .
@prefix dct:     <http://purl.org/dc/terms/> .
@prefix schema:  <https://schema.org/> .

tna:res.FO.2020.3J.P.1
    dct:identifier "FO.2020.3J.P.1" ;
    tna:classicCatalogueReference "FO 12/34/56" ;
    tna:formerReferenceFromDepartment "R123456"
    .
A Record with Primary and Secondary Identifier - Bespoke Vocabulary

I believe that this final approach strikes a good compromise. Whilst we are not directly reusing an existing common vocabulary here for our secondary identifiers, we have a good reason, which is that schema:identifier is not a good fit for use considering our use-case. However, all is not lost, whilst we have had to create our own properties, they themselves are derived from a property (dct:identifier) from an existing common vocabulary (Dublin Core Terms). Additionally, and very subjectively, I would argue that it is much easier for humans to understand a single line which says tna:formerReferenceFromDepartment than looking within data or object properties of schema:identifier.

Conclusion

Although we started out with a different problem, and tried different approaches along the way, we ultimately ended up with an approach that looks remarkably similar to that in Part 1.

Hopefully this has re-enforced the idea that when attempting to solely reuse existing popular vocabularies, if you falter due a lack of suitable available classes and properties, there are options available, but there are trade-offs that have to be made between reuse and precision.