Archival Catalogue Record Identifiers
As previously described, in Project Omega at TNA (The National Archives), we will be using a Graph Model to hold all of the catalogue information, specifically an RDF model.
One of the key things in RDF is that every resource has a URI. In the Omega Catalogue we will have a plethora of different types of resources - (Archival Records, Locations, People, Organisations, etc.). Every one of these resources will need a URI. For our purposes these URI are composed of two parts, a base and an identifier.
In Omega we will be using a flat addressing scheme (i.e. no sub-folders/paths in the URI), and so our base is fixed and could be something like either: http://cat.nationalarchives.gov.uk/
or
http://cat.nationalarchives.gov.uk#
.
Determining which is best to use depends on which approach of the HashVsSlash schemes is most advantageous to your application. I have decided that in Omega we will be using the Slash scheme, i.e. http://cat.nationalarchives.gov.uk/
. The Slash scheme has the advantage that there are multiple documents. As our catalogue will be very large, the single document approach as used by the Hash scheme would be unwieldy.
Now, the interesting part is the identifier! Every resource in our RDF graph needs a URI and therefore an identifier, in this article I will focus solely on identifiers for Archival Records.
Requirements for a Good Identifier
When choosing identifiers for our resources, there are some properties that they must/should/count exhibit:
-
Must be Unique (within our domain).
We can't have one identifier identify more than one resource without breaking our RDF model. -
Must be Persistent.
We don't want our resources disappearing and/or reappearing with different URI over time, otherwise we end up with broken links. Therefore the identifier must be immutable, thus meaning that we must exclude any changeable properties of a resource from use within its identifier. This is also for archival purposes, as ideally we don't want to have to retrieve records from potentially distant locations or media to modify their identifiers. -
Must be Computable.
Whatever form the identifier takes it must be computationally valid for use within a URI. Ideally it should be possible to generate such an identifier computationally without requiring a manual (human) registration/validation process. -
Must be Uniform.
By ensuring that every identifier follows a prescribed format and length, it is easy to validate what is an identifier and what is not; that is not to say that an identifier leads to a resource. -
Should be Humane.
The identifier when considered as part of a larger URI is often used by humans, perhaps within SPARQL statements that they construct to query the data, or by de-referencing such URI via the Web.
Additionally the identifier as a stand-alone element may have value in itself and could conceivable be used by humans to communicate about resources, for example a visitor to TNA could imaginably ask to see the record with identifier X.Consideration should be given to making the identifiers communicable by humans, which implies that there are additional desirable properties, such as:
-
Should be succinct.
Typically humans are better at accurately communicating short sequences of data. -
Should be easy to transcribe.
Human transcription errors can be reduced by using a commonly known alphabet. The Latin or Roman alphabet would seem sensible for an archive based in the UK. This would suggest excluding any non-alphanumeric characters from the identifier. -
Should be easy to verbalise.
Records of TNA are often discussed or requested verbally. For example, collaboration between staff members, or by a member of the public telephoning or making an enquiry face-to-face on site.
-
-
Could Convey Knowledge.
If the identifier is able to convey some knowledge about the resource that can be interpreted by machines and/or humans, we gain the advantage of being able to determine certain facts about the resource just from its identifier. This has to be carefully balanced with (2).
There are two interesting articles from the W3C about designing URI schemes both for the Web and RDF that may be of further interest to the reader:
Existing Identifiers for Archival Records
Those of you readers already familiar with records keeping, may of course be thinking that TNA must have an existing identifier scheme for its records, and you would be right. In fact TNA has several different schemes in-use today for identifying its records. For brevity's sake I will discount those used within various internal systems, and focus on what the general public tend to see.
The predominant identifier used by TNA for its records that the general public (and most staff) see and work with is simply known as a "Catalogue Reference". In actuality there are two different identifier schemes in use today, and a Catalogue Reference may be expressed in one or the other scheme. The schemes are:
-
CCR (Classic Catalogue Reference)
Before the advent of GCRs, this was simply known as "The Catalogue Reference" and was the de-facto identifier for any record catalogued by TNA. It was developed before the advent of digital records. -
GCR (Generated Catalogue Reference)
I developed this scheme for TNA in 2012 to allow computational generation of identifiers for digital records.
At present, TNA uses CCRs for physical (think paper) records, and GCRs for Born-Digital records. CCRs were previously used for Digitised records, but GCRs are now starting to be used for those too.
We will briefly take a look at each existing identifier scheme, and show that unfortunately both have properties which make them unsuitable for use as identifiers within URI.
CCR (Classic Catalogue References)
To understand CCRs, you need to first understand a little bit about how archival records are arranged. Ultimately it all comes down to the principle of Respect des fonds, in the simplest of terms - we must respect the arrangement of the records as defined by their creator. In more concrete terms TNA uses an internal standard known simply as TNA-CS13 (The National Archives - Cataloguing Standards 2013) which is itself derived from ISAD(G) (General International Standard Archival Description).
TNA-CS13 basically stipulates that each record is arranged according to a mono-hierarchical structure, that structure may have between 3 and 7 levels. These levels are known by the names (from top-to-bottom): Department, Division, Series, Sub-series, Sub-sub-series, Piece, and Item; you can read more about them in the article Citing records in The National Archives.
A CCR identifier basically encodes all the references used for 3 or 4 levels of the record's arrangement. The CCR scheme has one of two forms, for records catalogued to Piece level it is:
{Department Reference} {Series Reference}/{Piece Reference}
For records catalogue to Item level the scheme is:
{Department Reference} {Series Reference}/{Piece Reference}/{Item Reference}
Here are five examples of valid CCRs that are in use:
MH 55/2713
AIR 79/1064/118667
E 317/Devon/1
AIR 1/1983/204/273/89-M-N-O-P-R-S
T 1/440/15,65-66107-113,116-142,166,180-189,etc
From what I have explained so far, hopefully you have recognised that Example (1) is a CCR for a Piece, whereas Example (2) is a CCR for an Item.
Now, I would not blame you for thinking that Example (3) is also an Item, however you would be mistaken! Unfortunately whilst the /
character is used as a separator between the Series, Piece, and Item references, at some point in the past it was also introduced as a valid character within the Piece and Item references themselves. We will cover the reason for that decision shortly, but for now we can safely assert that it causes problems: As a human I can no longer visually determine whether the CCR refers to a Piece or an Item, and perhaps worse yet, if I try and parse the identifier using a software program I get an ambiguous result. Sadly the intention for a CCR to carry information that is meant to be helpful to understanding the record has not held up well, instead many CCRs are ambiguous which may lead to confusion, and ultimately the fact that the arrangement of the record cannot be known without going into the physical archival stacks and retrieving it.
I chose Example (4) and Example (5) to further illustrate the non-uniformity of CCRs, they refer to a Piece and an Item respectively.
We can identify several issues that make CCRs unsuitable for our identifier needs in Omega:
-
A CCR may be ambiguous and therefore does not meet our requirement for unique identifiers.
-
A CCR encodes the arrangement of the record, whilst one would hope the arrangement is fixed at the time of accession, the reality is that mistakes can be made and from there the record may need to be re-catalogued which could also involve a change to its CCR. Therefore, CCRs do not meet our requirement for persistent identifiers.
-
Each CCR is allocated and registered manually by an archivist whereas we would need to be able to compute such identifiers within Omega. Additionally their ambiguity and non-uniformity means that they cannot be computationally validated.
-
For CCRs with Piece and Item identifiers containing non-alphanumeric characters (e.g.
/
, or,
), such characters would require URI Encoding to be able to use the CCR as part of a URI. Unfortunately URI Encoding is non-intuitive to humans. -
The '/' character was introduced into the Piece and Item references within a CCR to allow the archivist to hint at further levels of arrangement which were prohibited by TNA-CS13. A goal of Project Omega is to provide a catalogue that works for any record regardless of its medium (e.g. physical or digital), one known axiom of digital records preservation is that such records have far more complex arrangements than their paper counterparts, often requiring arbitrarily deep levels of hierarchy or poly-hierarchical arrangement. For this reason the encoding of level identifiers into CCRs will not scale for digital records, and was in fact one of the drivers for creating the GCR scheme.
Whilst CCRs may not be perfect, it should be recognised they have until now largely been successful in providing an identifier for the retrieval of a record, thus demonstrated by the fact that they are still used daily to access millions of archival records.
GCR (Generated Catalogue References)
I designed the GCR scheme for TNA back in 2012 when I was leading the design and implementation of their DRI (Digital Repository Infrastructure) project. The goal of that 3 year project was to design and implement a new Digital Archive for preserving digital records.
DRI needed to be able to accession Born Digital records. The practice of archiving and cataloguing physical records is rather well established and understood. At that time the practice of archiving and cataloguing digital records was still in its infancy with much international discussion, and arguably best practice is still being refined. In particular, Born Digital records have several aspects that make them much more complex to handle than physical records, if we are to apply principle of Respect des fonds then we must preserve the creator's arrangement of the digital files comprising the records. Generally digital files are organised according to a mono-hierarchical file-system or file-plan, however such a hierarchy may be of an arbitrarily deep number of levels and operate without any global constraints on the naming of each level. In addition there are some systems (e.g. Content/Document Management Systems and/or Cloud Office Suites) which offer label based arrangements of documents, thus resulting in arbitrarily deep poly-hierarchical structures and again without restrictions on the naming of labels (i.e. levels).
By recognising that CCRs reflected an arrangement of 3 or 4 levels, and that Born Digital files could have many more levels of arrangement, we realised that adding additional level identifiers to CCRs would not scale; as we could end up with very long CCRs which are encoding file-system paths with each component of arbitrary length. In addition whilst TNA may receive a large collection of paper records and these can be catalogued and accessioned by humans, the volume of digital files for Born Digital records is much much higher, to the extent that cataloguing such records manually becomes impossible with the resources available.
To solve this problem I developed GCRs, with the goals of:
-
Eliminating the encoding of multiple levels of arrangement into the Catalogue Reference.
-
Computing Catalogue References automatically during the automated accessioning process for a collection of digital records.
-
Creating Catalogue References that are unambiguous, uniform, and can be easily validated.
-
Ensuring that the GCR scheme is still easily communicable by humans by both written and verbal mechanisms.
A GCR starts just like a CCR by encoding the Department and Series References, however from there it deviates, instead of encoding further levels, it instead uses a sequentially allocated Record Number, and finally an optional Revision Number. The GCR scheme for most records, i.e. those with a single manifestation, looks like:
{Department Reference} {Series Reference} {Record Number} Z
For records with more than one manifestation, the additional manifestations can be identified by the GCR scheme:
{Department Reference} {Series Reference} {Record Number} Z{Revision Number}
Each record number is monotonically increasing per Department and Series pair. To ensure that the GCR remains succinct even when there are many records, I then encoded the record number using a custom Base25 alphabet. This encoding results in a significant compression of the number of characters needed to express the record number. The Base25 alphabet was carefully chosen to eliminate characters which could be confused when communicated by humans, for example 0
(the digit) and O
(the letter), in addition I removed vowels so that we were not incidentally generating recognisable words. The Z
character, which I also removed from the alphabet, is carefully placed to enable a GCR to be easily distinguishable from a CCR.
GCR Base25 Alphabet
Numeric Value | Encoded Symbol |
---|---|
0 | B |
1 | C |
2 | D |
3 | F |
4 | G |
5 | H |
6 | J |
7 | K |
8 | L |
9 | M |
10 | N |
11 | P |
12 | Q |
13 | R |
14 | S |
15 | T |
16 | V |
17 | W |
18 | X |
19 | 2 |
20 | 3 |
21 | 4 |
22 | 5 |
23 | 6 |
24 | 7 |
Examples of valid GCRs:
LOC 5 CWG Z
LOC 5/FPF/Z
LOC 5 CWG Z3
Example (1) and Example(2) are both valid GCRs. The GCR scheme does not actually stipulate that there should be a /
used between the Series, Record, and Z
components, however, for visual continuity TNA have elected to use this when presenting them. Example (3) shows the Revision number component, which serves to allow multiple manifestations of a record to be addressed by a GCR, for example you may have a Microsoft Word 2000 Document original, and a migrated PDF manifestation.
TNA have now been using GCRs for digital records for a few years. Retrospectively looking back at the design of GCRs, I have to admit that I am still quite happy with them, my younger self must have been having a particularly good day when he sat down to design the GCR scheme! It is certainly humbling to think that these innocuous little identifiers are forming a small part of the UKs permanent history, and that I had a hand in defining them.
For the purpose of considering them for use as the identifier scheme in Omega, GCRs have many of the properties that we require in a good identifier - they are unique, they are persistent (for the vast majority of records), they are computable, they are uniform, and they are humane (in many ways more so that CCRs, although perhaps not as memorable).
Indeed, we could casually adopt GCRs for use in Omega as our identifier scheme. Yet with further thought there are a couple of minor issues with that, and as we have the opportunity, I could perhaps even improve on GCRs yet. The issues that I perceive with adopting GCRs for Omega are:
-
In Omega we want our identifiers to be persistent. If the records need to be re-arranged, whist it is extremely unlikely that the Department reference would change, it is possible that the Series reference could. With a GCR, the Series reference is encoded in the identifier, which means the re-arrangement would unfortunately result in a change to the identifier.
-
To support our goal of building an immutable history of our records in Omega, we have a very clear distinction between, the enduring form of the record (i.e. the concept of the record), and our temporal understanding of the record (i.e. descriptions of the record). For this end, we need identifiers that can indicate both the concept of the record, and its descriptions as our understanding of the record accumulates and evolves through time. Whilst a GCR has a revision number to identify manifestations of the record, and one might consider re-purposing that for indicating revisions of description, that would then fall-short as we also have the concept of manifestations in Omega.
-
Adopting GCRs as identifiers for all records would mean that physical records would also gain a GCR alongside their CCR, born-digital records already have GCRs. There is in fact precedent for this, TNA-CS13 allows records to have Former References alongside their Catalogue Reference, one such Former Reference is the PRO (Public Records Office) reference; The PRO is of course the predecessor of TNA. The issue I perceive is one of mindset, staff know that GCR are used only for digital records, so when they see the
Z
character in a GCR they infer that it refers to a digital record. This is perhaps unfortunate, whilst it was never previously envisaged that the GCR scheme would be used for physical records, it didn't impose any such limitation, its specification in-fact states: "not solely limited to Born Digital Records" and that theZ
character (the Generated Catalogue Reference Indicator) is for the purpose of "allow[ing] users to visually differentiate a GCR from a CCR easily".
Additionally, one place where I think I could improve upon the GCR scheme is where CCRs have the advantage of being able to convey more information directly to the user about the record, thus reducing the need for the user to actually retrieve the record. Sometimes this information may be ambiguous and/or confusing, but more often than not, it is helpful to the user. GCRs removed a lot of that information to meet its goals, and ultimately ended up with a much more persistent identifier scheme which is a good thing. In Omega we have a clear split and definition of information that we believe changes over time, and information which we believe is enduring. If there is enduring information that is useful to the user, and assuming it is sensible for use in an identifier and URI, we can place that into the identifier without compromising on our requirement for persistent identifiers.
OCI (Omega Catalogue Identifier)
From what I have learnt about CCRs and how TNA use GCRs, in consultation with TNA's Catalogue Team I have developed a new identifier scheme which rather unimaginatively I am simply calling OCI (Omega Catalogue Identifier).
Let me be clear, my driver for this is solely the requirements of Project Omega. I believe that these identifiers will work well for the URI of our catalogue resources in our RDF graph, and would equally work well within a Linked Data context.
OCIs can be used for any type of record held by TNA. I might be suggesting, but I am NOT proposing, that the canonical Catalogue Reference of a record catalogued by TNA change from the existing CCR or GCR scheme. At this stage, I see OCIs as complementing CCRs and GCRs, whereby in some applications, such as Omega, the OCI is the primary identifier of the record. Regardless, OCIs will be generated for all existing catalogue records at TNA when they are imported into the Project Omega system. Could TNA start using OCIs instead of CCRs and GCRs for all new records that it accessions? Yes, of course! Will TNA do that? I am the wrong person to ask... that level of decision making is high above my position!
The basic OCI scheme for records has the following components:
{Creator Reference}.{Accession Year}.{Record Number}.{Accession Format}
-
Creator Reference
This is some identifier that uniquely identifies the organisation, group, or individual that created the records. Historically at TNA this is most often the Government Department, known as "Department reference" in CCR terms. -
Accession Year
The year in which the record was accessioned into the archive. -
Record Number
A monotonically increasing number, initialised per Creator Reference and Accession Year pair. This number is encoded using a special purpose alphabet. This is not the same Base25 alphabet used in GCRs for some important reasons covered below. -
Accession Format
A single character to indicate the format of the accessioned record. Currently limited toP
for physical records, orD
for digital records. Note that accession format is useful to indicate the format of the public record that was initially accessioned, but it should be remembered that there might also be additional manifestations of the record available in complementary formats, e.g. a digitisation of a physical record.
As already mentioned, Omega separates the concept (or enduring form) of a record from TNA's descriptions' of the record which evolve through time. The OCI scheme illustrated above allows one to identify a record, but how are we to identify the descriptions of the record? To identify a specific description of a record, we simply number them, and add an additional component to indicate the description:
{Creator}.{Accession Year}.{Record Number}.{Accession Format}.{Description Number}
- Description Number
A monotonically increasing number, initialised per record concept. This is not encoded. Comparing these numbers only infers an ordering through time of descriptions, it does not infer the correct description as there may be multiple competing descriptions from different sources.
It is perhaps worth pointing out that within the RDF graph for Omega there are explicit relationships that link a record with all of its descriptions, and also indicate the latest description. From a Linked Data perspective, if a user wished to resolve the record using the web of data, we would provide just the data about the concept and links to its descriptions. However, if another user was to resolve the record via a Web Browser, then we would likely redirect them to an HTML page of the latest description of the record.
Similarly to how we have multiple descriptions of a record, Omega also offers multiple manifestations of a record. A manifestation of a record can take many different forms, but there is always the original manifestation of the record as accessioned by TNA, for example, a parchment or digital file. There may also be additional manifestations of the record created for preservation or presentation purposes, for example, copies, digitisation, thumbnails, language translation, transcription, redaction, or file-format migration. The OCI scheme adds a component for manifestation, which is numbered in a similar manner to descriptions.
{Creator}.{Accession Year}.{Record Number}.{Accession Format}.{M{Manifestation Number}}
- Manifestation Number
Prefixed by anM
character, this is a monotonically increasing number, initialised per record concept. This is not encoded. Comparing these numbers does not infer an ordering of manifestations.
Note, it is important to realise that descriptions and manifestations are both numbered per record concept. Descriptions and manifestations do not have a hierarchical relationship, instead they are orthogonal to each other.
Here are some examples of (fictional) OCIs:
-
MSW.1970.7GH.P
This is the OCI for a physical record numbered 7GH which was created by MSW (The Ministry of Silly Walks) and accessioned by TNA in 1970. -
MSW.2014.L4F.D
This is the OCI for a digital record numbered L4F which was created by MSW and accessioned by TNA in 2014. -
MSW.1981.HGF.P.1
This is the OCI for the 1st description of, the physical record numbered HGF which was created by MSW and accessioned by TNA in 1981. -
MSW.1981.HGF.P.5
This is the OCI for the 5th description of, the physical record numbered HGF which was created by MSW and accessioned by TNA in 1981. -
MSW.1999.TSF.P.M1
This is the OCI for the 1st manifestation of, the physical record numbered HGF which was created by MSW and accessioned by TNA in 1999. -
MSW.1999.TSF.P.M5
This is the OCI for the 5th manifestation of, the physical record numbered HGF which was created by MSW and accessioned by TNA in 1999.
Astute readers may have noticed that we could potentially remove the .
character between the components in an OCI without loosing precision or introducing ambiguity. This is an interesting idea, and one that I discussed with the Catalogue Team, whilst it would make no difference to a machine, the majority felt that it was clearer for human use if they remained.
The alphabet for encoding record numbers in OCIs was created by:
- Starting with the Base32 alphabet from RFC 4648.
- Eliminating the English vowels -
A
,E
,I
,O
, andU
. We don't want to incidentally create meaningful words! - Removing the characters
P
,D
, andM
, as they are reserved to signify Physical, Digital, and Manifestation. - Removing the digit
0
(zero) as it could be misconstrued as numeric padding. - Removing the character
B
as it could be confused with the digit8
(eight) when read or written by humans. - Adding the characters
W
,X
, andY
. We opted not to addZ
so as to avoid any confusion with GCRs.
OCI Base25 Alphabet
Numeric Value | Encoded Symbol |
---|---|
0 | 1 |
1 | 2 |
2 | 3 |
3 | 4 |
4 | 5 |
5 | 6 |
6 | 7 |
7 | 8 |
8 | 9 |
9 | C |
10 | F |
11 | G |
12 | H |
13 | J |
14 | K |
15 | L |
16 | N |
17 | Q |
18 | R |
19 | S |
20 | T |
21 | V |
22 | W |
23 | X |
24 | Y |
The Base25 alphabet as used in OCIs has a small advantage over that used in GCRs - encoded data maintains its sort order when it is compared bit-wise.
I believe that the OCI scheme has two key advantages over GCRs:
-
100% Persistent.
A URI using an OCI will never change, even if the description or arrangement of the record changes. Once an OCI is created in Omega it lives forever. -
Conveys Knowledge.
Like a CCR an OCI confers some information about the record that it identifies, however unlike a CCR this is done without compromising on persistence of the identifier.
The OCI scheme should certainly be considered as a draft at the moment, and I am looking forward to both experimenting with it and receiving further feedback.
We have placed the source code for two software tools for encoding/decoding OCI Base25 (and also GCR Base25) onto GitHub: OCI Tools (Scala) and OCI Tools (TypeScript).
Full Circle back to URIs
As discussed at the start of this article... for Project Omega I defined a static base URI for expressing TNAs resources in RDF, and I have now also defined a suitable identifier scheme - OCI.
The URI for our data now look like this for a record (concept):
http://cat.nationalarchives.gov.uk/MSW.1970.7GH.P
and this for a record's description:
http://cat.nationalarchives.gov.uk/MSW.1970.7GH.P.2
Just above I said: "URI for our data", ideally such URI should be resolvable via the Web with various content negotiation options. At present alongside Project Omega, TNA is also operating Project Alpha. Alpha is focused upon the User Experience around the discoverability of records through TNAs website. There has already been some collaboration and information sharing between the two projects around URI. Yet, it is important to kept in mind that URI for addressing records (Omega), are not necessarily the same as URI for finding records (Alpha).
I expect that there will be further collaboration in the near-future between the Alpha and Omega projects to ensure that TNA can benefit from the exciting Linked Data on the Web applications that Omega is unlocking! :-)