Archival Identifiers for Digital Files

As part of Project Omega for TNA (The National Archives), I have been thinking about how identifiers for Digital Files should be constructed. This blog entry continues on from my previous entry: Archival Catalogue Record Identifiers.

When considering development of a new archival catalogue that can describe both physical, digitised, and born digital records, we quickly realised that unlike its predecessors this catalogue will also need to describe digital files.

At this point you might think that I am mixing current concerns between what archives' have often thought of as two separate systems, 1) their Archival catalogue, and 2) their Digital Preservation system. Yes, I am, and intentionally so! However, I would argue that this soup has been cooking for some time; I have seen that until now digital preservation systems have had to include some aspect of cataloguing (for their digital records) as the traditional archival catalogues, that were already in-place, were ill-equipped to describe the new digital world. I believe that a clean and mutually-beneficial separation between cataloguing and (digital) preservation activities can be established, but that as practitioners we are still very much writing the book on digital preservation.

Anyway, I digress! The archival concept of a Digital File is a complex one, as archivists we have to ask difficult questions like:

What is a digital file?
How do I describe a digital file?
Is a copy of a file the same digital file?
If I change the name of the file, is it still the same digital file?

All of these things have to be considered when designing a scheme for local identifiers of Digital File. Without writing an extended article on various principles of digital preservation, it is perhaps enough to say that the file's path and/or name are not suitable for use as an identifier; in no small part due to both their transient nature, and inability to be combined with files from other systems which may cause rise to naming conflicts.

The Current Approach

To date the predominant approach in digital preservation for generating identifiers for digital files has been to simply assign them a UUID (Universally Unique Identifier); more specifically a Version 4 UUID. This approach has several nice properties:

These can be generated independently of each other.

You can just magic a UUID into existence without concern for other UUIDs that have gone before or come after it.

The chance of a collision is incredibly small - "the probability to find a duplicate within 103 trillion version-4 UUIDs is one in a billion".
They are relatively compact and presentable.

A UUID is just a 128-bit positive integer. This is typically formatted for presentation as a hexadecimal string of five components, totaling 36 printable characters, albeit they are not very human friendly.
They are cheap to compute.

On a modern laptop we can easily generate over 500,000 every second.

A New Approach - Content Identifiers

As an alternative to UUIDs, I am proposing a new approach for generating an identifier for Digital File which is computed from the content of the file itself.

I should be clear that this is not some stroke of genius on my part, similar approaches are already widely used in other domains. For example, the Git SCM (Source Code Management) uses SHA1 digests to identify files and changes. Likewise, the IPFS (InterPlanetary File System) uses its definition of a CID (Content Identifier), which is a hash function's digest of a file's content to address that file.

To avoid any confusion between IPFS CID's and our "Content Identifiers", I will herein use the abbreviation ACID (Archival Content Identifier) to refer to my proposal for identifiers.

The main part of an ACID is generated by computing the digest of the byte-stream (i.e. content) of the digital file via a hash function. This raises the question, of which hash function should be used? There is a wealth of different hash algorithms available with various properties and different trade-offs. That being said, I am going to suggest that we use a BLAKE2b-256 hash for the following reasons:

Recognised and verified by NIST.
Likelihood of collision is incredibly small.
Much faster to generate than equivalents such as SHA-256.
At least as secure as SHA-3.

For example, if we wanted to generate a BLAKE2b-256 hash digest for the Apache 2.0 License file, we could run:

curl https://www.apache.org/licenses/LICENSE-2.0.txt | b2sum --length 256 --binary

This yields a 256-bit number encoded into a hexadecimal string totaling 64 printable characters:

3cbae8f16217ad44981e5843100092cd582202e69d452eb094480f2d24abdb49

This hexadecimal string has some interesting properties:

It can be used an an identifier for the Digital File.
Verifiable Descriptions.

Provoided with both, 1) the description and identifier of a digital file and, 2) the file itself, we can verify that the description is indeed about the file by re-computing the hash digest of the file and comparing the result with the digital file identifier.
Verifiable Preservation.

Similarly to above, if the hash digest of the file changes over time, then we can assert that there has been an issue with its preservation, e.g. data-rot.

There are some down-sides to using a hash digest as opposed to a UUID:

More expensive to compute.

A hash digest is much more expensive to compute than a UUID, and the larger the file being digested the more expensive it becomes.
Less compact.

Our 256-bit hash generates a result which is twice as long as a UUID.

I believe that the down-sides of a hash digest are outweighed by its advantage of offering verifiability.

Which Hash Function was it?

For the purposes of preservation and interoperability, one thing that we have not yet considered is how one determines which hash function was used to generate an identifier. Sure, I said we would use BLAKE2b-256, but what if you want to use a different hash function? Also, from an digital archaeological perspective, given an identifier like:

3cbae8f16217ad44981e5843100092cd582202e69d452eb094480f2d24abdb49

You might be able to infer that it is a hash digest, and the selection of characters used and the number of them would indicate that it could be a 256-bit hash... but which hash function was used?

Ideally, we need a mechanism to also communicate the hash function that was used. In fact IPFS already thought about this, and they use an encoding called Multihash which prefixes their CIDs with a code indicating the hash function used. Whilst we could adopt Multihash here, it's much more complex than we need (famous last words?!?). Instead, I propose that our ACID's have a single ASCII character at the start that indicates the hash function that was used. A single ASCII character has the advantage of a fixed-length numeric encoding, and it makes the number of characters in the hexadecimal string representation an odd number, thus providing a hint to a digital archeologist that perhaps this ACID is similar to a digest but with an extra character. I will go one step further and say that this character should be outside of the hexadecimal alphabet (and ignorant of case-sensitivity), this should make it glaringly obvious to such a digital archeologist that the prefix character has a meaning which is distinct from the rest of the string.

An ACID is then formatted from a template like this:

{Hash Function Type}{Hash Digest}

For the Hash Function Type, I am going to reserve the ! character to indicate BLAKE2b-256. Why? Because, I think it looks cool! This would mean that our earlier digital file identifier now simply becomes:

!3cbae8f16217ad44981e5843100092cd582202e69d452eb094480f2d24abdb49

What about Collisions?

Sure, generating a digital file identifier with BLAKE2b-256 has a very small chance of generating a collision (i.e. two different files with the same identifier), but what if...?

If you detect a collision, I will build you a new digital archive system for free... Nope! Just joking! We actually already have a mechanism for coping with this, the Hash Function Type; for the new file which creates the collision you could switch to a different hash function, perhaps a 512-bit one! This would at least give you a different unique identifier. But... what to do about the original file which is on the other side of the collision, it's probably deeply embedded in your archive by now! You could re-catalogue it, but maybe you don't even need to???

I have in mind the idea to write another article about further encoding such ACID's for compact machine use. Okay… that’s enough for today!