Down the Code Mine

Refreshing Apache XML Infrastructure

Adam Retter — Mon, 10 Nov 2025 22:33:24 GMT

At my company (Evolved Binary) we recently had to address a series of bugs in Elemental that involved the Serialization and Deserialization of XDM (XQuery and XPath Data Model) values. The issues occurred when transferring the values of certain XDM types over the REST and XML-RPC APIs. As part of the #markupmonday effort from the XML community to post more articles, I decided to write up what we have been working on in this blog post.

The issues in the REST API were relatively easy for us to address as we control almost all of the code used there. However, Elemental inherited the XML:RPC API from eXist-db, and that relies heavily on the Apache XML-RPC library. The last official release of Apache XML-RPC was version 3.1.3 back in February of 2010 - some 15+ years ago! Since then it has not been maintained by the Apache project.

In fact, we had already visited the topic of what to do with Apache XML-RPC in the past...

In June 2018, we identified that there were known security issues in Apache XML-RPC. I noted that a number of linux distributions were shipping security patches to protect against those issues, but that those patches were not included in the official Apache XML-RPC release. At that time we decided to create our own fork of the Apache XML-RPC project because:

The Apache XML-RPC project appeared to have been archived and was unmaintained.
We had a direct dependency on it.
It would not be easy to replace as we needed to maintain 100% forwards compatibility.

At that time I imported the Apache XML-RPC code from Subversion to our GitHub, fixed the issues, and cut our own release. We did this work publicly and in accordance with the original license. We chose a new major version number to try and signal that this was a major departure from 3.1.3, i.e. a new fork. We published a new release of Apache XML-RPC version 4.0.0 (in our own public namespace). This enabled anyone to use Apache XML-RPC with the security fixes already included.

In September 2022, I needed to switch a project over from Javax Servlet, to Jakarta Servlet. Unfortunately such a change in the Java world is not simple. Java Servlets (or those from any libraries you use) that utilise the Javax Servlet API have to be updated to compile against the newer Jakarta Servlet API. Apache XML-RPC was an example of one such library that delivers its Web Endpoint(s) using Javax Servlet. This time was easier as we already had own fork of Apache XML-RPC. From there we made the necessary changes, and chose a new major version number to indicate the breaking API change, and then we publicly released Apache XML-RPC version 5.0.0 that is compatible with Jakarta Servlet.

More recently, for Elemental, we wanted to be able to serialize and deserialize all XDM types over XML-RPC. If you squint a little, then the XDM Node types are a subset of the XML DOM (Document Object Model) types, and so we thought it might be nice if you could send and receive any XML over XML-RPC.

Now you might be forgiven for some confusion here if you are thinking: Wait a minute this is XML-RPC! Why can't he send XML over his XML-RPC? Let's step back for a moment. The XML in XML-RPC is just a wire-format for RPC, the focus is still on RPC (Remote Procedure Call), the XML is only used to describe function calls, their parameters, and their results.

For example, imagine we had a Java function like: public String sayHello(String name), then an XML-RPC call to that function would produce the following XML document request that is sent from the client to the server:


  sayHello
  
    
      Adam

Simple XML-RPC Request

The server would then execute the sayHello function, and all being well, might send this XML response document back to the client:


  
    
      Hello Adam!

Simple XML-RPC Response

So far so good, but our use-case is more complicated as we want to be able to send any XML DOM type (e.g. org.w3c.dom.Document, org.w3c.dom.Element, org.w3c.dom.Text, etc.) as either function parameters or the function result type. Imagine that we had another Java function like: public Document createInvoice(Attr id, Element address, Element[] items). We now need to be able to put XML inside our XML-RPC request and/or response document. By default, XML-RPC does not support that. XML-RPC has a quite limited type system that supports just:

32-bit signed integers.
64-bit double-precision signed floating point numbers
Boolean values.
Strings.
ISO 8601 date/time.
Base64
Arrays and Structures composed of the prior types.

Apache XML-RPC allows you to add extensions in the form of custom XML serializers/deserializers for your own types. However, as our goal here was the serialization/deserialization of general XML and not some custom type, and furthermore as we found some incomplete support for this in Apache XML-RPC, we decided it was best to extend that prior work to completion. After our additions, an XML-RPC call to the createInvoice function now produces the following XML document request that is sent from the client to the server:


  createInvoice
  
    
      
        
      
    
    
      
        99Via MedailBardonecchiaTO
      
    
    
      
        
          
            Sprockets57.50
            Springets1219.20

XML-RPC Request containing embedded XML

The server would then execute the createInvoice function, and all being well, might send the back this XML response document to the client:

...

XML-RPC Response containing embedded XML

We wrapped that nice new feature up, and to signal that it could cause backward incompatibilities with previous versions, we released it as a major version 6.0.0.

Unfortunately during testing of 6.0.0 without our system, we found occasional exceptions being thrown with particular XML documents that we wanted to use as parameters to our functions. This initially puzzled me, as these were completely valid XML documents in their own right, and should have been handled correctly by our changes. Here is an example of such a document, that caused an exception when used as a parameter (or function return type) within our Apache XML-RPC 6.0.0:


  123
  123

Innocuous looking XML document

Trying to use such a document within XML-RPC raised an exception like:

java.lang.IllegalStateException: The prefix  isn't the prefix, which has been defined last.
	at org.apache.ws.commons.util.NamespaceContextImpl.endPrefixMapping(NamespaceContextImpl.java:95)
	at org.apache.ws.commons.util.test.NamespaceContextIT$NamespaceContextHandler.endPrefixMapping(NamespaceContextIT.java:90)
	...

Java Exception raised by processing the innocuous looking XML document

A quick look into org.apache.ws.commons.util.NamespaceContextImpl#endPrefixMapping(String) and we find this little nugget of Javadoc:

* @throws IllegalStateException The prefix is not the prefix, which
* has been defined last. In other words, the calls to
* {@link #startPrefixMapping(String, String)}, and
* {@link #endPrefixMapping(String)} aren't in LIFO order.

Javadoc from NamespaceContextImpl

This and a further study of the complete code in NamespaceContextImpl confirmed that it expects the interleaved calls to startPrefixMapping and endPrefixMapping to happen in LIFO (Last in, First out) order. There's one big problem with that though, which is that that is not how prefix mapping might work in the real world! Typically such start/end prefix mapping methods are fired by an XML Parser or Serializer, and if we look at SAX (Simple API for XML) which is an approach used in many XML parsers and serializers, we see that it explicitly states here:

Note that start/endPrefixMapping events are not guaranteed to be properly nested relative to each other

Javadoc from SAX's ContentHandler

Through using XML-RPC to process XML Elements that have complex namespace arrangements, it seems we have found another bug. Unfortunately for us, this bug was not actually in Apache XML-RPC, but was in a library that it depends on - ws-commons-util (Apache Web Services Common Utilities). The Apache ws-commons-util library is even older than Apache XML-RPC, with its last official release version 1.0.2 being published back in August 2007 - some 18+ years ago! Since then it has not been maintained by the Apache project.

So we need to fix this bug too of course, but how should we do that?

With Apache XML-RPC we had felt comfortable forking the project as we had a direct dependency on it ourselves, but this is another level removed, where we have an indirect dependency on apache ws-commons-util through Apache XML-RPC. In addition, whilst Apache XML-RPC is quite application specific, ws-commons-util is known to be used as a general utility library in lots of other XML projects too. In the Java world, it is effectively XML infrastructure. Initially forking ws-commons-util felt like we were over reaching, and that any fix that we produced would be better upstreamed to Apache for all users to receive easily. Initially we contacted two of the main authors of ws-commons-util, and we received a very kind response from Jochen Wiedmann, who apologised that it was unmaintained and explained that whilst it may not be the response I hoped for, my best option was probably to fork it. Thank you very much Jochen :-)

Well with one of the original author's blessings in hand, and without any other option in sight, we forked the ws-commons-util project. I imported the Apache code from Subversion to our GitHub, fixed the bug by writing a completely new specification compliant implementation of NamespaceContextImpl, and cut our own release. We did this work publicly and in accordance with the original license. We bumped the feature version number to try and signal that this was an incremental release, i.e. a new implementation of NamespaceContextImpl that whilst maintaining the existing API, also adds a new and improved API. We published a new release of Apache ws-commons-util version 1.1.0 (in our own public namespace).

Finally, we then released a new version of Apache XML-RPC version 6.1.0, where we changed its dependency from the official Apache ws-commons-util 1.0.2 to our newer fork: ws-commons-util 1.1.0.

If you are users of Apache XML-RPC or ws-commons-util, we would love to hear from you if this has been helpful for you too.

It seems that at my company, Evolved Binary, we are creating a bit of a pattern for forking and maintaining XML infrastructure. I recognise that this is not without its own risks, and I hope to discuss those on this blog in detail soon.

In the meantime, for your perusal, Evolved Binary now maintain the following XML infrastructure projects:

Builds of Apache Xerces, with and without XML Schema 1.1 support, and with and without Java 14+ support - https://central.sonatype.com/artifact/com.evolvedbinary.thirdparty.xerces/xercesImpl
Apache XML APIs - https://central.sonatype.com/artifact/com.evolvedbinary.thirdparty.xml-apis/xml-apis
Eclipse XPath 2 Engine - https://central.sonatype.com/artifact/com.evolvedbinary.thirdparty.org.eclipse.wst.xml/xpath2
Milton WebDAV library - https://central.sonatype.com/namespace/org.exist-db.thirdparty.com.ettrema
Builds of XML Mind DITAC - https://central.sonatype.com/artifact/com.evolvedbinary.thirdparty.com.xmlmind/ditac
JVNet JAXB Maven Plugin - https://github.com/evolvedbinary/jvnet-jaxb-maven-plugin
Mojohaus JAXB Maven Plugin - https://github.com/evolvedbinary/mojohaus-jaxb-maven-plugin

Routing Success and Failure in XProc 3.0

Adam Retter — Sun, 24 Nov 2024 16:30:38 GMT

Recently on a new project where I needed to transform documents from a variety of source formats into a standard output format, I decided to try and orchestrate this by using XProc 3.0. In the past when faced with a similar challenge I have typically built the pipeline itself in Java or XQuery, but I felt that the goals and purpose of XProc should mean that it is a better fit for such projects.

One aspect that I have repeatedly been struggling with when building my XProc 3.0 pipeline is that of comprehending the exact flow of documents and/or data through the pipeline. Specifically, I want to be able to have separate success and failure flows in my pipeline, so that I can manage them with independent steps. To achieve this, I want to be able to capture any failures and route them to separate sub-pipelines that handle or recover from such failures.

I have tried to illustrate below an example of a simple pipeline where the main step of consequence is "Parse Document", whose tasks is to parse an XML document. If the parsing succeeds the pipeline should store the document to a location on the filesystem, and then return a copy of the document as the output of the pipeline. If the parsing fails, perhaps because the document is not well-formed, then it should create an error report of the details and store the error report to a different location on the filesystem.

Illustration of an example Parsing pipeline with separate Success and Failure routes

The point I am trying to illustrate is that I would like a different flow through the pipeline depending on whether the "Parse Document" step succeeds or fails.

In XProc 3.0 we can encapsulate any other step within a try/catch step. In this case we could encapsulate a p:load step within a p:try sub-pipeline. This enables us to catch and handle any error that is raised by the p:load step. XProc 3.0 also has the concept of Output Ports, and we can use them in this case to setup two Output Ports, one for success, and a separate one for any failure.

The theory, at least to me, of what we need to do is simple and straight-forward. Unfortunately, I spent a couple of days struggling to get this working in XProc 3.0; I experimented with many syntactic variations to try and implement this. Ultimately, after some kind pointers from both Norman Walsh and Achim Berndzen (thank you both!), I was able to construct an example that finally worked. I am reproducing this below for both my own reference, and for anyone else that might want to achieve the same thing.

Routing from Try/Catch in XProc 3.0

example-try-catch-routing.xproc - XProc 3.0 Pipeline with Success/Failure routing from Try/Catch

To experiment with this XProc 3.0 pipeline, you can run it using Morgana XProc IIIse, by executing:

$ ./Morgana.sh example-try-catch-routing.xproc

If you want to run and see the result of the Success flow:

Uncomment the line:
Comment out the line:
Comment out the line:

If you want to run and see the result of the Failure flow:

Comment out the line:
Uncomment the line:
Uncomment the line:

I have tried to show an illustration below of the data flow through such a pipeline. I admit that it is not a very good illustration and it might have been a better idea to flatten all of the steps and connect them by their ports only rather than showing them in the same nested fashion as that of the XProc syntax.

Illustration of an XProc 3.0 Pipeline with Success/Failure routing from Try/Catch

Running OpenBSD 7.4 under UTM on macOS

Adam Retter — Tue, 16 Jan 2024 19:07:30 GMT

I had some difficulty with installing and running OpenBSD 7.4 (and previously 7.3) under UTM on macOS. My main issue appears to be that OpenBSD cannot access the Mouse or Keyboard from the Mac. I am unsure whether this is a problem caused by one or more of UTM, QEMU, OpenBSD, or possibly macOS, but I can report that I have no problems with Ubuntu under UTM on macOS.

I have a MacBook Pro 2019 (Intel Core i9, 32 GB RAM) running macOS Sonoma 14.1.1. I am using UTM 4.4.5, and have downloaded the OpenBSD 7.4 image for the amd64 architecture.

Thankfully I have managed to find workarounds for each of the issues, which I am documenting below for my own memory, and because hopefully they may be useful for others.

Update 2024-01-17: A simpler workaround in UTM to fix the unresponsive Keyboard and Mouse.

How to Boot the OpenBSD Install Image in UTM

From the openbsd.org website, I downloaded the OpenBSD version 7.4 bootable image for amd64 systems from my closest mirror, the file is called: install74.img.

I created a new VM (Virtual Machine) in UTM by selecting Virtualize, and choosing the Other category, I then assigned it 2 vCPU, 8 GB RAM, and a 40GB hard disk. However, as the install74.img file is not an ISO file, I had to click "Skip ISO boot" in the first stage of the configuration wizard, and then "Open VM Settings" in the last stage. From the VM settings I added a second drive of type Removable USB, set it to Read-Only, and selected the install74.img file as the source.

USB Disk Image for install74.img

How to use the Keyboard in OpenBSD Installer in UTM

Initially after booting the OpenBSD image described above, I found that I was unable to use the keyboard within the OpenBSD installer inside UTM; or at least OpenBSD did not seem to receiving any keystrokes!

Keyboard appears non-responsive within the OpenBSD Installer inside UTM

A little bit of experimentation showed however that before the installer starts, whilst at the initial OpenBSD kernel boot> prompt, the keyboard was actually working. With that knowledge, I was then able to add a Serial Port to the VM in UTM. One of the nice features of UTM, is that it will start up a terminal in a window that is connected to the new serial port.

Configuring a Serial Port for the VM in UTM

After adding the Serial Port to the VM, I rebooted OpenBSD and told it to redirect its default console to the serial port by entering the command set tty com0 at the boot> prompt.

Switching the default console to the com0 Serial Port during OpenBSD boot

With that done, I was able to boot and install OpenBSD as normal albeit using the serial port terminal for interacting with it via the keyboard.

One more thing to note is that during the install process, the installer will ask whether to permanently switch the default console to the serial port, due to the issue in UTM we should answer yes, and accept the defaults for the serial port speed, i.e.:

Change the default console to com0? [yes] yes
Available speeds are: 9600 19200 38400 57600 115200.
Which speed should com0 use? (or 'done') [9600] 9600

How to use the Keyboard and Mouse with X Window System in OpenBSD in UTM

During the installation process I opted to install X, however after the install finished and the system had rebooted, I could see the X Windows System on the default console but I could not interact with it via the Mouse or Keyboard.

Mouse and Keyboard appear non-responsive within the OpenBSD X Windows System inside UTM

We can work around this by installing a VNC Server and then remotely controlling X Windows via a VNC Client. We can install and configure the VNC Server using the Serial Console that we earlier setup:

su -
pkg_add tigervnc
exit

vncpasswd

XAUTHORITY=/etc/X11/xenodm/authdir/authfiles/A:0-r4dlnM x0vncserver -display :0 -PasswordFile ~/.vnc/passwd

You can then connect to the desktop via a VNC Client from the host Mac. Any VNC compatible client should work, personally I am using Real VNC for this:

Connecting via VNC from Mac to OpenBSD X Windows System

Additional UTM VM Settings for OpenBSD

Whilst the above is enough to get OpenBSD 7.4 running in UTM, I also made some additional changes to the VM configuration in UTM:

Enable the QEMU -> Balloon Device.
Enable the Display -> Retina Mode.
Change the Network -> Emulated Network Card to virtio-net-pci.
Change the IDE Drive -> Interface from IDE to VirtIO.

Update 2024-01-17: Simpler Workaround - i440FX instead of Q35

After submitting this article to Lobste.rs, I had some feedback from a user named vtech who suggested that changing the system type from Q35 (the default in UTM) to an i440FX would resolve the un-responsive Keyboard and Mouse issues I had been experiencing. I can confirm that this does appear to also workaround the issue in a simpler fashion. Thanks vtech!

Setting the System to i440FX instead of Q35 in UTM makes the Keyboard and Mouse responsive again

Ubuntu RISC-V 64 Guest on an x86_64 KVM Host

Adam Retter — Mon, 11 Dec 2023 21:09:01 GMT

Ubuntu's uvtool utility is fantastic for easily creating small Cloud VMs (Virtual Machines). Sadly, it didn't yet support creating VMs that provide RISC-V emulation on x86_64 hosts. However, with a little bit of patching to Ubuntu 22.04's uvtool package, I was able to get it to create an (emulated) RISC-V Ubuntu 22.04 KVM guest VM.

My Host system was:

Intel Xeon E5-1650v3
Ubuntu 22.04.3 LTS
QEMU emulator version 6.2.0 (installed from Ubuntu's apt)
libvirt 8.0.0-1 (installed from Ubuntu's apt)
uvtool 0~git178-0 (installed from Ubuntu's apt)

If you don't already have QEMU, KVM, and libvirt installed you will need to run:

sudo apt install -y cpu-checker qemu qemu-kvm libvirt-daemon libvirt-clients bridge-utils dnsmasq

Install QEMU, KVM, and libvirt

To install uvtool you will need to run:

sudo apt-get install -y uvtool

Install uvtool

To install support for RISC-V emulation you will need to run:

sudo apt-get install -y qemu-system-misc u-boot-qemu

Install RISC-V emulation support

Patching uvtool to Support riscv64 Emulation

The latest version of uvtool (version 0~git178-0) at the time of writing does not support emulation of riscv64 via QEMU. However, by applying my patches you can get this work.

If you are familiar with Git and Python, you can find my patches here: https://code.launchpad.net/~adam-retter/ubuntu/+source/uvtool/+git/uvtool/+merge/457260

If you are not, then you can take the following steps to apply my changes manually:

Make a backup of the file: /usr/lib/python3/dist-packages/uvtool/libvirt/kvm.py, and then make the following changes to the original by replacing:

if target_arch == 'armhf':
    return '/usr/share/uvtool/libvirt/template-emu-armhf.xml'

with:

if target_arch == 'armhf':
    return '/usr/share/uvtool/libvirt/template-emu-armhf.xml'
if target_arch == 'riscv64':
    return '/usr/share/uvtool/libvirt/template-emu-riscv64.xml'

Make a backup of the file: /usr/lib/python3/dist-packages/uvtool/libvirt/__init__.py, and then make the following changes to the original by replacing:

'ppc64le': 'ppc64el',

with:

'ppc64le': 'ppc64el',
'riscv64': 'riscv64',

and, by replacing:

# early exit on not supported emulations
if target_arch != 'armhf':
    return False

with:

# early exit on not supported emulations
if target_arch != 'armhf' and target_arch != 'riscv64':
    return False

Delete the folder /usr/lib/python3/dist-packages/uvtool/libvirt/__pycache__

Create the following file at /usr/share/uvtool/libvirt/template-emu-riscv64.xml:


    
        hvm
        /usr/lib/u-boot/qemu-riscv64_smode/uboot.elf
        root=/dev/vda1
        
    
    
        
    
    
        /usr/bin/qemu-system-riscv64

Creating a riscv64 Emulated KVM Guest with uvtool

First, we will use uvtool to download a Cloud Image of Ubuntu 22.04 (Jammy Jellyfish) for RISC-V 64, this can be done by running:

sudo uvt-simplestreams-libvirt sync arch=riscv64 release=jammy

Download the Cloud Image of Ubuntu 22.04 for RISC-V 64

After that completes, you can check that you have successfully downloaded the image by running the following command and observing the output:

sudo uvt-simplestreams-libvirt query

release=jammy arch=riscv64 label=release (20231211)

Show downloaded Ubuntu Cloud Images

To create an emulated RISC-V 64 KVM guest with uvtool, you need to pass the argument --guest-arch riscv64 to its uvt-kvm command. For example:

sudo uvt-kvm create \
    --guest-arch riscv64 --cpu 2 --memory 16384 --disk 40 \
    --bridge virbr1 --network-config /tmp/myvm-net.yaml \
    --ssh-public-key-file /tmp/ssh/myvm.pub \
    myvm \
    arch=riscv64 release=jammy label=release

You may need to vary the above configuration depending on your network settings:

I already have virbr1 configured as a virtual network bridge on the host.
I have already generated an SSH key-pair at /tmp/ssh using ssh-keygen.
I have created a Netplan configuration file at /tmp/myvm-net.yaml for myvm that will be deployed by cloud-init when the VM first boots:

version: 2
ethernets:
  enp1s0:
    addresses:
      - 10.0.0.10/32
      - fc00:10:0:0::10/128
    nameservers:
      addresses:
        - 10.0.0.253
        - fc00:10:0:0::253
      search:
        - home.dom
    routes:
      - to: 0.0.0.0/0
        via: 10.0.0.254
        on-link: true
      - to: "::/0"
        via: "fc00:10:0:0::254"
        on-link: true

Example Netplan configuration file for the VM

After uvt-kvm completes successfully we can check the status of the VM, by running:

virsh list --all

 Id   Name         State
-----------------------------
 1    myvm         running

You can then observe the console of the VM by running: virsh console myvm, or by SSH'ing to the IP address of the VM using the SSH key-pair previously generated and the username ubuntu, e.g. ssh -i /tmp/ssh/myvm ubuntu@10.0.0.10. After logging, in you will see a banner like this:

Welcome to Ubuntu 22.04.3 LTS (GNU/Linux 5.19.0-1021-generic riscv64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

  System information as of Mon Dec 11 18:28:49 UTC 2023

  System load:             0.15380859375
  Usage of /:              4.1% of 38.60GB
  Memory usage:            1%
  Swap usage:              0%
  Processes:               101
  Users logged in:         0
  IPv4 address for enp1s0: 10.0.0.10
  IPv6 address for enp1s0: fc00:10:0:0::10

Ubuntu 22.04 RISC-V Login MOTD

Finally, running uname -av and cat /proc/cpuinfo confirms that this is indeed a RISC-V 64 system:

uname -av

Linux blackberry 5.19.0-1021-generic #23~22.04.1-Ubuntu SMP Thu Jun 22 12:49:35 UTC 2023 riscv64 riscv64 riscv64 GNU/Linux


cat /proc/cpuinfo

processor	: 0
hart		: 0
isa		: rv64imafdc
mmu		: sv48

processor	: 1
hart		: 1
isa		: rv64imafdc
mmu		: sv48

Ubuntu 22.04 VM Emulated RISC-V CPU Details

AWS Client VPN with Manually Provisioned Certificates using Terraform

Adam Retter — Mon, 25 Sep 2023 12:30:01 GMT

This is a brief explanation of how to use Terraform to setup an AWS CVPN (Client VPN) where the certificates (for VPN authentication) are manually provisioned by yourself and then uploaded into ACM (AWS Certificate Manager).

The advantage of using a manually provisioned approach is that the cost is significantly lower than a managed approach that utilises AWS Private CA (Certificate Authority). Previously, the cost of operating AWS Private CA for us was $800 / Month. Apart from the initial manual steps required to generate the certificate chain, the main disadvantages of this approach over a fully managed approach are that: (1) you also have to store your certificates and keys outside of the cloud in a secure location, and (2) you have to remember to renew certificates before they expire and then replace the expiring certificates with their newer counterparts in ACM.

If you are looking for a fully managed approach instead, you can find this in a previous blog article that I wrote - AWS Client VPN with Managed Certificates using Terraform.

The approach taken in this article was informed by the Timeular blog post: Creating an AWS Client VPN with Terraform. Unfortunately, I could not get their approach to work correctly, and so the approach detailed in my article is an adaption/extension of their approach, but I want to thank them as my work builds upon theirs.

Chain of Trust

We will create a number of certificates that will form a chain of trust. Each subordinate certificate is signed by its parent certificate. Our chain of trust will look like:

Root CA Certificate.
Intermediate CA Certificate - this CA certificate is used as an intermediary between the Root CA and our other CAs. This makes it easier to revoke subordinate CA certificates should we need to. This certificate is signed by the Root CA Certificate.
CVPN CA Certificate - this CA certificate is used for issuing and signing certificates that are used by the CVPN Server and its clients (e.g. users) to connect to the CVPN Server. This certificate is signed by the Intermediate CA Certificate.
CVPN Server Certificate - this endpoint certificate is for the CVPN server itself in AWS. This certificate is signed by the CVPN CA Certificate.
CVPN Client Certificate - we create one of these endpoint certificates for each client (e.g. user) that wishes to connect to the CPVN Server. These certificates will also be signed by the CVPN CA Certificate.

Chain of Trust diagram

The purpose of having a separate Root CA, Intermediate CA, and CVPN CA is that we have delegated trust from the Root to the CVPN via the Intermediate. We can therefore easily isolate the CVPN if needed and manage CVPN certificate issuance and revocation completely separately to the Root. In future it may be desirable to create further delegated CAs from the Intermediate CA for other purposes.

Creating the Certificates

We will manually create all of the CA certificates and endpoint certificates. Whilst various tools are available for this purpose, for convenience, the first time I undertook this I used the GUI tool XCA (which is available for macOS, Linux, and Windows). In this article I will show both the settings required for XCA, and also provide the equivalent OpenSSL commands. After the Certificates are generated we will deploy them and the Client VPN with Terraform further below.

Note: For the OpenSSL commands, I tested against OpenSSL version 3.0.11.

Root CA

In XCA you need to navigate to the Certificates tab, and then press the New Certificate button. This presents a dialog with a number of tabs, that should be configured with the following information and settings, after which a new Root CA certificate will be created in the XCA database for you.

Type: x509 Certificate
Source:
- Signing: Create a self signed certificate
- Signature algorithm: SHA 512
- Template for new certificate: [default] Empty template
Subject:
- Internal Name: My Private Root CA
- Distinguished Name:
  - countryName: GB
  - stateOrProvinceName: My State
  - localityName: My Locality
  - organizationName: My Company
  - organizationalUnitName: DevOps
  - commonName: root.ca.cert.private.mydomain.com
  - emailAddress: devops@mydomain.com
- Private Key:
  - Name: My Private Root CA
  - Type: RSA
  - Size: 4096 bit
Extensions:
- x509v3 Basic Constraints:
  - Type: Certification Authority
- Key Identifier:
  - x509v3 Subject Key Identifier
  - x509v3 Authority Key Identifier
- Validity: (10 Years)
  - Not Before: 2023-09-01 00:00 GMT
  - Not After: 2033-08-31 23:59 GMT
Key usage:
- x509v3 Key Usage:
  - Certificate Sign
  - CRL Sign

Root CA visible in XCA

If you prefer to generate the Root CA with OpenSSL then you can use the following commands:

cat << EOF > openssl.conf
default_ca = my_ca

[ my_ca ]
unique_subject = yes
database = openssl.index.txt
serial = openssl.serial
default_md = sha1
default_crl_days = 730

[ req_ca ]
x509_extensions = v3_ca

[ v3_ca ]
basicConstraints = CA:TRUE
subjectKeyIdentifier=hash
authorityKeyIdentifier=keyid
keyUsage=Certificate Sign, CRL Sign

[ req_tls_server ]
x509_extensions = v3_tls_server

[ v3_tls_server ]
basicConstraints = CA:FALSE
subjectKeyIdentifier=hash
authorityKeyIdentifier=keyid
keyUsage=Digital Signature
extendedKeyUsage=TLS Web Server Authentication

[ req_tls_client ]
x509_extensions = v3_tls_client

[ v3_tls_client ]
basicConstraints = CA:FALSE
subjectKeyIdentifier=hash
authorityKeyIdentifier=keyid
keyUsage=Digital Signature
extendedKeyUsage=TLS Web Client Authentication
EOF

openssl genrsa -out root.ca.private.key.pem 4096

openssl req -new -x509 -config openssl.conf -section req_ca -sha512 \
    -subj "/C=GB/ST=My State/L=My Locality/O=My Company/OU=DevOps/CN=root.ca.cert.private.mydomain.com/emailAddress=devops@mycompany.com" \
    -days 3650 -set_serial 0x1 -key root.ca.private.key.pem -out root.ca.crt.pem

Generating a Root CA with OpenSSL

Intermediate CA

To generate the subordinate Intermediate CA in XCA you need to make sure you have the My Private Root CA certificate selected in the Certificates tab, and then press the New Certificate button. This presents a dialog with a number of tabs, that should be configured with the following information and settings, after which a new Intermediate CA certificate will be created (as a subordinate of the Root CA) in the XCA database for you.

Type: x509 Certificate
Source:
- Signing: My Private Root CA
- Signature algorithm: SHA 512
- Template for new certificate: [default] Empty template
Subject:
- Internal Name: My Private Intermediate CA
- Distinguished Name:
  - countryName: GB
  - stateOrProvinceName: My State
  - localityName: My Locality
  - organizationName: My Company
  - organizationalUnitName: DevOps
  - commonName: intermediate.ca.cert.private.mydomain.com
  - emailAddress: devops@mydomain.com
- Private Key:
  - Name: My Private Intermediate CA
  - Type: RSA
  - Size: 4096 bit
Extensions:
- x509v3 Basic Constraints:
  - Type: Certification Authority
- Key Identifier:
  - x509v3 Subject Key Identifier
  - x509v3 Authority Key Identifier
- Validity: (5 Years)
  - Not Before: 2023-09-01 00:00 GMT
  - Not After: 2028-08-31 23:59 GMT
Key usage:
- x509v3 Key Usage:
  - Certificate Sign
  - CRL Sign

Intermediate CA visible in XCA

If you prefer to generate the Intermediate CA with OpenSSL then you can use the following commands:

openssl genrsa -out intermediate.ca.private.key.pem 4096

openssl req -new -config openssl.conf -section req_ca -sha512 \
    -subj "/C=GB/ST=My State/L=My Locality/O=My Company/OU=DevOps/CN=intermediate.ca.cert.private.mydomain.com/emailAddress=devops@mycompany.com" \
    -key intermediate.ca.private.key.pem -out intermediate.ca.csr.pem

openssl req -x509 -config openssl.conf -section req_ca -sha512 \
    -CA root.ca.crt.pem -CAkey root.ca.private.key.pem \
    -days 1825 -set_serial 0x2 -in intermediate.ca.csr.pem -key intermediate.ca.private.key.pem -out intermediate.ca.crt.pem

Generating an Intermediate CA with OpenSSL

CVPN CA

To generate the subordinate CVPN CA in XCA you need to make sure you have the My Private Intermediate CA certificate selected in the Certificates tab, and then press the New Certificate button. This presents a dialog with a number of tabs, that should be configured with the following information and settings, after which a new CVPN CA certificate will be created (as a subordinate of the Intermediate CA) in the XCA database for you.

Type: x509 Certificate
Source:
- Signing: My Private Intermediate CA
- Signature algorithm: SHA 512
- Template for new certificate: [default] Empty template
Subject:
- Internal Name: My Private CVPN CA
- Distinguished Name:
  - countryName: GB
  - stateOrProvinceName: My State
  - localityName: My Locality
  - organizationName: My Company
  - organizationalUnitName: DevOps
  - commonName: cvpn.ca.cert.private.mydomain.com
  - emailAddress: devops@mydomain.com
- Private Key:
  - Name: My Private CVPN CA
  - Type: RSA
  - Size: 2048 bit
Extensions:
- x509v3 Basic Constraints:
  - Type: Certification Authority
- Key Identifier:
  - x509v3 Subject Key Identifier
  - x509v3 Authority Key Identifier
- Validity: (3 Years)
  - Not Before: 2023-09-01 00:00 GMT
  - Not After: 2026-08-31 23:59 GMT
Key usage:
- x509v3 Key Usage:
  - Certificate Sign
  - CRL Sign

Note that for the CVPN CA, the Private Key size is only 2048 bits and not 4096 bits (as used by the Root and Intermediate CA); this is due to a limitation with the AWS Client VPN only supporting a maximum key size of 2048 bits.

CVPN CA visible in XCA

If you prefer to generate the CVPN CA with OpenSSL then you can use the following commands:

openssl genrsa -out cvpn.ca.private.key.pem 2048

openssl req -new -config openssl.conf -section req_ca -sha512 \
    -subj "/C=GB/ST=My State/L=My Locality/O=My Company/OU=DevOps/CN=cvpn.ca.cert.private.mydomain.com/emailAddress=devops@mycompany.com" \
    -key cvpn.ca.private.key.pem -out cvpn.ca.csr.pem

openssl req -x509 -config openssl.conf -section req_ca -sha512 \
    -CA intermediate.ca.crt.pem -CAkey intermediate.ca.private.key.pem \
    -days 1095 -set_serial 0x3 -in cvpn.ca.csr.pem -key cvpn.ca.private.key.pem -out cvpn.ca.crt.pem

cat root.ca.crt.pem intermediate.ca.crt.pem > cvpn.ca.crt.chain.pem

Generating a CVPN CA with OpenSSL

CVPN Server Certificate

To generate the endpoint CVPN Server certificate in XCA you need to make sure you have the My Private CVPN CA certificate selected in the Certificates tab, and then press the New Certificate button. This presents a dialog with a number of tabs, that should be configured with the following information and settings, after which a new CVPN Server certificate will be created (as a subordinate of the CVPN CA) in the XCA database for you.

Type: x509 Certificate
Source:
- Signing: My Private CVPN CA
- Signature algorithm: SHA 512
- Template for new certificate: [default] Empty template
Subject:
- Internal Name: My Private CVPN - Server
- Distinguished Name:
  - countryName: GB
  - stateOrProvinceName: My State
  - localityName: My Locality
  - organizationName: My Company
  - organizationalUnitName: DevOps
  - commonName: cvpn-server.cvpn.cert.private.mydomain.com
  - emailAddress: devops@mydomain.com
- Private Key:
  - Name: My Private CVPN - Server
  - Type: RSA
  - Size: 2048 bit
Extensions:
- x509v3 Basic Constraints:
  - Type: End Entity
- Key Identifier:
  - x509v3 Subject Key Identifier
  - x509v3 Authority Key Identifier
- Validity: (3 Years)
  - Not Before: 2023-09-01 00:00 GMT
  - Not After: 2026-08-31 23:59 GMT
Key usage:
- x509v3 Key Usage:
  - Digital Signature
- x509v3 Extended Key Usage:
  - TLS Web Server Authentication

CVPN Server certificate visible in XCA

If you prefer to generate the CVPN Server certificate with OpenSSL then you can use the following commands:

openssl genrsa -out cvpn-server.private.key.pem 2048

openssl req -new -config openssl.conf -section req_tls_server -sha512 \
    -subj "/C=GB/ST=My State/L=My Locality/O=My Company/OU=DevOps/CN=cvpn-server.cvpn.cert.private.mydomain.com/emailAddress=devops@mycompany.com" \
    -key cvpn-server.private.key.pem -out cvpn-server.csr.pem

openssl req -x509 -config openssl.conf -section req_tls_server -sha512 \
    -CA cvpn.ca.crt.pem -CAkey cvpn.ca.private.key.pem \
    -days 1095 -set_serial 0x4 -in cvpn-server.csr.pem -key cvpn-server.private.key.pem -out cvpn-server.crt.pem

cat root.ca.crt.pem intermediate.ca.crt.pem cvpn.ca.crt.pem > cvpn-server.crt.chain.pem

Generating a CVPN Server certificate with OpenSSL

CVPN Client Certificate(s)

To generate the endpoint CVPN Client certificate(s) that will enable a client (e.g. user) to connect to the CVPN Server, you may use the following as a template and repeat it for as many clients as you need; just replace <> with the username of the user, etc.

In XCA you need to make sure you have the My Private CVPN CA certificate selected in the Certificates tab, and then press the New Certificate button. This presents a dialog with a number of tabs, that should be configured with the following information and settings, after which a new CVPN Client certificate will be created (as a subordinate of the CVPN CA) in the XCA database for you.

Type: x509 Certificate
Source:
- Signing: My Private CVPN CA
- Signature algorithm: SHA 512
- Template for new certificate: [default] Empty template
Subject:
- Internal Name: My Private CVPN - User - <>
- Distinguished Name:
  - countryName: <>
  - stateOrProvinceName: <>
  - localityName: <>
  - organizationName: <>
  - organizationalUnitName: <>
  - commonName: <>.cvpn.cert.private.mydomain.com
  - emailAddress: <>
- Private Key:
  - Name: My Private CVPN - User - <>
  - Type: RSA
  - Size: 2048 bit
Extensions:
- x509v3 Basic Constraints:
  - Type: End Entity
- Key Identifier:
  - x509v3 Subject Key Identifier
  - x509v3 Authority Key Identifier
- Validity: (1 Year)
  - Not Before: 2023-09-01 00:00 GMT
  - Not After: 2024-08-31 23:59 GMT
Key usage:
- x509v3 Key Usage:
  - Digital Signature
- x509v3 Extended Key Usage:
  - TLS Web Client Authentication

A CVPN client certificate for the user aretter visible in XCA

If you prefer to generate the CVPN Client certificate for the user with OpenSSL then you can use the following commands:

export CVPN_USERNAME=aretter

openssl genrsa -out $CVPN_USERNAME.private.key.pem 2048

openssl req -new -config openssl.conf -section req_tls_client -sha512 \
    -subj "/C=GB/ST=My State/L=My Locality/O=My Company/OU=DevOps/CN=$CVPN_USERNAME.cvpn.cert.private.mydomain.com/emailAddress=$CVPN_USERNAME@mycompany.com" \
    -key $CVPN_USERNAME.private.key.pem -out $CVPN_USERNAME.csr.pem

openssl req -x509 -config openssl.conf -section req_tls_client -sha512 \
    -CA cvpn.ca.crt.pem -CAkey cvpn.ca.private.key.pem \
    -days 365 -set_serial 0x5 -in $CVPN_USERNAME.csr.pem -key $CVPN_USERNAME.private.key.pem -out $CVPN_USERNAME.crt.pem

Generating a CVPN Server certificate with OpenSSL

Deploying the Certificates in Terraform

Now that we have generated CA certificates and endpoint certificates, we can now deploy them to ACM (AWS Certificate Manager) by using Terraform.

# Deploy the Client VPN CA
data "local_sensitive_file" "my_private_cvpn_ca_private_key" {
  filename = ("cvpn.ca.private.key.pem")
}

data "tls_certificate" "my_private_cvpn_ca_certificate" {
  content = file("cvpn.ca.private.crt.pem")
}

data "tls_certificate" "my_private_cvpn_ca_chain_certificate" {
  content = file("cvpn.ca.private.crt.chain.pem")
}

resource "aws_acm_certificate" "my_private_cvpn_ca_certificate" {
  private_key       = data.local_sensitive_file.my_private_cvpn_ca_private_key.content
  certificate_body  = data.tls_certificate.my_private_cvpn_ca_certificate.content
  certificate_chain = data.tls_certificate.my_private_cvpn_ca_chain_certificate.content

  lifecycle {
    create_before_destroy = true
  }

  tags = {
    Environment = "cvpn"
  }
}


# Deploy the Client VPN Server certificate
data "local_sensitive_file" "my_private_cvpn_server_private_key" {
  filename = ("cvpn-server.private.key.pem")
}

data "tls_certificate" "my_private_cvpn_server_certificate" {
  content = file("cvpn-server.crt.pem")
}

data "tls_certificate" "my_private_cvpn_server_chain_certificate" {
  content = file("cvpn-server.crt.chain.pem")
}

resource "aws_acm_certificate" "my_private_cvpn_server_certificate" {
  private_key       = data.local_sensitive_file.my_private_cvpn_server_private_key.content
  certificate_body  = data.tls_certificate.my_private_cvpn_server_certificate.content
  certificate_chain = data.tls_certificate.my_private_cvpn_server_chain_certificate.content

  lifecycle {
    create_before_destroy = true
  }

  tags = {
    Environment = "cvpn"
  }
}

Terraform code to deploy the Client VPN Certificates to AWS Certificate Manager

Note that as the client (e.g. user) certificates are signed by the CVPN CA (i.e. the same CA as used by the Client VPN and the Client VPN Server certificate), they do not need to be uploaded into AWS ACM.

Creating the Client VPN in Terraform

Now that we have the CA certificates and endpoint certificates in place, we can create and configure the AWS Client VPN by using Terraform.

resource "aws_ec2_client_vpn_endpoint" "cvpn" {
  description = "Client VPN"

  vpc_id = 

  client_cidr_block = "192.168.68.0/22"
  split_tunnel      = true

  server_certificate_arn = aws_acm_certificate.my_private_cvpn_server_certificate.arn

  authentication_options {
    type                       = "certificate-authentication"
    root_certificate_chain_arn = aws_acm_certificate.my_private_cvpn_ca_certificate.arn
  }

  self_service_portal = "disabled"

  security_group_ids = [
    module.cvpn_access_security_group.security_group_id
  ]

  tags = {
    Name        = "cvpn_endpoint"
    Environment = "cvpn"
  }
}

resource "aws_ec2_client_vpn_network_association" "cvpn" {
  count = 1

  client_vpn_endpoint_id = aws_ec2_client_vpn_endpoint.cvpn.id
  subnet_id              = 

  lifecycle {
    // The issue why we are ignoring changes is that on every change
    // terraform screws up most of the cvpn assosciations
    // see: https://github.com/hashicorp/terraform-provider-aws/issues/14717
    ignore_changes = [subnet_id]
  }
}

resource "aws_ec2_client_vpn_authorization_rule" "cvpn_auth" {
  client_vpn_endpoint_id = aws_ec2_client_vpn_endpoint.cvpn.id
  target_network_cidr    = 
  authorize_all_groups   = true
}


module "cvpn_access_security_group" {
  source  = "terraform-aws-modules/security-group/aws"
  version = "4.17.2"

  name        = "cvpn_access_security_group"
  description = "Security group for CVPN Access"

  vpc_id = 

  computed_ingress_with_cidr_blocks = [
    {
      description = "VPN TLS"
      from_port   = 443
      to_port     = 443
      protocol    = "udp"
      cidr_blocks = 
    }
  ]
  number_of_computed_ingress_with_cidr_blocks = 1

  computed_ingress_with_ipv6_cidr_blocks = [
    {
      description      = "VPN TLS (IPv6)"
      from_port        = 443
      to_port          = 443
      protocol         = "udp"
      ipv6_cidr_blocks = 
    }
  ]
  number_of_computed_ingress_with_ipv6_cidr_blocks = 1

  egress_with_cidr_blocks = [
    {
      description = "All"
      from_port   = -1
      to_port     = -1
      protocol    = -1
      cidr_blocks = "0.0.0.0/0"
    }
  ]

  egress_with_ipv6_cidr_blocks = [
    {
      description = "All (IPv6)"
      from_port   = -1
      to_port     = -1
      protocol    = -1
      cidr_blocks = "2001:db8::/64"
    }
  ]

  tags = {
    Name        = "sg_cvpn"
    Type        = "security_group"
    Environment = "cvpn"
  }
}

Terraform code for a Client VPN

The following placeholders in the code above need to be filled out with your own Terraform AWS configuration values:

- the AWS ID of your VPC in which you wish to deploy the Client VPN.
- the AWS ID of the IPv4 Subnet within your VPC where you wish VPN traffic to arrive to.
- the IPv4 Subnet CIDR address within your VPC where you wish VPN traffic to arrive to.
- the IPv6 Subnet CIDR address within your VPC where you wish VPN traffic to arrive to.

Connecting to the Client VPN

Once you have the above setup you can download a skeleton OpenVPN configuration file from the AWS Dashboard.

Download OpenVPN Configuration file from AWS

The OpenVPN configuration file provided by AWS will be incomplete, and will something look like this:

client
dev tun
proto udp
remote cvpn-endpoint-006d2181ae8616b54.prod.clientvpn.eu-west-2.amazonaws.com 443
remote-random-hostname
resolv-retry infinite
nobind
remote-cert-tls server
cipher AES-256-GCM
verb 3

-----BEGIN CERTIFICATE-----
SECRET STUFF HERE ;-)
-----END CERTIFICATE-----




reneg-sec 0

verify-x509-name cvpn-server.cvpn.cert.private.mydomain.com name

AWS provided OpenVPN configuration file

You will need to modify it to add:

A section containing the certificate for one of the CVPN Client certificate(s) that you generated above with XCA or OpenSSL.
A section containing the private key for the CVPN Client certificate that you are using.

You should then have a complete OpenVPN configuration file that looks similar to this:

client
dev tun
proto udp
remote cvpn-endpoint-006d2181ae8616b54.prod.clientvpn.eu-west-2.amazonaws.com 443
remote-random-hostname
resolv-retry infinite
nobind
remote-cert-tls server
cipher AES-256-GCM
verb 3

-----BEGIN CERTIFICATE-----
SECRET STUFF HERE ;-)
-----END CERTIFICATE-----



-----BEGIN CERTIFICATE-----
SECRET STUFF HERE ;-)
-----END CERTIFICATE-----



-----BEGIN RSA PRIVATE KEY-----
SECRET STUFF HERE ;-)
-----END RSA PRIVATE KEY-----



reneg-sec 0

verify-x509-name cvpn-server.cvpn.cert.private.mydomain.com name

Complete OpenVPN configuration file

You may now use your favourite OpenVPN client tool, with the above OpenVPN configuration file to connect to your new Client VPN; I personally use Tunnelblick. Enjoy!

AWS Client VPN with Managed Certificates using Terraform

Adam Retter — Mon, 11 Sep 2023 12:58:48 GMT

This is a brief explanation of how to use Terraform to setup an AWS CVPN (Client VPN) where the certificates (for VPN authentication) are entirely generated and managed by AWS.

The advantage of using a managed certificates approach is that you need not generate or directly manage any certificates or private keys manually. To avoid such manual processes, this approach makes use of AWS Private CA (Certificate Authority) and AWS Certificate Manager. The main disadvantage of this approach is that AWS Private CA is an expensive proposition, at the time of writing, it is priced at $400 / Month / CA Certificate; we will use two CA Certificates (although you could just use one if you wish), at a total cost of $800 / Month.

If you want to avoid the cost of AWS Private CA, you can take an alternative approach where you manually provision and manage the certificates. You can find details of how to achieve that in my blog article - AWS Client VPN with Manually Provisioned Certificates using Terraform.

Chain of Trust

We will create a number of certificates that will form a chain of trust. Each subordinate certificate is signed by its parent certificate. Our chain of trust will look like:

Root CA Certificate.
CVPN Server Certificate - this endpoint certificate is for the CVPN server itself in AWS. This certificate is signed by the Root CA Certificate.
CVPN Client CA Certificate - this CA certificate is used for issuing and signing certificates that are used by clients (e.g. users) to connect to the CVPN Server. This certificate is also signed by the Root CA Certificate.
CVPN Root Client Certificate - this endpoint certificate is for configuring the CVPN server with a client certificate. This certificate is signed by the CVPN Client CA Certificate. This certificate is needed due to a peculiarity with how AWS Client VPN is configured in AWS; their CVPN Server configuration requires a client certificate so that it can access the chain-of-trust for the client certificates (e.g. Client CA Certificate and Root CA certificate)! No client (e.g. user) will actually ever use this certificate directly.
CVPN Client Certificate - we create one of these endpoint certificates for each client (e.g. user) that wishes to connect to the CPVN Server. These certificates will also be signed by the CVPN Client CA Certificate.

Chain of Trust diagram

The purpose of having a separate Root CA and CVPN Client CA is that we have delegated trust from the Root to the CVPN. We can therefore easily isolate the CVPN if needed and manage CVPN certificate issuance and revocation completely separately to the Root. In future it may be desirable to create further delegated CAs from the Root CA for other purposes.

Creating the Certificates in Terraform

We can create all of the CA certificates and endpoint certificates, and store them in AWS Private CA and AWS Certificate Manager by using Terraform.

Terraform Root CA

The Terraform code below will create a Root CA and certificate:

resource "aws_acmpca_certificate_authority" "root_ca" {
  type = "ROOT"

  certificate_authority_configuration {
    key_algorithm     = "RSA_4096"
    signing_algorithm = "SHA512WITHRSA"

    subject {
      common_name         = "root.ca.cert.private.mydomain.com"
      organizational_unit = "DevOps"
      organization        = "My Company"
      locality            = "My City"
      state               = "My State"
      country             = "GB"
    }
  }

  tags = {
    Name        = "certificate_authority"
    Environment = "cvpn"
  }
}

resource "aws_acmpca_certificate" "root_ca_certificate" {
  certificate_authority_arn   = aws_acmpca_certificate_authority.root_ca.arn
  certificate_signing_request = aws_acmpca_certificate_authority.root_ca.certificate_signing_request
  signing_algorithm           = "SHA512WITHRSA"

  template_arn = "arn:${data.aws_partition.current.partition}:acm-pca:::template/RootCACertificate/V1"

  validity {
    type  = "YEARS"
    value = 5
  }
}

resource "aws_acmpca_certificate_authority_certificate" "root_ca_certificate_association" {
  certificate_authority_arn = aws_acmpca_certificate_authority.root_ca.arn

  certificate       = aws_acmpca_certificate.root_ca_certificate.certificate
  certificate_chain = aws_acmpca_certificate.root_ca_certificate.certificate_chain
}

Terraform code for a Root CA with AWS Private CA

Note that the type of the aws_acmpca_certificate_authority resource is set to ROOT.

Terraform CVPN Server Certificate

The Terraform code below will create a certificate for use as the CVPN Server certificate:

resource "tls_private_key" "cvpn_server_certificate_private_key" {
  algorithm = "RSA"
  rsa_bits  = "2048"
}

resource "tls_cert_request" "cvpn_server_certificate_signing_request" {
  private_key_pem = tls_private_key.cvpn_server_certificate_private_key.private_key_pem

  subject {
    common_name         = "cvpn-server.cvpn.cert.private.mydomain.com"
    organizational_unit = "DevOps"
    organization        = "My Company"
    street_address      = ["My Street"]
    locality            = "My City"
    state               = "My State"
    country             = "GB"
    postal_code         = "XX1 2XX"
  }
}

resource "aws_acmpca_certificate" "cvpn_server_certificate" {
  certificate_authority_arn   = aws_acmpca_certificate_authority.root_ca.arn
  certificate_signing_request = tls_cert_request.cvpn_server_certificate_signing_request.cert_request_pem
  signing_algorithm           = "SHA512WITHRSA"
  validity {
    type  = "YEARS"
    value = 3
  }
}

resource "aws_acm_certificate" "cvpn_server_certificate" {
  private_key       = tls_private_key.cvpn_server_certificate_private_key.private_key_pem
  certificate_body  = aws_acmpca_certificate.cvpn_server_certificate.certificate
  certificate_chain = aws_acmpca_certificate.cvpn_server_certificate.certificate_chain

  lifecycle {
    create_before_destroy = true
  }

  tags = {
    Name        = "certificate"
    Scope       = "cvpn_server"
    Environment = "cvpn"
  }
}

Terraform code for a CVPN Server certificate

Note that the certificate_authority_arn of the aws_acmpca_certificate resource is set to the ARN (Amazon Resource Name) of our previously created Root CA, i.e.: aws_acmpca_certificate_authority.root_ca.arn.

Terraform CVPN Client CA

The Terraform code below will create a CVPN Client CA and certificate delegated from the Root CA:

resource "aws_acmpca_certificate_authority" "cvpn_client_ca" {
  type = "SUBORDINATE"

  certificate_authority_configuration {
    key_algorithm     = "RSA_4096"
    signing_algorithm = "SHA512WITHRSA"

    subject {
      common_name         = "cvpn-client.ca.cert.private.mydomain.com"
      organizational_unit = "DevOps"
      organization        = "My Company"
      locality            = "My City"
      state               = "My State"
      country             = "GB"
    }
  }

  tags = {
    Name        = "certificate_authority"
    Environment = "cvpn"
  }
}

resource "aws_acmpca_certificate" "cvpn_client_ca_certificate" {
  certificate_authority_arn   = aws_acmpca_certificate_authority.root_ca.arn
  certificate_signing_request = aws_acmpca_certificate_authority.cvpn_client_ca.certificate_signing_request
  signing_algorithm           = "SHA512WITHRSA"

  template_arn = "arn:${data.aws_partition.current.partition}:acm-pca:::template/SubordinateCACertificate_PathLen0/V1"

  validity {
    type  = "YEARS"
    value = 3
  }
}

resource "aws_acmpca_certificate_authority_certificate" "cvpn_client_ca_certificate_association" {
  certificate_authority_arn = aws_acmpca_certificate_authority.cvpn_client_ca.arn

  certificate       = aws_acmpca_certificate.cvpn_client_ca_certificate.certificate
  certificate_chain = aws_acmpca_certificate.cvpn_client_ca_certificate.certificate_chain
}

Terraform code for a subordinate CVPN Client CA with AWS Private CA

Note that the type of the aws_acmpca_certificate_authority resource is set to SUBORDINATE, and that the certificate_authority_arn of the aws_acmpca_certificate resource is set to the ARN (Amazon Resource Name) of our previously created Root CA, i.e.: aws_acmpca_certificate_authority.root_ca.arn.

Terraform CVPN Root Client Certificate

The Terraform code below will create a CVPN Root Client Certificate. It will only be used as part of the CVPN Server configuration in AWS.

resource "tls_private_key" "root_user_cvpn_client_certificate_private_key" {
  algorithm = "RSA"
  rsa_bits  = "2048"
}

resource "tls_cert_request" "root_user_cvpn_client_certificate_signing_request" {
  private_key_pem = tls_private_key.root_user_cvpn_client_certificate_private_key.private_key_pem

  subject {
    common_name         = "root-user.cvpn.cert.private.mydomain.com"
    organizational_unit = "DevOps"
    organization        = "My Company"
    street_address      = ["My Street"]
    locality            = "My City"
    state               = "My State"
    country             = "GB"
    postal_code         = "XX1 2XX"
  }
}

resource "aws_acmpca_certificate" "root_user_cvpn_client_certificate" {
  certificate_authority_arn   = aws_acmpca_certificate_authority.cvpn_client_ca.arn
  certificate_signing_request = tls_cert_request.root_user_cvpn_client_certificate_signing_request.cert_request_pem
  signing_algorithm           = "SHA512WITHRSA"
  validity {
    type  = "YEARS"
    value = 1
  }
}

resource "aws_acm_certificate" "root_user_cvpn_client_certificate" {
  private_key       = tls_private_key.root_user_cvpn_client_certificate_private_key.private_key_pem
  certificate_body  = aws_acmpca_certificate.root_user_cvpn_client_certificate.certificate
  certificate_chain = aws_acmpca_certificate.root_user_cvpn_client_certificate.certificate_chain

  lifecycle {
    create_before_destroy = true
  }

  tags = {
    Name        = "certificate"
    Scope       = "cvpn_server"
    Environment = "cvpn"
  }
}

Terraform code for a CVPN Root Client Certificate

Note that the certificate_authority_arn of the aws_acmpca_certificate resource is set to the ARN (Amazon Resource Name) of our previously created CVPN Client CA, i.e.: aws_acmpca_certificate_authority.cvpn_client_ca.arn.

Terraform CVPN Client Certificate(s)

The Terraform code below will create a CVPN Client Certificate that will enable a client (e.g. user) to connect to the CVPN Server. You may use this as a template and repeat it for as many clients as you need.

resource "tls_private_key" "user_1_cvpn_client_certificate_private_key" {
  algorithm = "RSA"
  rsa_bits  = "2048"
}

resource "tls_cert_request" "user_1_cvpn_client_certificate_signing_request" {
  private_key_pem = tls_private_key.user_1_cvpn_client_certificate_private_key.private_key_pem

  subject {
    common_name         = "user-1.cvpn.cert.private.mydomain.com"
    organizational_unit = "DevOps"
    organization        = "My Company"
    street_address      = ["My Street"]
    locality            = "My City"
    state               = "My State"
    country             = "GB"
    postal_code         = "XX1 2XX"
  }
}

resource "aws_acmpca_certificate" "user_1_cvpn_client_certificate" {
  certificate_authority_arn   = aws_acmpca_certificate_authority.cvpn_client_ca.arn
  certificate_signing_request = tls_cert_request.user_1_cvpn_client_certificate_signing_request.cert_request_pem
  signing_algorithm           = "SHA512WITHRSA"
  validity {
    type  = "YEARS"
    value = 1
  }
}

resource "aws_acm_certificate" "user_1_cvpn_client_certificate" {
  private_key       = tls_private_key.user_1_cvpn_client_certificate_private_key.private_key_pem
  certificate_body  = aws_acmpca_certificate.user_1_cvpn_client_certificate.certificate
  certificate_chain = aws_acmpca_certificate.user_1_cvpn_client_certificate.certificate_chain

  lifecycle {
    create_before_destroy = true
  }

  tags = {
    Name        = "certificate"
    Scope       = "cvpn_client"
    Environment = "cvpn"
  }
}

Terraform code to produce a CVPN Client (e.g. User) certificate

Creating the Client VPN in Terraform

Now that we have the CA certificates and endpoint certificates in place, we can create and configure the AWS Client VPN by using Terraform.


resource "aws_ec2_client_vpn_endpoint" "cvpn" {
  description = "Client VPN"

  vpc_id = 

  client_cidr_block = "192.168.68.0/22"
  split_tunnel      = true

  server_certificate_arn = aws_acm_certificate.cvpn_server_certificate.arn

  authentication_options {
    type                       = "certificate-authentication"
    root_certificate_chain_arn = aws_acm_certificate.root_user_cvpn_client_certificate.arn
  }

  self_service_portal = "disabled"

  security_group_ids = [
    module.cvpn_access_security_group.security_group_id
  ]

  tags = {
    Name        = "cvpn_endpoint"
    Environment = "cvpn"
  }
}

resource "aws_ec2_client_vpn_network_association" "cvpn" {
  count = 1

  client_vpn_endpoint_id = aws_ec2_client_vpn_endpoint.cvpn.id
  subnet_id              = 

  lifecycle {
    // The issue why we are ignoring changes is that on every change
    // terraform screws up most of the cvpn assosciations
    // see: https://github.com/hashicorp/terraform-provider-aws/issues/14717
    ignore_changes = [subnet_id]
  }
}

resource "aws_ec2_client_vpn_authorization_rule" "cvpn_auth" {
  client_vpn_endpoint_id = aws_ec2_client_vpn_endpoint.cvpn.id
  target_network_cidr    = 
  authorize_all_groups   = true
}


module "cvpn_access_security_group" {
  source  = "terraform-aws-modules/security-group/aws"
  version = "4.17.2"

  name        = "cvpn_access_security_group"
  description = "Security group for CVPN Access"

  vpc_id = 

  computed_ingress_with_cidr_blocks = [
    {
      description = "VPN TLS"
      from_port   = 443
      to_port     = 443
      protocol    = "udp"
      cidr_blocks = 
    }
  ]
  number_of_computed_ingress_with_cidr_blocks = 1

  computed_ingress_with_ipv6_cidr_blocks = [
    {
      description      = "VPN TLS (IPv6)"
      from_port        = 443
      to_port          = 443
      protocol         = "udp"
      ipv6_cidr_blocks = 
    }
  ]
  number_of_computed_ingress_with_ipv6_cidr_blocks = 1

  egress_with_cidr_blocks = [
    {
      description = "All"
      from_port   = -1
      to_port     = -1
      protocol    = -1
      cidr_blocks = "0.0.0.0/0"
    }
  ]

  egress_with_ipv6_cidr_blocks = [
    {
      description = "All (IPv6)"
      from_port   = -1
      to_port     = -1
      protocol    = -1
      cidr_blocks = "2001:db8::/64"
    }
  ]

  tags = {
    Name        = "sg_cvpn"
    Type        = "security_group"
    Environment = "cvpn"
  }
}

Terraform code for a Client VPN

The following placeholders in the code above need to be filled out with your own Terraform AWS configuration values:

- the AWS ID of your VPC in which you wish to deploy the Client VPN.
- the AWS ID of the IPv4 Subnet within your VPC where you wish VPN traffic to arrive to.
- the IPv4 Subnet CIDR address within your VPC where you wish VPN traffic to arrive to.
- the IPv6 Subnet CIDR address within your VPC where you wish VPN traffic to arrive to.

Connecting to the Client VPN

Once you have the above setup you can download a skeleton OpenVPN configuration file from the AWS Dashboard.

Download OpenVPN Configuration file from AWS

The OpenVPN configuration file provided by AWS will be incomplete, and will something look like this:

client
dev tun
proto udp
remote cvpn-endpoint-006d2181ae8616b54.prod.clientvpn.eu-west-2.amazonaws.com 443
remote-random-hostname
resolv-retry infinite
nobind
remote-cert-tls server
cipher AES-256-GCM
verb 3

-----BEGIN CERTIFICATE-----
SECRET STUFF HERE ;-)
-----END CERTIFICATE-----




reneg-sec 0

verify-x509-name cvpn-server.cvpn.cert.private.mydomain.com name

AWS provided OpenVPN configuration file

You will need to modify it to add:

A section containing the certificate for one of the CVPN Client Certificate(s) that you generated above.
A section containing the private key for the CVPN Client Certificate that you are using.

Note that the values needed for the and can be found in your terraform.tfstate file. You should then have a complete OpenVPN configuration file that looks similar to this:

client
dev tun
proto udp
remote cvpn-endpoint-006d2181ae8616b54.prod.clientvpn.eu-west-2.amazonaws.com 443
remote-random-hostname
resolv-retry infinite
nobind
remote-cert-tls server
cipher AES-256-GCM
verb 3

-----BEGIN CERTIFICATE-----
SECRET STUFF HERE ;-)
-----END CERTIFICATE-----



-----BEGIN CERTIFICATE-----
SECRET STUFF HERE ;-)
-----END CERTIFICATE-----



-----BEGIN RSA PRIVATE KEY-----
SECRET STUFF HERE ;-)
-----END RSA PRIVATE KEY-----



reneg-sec 0

verify-x509-name cvpn-server.cvpn.cert.private.mydomain.com name

Complete OpenVPN configuration file

You may now use your favourite OpenVPN client tool, with the above OpenVPN configuration file to connect to your new Client VPN; I personally use Tunnelblick. Enjoy!

Processing Historical Dates

Adam Retter — Fri, 04 Jun 2021 10:41:00 GMT

Phase 2 of Project Omega at TNA (The National Archives) commenced at the end of January 2021, and our first goal was to perform a large export of their Catalogue Records from their previous system, ILDB (Microsoft SQL Server), into our new pan-archival RDF data model. I previously discussed the tooling that we were building to enable such ETL (Extract Transform Load) processes.

Whilst improving the ETL pipeline I experienced some interesting problems when trying to parse and process what I believed to be already computed dates from ILDB. These values that looked like dates suitable for computation were expressed as serial numeric values, for example in that scheme the 17th March 2021, would be expressed as 20210317.

I later found out that even before the year 2000, The National Archives had already encountered this problem. Difficulties that arose previously from attempting to store historical dates as computable dates, gave rise to the decision at that time to store them as text and to convert regnal years (e.g. 3 Eliz I) into serial numeric values rather than computable date values.

The existing data in the ILDB system should adhere to TNA-CS13 (
The National Archives - Cataloguing Standards, June 2013) specification, which itself is an extension of ISAD(G) (General International Standard Archival Description).

Covering Dates

The date Data Elements that I was trying to process are known in TNA-CS13 as the Covering Dates, and are described as:

Identifies and records the date(s) of creation of the records being described.

As a non-archivist, I personally find the term Covering Dates non-intuitive! The first time I encountered it some years ago I incorrectly assumed that it was the period of time discussed in the records, as opposed to the creation date(s) of the records themselves.

The expert archivist however exactly understands the concept of Covering Dates, and I am told that:

An archival catalogue describes records/documents not historical events. The catalogue is agnostic to historical events and so is the archivist. All intrinsic metadata in the catalogue refers to the record. There may be extrinsic metadata providing contextual information (generally added later as a result of enrichment) but the dates refer to the records.

Further explanation of how to complete the Covering Dates is also provided:

"Give the covering dates of the creation of the records within the unit of description as a single calendar date or range of dates as appropriate."

Within the database behind ILDB, the Covering Dates are stored using 3 fields:

date_text This is a textual description of the dates as they appear in the document or file (or the metadata for a born-digital folder file or folder). It outlines the date or range of dates when the document or file was created or accumulated.
first_date This is the earliest creation date of any record within the unit of description. It is stored as a serial numeric value of the form yyyyMMdd.
last_date If the unit of description encompasses more than one record and those records have different creating dates, then this is the latest creation date of any record within that unit. If there is only a single record, or all creation dates are the same within the unit, then this duplicates the first_date. It is stored as a serial numeric value of the form yyyyMMdd.

ILDB Item Level Fields (Covering Date fields highlighted)

The system attempts to infer the first_date and last_date fields from the date_text as explained in TNA-CS13, however ultimately the archivist can override this behaviour manually:

Dates must be entered in a particular format because the covering date format automatically generates numeric start and end dates in the catalogue in order to enable date searching.

Data for archival catalogues was often generated through the conversion of paper lists into digital form. In the days before the internet, archivists typed the calendar year before the month and the day as this helped readers to rule out or identify relevant information faster. If you are familiar with ISO 8601 style dates, then this will be somewhat familiar to you already.

Some example Covering Dates:

Example Covering Dates stored in ILDB

Problems Processing Covering Dates

For our RDF data model, we wanted to express the Covering Dates using W3C OWL-Time. We had decided upon using either an Instant for a single covering date where the first and last dates are the same, or a ProperInterval for a covering date where the first and last dates differ.

During our ETL process we have a step which parses the numeric first_date and last_date fields into Java Date objects, and later a subsequent step that adds these to our RDF Model as xsd:date literals.

During execution of the ETL which was in total processing ~8.2 million records at item level, we would occasionally see a perplexing error related to first_date:

covering_date_start String(9) : couldn't convert string [15821011] to a date using format [yyyyMMdd] on offset location 8

Pentaho - Date Conversion Error

To the uninitiated (i.e. my past self), the string 15821011 looks like it should be parsable using the pattern yyyyMMdd; I can see the year is 1582, the month is October, and the day of the month is the 11th. So what's wrong with this?

Reproducing the Issue

As Pentaho is written in Java, my first thought was to try and reproduce the issue in a couple of lines of Java code, so that I can either rule in or out an issue with Pentaho. So I wrote the following idiomatic Java code:

import java.text.ParseException;
import java.text.ParsePosition;
import java.text.SimpleDateFormat;

public class DateTest {

    public static void main(final String args[]) throws ParseException {
        final String input = "15821011";
        final SimpleDateFormat sdf = new SimpleDateFormat("yyyyMMdd");

        final ParsePosition pp = new ParsePosition(0);
        sdf.parse(input, pp);

        if (pp.getErrorIndex() >= 0) {
            // error occurred
            throw new ParseException("Unable to parse: '" + input + "'", pp.getErrorIndex());
        }
    }
}

Parsing a yyyyMMdd date in Java

The above code did not throw a ParseException, thus meaning that it was able to parse the date just fine. This made me suspect that there must be some difference between the above and how Pentaho is parsing the date itself.

To confirm this, I started Pentaho Spoon with a Java debugging agent configured for remote access, connected to it from IntelliJ IDEA and set some break-points in the Select Values step class (org.pentaho.di.trans.steps.selectvalues.SelectValues) that is used for parsing the date into a Java Date object. Through running the ETL and stepping through the executing code using the Java Debugger I was able to ascertain, that by default the step was calling SimpleDateFormat#setLenient(boolean) with the argument false before parsing the date.

Whether lenient parsing is enabled by default for SimpleDateFormat depends on the Calendar that backs it. This in itself depends on your JDK Platform and likely your locale. On my en_GB.UTF-8 system with OpenJDK 8, the lenient setting is inherited from java.util.Calendar where it is enabled by default.

Modifying our reproducible Java code to disable lenient parsing yields:

import java.text.ParseException;
import java.text.ParsePosition;
import java.text.SimpleDateFormat;

public class DateTest {

    public static void main(final String args[]) throws ParseException {
        final String input = "15821011";
        final SimpleDateFormat sdf = new SimpleDateFormat("yyyyMMdd");

        sdf.setLenient(false);  // disable lenient parsing mode

        final ParsePosition pp = new ParsePosition(0);
        sdf.parse(input, pp);

        if (pp.getErrorIndex() >= 0) {
            // error occurred
            throw new ParseException("Unable to parse: '" + input + "'", pp.getErrorIndex());
        }
    }
}

Strict parsing of a yyyyMMdd date in Java

The above modified code does indeed now throw a ParseException as a result of trying to parse the date 15821011. So far so good, we have isolated and reproduced the issue!

It's all about the Calendar!

So, 15821011 looks like a valid date... so why does it not parse when lenient parsing is disabled?

Our first hint is from examining the result of sdf.parse (as a String) when lenient parsing is enabled. The result is:

Thu Oct 21 00:00:00 CET 1582

Interesting! I was expecting the 11th October 1582, but we are told that we have the 21st October 1582.

I previously had a basic understanding that there was a Julian Calendar and that this pre-dated the creation and use of the Gregorian Calendar, and that the switch-over happened in October 1582. That switch-over happened such that, Thursday 4th October 1582 (Julian Calendar) was followed by Friday 15th October 1582 (Gregorian Calendar). I can only imagine that this must have confused a few people who woke up on that Friday morning ;-)

As the calendar switched from Julian to Gregorian in October 1582 we can see that according to the Julian-Gregorian hybrid calendar (which is what Java uses by default on my platform) there was no 11th October 1582, and therefore 15821011 is not a valid date (for that Calendar).

So what's up with lenient parsing? Basically SimpleDateFormat makes a best effort attempt to interpret the supplied date. As 11th October 1582 falls in the period between the Julian-Gregorian switch-over, Java adds 10 days (the difference at that time between the calendars), to yield the 21st October 1582 in the Gregorian Calendar. However, that's not really what we want! We want non-lenient parsing as we may have other dates in the source data that are actually incorrect and we don't want them slipping through undetected.

...and if that was the end of it, it wouldn't be too bad as we could just correct an invalid date in the source data. However, that date is perfectly valid... keep reading!

Julian-Gregorian Switch-Over Wasn't Universal!

Simply put, whilst most of Roman Catholic Europe switched from Julian to Gregorian calendars in October 1582, other countries followed later. The countries now making up the UK and Ireland are important in this story because this date comes from The Catalogue of The National Archives, and they didn't in fact switch-over until almost 200 years later - 14th September 1752.

When given a date, I think you actually have to know two things to be able to parse it, 1) the date itself, and 2) to which Calendar the date is in reference to. Our date 11th October 1582 (15821011) is perfectly valid in the UK at that time (according to the Julian Calendar), as the UK had not yet switched over to the Gregorian calendar.

When parsing 15821011 in Java, we need to instruct Java to use the correct Calendar configuration. Initially I (incorrectly) assumed that Java would know the switch-over dates on a per-country basis and adjust its Julian-Gregorian calendar appropriately, as such I tried setting both the Locale and Time Zone to UK:

import java.text.ParseException;
import java.text.ParsePosition;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.GregorianCalendar;
import java.util.Locale;
import java.util.TimeZone;

public class DateTest {

    public static void main(final String args[]) throws ParseException {
        final String input = "15821011";

        final Locale ukLocale = Locale.UK;
        final SimpleDateFormat sdf = new SimpleDateFormat("yyyyMMdd", ukLocale);

        final TimeZone ukTimeZone = TimeZone.getTimeZone("Europe/London");
        final GregorianCalendar ukCalendar = new GregorianCalendar(ukTimeZone, ukLocale);
        sdf.setCalendar(ukCalendar);

        sdf.setLenient(false);  // disable lenient parsing mode

        final ParsePosition pp = new ParsePosition(0);
        final Date result = sdf.parse(input, pp);

        if (pp.getErrorIndex() >= 0) {
            // error occurred
            throw new ParseException("Unable to parse: '" + input + "'", pp.getErrorIndex());
        }

        System.out.println("Result: " + result.toString());
    }
}

Incorrect way to configure UK Julian-Gregorian switch-over in Java

Unfortunately, as mentioned above, my assumption that Java would configure the Julian-Gregorian switch-over date automatically once it knew the Locale and Time Zone was incorrect; the above code still throws a ParseException.

Once you know where to look, this is documented and expected behaviour. Java's GregorianCalendar class documentation states:

GregorianCalendar is a hybrid calendar that supports both the Julian and Gregorian calendar systems with the support of a single discontinuity, which corresponds by default to the Gregorian date when the Gregorian calendar was instituted (October 15, 1582 in some countries, later in others). The cutover date may be changed by the caller by calling setGregorianChange().

Historically, in those countries which adopted the Gregorian calendar first, October 4, 1582 (Julian) was thus followed by October 15, 1582 (Gregorian). This calendar models this correctly. Before the Gregorian cutover, GregorianCalendar implements the Julian calendar.

Therefore, to construct a calendar that respects the UK Julian-Gregorian calendar switch-over we can call setGregorianChange with an argument of 14th September 1752. For example:

import java.text.ParseException;
import java.text.ParsePosition;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.Date;
import java.util.GregorianCalendar;
import java.util.Locale;
import java.util.TimeZone;

public class DateTest {

    public static void main(final String args[]) throws ParseException {
        final String input = "15821011";

        final Locale locale = Locale.UK;
        final TimeZone timeZone = TimeZone.getTimeZone("Europe/London");

        // setup a UK Julian-Gregorian Calendar with the correct switch-over date for the UK
        final GregorianCalendar ukJulianGregorianCalendar = new GregorianCalendar(timeZone, locale);
        final GregorianCalendar ukGregorianCalendarCutoverDate = (GregorianCalendar) ukJulianGregorianCalendar.clone();
        ukGregorianCalendarCutoverDate.set(1752, Calendar.September, 14);
        ukJulianGregorianCalendar.setGregorianChange(ukGregorianCalendarCutoverDate.getTime());

        final SimpleDateFormat sdf = new SimpleDateFormat("yyyyMMdd", locale);
        sdf.setCalendar(ukJulianGregorianCalendar);

        sdf.setLenient(false);  // disable lenient parsing mode

        final ParsePosition pp = new ParsePosition( 0 );
        final Date result = sdf.parse(input, pp);

        if ( pp.getErrorIndex() >= 0 ) {
            // error occurred
            throw new ParseException( input, pp.getErrorIndex());
        }

        System.out.println("parsed='" + result.toString() + "'");
        System.out.println("serialized='" + sdf.format(result) + "'");
    }
}

Strict parsing of a yyyyMMdd date in Java with correct UK Julian-Gregorian switch-over

NOTE: In the above code the call to Calendar#set(int, int, int) takes arguments for a year, a month, and a day, but... the month argument is zero-based and not one-based! So for September you must enter 8 and not 9! Alternatively you can use the constants from the Calendar API, e.g.: set(1752, Calendar.September, 14). This small difference confused me for some time :-(

The most important thing to note in the code above, is that we have still explicitly disabled lenient parsing, but because we have correctly set the UK Julian-Gregorian switch-over date, we can now parse our date 11th October 1582 (15821011) without any errors. The result of the above code is:

parsed='Thu Oct 21 01:00:00 CET 1582'
serialized='15821011'

Now you might be thinking... Hang-on one hot-moment, the parsed line still says "21st October 2021"!
Indeed it does! However, that is correct because as explained in Java's GregorianCalendar class documentation:

GregorianCalendar implements proleptic Gregorian and Julian calendars. That is, dates are computed by extrapolating the current rules indefinitely far backward and forward in time. As a result, GregorianCalendar may be used for all years to generate meaningful and consistent results.

The important term above is "proleptic", and I will leave that for the reader to look up. I am simplifying but you can expect the Java Date class to always represent dates internally according to the Gregorian Calendar. That's not a problem because we can convert forward and backward as needed between Calendar representations. This is demonstrated by the serialized output line in the results above.

Now that our Calendar is correctly configured (for the UK) in Java we have no problem parsing our Covering Dates. Unfortunately I have been unable to find any options for configuring this in Pentaho for our ETL, and as such we have 2 options:

As Pentaho is Open Source, we could improve Pentaho's Select Value step to offer suitable configuration.
Create a custom Pentaho Plugin step which handles such date parsing.
I may cover such topics in a future post, but we won't consider Pentaho any further today.

What about that RDF?

As briefly mention earlier we want to output our Covering Dates into our RDF using a W3C OWL-Time format. We need to incorporate the 3 fields: date_text, first_date, and the optional last_date. The first_date and last_date fields are now valid Java Date objects according to our calendar (as discussed above), whilst the date_text remains a string value.

An example, of a Covering Date indicating a single point in time for a unit of description, may have the date_text with a value of 1859 Aug 30, and only a first_date with a value of 18590830. This can be expressed in our RDF model for Project Omega as:

omg:created [
    a time:Instant ;
    dct:description "1859 Aug 30" ;
    time:inXSDDate "1859-08-30Z"^^xsd:date
] .

Project Omega - Example Record Covering Date for a Point in Time

Another example, of a Covering Date indicating a period of time for a unit of description, may have the date_text with a value of 11 Oct 1582 - 29 Nov 1582, the first_date with a value of 15821011, and last_date with a value of 15821129. This can be expressed in our RDF model for Project Omega as:

omg:created [
    a time:ProperInterval ;
    dct:description "11 Oct 1582 - 29 Nov 1582" ;
    time:hasBeginning [
        a time:Instant ;
        time:inXSDDate "1582-10-21Z"^^xsd:date
    ] ;
    time:hasEnd [
        a time:Instant ;
        time:inXSDDate "1582-12-09Z"^^xsd:date
    ]
] .

Project Omega - Example Record Covering Date for a Period of Time

Now, if you have been paying close attention so far, you will have noticed that the literal values of the time:inXSDDate properties don't look like the first_date and last_date values!

However, if I told you that these dates in RDF are stored according to the xsd:date (W3C XML Schema Date) data-type, and that that specifies a proleptic Gregorian Calendar, then perhaps you might have an "Ah ha!" moment.

If not, then let me explain that the dates in the RDF are the Gregorian equivalent of the Julian dates that were provided as input. No information has been lost as you can convert back and forward between these with relative ease.

Finally, The Archivist vs. The Software Engineer

The Digital Humanities require a fine and pragmatic balance between Human and Technical factors.

Software Engineering - Technical Factors

In the example of storing these Covering Dates in RDF the Software Engineer has proposed storing them as xsd:date typed values. The software Engineer recalling that TNA-CS13 states:

the covering date format automatically generates numeric start and end dates in the catalogue in order to enable date searching.

believes that the date_text is the important property from an archival descriptive perspective, and that the first_date and last_date are really just present to enable the access function of searching records by dates. The Software Engineer also understands by assumption that the first_date and last_date should be in synchronisation with the date_text and therefore be a faithful representation of that period.

For the software engineer, how the first_date and last_date are stored is a technical consideration that centers around arguments of accuracy, performant indexing, and range searches over dates. The Software Engineer believes that dates for dates to be processed consistently in isolation, they must be expressed according to a Calendar and Time Zone. In effect 3 facts are required to process a date, the date itself, the calendar for which the date is expressed, and any Time Zone information for how that date is recorded.

Ultimately, should there be a presentation requirement, the Software Engineer knows that they can convert the first_date and last_date and present/search it in any format requested by the user. The Software Engineer is confident that no information has been lost or destroyed.

Archival - Preservation Factors

The archivist is concerned that at present the first_date and last_date are recorded as simple sequential numbers. The archivist understands the context of the record, and is happy to glance at the serial date and interpret it within its historical context. The archivist believes that all 3 properties (first_date, last_date, and date_text) are equally important from a records keeping perspective and should be preserved as is.

The archivist worries that the raw expression of an xsd:date e.g. 1859-08-30Z may be harder to interpret that the previous serial format: 18590830; worse yet, dates that were written for the Julian Calendar (as that was the Calendar in use at that date) e.g. 15821129 might now be expressed as 1582-12-09Z for the Gregorian Calendar. Without careful explanation to the user, the new Gregorian form of the Julian date is confusing, and use of the original serial format of date perhaps makes more sense.

In Conclusion - Human and Technical Together

Ultimately, all of these dates regardless of their formatting for expression relative to a particular calendar are stored as electro-magnetic 1s and 0s on a disk. The current serial formatted first and last dates stored in ILDB are SQL Integers laid down inside a complex proprietary MS SQL Server database format, that likely few could hope to decipher!

Ignoring for the moment the field of Digital Preservation, we can have the best of both worlds. The Software Engineer can design correct and performant software, which accurately records the dates, and that the archivists and the users can be presented and allowed to search those dates in whichever format is most desirable. Such presentation could include displaying the calendar or adjusting the display dates to the historically relevant calendar for the record.

This Omega Catalogue system is designed to be an online system, for the purposes of Digital Preservation, one could imagine preserving frequent exports of our RDF data as perhaps Turtle or similar (UTF-8 encoded text). As a Software Engineer and potential Digital Archeologist I would argue that (given pre-knowledge of the spec), it is much easier to interpret and process 1582-12-09Z over 15821129, as the first form can only be according to the Gregorian Calendar (as per the W3C specifications for the XML Schema Date Type) and also indicates the time-zone (the Z character denotes UTC). I therefore have everything I need within the date string itself. I don't need to research further into the history of Julian to Gregorian Calendar switch-overs, which I would otherwise have to do with the second form!

Reusing Standard RDF Vocabularies - Part 2

Adam Retter — Wed, 24 Mar 2021 09:49:45 GMT

In Part 1 we looked at the challenges of strictly sticking to a policy of reusing existing vocabularies within Project Omega at TNA (The National Archives) and why you may occasionally have to make concessions to correctly express your data.

The example discussed in Part 1 was concerned with the scenario where one cannot from a suitable property from an existing popular standardised vocabulary. In this shorter article, we will look at a second scenario where there may be a reusable property available, but its implementation is problematic.

Primary and Secondary Identifiers

For this example let me explain another use-case that we recently had to solve for Project Omega. TNA's current Catalogue contains multiple identifiers, for each Unit of Description (a single document of folder):

Database Table Primary Key, e.g. tbl_item.-4653191
CCR (Classic Catalogue Reference), e.g. AIR 79/1064/118667
Optional - The Former Reference - Creating Department, e.g. R515333
Optional - The Former Reference - PRO (Public Records Office), e.g. E 315/509/Fo. 11

In addition, in the new Omega Catalogue, every Resource has:

An OCI (Omega Catalogue Identifier), e.g. FO.2020.3J.P.1
Optional - A related identifier from the Discovery system called an IAID (Information Asset Identifier), e.g. 01d43d64-d7a6-4250-a2f2-4153a606a948.

The Primary Identifier in Omega is the OCI, and we can happily reuse the Dublin Core Terms' dc:identifier property for this. For example:

@prefix tna:      .
@prefix dct:      .

tna:res.FO.2020.3J.P.1
    dct:identifier "FO.2020.3J.P.1" ;
    .

A Record with a Primary Identifier

This leads us to the question of - What is the best way to express our Secondary Identifiers?

Expressing Secondary Identifiers

If we were to express our Secondary Identifiers also using dct:identifier, it becomes difficult, impossible even, to differentiate the scheme to which identifier belongs as the dct:identifier is a Data Type Property and only permits a single literal value. Consider for example:

@prefix tna:      .
@prefix dct:      .

tna:res.FO.2020.3J.P.1
    dct:identifier "FO.2020.3J.P.1" ;
    dct:identifier "01d43d64-d7a6-4250-a2f2-4153a606a948" ;
    dct:identifier "tbl_item.-4653191" ;
    dct:identifier "AIR 79/1064/118667" ;
    dct:identifier "R515333" ;
    dct:identifier "E 315/509/Fo. 11" ;
    .

A Record with Many Identifiers; problematic?

The difficulty in working with the above data is that it raises questions such as:

Why are there so many identifiers?
To which schemes do these identifiers belong, and where can I find more information about those?
Which identifier should I use?
If I perform a query involving dct:identifier, then I am querying across identifier schemes, but am I guaranteed that there are no duplicate or conflicting identifiers across those schemes?
As a maintainer of the data, are all of the identifiers that are needed present?

Ideally we want instead a mechanism for Secondary Identifiers that not only expresses the identifier itself, but also the scheme which defines the use and syntax of that identifier.

After looking through several popular and standardised vocabularies, Schema.org's identifier property appears to suit our needs - schema:identifier.

@prefix tna:      .
@prefix dct:      .
@prefix schema:   .

tna:res.FO.2020.3J.P.1
    dct:identifier "FO.2020.3J.P.1" ;
    schema:identifier [
        a schema:PropertyValue ;
        schema:name "CCR" ;
        schema:description "The Classic Catalogue Reference" ;
        schema:value "FO 12/34/56"
    ] ;
    schema:identifier [
        a schema:PropertyValue ;
        schema:name "FRCD" ;
        schema:description "The Former Reference - Creating Department" ;
        schema:value "R123456"
    ]
    .

A Record with Primary and Secondary Identifier - Literals

This is an improvement over the sole and repeated use of dct:identifier as it allows us to reserve dct:identifier to indicate our Primary Identifier, and our secondary identifiers are now easily located via schema:identifier. In addition, each Secondary Identifier carries information explaining its purpose.

Further Concessions Against Reuse

We could refactor this to eliminate duplication and make it easier to query against specific secondary identifier(s). Thus yielding:

@prefix tna:      .
@prefix dct:      .
@prefix schema:   .

tna:res.FO.2020.3J.P.1
    dct:identifier "FO.2020.3J.P.1" ;
    schema:identifier [
        a schema:PropertyValue ;
        schema:propertyID tna:ccr ;
        schema:value "FO 12/34/56"
    ] ;
    schema:identifier [
        a schema:PropertyValue ;
        schema:propertyID tna:frcd ;
        schema:value "R123456"
    ]
    .

A Record with Primary and Secondary Identifier - URIs

The above involves trading-off the reuse of existing vocabulary properties for further precision of meaning.

We gain:

A reduction in duplicated strings, e.g. the schema:name and schema:description being placed on each secondary identifier.
The ability to easily and confidently search the data, we can match on ? schema:propertyID tna:ccr instead of ? schema:name "CCR". This becomes even more important where data may have been mis-keyed.

We lose:

The ability for strangers to interpret our data easily by glancing at a Secondary Identifier (schema:identifier) and immediately know what it is by reading its inline schema:name and schema:description.
The ability to express our data without needing to define our own vocabulary.

These trade-offs are quite severe and I think we lose too much for the (maybe as yet unknown) humans who want to work with our data. Of course we gain for the machines, but if we were only concerned about machines we would just use the most efficient binary encoding possible and this article would be redundant.

To complete the example above, we should also define tna:ccr and tna:frcd in a new vocabulary of our own:

@prefix owl       .
@prefix rdfs      .
@prefix tna:      .
@prefix dct:      .
@prefix schema:   .

tna:ccr
    a owl:Class ;
    rdfs:label "Classic Catalogue Reference" ;
    rdfs:comment "The CCR (Classic Catalogue Reference) is a secondary
                  identifier for a Unit of Description. It reflects the
                  historic ISAD(G) like archival arrangement of the unit, i.e
                  Department, Series, Piece, and Item. It has been in use
                  at The National Archives since the 19th century and is
                  aligned to the ISAD(G) standard. It is defined on page 13
                  of the document: TNA-CS13 (Cataloguing Standards - Part A
                  Data Elements, June 2013)."@en ;
    rdfs:seeAlso tna:cs13-a, schema:identifier, dct:identifier
    .
  
 tna:frcd
    a owl:Class ;
    rdfs:label "Former Reference - Creating Department" ;
    rdfs:comment "The FRCD (Former Reference - Creating Department) is
                  a secondary identifier for a Unit of Description. It holds
                  the unique identifier given to the material by the
                  originating creator. It is defined on page 17 of the
                  document: TNA-CS13 (Cataloguing Standards - Part A Data
                  Elements, June 2013)."@en ;
    rdfs:seeAlso tna:cs13-a, schema:identifier, dct:identifier
    .
    
tna:cs13-a
    a owl:Class ;
    rdfs:label "Cataloguing Standards - Part A Data Elements, June 2013" ;
    rdfs:comment "Describes the various Catalogue Elements derived from
                  ISAD(G) to manage descriptive data which is available
                  in PROCAT Editorial."@en
    .

Describing the Schemes of our Secondary Identifiers

Further Compromise for Human Use

We had chosen to reuse schema:identifier for our secondary identifiers because we had reserved dct:identifier for our primary identifier, and we wanted to follow our guiding rule of reusing common vocabularies wherever possible.

However, I feel that we have not yet arrived at a good solution. Perhaps, there is a different approach that we might take that would yield a more favourable balance between human understandability, precision or our data, and computability by machines?

What about if we took a similar approach to that which we ultimately proposed in Part 1? That is to say, that we could derive our own property(s) for secondary identifiers from an existing one, thus reusing the common definition yet adding further meaning. Of course we must still be very considerate of human users, and wisely choose straight-forward or obvious generic names for such properties so as to help them infer their purpose.

Whilst we can't use dct:identifier directly for our secondary identifiers, there is nothing to stop us deriving our own properties for secondary identifiers from it!

@prefix owl       .
@prefix rdfs      .
@prefix xsd:      .
@prefix tna:      .
@prefix dct:      .

tna:classicCatalogueReference
    a owl:DatatypeProperty ;
    rdfs:subPropertyOf dct:identifier
    rdfs:range xsd:string ;
    rdfs:label "Classic Catalogue Reference" ;
    rdfs:comment "The Classic Catalogue Reference is a secondary
                  identifier for a Unit of Description. It reflects the
                  historic ISAD(G) like archival arrangement of the unit, i.e
                  Department, Series, Piece, and Item. It has been in use
                  at The National Archives since the 19th century and is
                  aligned to the ISAD(G) standard. It is defined on page 13
                  of the document: TNA-CS13 (Cataloguing Standards - Part A
                  Data Elements, June 2013)."@en ;
    rdfs:seeAlso tna:cs13-a, schema:identifier, dct:identifier
    .

tna:formerReferenceFromDepartment
    a owl:DatatypeProperty ;
    rdfs:range xsd:string ;
    rdfs:subPropertyOf dct:identifier ;
    rdfs:label "Former Reference - Creating Department" ;
    rdfs:comment "The 'Former Reference - Creating Department' is a secondary
                  identifier for a Unit of Description. It holds the unique
                  identifier given to the material by the originating
                  creator. It is defined on page 17 of the document:
                  TNA-CS13 (Cataloguing Standards - Part A Data Elements,
                  June 2013)."@en ;
    rdfs:seeAlso tna:cs13-a
    .

Definition of our derived properties for two of our Secondary Identifiers

Using our own derived properties would then yield an expression for our Record that looks something like:

@prefix tna:      .
@prefix dct:      .
@prefix schema:   .

tna:res.FO.2020.3J.P.1
    dct:identifier "FO.2020.3J.P.1" ;
    tna:classicCatalogueReference "FO 12/34/56" ;
    tna:formerReferenceFromDepartment "R123456"
    .

A Record with Primary and Secondary Identifier - Bespoke Vocabulary

I believe that this final approach strikes a good compromise. Whilst we are not directly reusing an existing common vocabulary here for our secondary identifiers, we have a good reason, which is that schema:identifier is not a good fit for use considering our use-case. However, all is not lost, whilst we have had to create our own properties, they themselves are derived from a property (dct:identifier) from an existing common vocabulary (Dublin Core Terms). Additionally, and very subjectively, I would argue that it is much easier for humans to understand a single line which says tna:formerReferenceFromDepartment than looking within data or object properties of schema:identifier.

Conclusion

Although we started out with a different problem, and tried different approaches along the way, we ultimately ended up with an approach that looks remarkably similar to that in Part 1.

Hopefully this has re-enforced the idea that when attempting to solely reuse existing popular vocabularies, if you falter due a lack of suitable available classes and properties, there are options available, but there are trade-offs that have to be made between reuse and precision.

Reusing Standard RDF Vocabularies - Part 1

Adam Retter — Wed, 17 Mar 2021 17:05:09 GMT

In Phase 1 of Project Omega at TNA (The National Archives) we evaluated several different existing models and vocabularies/ontologies to ascertain their suitability for expressing the data of TNA's new Pan-Archival Catalogue. We published a fairly comprehensive report of our findings and proposed a way forward: Catalogue Model Proposal.

In summary, we felt that none of the existing models were perfect. We recognised that ICA's RiC (International Council on Archives' Record in Contexts) was very promising but currently under-developed for TNA's needs. Ultimately, we felt that the approach taken in developing The Matterhorn RDF Data Model had a lot of strengths and that we would take a similar path.

We decided that the new Data Model for Project Omega would:

attempt to adhere to the broader principles of RiC's Conceptual Model, but discard RiC's Ontology.
follow the approach of The Matterhorn RDF Data Model, i.e. reuse existing vocabularies and NOT create our own.

We started with the model specified in Matterhorn and added additional properties and classes from other shared and standardised vocabularies as we needed. The (work-in-progress) documentation of our data model: Omega Catalogue Data Model.

Now that we are in Phase 2 of Project Omega and exporting data into this data model in the form of Turtle RDF, we are starting to revisit some of our initial assumptions about reuse.

Shared Language vs. Precision

The beauty of reusing existing vocabularies (assuming that you choose popular and standardised ones), is that any developer, data scientist, or user who has worked with RDF before can likely already understand and work with our data. For example let's consider a simplified description of a Record:

@prefix tna:      .
@prefix dct:      .

tna:res.FO.2020.3J.P.1
    dct:identifier "FO.2020.3J.P.1" ;
    dct:description "This is a Foreign Office record about..." ;
    .

Record with an Identifier and Description that reuse common vocabularies

DCT (Dublin Core Terms) is a vocabulary that has been around since 2008 and its use is ubiquitous. Even if somehow the user was not aware of Dublin Core, the naming of the terms is straight-forward. As a human I can likely infer the meaning of dct:identifier as holding an identifier for the resource, and dct:description as holding a description of the resource. If you felt the need to, you could confirm your suspicions by checking the Dublin Core standard document itself, however, the point here is that you didn't have to, the meaning is already known or at least almost-obvious. This is a major benefit of reusing popular vocabularies as it both reduces the cognitive load for those working with the data, and enables us to form and use a shared language even when working with vastly different datasets.

On the flip-side, the disadvantage of reusing popular shared vocabularies is that they are often, by design, quite generic in their definitions. This is of course by necessity, common terms acceptable to a wide-audience need to be agreeable by that audience, and so generic and/or vaguely defined terms are more palatable.

Defining your own vocabulary has the absolute advantage of allowing you to precisely define your world-view and exactly what you mean. That's powerful stuff!

By way of contrast consider the same record expressed in a (fictional) bespoke vocabulary:

@prefix tna:      .

tna:res.FO.2020.3J.P.1
    tna:oci "FO.2020.3J.P.1" ;
    tna:scope-content "This is a Foreign Office record about..." ;
    .

Record with an Identifier and Description that use bespoke vocabulary

If you work in the Archives sector you might well guess that tna:scope-content holds the Scope and Content of the record... but how many people outside of the Archives sector know exactly what is meant by the "Scope and Content" of a record? Even then, you likely wouldn't know the meaning of tna:oci! It's the Omega Catalogue Identifier, and awareness of that is not even organisation-wide throughout TNA yet.
We would of course write OWL and documentation to define exactly what tna:oci and tna:scope-content mean, but the user has to go and read those before they can work with the data.

The trade-off is ultimately: Ease of consumption through reuse of Shared Vocabularies vs. Precisely/Correctly expressing your domain and data.

When and how to trade-off?

In Omega our underlying principle is to always attempt reuse first. We are discovering that sometimes however there just isn't an appropriate Property or Class that can be reused from a popular standardised vocabulary.

By way of an example let me explain a use-case that we recently had to solve for Project Omega. TNA's Catalogue currently contains Covering Dates for each Unit of Description (a single document of folder). These covering dates are the period-of-time during which the record(s) being described were created. They are expressed using between 1 and 3 values: The Date Text (as it appears on the unit of description), the First Date (the start of the period), and the Last Date (the end of the period).

Originally we had decided to use a property from a common vocabulary to express these, Dublin Core Terms (perhaps you know it!). The property we initially selected was dct:temporal. As I interpret the DCT (Dublin Core Terms) standard, it appears to me that dct:temporal is intended to describe the temporal coverage of the resource, i.e. the time period discussed/indicated within the resource as opposed to the date that the resource was created. So after further consideration, we decided to use something else instead of dct:temporal, and this is where we had to start making trade-offs.

The options we considered:

Use dct:created instead.
Unfortunately dct:created is a Data Type Property and so requires a literal value, yet we need to store 3 literal values (Date Text, First Date, and Last Date). To achieve this we could either:

a) Encode the 3 literal values into 1 literal value using ISO 8601-1, W3CDTF, EDTF, or DCMI Period. This has the downside that querying this with SPARQL becomes complex and requires various string split operations. For example, encoding using DCMI Period might produce the single literal string value:
name=1941-1951; scheme=W3C-DTF; start=1941-01-01Z; end=1951-01-01Z.

b) Ignore Dublin Core specifics here, and use an Object Property. We could somewhat enforce this approach with SHACL and documentation. However, those that are used to Dublin Core may be surprised; SPARQL queries for dct:created would be different in our system than other systems. This negates the advantage of using a property from a shared standardised vocabulary!
Use time:hasTime instead.
This is a generic property from the W3C Time Ontology in OWL. This is an Object Property that allows us to express our covering dates exactly as we would need. Unfortunately tme:hasTime only tells us that there is a time, not what that time represents. It is too generic and fails to adequately describe that these are the created dates of the records; dct:created would have been much more precise!
Create our own vocabulary property.
We have two main options of how to approach this:

a) Define our own standalone property in our own vocabulary.

b) If there is a property from a common vocabulary that is close to what we need, we can define our own property which is derived from that.

Whilst dct:created infers (to a human) the meaning that we are looking for it, doesn't allow us to store the information we need. The time:hasTime property is the opposite, it lacks a sufficient precise meaning, but allows great flexibility in how we store our covering dates. Therefore, as there is no readily suitable property from a common vocabulary we have little choice but to create our own!

As the property time:hasTime allows us to store the data we need, but is lacking in sufficient descriptive power, rather than defining our own standalone property, we can instead derive our property from time:hasTime and add further descriptive information. Our new derived property will be tna:created and could look something like this:

@prefix owl       .
@prefix rdfs      .
@prefix tna:      .
@prefix dct:      .
@prefix time:     .
@prefix rdae:     .

tna:created
    a owl:ObjectProperty ;
    rdfs:subPropertyOf time:hasTime ;
    rdfs:label "Created date" ;
    rdfs:comment "The date that the resource was created, or the date-period
                  during which the resource was created. Historically at The
                  National Archives, this has also been known as the
                  'Covering Dates' (of the Unit of Description)."@en ;
    rdfs:seeAlso dct:created, rdae:P20214
  .

Definition of our derived property for Covering Dates

The above definition of tna:created declares it as a sub-property of time:hasTime but gives further information about its use, and also informs us that additional information can be found by looking at dct:created and rdae:P20214.

In practical use our earlier RDF augmented with our Covering Dates now finally looks something like:

@prefix tna:      .
@prefix dct:      .
@prefix time:     .
@prefix xsd:      .

tna:res.FO.2020.3J.P.1
    dct:identifier "FO.2020.3J.P.1" ;
    dct:description "This is a Foreign Office record about..." ;
    tna:created [
        a       time:ProperInterval ;
        dct:description "1941-1951" ;
        time:hasBeginning [
            a       time:Instant ;
            time:inXSDDate "1941-01-01Z"^^xsd:date
        ] ;
        time:hasEnd [
            a       time:Instant ;
            time:inXSDDate "1951-12-31Z"^^xsd:date
        ]
    ]
    .

Record with an Identifier, Description, and Created Dates

We still utilise dct:identifier and dct:description above, because they are a good fit for our reuse. Whilst the new tna:created demonstrates our trade-off perfectly!

Earlier, I explained that we had wanted to reuse dct:created because it is a property that is widely used and understood, but that it was unsuitable for storing our covering dates (as it is specified as a Data Type Property).
As we could not find a suitable property from an existing popular vocabulary that we could reuse, we were forced to create our own. This property, tna:created, has two important design aspects:

It's name is straight forward. Even someone from outside of the Archival sector can likely guess it's meaning and purpose. It's unlikely that someone would have to look at our OWL definition or documentation to be able to start working with it. It's very much intentional that at a glance it looks a lot like dct:created.
Whilst this property is TNA specific and explicates a precise meaning, it does not standalone. Instead, it reuses the W3C Time Ontology in OWL (a popular vocabulary itself) by virtue of being derived from the time:hasTime property.

Conclusion

For Project Omega - we still prefer reuse wherever possible as it enables easier consumption by others. Creating a new Property or Class (even if derived from a common vocabulary) is sometimes unavoidable, but should be considered as an absolute last resort and undertaken only when no common property exists, or said property fails to adequately describe the data.

Hopefully this article has provided you with some insight into the challenges that arise when strictly trying to reuse existing vocabularies, and the trade-offs that may have to be made.

In Part 2, I work through a second use-case from Project Omega, and show a further example of where vocabulary reuse can be challenging.

Extreme Identifiers (for use in URIs)

Adam Retter — Wed, 17 Jun 2020 09:38:38 GMT

Recently I have been thinking about how Identifiers for things should be constructed. More specifically, as part of Project Omega for TNA (The National Archives), I have been thinking about identifiers for resources in RDF. This blog post continues on from my previous posts: Archival Identifiers for Digital Files, and Archival Catalogue Record Identifiers.

Whilst this article frames the content in the context of RDF and Archives, the principles are much more widely applicable. This blog post shows how to efficiently encode any positive numeric integer into a URI.

A resource in RDF is identified by a URI and that URI often consists of two major components, first a base URI, and then some sort of (domain specific) local identifier. For example:

A Base URI: http://cat.nationalarchives.gov.uk
A local identifier: Record12345
The final URI: http://cat.nationalarchives.gov.uk/Record12345

One of the types of resource that we need to describe in RDF is that of a Digital File, you can think of it simply as a file on your computer. It is the identifiers of those digital files that we will concern ourselves with in this article.

In my recent article - Archival Identifiers for Digital Files, I covered two different identifier schemes for Digital File: UUID (Universally Unique Identifier), and ACID (Archival Content Identifier). One thing that bothered me was the length of the presentation representation (i.e. hexadecimal encoding) of those identifiers.

If for a moment we ignore the Hash Function Type prefix of the ACID scheme, we can recognise that each scheme is really just generating a large positive integer! As hexadecimal has a numeric base of 16, i.e. there are only 16 symbols in its alphabet, we might likewise recognise that it is perhaps not the most efficient presentation representation.

Our Digital File identifier will ultimately form the local identifier component of our RDF resource URI. What would be the most efficient way to encode that Digital File identifier into a URI?

Encoding a Number into a URI Path

If we wanted to encode any positive integer (i.e. Base10) into a URI path as efficiently as possible, we first need to create an alphabet that utilizes every possible character that can be legally expressed within the path of a URI. Once we have that alphabet we can encode a Base10 number into a BaseN number, where N is the number of characters in our new alphabet.

I took the following steps to create such an alphabet:

Start with all possible path characters for use in a URI from RFC 2396. These are defined as pchar in section 3.3 of the RFC.
Eliminate escaped characters by removing the % character. An escaped character is modelled in a URI by using the % character and then two hexadecimal characters. We are interested in using the least characters possible, so using 3 characters here to represent 1 character is the opposite of what we want to do!
(optional) Eliminate the English vowels - A, E, I, O, and U, and a, e, i, o, and u. We don't want to incidentally create meaningful words! This is a requirement for TNA; if they were to publish their RDF as Linked Data, then such encoded numbers would become publicly visible.
Sort the remaining characters byte-wise by their UTF-8 numeric value.

Following these steps yields an alphabet of 78 characters, or if you perform the optional Step 3, then 68 characters.

Numeric Value	Encoded Symbol
0	!
1	$
2	&
3	'
4	(
5	)
6	*
7	+
8	,
9	-
10	.
11	0
12	1
13	2
14	3
15	4
16	5
17	6
18	7
19	8
20	9
21	:
22	=
23	@
24	B
25	C
26	D
27	F
28	G
29	H
30	J
31	K
32	L
33	M
34	N
35	P
36	Q
37	R
38	S
39	T
40	V
41	W
42	X
43	Y
44	Z
45	_
46	b
47	c
48	d
49	f
50	g
51	h
52	j
53	k
54	l
55	m
56	n
57	p
58	q
59	r
60	s
61	t
62	v
63	w
64	x
65	y
66	z
67	~

Our 68 characters can yield a Base68 representation, which is much more compact than a Base16 (hexadecimal) representation.

Encoding from Base10 to any other base is performed by a common recursive mathematical function. How this is achieved is not particularly relevant, but for both completeness and those readers that are interested, I have written an expression of it for encoding to Base68 using the Scala programming language:

/**
 * Encodes a Base10 positive integer to a Base68 String.
 *
 * @param value a positive integer
 * @return the encoded representation
 */
def encode(value: BigInt): String = {

  val b68Alphabet = Seq(
    '!', '$', '&', ''', '(', ')', '*', '+', ',', '-', '.',
    '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':',
    '=', '@', 'B', 'C', 'D', 'F', 'G', 'H', 'J', 'K', 'L',
    'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'X', 'Y',
    'Z', '_', 'b', 'c', 'd', 'f', 'g', 'h', 'j', 'k', 'l',
    'm', 'n', 'p', 'q', 'r', 's', 't', 'v', 'w', 'x', 'y',
    'z', '~'
  )
  val len = b68Alphabet.length

  @tailrec
  def encode(v: BigInt, accum: List[Char]): String = {
    if(v == 0 && accum.nonEmpty) {
      accum.mkString
    } else if(v <= 1) {
      (b68Alphabet(v.toInt) :: accum).mkString
    } else {
      val div = v / len
      val mod = v % len
      encode(div, b68Alphabet(mod.toInt) :: accum)
    }
  }

  encode(value, List.empty[Char])
}

Examples

Let's look at some examples of what these encoded identifiers look like.

A Version 4 UUID in its default hexadecimal representation: d879f8b2-5f67-495d-8796-5ce5b06ba238, consists of 36 characters and is equivalent to the Base10 number: 287746559179145117594110380901673968184.

If we instead encode the Base10 number into our Base68 alphabet this yields the string: xDZTz4*0-L0+S5V@4wFZB, which is just 21 characters in length. Compared to the default hexadecimal representation, this is a saving of 15 characters, i.e. ~42%.
An ACID in its default hexadecimal representation: !3cbae8f16217ad44981e5843100092cd582202e69d452eb094480f2d24abdb49, after dropping the ! character (for the Hash Function Type prefix), consists of 64 characters and is equivalent to the Base10 number: 27469012181874709647382529974656231352255408628267256258746051793118097562441.

If we instead encode the Base10 number into our Base68 alphabet this yields the string: 94TTsZ-tsvNkZzcM2jWXYCy,ym4d1XZ8N7).8:N9v6, which is just 42 characters in length. Compared to the default hexadecimal representation, this is a saving of 22 characters, i.e. ~34%.

Our Digital File resource URI would now look like:

A UUID based identifier for a Digital File using Base68:

http://cat.nationalarchives.gov.uk/xDZTz4*0-L0+S5V@4wFZB

An ACID based identifier for a Digital File using Base68 (with the Hash Function Type prefix reinstated):

http://cat.nationalarchives.gov.uk/!94TTsZ-tsvNkZzcM2jWXYCy,ym4d1XZ8N7).8:N9v6

Do these Base68 encoded identifiers look ugly to the human eye? Absolutely!
But... I did not design them for human-use or even to be humane! ;-) Instead, this encoding scheme is designed to fit the identifier for a Digital File (a number), into a URI as succinctly as possible, whilst still yielding a valid URI. I expect these URIs to be used by machines and not humans!

Is this a good solution? It certainly meets the requirements I set out, and it feels computationally neat. In practice, it likely needs further thought and experimentation. Will these identifies prove too cumbersome for developers writing SPAQL queries against our data? Possibly!

Decoding from Base68

Finally, decoding from BaseN to Base10 is performed by a very simple table lookup against our alphabet. Like encoding, how this is achieved is not particularly relevant, but again for both completeness and those readers that are interested, I have included a decoder from Base68 using the Scala programming language:

/**
 * Decodes a Base68 string to a Base10 positive integer.
 *
 * @param str the encoded representation
 * @return a positive integer
 */
def decode(str: String): BigInt = {

  val b68Alphabet = Seq(
    '!', '$', '&', ''', '(', ')', '*', '+', ',', '-', '.',
    '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':',
    '=', '@', 'B', 'C', 'D', 'F', 'G', 'H', 'J', 'K', 'L',
    'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'X', 'Y',
    'Z', '_', 'b', 'c', 'd', 'f', 'g', 'h', 'j', 'k', 'l',
    'm', 'n', 'p', 'q', 'r', 's', 't', 'v', 'w', 'x', 'y',
    'z', '~'
  )
  val len = b68Alphabet.length

  val indicies = str.map(b68Alphabet.indexOf(_))
  val vs: Seq[BigInt] = for(i <- 0 to indicies.length - 1) yield {
    val exp = (indicies.length - i) - 1
    indicies(i) * BigInt(len).pow(exp)
  }
  vs.reduceLeft(_ + _)
}

Open Source

More complete encoders and decoders can be found on GitHub implemented in both Scala: oci-tools-scala, and TypeScript: oci-tools-typescript.

Archival Identifiers for Digital Files

Adam Retter — Tue, 09 Jun 2020 18:36:16 GMT

As part of Project Omega for TNA (The National Archives), I have been thinking about how identifiers for Digital Files should be constructed. This blog entry continues on from my previous entry: Archival Catalogue Record Identifiers.

When considering development of a new archival catalogue that can describe both physical, digitised, and born digital records, we quickly realised that unlike its predecessors this catalogue will also need to describe digital files.

At this point you might think that I am mixing current concerns between what archives' have often thought of as two separate systems, 1) their Archival catalogue, and 2) their Digital Preservation system. Yes, I am, and intentionally so! However, I would argue that this soup has been cooking for some time; I have seen that until now digital preservation systems have had to include some aspect of cataloguing (for their digital records) as the traditional archival catalogues, that were already in-place, were ill-equipped to describe the new digital world. I believe that a clean and mutually-beneficial separation between cataloguing and (digital) preservation activities can be established, but that as practitioners we are still very much writing the book on digital preservation.

Anyway, I digress! The archival concept of a Digital File is a complex one, as archivists we have to ask difficult questions like:

What is a digital file?
How do I describe a digital file?
Is a copy of a file the same digital file?
If I change the name of the file, is it still the same digital file?

All of these things have to be considered when designing a scheme for local identifiers of Digital File. Without writing an extended article on various principles of digital preservation, it is perhaps enough to say that the file's path and/or name are not suitable for use as an identifier; in no small part due to both their transient nature, and inability to be combined with files from other systems which may cause rise to naming conflicts.

The Current Approach

To date the predominant approach in digital preservation for generating identifiers for digital files has been to simply assign them a UUID (Universally Unique Identifier); more specifically a Version 4 UUID. This approach has several nice properties:

These can be generated independently of each other.

You can just magic a UUID into existence without concern for other UUIDs that have gone before or come after it.

The chance of a collision is incredibly small - "the probability to find a duplicate within 103 trillion version-4 UUIDs is one in a billion".
They are relatively compact and presentable.

A UUID is just a 128-bit positive integer. This is typically formatted for presentation as a hexadecimal string of five components, totaling 36 printable characters, albeit they are not very human friendly.
They are cheap to compute.

On a modern laptop we can easily generate over 500,000 every second.

A New Approach - Content Identifiers

As an alternative to UUIDs, I am proposing a new approach for generating an identifier for Digital File which is computed from the content of the file itself.

I should be clear that this is not some stroke of genius on my part, similar approaches are already widely used in other domains. For example, the Git SCM (Source Code Management) uses SHA1 digests to identify files and changes. Likewise, the IPFS (InterPlanetary File System) uses its definition of a CID (Content Identifier), which is a hash function's digest of a file's content to address that file.

To avoid any confusion between IPFS CID's and our "Content Identifiers", I will herein use the abbreviation ACID (Archival Content Identifier) to refer to my proposal for identifiers.

The main part of an ACID is generated by computing the digest of the byte-stream (i.e. content) of the digital file via a hash function. This raises the question, of which hash function should be used? There is a wealth of different hash algorithms available with various properties and different trade-offs. That being said, I am going to suggest that we use a BLAKE2b-256 hash for the following reasons:

Recognised and verified by NIST.
Likelihood of collision is incredibly small.
Much faster to generate than equivalents such as SHA-256.
At least as secure as SHA-3.

For example, if we wanted to generate a BLAKE2b-256 hash digest for the Apache 2.0 License file, we could run:

curl https://www.apache.org/licenses/LICENSE-2.0.txt | b2sum --length 256 --binary

This yields a 256-bit number encoded into a hexadecimal string totaling 64 printable characters:

3cbae8f16217ad44981e5843100092cd582202e69d452eb094480f2d24abdb49

This hexadecimal string has some interesting properties:

It can be used an an identifier for the Digital File.
Verifiable Descriptions.

Provoided with both, 1) the description and identifier of a digital file and, 2) the file itself, we can verify that the description is indeed about the file by re-computing the hash digest of the file and comparing the result with the digital file identifier.
Verifiable Preservation.

Similarly to above, if the hash digest of the file changes over time, then we can assert that there has been an issue with its preservation, e.g. data-rot.

There are some down-sides to using a hash digest as opposed to a UUID:

More expensive to compute.

A hash digest is much more expensive to compute than a UUID, and the larger the file being digested the more expensive it becomes.
Less compact.

Our 256-bit hash generates a result which is twice as long as a UUID.

I believe that the down-sides of a hash digest are outweighed by its advantage of offering verifiability.

Which Hash Function was it?

For the purposes of preservation and interoperability, one thing that we have not yet considered is how one determines which hash function was used to generate an identifier. Sure, I said we would use BLAKE2b-256, but what if you want to use a different hash function? Also, from an digital archaeological perspective, given an identifier like:

3cbae8f16217ad44981e5843100092cd582202e69d452eb094480f2d24abdb49

You might be able to infer that it is a hash digest, and the selection of characters used and the number of them would indicate that it could be a 256-bit hash... but which hash function was used?

Ideally, we need a mechanism to also communicate the hash function that was used. In fact IPFS already thought about this, and they use an encoding called Multihash which prefixes their CIDs with a code indicating the hash function used. Whilst we could adopt Multihash here, it's much more complex than we need (famous last words?!?). Instead, I propose that our ACID's have a single ASCII character at the start that indicates the hash function that was used. A single ASCII character has the advantage of a fixed-length numeric encoding, and it makes the number of characters in the hexadecimal string representation an odd number, thus providing a hint to a digital archeologist that perhaps this ACID is similar to a digest but with an extra character. I will go one step further and say that this character should be outside of the hexadecimal alphabet (and ignorant of case-sensitivity), this should make it glaringly obvious to such a digital archeologist that the prefix character has a meaning which is distinct from the rest of the string.

An ACID is then formatted from a template like this:

{Hash Function Type}{Hash Digest}

For the Hash Function Type, I am going to reserve the ! character to indicate BLAKE2b-256. Why? Because, I think it looks cool! This would mean that our earlier digital file identifier now simply becomes:

!3cbae8f16217ad44981e5843100092cd582202e69d452eb094480f2d24abdb49

What about Collisions?

Sure, generating a digital file identifier with BLAKE2b-256 has a very small chance of generating a collision (i.e. two different files with the same identifier), but what if...?

If you detect a collision, I will build you a new digital archive system for free... Nope! Just joking! We actually already have a mechanism for coping with this, the Hash Function Type; for the new file which creates the collision you could switch to a different hash function, perhaps a 512-bit one! This would at least give you a different unique identifier. But... what to do about the original file which is on the other side of the collision, it's probably deeply embedded in your archive by now! You could re-catalogue it, but maybe you don't even need to???

I have in mind the idea to write another article about further encoding such ACID's for compact machine use. Okay… that’s enough for today!

Archival Catalogue Record Identifiers

Adam Retter — Wed, 03 Jun 2020 10:48:08 GMT

As previously described, in Project Omega at TNA (The National Archives), we will be using a Graph Model to hold all of the catalogue information, specifically an RDF model.

One of the key things in RDF is that every resource has a URI. In the Omega Catalogue we will have a plethora of different types of resources - (Archival Records, Locations, People, Organisations, etc.). Every one of these resources will need a URI. For our purposes these URI are composed of two parts, a base and an identifier.

In Omega we will be using a flat addressing scheme (i.e. no sub-folders/paths in the URI), and so our base is fixed and could be something like either:
http://cat.nationalarchives.gov.uk/

http://cat.nationalarchives.gov.uk#.

Determining which is best to use depends on which approach of the HashVsSlash schemes is most advantageous to your application. I have decided that in Omega we will be using the Slash scheme, i.e. http://cat.nationalarchives.gov.uk/. The Slash scheme has the advantage that there are multiple documents. As our catalogue will be very large, the single document approach as used by the Hash scheme would be unwieldy.

Now, the interesting part is the identifier! Every resource in our RDF graph needs a URI and therefore an identifier, in this article I will focus solely on identifiers for Archival Records.

Requirements for a Good Identifier

When choosing identifiers for our resources, there are some properties that they must/should/count exhibit:

Must be Unique (within our domain).
We can't have one identifier identify more than one resource without breaking our RDF model.
Must be Persistent.
We don't want our resources disappearing and/or reappearing with different URI over time, otherwise we end up with broken links. Therefore the identifier must be immutable, thus meaning that we must exclude any changeable properties of a resource from use within its identifier. This is also for archival purposes, as ideally we don't want to have to retrieve records from potentially distant locations or media to modify their identifiers.
Must be Computable.
Whatever form the identifier takes it must be computationally valid for use within a URI. Ideally it should be possible to generate such an identifier computationally without requiring a manual (human) registration/validation process.
Must be Uniform.
By ensuring that every identifier follows a prescribed format and length, it is easy to validate what is an identifier and what is not; that is not to say that an identifier leads to a resource.
Should be Humane.
The identifier when considered as part of a larger URI is often used by humans, perhaps within SPARQL statements that they construct to query the data, or by de-referencing such URI via the Web.
Additionally the identifier as a stand-alone element may have value in itself and could conceivable be used by humans to communicate about resources, for example a visitor to TNA could imaginably ask to see the record with identifier X.

Consideration should be given to making the identifiers communicable by humans, which implies that there are additional desirable properties, such as:
1. Should be succinct.
  Typically humans are better at accurately communicating short sequences of data.
2. Should be easy to transcribe.
  Human transcription errors can be reduced by using a commonly known alphabet. The Latin or Roman alphabet would seem sensible for an archive based in the UK. This would suggest excluding any non-alphanumeric characters from the identifier.
3. Should be easy to verbalise.
  Records of TNA are often discussed or requested verbally. For example, collaboration between staff members, or by a member of the public telephoning or making an enquiry face-to-face on site.
Could Convey Knowledge.
If the identifier is able to convey some knowledge about the resource that can be interpreted by machines and/or humans, we gain the advantage of being able to determine certain facts about the resource just from its identifier. This has to be carefully balanced with (2).

There are two interesting articles from the W3C about designing URI schemes both for the Web and RDF that may be of further interest to the reader:

Existing Identifiers for Archival Records

Those of you readers already familiar with records keeping, may of course be thinking that TNA must have an existing identifier scheme for its records, and you would be right. In fact TNA has several different schemes in-use today for identifying its records. For brevity's sake I will discount those used within various internal systems, and focus on what the general public tend to see.

The predominant identifier used by TNA for its records that the general public (and most staff) see and work with is simply known as a "Catalogue Reference". In actuality there are two different identifier schemes in use today, and a Catalogue Reference may be expressed in one or the other scheme. The schemes are:

CCR (Classic Catalogue Reference)
Before the advent of GCRs, this was simply known as "The Catalogue Reference" and was the de-facto identifier for any record catalogued by TNA. It was developed before the advent of digital records.
GCR (Generated Catalogue Reference)
I developed this scheme for TNA in 2012 to allow computational generation of identifiers for digital records.

At present, TNA uses CCRs for physical (think paper) records, and GCRs for Born-Digital records. CCRs were previously used for Digitised records, but GCRs are now starting to be used for those too.

We will briefly take a look at each existing identifier scheme, and show that unfortunately both have properties which make them unsuitable for use as identifiers within URI.

CCR (Classic Catalogue References)

To understand CCRs, you need to first understand a little bit about how archival records are arranged. Ultimately it all comes down to the principle of Respect des fonds, in the simplest of terms - we must respect the arrangement of the records as defined by their creator. In more concrete terms TNA uses an internal standard known simply as TNA-CS13 (The National Archives - Cataloguing Standards 2013) which is itself derived from ISAD(G) (General International Standard Archival Description).

TNA-CS13 basically stipulates that each record is arranged according to a mono-hierarchical structure, that structure may have between 3 and 7 levels. These levels are known by the names (from top-to-bottom): Department, Division, Series, Sub-series, Sub-sub-series, Piece, and Item; you can read more about them in the article Citing records in The National Archives.

A CCR identifier basically encodes all the references used for 3 or 4 levels of the record's arrangement. The CCR scheme has one of two forms, for records catalogued to Piece level it is:

{Department Reference} {Series Reference}/{Piece Reference}

For records catalogue to Item level the scheme is:

{Department Reference} {Series Reference}/{Piece Reference}/{Item Reference}

Here are five examples of valid CCRs that are in use:

From what I have explained so far, hopefully you have recognised that Example (1) is a CCR for a Piece, whereas Example (2) is a CCR for an Item.

Now, I would not blame you for thinking that Example (3) is also an Item, however you would be mistaken! Unfortunately whilst the / character is used as a separator between the Series, Piece, and Item references, at some point in the past it was also introduced as a valid character within the Piece and Item references themselves. We will cover the reason for that decision shortly, but for now we can safely assert that it causes problems: As a human I can no longer visually determine whether the CCR refers to a Piece or an Item, and perhaps worse yet, if I try and parse the identifier using a software program I get an ambiguous result. Sadly the intention for a CCR to carry information that is meant to be helpful to understanding the record has not held up well, instead many CCRs are ambiguous which may lead to confusion, and ultimately the fact that the arrangement of the record cannot be known without going into the physical archival stacks and retrieving it.
I chose Example (4) and Example (5) to further illustrate the non-uniformity of CCRs, they refer to a Piece and an Item respectively.

We can identify several issues that make CCRs unsuitable for our identifier needs in Omega:

A CCR may be ambiguous and therefore does not meet our requirement for unique identifiers.
A CCR encodes the arrangement of the record, whilst one would hope the arrangement is fixed at the time of accession, the reality is that mistakes can be made and from there the record may need to be re-catalogued which could also involve a change to its CCR. Therefore, CCRs do not meet our requirement for persistent identifiers.
Each CCR is allocated and registered manually by an archivist whereas we would need to be able to compute such identifiers within Omega. Additionally their ambiguity and non-uniformity means that they cannot be computationally validated.
For CCRs with Piece and Item identifiers containing non-alphanumeric characters (e.g. /, or ,), such characters would require URI Encoding to be able to use the CCR as part of a URI. Unfortunately URI Encoding is non-intuitive to humans.
The '/' character was introduced into the Piece and Item references within a CCR to allow the archivist to hint at further levels of arrangement which were prohibited by TNA-CS13. A goal of Project Omega is to provide a catalogue that works for any record regardless of its medium (e.g. physical or digital), one known axiom of digital records preservation is that such records have far more complex arrangements than their paper counterparts, often requiring arbitrarily deep levels of hierarchy or poly-hierarchical arrangement. For this reason the encoding of level identifiers into CCRs will not scale for digital records, and was in fact one of the drivers for creating the GCR scheme.

Whilst CCRs may not be perfect, it should be recognised they have until now largely been successful in providing an identifier for the retrieval of a record, thus demonstrated by the fact that they are still used daily to access millions of archival records.

GCR (Generated Catalogue References)

I designed the GCR scheme for TNA back in 2012 when I was leading the design and implementation of their DRI (Digital Repository Infrastructure) project. The goal of that 3 year project was to design and implement a new Digital Archive for preserving digital records.

DRI needed to be able to accession Born Digital records. The practice of archiving and cataloguing physical records is rather well established and understood. At that time the practice of archiving and cataloguing digital records was still in its infancy with much international discussion, and arguably best practice is still being refined. In particular, Born Digital records have several aspects that make them much more complex to handle than physical records, if we are to apply principle of Respect des fonds then we must preserve the creator's arrangement of the digital files comprising the records. Generally digital files are organised according to a mono-hierarchical file-system or file-plan, however such a hierarchy may be of an arbitrarily deep number of levels and operate without any global constraints on the naming of each level. In addition there are some systems (e.g. Content/Document Management Systems and/or Cloud Office Suites) which offer label based arrangements of documents, thus resulting in arbitrarily deep poly-hierarchical structures and again without restrictions on the naming of labels (i.e. levels).

By recognising that CCRs reflected an arrangement of 3 or 4 levels, and that Born Digital files could have many more levels of arrangement, we realised that adding additional level identifiers to CCRs would not scale; as we could end up with very long CCRs which are encoding file-system paths with each component of arbitrary length. In addition whilst TNA may receive a large collection of paper records and these can be catalogued and accessioned by humans, the volume of digital files for Born Digital records is much much higher, to the extent that cataloguing such records manually becomes impossible with the resources available.

To solve this problem I developed GCRs, with the goals of:

Eliminating the encoding of multiple levels of arrangement into the Catalogue Reference.
Computing Catalogue References automatically during the automated accessioning process for a collection of digital records.
Creating Catalogue References that are unambiguous, uniform, and can be easily validated.
Ensuring that the GCR scheme is still easily communicable by humans by both written and verbal mechanisms.

A GCR starts just like a CCR by encoding the Department and Series References, however from there it deviates, instead of encoding further levels, it instead uses a sequentially allocated Record Number, and finally an optional Revision Number. The GCR scheme for most records, i.e. those with a single manifestation, looks like:

{Department Reference} {Series Reference} {Record Number} Z

For records with more than one manifestation, the additional manifestations can be identified by the GCR scheme:

{Department Reference} {Series Reference} {Record Number} Z{Revision Number}

Each record number is monotonically increasing per Department and Series pair. To ensure that the GCR remains succinct even when there are many records, I then encoded the record number using a custom Base25 alphabet. This encoding results in a significant compression of the number of characters needed to express the record number. The Base25 alphabet was carefully chosen to eliminate characters which could be confused when communicated by humans, for example 0 (the digit) and O (the letter), in addition I removed vowels so that we were not incidentally generating recognisable words. The Z character, which I also removed from the alphabet, is carefully placed to enable a GCR to be easily distinguishable from a CCR.

GCR Base25 Alphabet

Numeric Value	Encoded Symbol
0	B
1	C
2	D
3	F
4	G
5	H
6	J
7	K
8	L
9	M
10	N
11	P
12	Q
13	R
14	S
15	T
16	V
17	W
18	X
19	2
20	3
21	4
22	5
23	6
24	7

Examples of valid GCRs:

LOC 5 CWG Z
LOC 5/FPF/Z
LOC 5 CWG Z3

Example (1) and Example(2) are both valid GCRs. The GCR scheme does not actually stipulate that there should be a / used between the Series, Record, and Z components, however, for visual continuity TNA have elected to use this when presenting them. Example (3) shows the Revision number component, which serves to allow multiple manifestations of a record to be addressed by a GCR, for example you may have a Microsoft Word 2000 Document original, and a migrated PDF manifestation.

TNA have now been using GCRs for digital records for a few years. Retrospectively looking back at the design of GCRs, I have to admit that I am still quite happy with them, my younger self must have been having a particularly good day when he sat down to design the GCR scheme! It is certainly humbling to think that these innocuous little identifiers are forming a small part of the UKs permanent history, and that I had a hand in defining them.

For the purpose of considering them for use as the identifier scheme in Omega, GCRs have many of the properties that we require in a good identifier - they are unique, they are persistent (for the vast majority of records), they are computable, they are uniform, and they are humane (in many ways more so that CCRs, although perhaps not as memorable).

Indeed, we could casually adopt GCRs for use in Omega as our identifier scheme. Yet with further thought there are a couple of minor issues with that, and as we have the opportunity, I could perhaps even improve on GCRs yet. The issues that I perceive with adopting GCRs for Omega are:

In Omega we want our identifiers to be persistent. If the records need to be re-arranged, whist it is extremely unlikely that the Department reference would change, it is possible that the Series reference could. With a GCR, the Series reference is encoded in the identifier, which means the re-arrangement would unfortunately result in a change to the identifier.
To support our goal of building an immutable history of our records in Omega, we have a very clear distinction between, the enduring form of the record (i.e. the concept of the record), and our temporal understanding of the record (i.e. descriptions of the record). For this end, we need identifiers that can indicate both the concept of the record, and its descriptions as our understanding of the record accumulates and evolves through time. Whilst a GCR has a revision number to identify manifestations of the record, and one might consider re-purposing that for indicating revisions of description, that would then fall-short as we also have the concept of manifestations in Omega.
Adopting GCRs as identifiers for all records would mean that physical records would also gain a GCR alongside their CCR, born-digital records already have GCRs. There is in fact precedent for this, TNA-CS13 allows records to have Former References alongside their Catalogue Reference, one such Former Reference is the PRO (Public Records Office) reference; The PRO is of course the predecessor of TNA. The issue I perceive is one of mindset, staff know that GCR are used only for digital records, so when they see the Z character in a GCR they infer that it refers to a digital record. This is perhaps unfortunate, whilst it was never previously envisaged that the GCR scheme would be used for physical records, it didn't impose any such limitation, its specification in-fact states: "not solely limited to Born Digital Records" and that the Z character (the Generated Catalogue Reference Indicator) is for the purpose of "allow[ing] users to visually differentiate a GCR from a CCR easily".

Additionally, one place where I think I could improve upon the GCR scheme is where CCRs have the advantage of being able to convey more information directly to the user about the record, thus reducing the need for the user to actually retrieve the record. Sometimes this information may be ambiguous and/or confusing, but more often than not, it is helpful to the user. GCRs removed a lot of that information to meet its goals, and ultimately ended up with a much more persistent identifier scheme which is a good thing. In Omega we have a clear split and definition of information that we believe changes over time, and information which we believe is enduring. If there is enduring information that is useful to the user, and assuming it is sensible for use in an identifier and URI, we can place that into the identifier without compromising on our requirement for persistent identifiers.

OCI (Omega Catalogue Identifier)

From what I have learnt about CCRs and how TNA use GCRs, in consultation with TNA's Catalogue Team I have developed a new identifier scheme which rather unimaginatively I am simply calling OCI (Omega Catalogue Identifier).

Let me be clear, my driver for this is solely the requirements of Project Omega. I believe that these identifiers will work well for the URI of our catalogue resources in our RDF graph, and would equally work well within a Linked Data context.

OCIs can be used for any type of record held by TNA. I might be suggesting, but I am NOT proposing, that the canonical Catalogue Reference of a record catalogued by TNA change from the existing CCR or GCR scheme. At this stage, I see OCIs as complementing CCRs and GCRs, whereby in some applications, such as Omega, the OCI is the primary identifier of the record. Regardless, OCIs will be generated for all existing catalogue records at TNA when they are imported into the Project Omega system. Could TNA start using OCIs instead of CCRs and GCRs for all new records that it accessions? Yes, of course! Will TNA do that? I am the wrong person to ask... that level of decision making is high above my position!

The basic OCI scheme for records has the following components:

{Creator Reference}.{Accession Year}.{Record Number}.{Accession Format}

Creator Reference
This is some identifier that uniquely identifies the organisation, group, or individual that created the records. Historically at TNA this is most often the Government Department, known as "Department reference" in CCR terms.
Accession Year
The year in which the record was accessioned into the archive.
Record Number
A monotonically increasing number, initialised per Creator Reference and Accession Year pair. This number is encoded using a special purpose alphabet. This is not the same Base25 alphabet used in GCRs for some important reasons covered below.
Accession Format
A single character to indicate the format of the accessioned record. Currently limited to P for physical records, or D for digital records. Note that accession format is useful to indicate the format of the public record that was initially accessioned, but it should be remembered that there might also be additional manifestations of the record available in complementary formats, e.g. a digitisation of a physical record.

As already mentioned, Omega separates the concept (or enduring form) of a record from TNA's descriptions' of the record which evolve through time. The OCI scheme illustrated above allows one to identify a record, but how are we to identify the descriptions of the record? To identify a specific description of a record, we simply number them, and add an additional component to indicate the description:

{Creator}.{Accession Year}.{Record Number}.{Accession Format}.{Description Number}

Description Number
A monotonically increasing number, initialised per record concept. This is not encoded. Comparing these numbers only infers an ordering through time of descriptions, it does not infer the correct description as there may be multiple competing descriptions from different sources.

It is perhaps worth pointing out that within the RDF graph for Omega there are explicit relationships that link a record with all of its descriptions, and also indicate the latest description. From a Linked Data perspective, if a user wished to resolve the record using the web of data, we would provide just the data about the concept and links to its descriptions. However, if another user was to resolve the record via a Web Browser, then we would likely redirect them to an HTML page of the latest description of the record.

Similarly to how we have multiple descriptions of a record, Omega also offers multiple manifestations of a record. A manifestation of a record can take many different forms, but there is always the original manifestation of the record as accessioned by TNA, for example, a parchment or digital file. There may also be additional manifestations of the record created for preservation or presentation purposes, for example, copies, digitisation, thumbnails, language translation, transcription, redaction, or file-format migration. The OCI scheme adds a component for manifestation, which is numbered in a similar manner to descriptions.

{Creator}.{Accession Year}.{Record Number}.{Accession Format}.{M{Manifestation Number}}

Manifestation Number
Prefixed by an M character, this is a monotonically increasing number, initialised per record concept. This is not encoded. Comparing these numbers does not infer an ordering of manifestations.

Note, it is important to realise that descriptions and manifestations are both numbered per record concept. Descriptions and manifestations do not have a hierarchical relationship, instead they are orthogonal to each other.

Here are some examples of (fictional) OCIs:

MSW.1970.7GH.P
This is the OCI for a physical record numbered 7GH which was created by MSW (The Ministry of Silly Walks) and accessioned by TNA in 1970.
MSW.2014.L4F.D
This is the OCI for a digital record numbered L4F which was created by MSW and accessioned by TNA in 2014.
MSW.1981.HGF.P.1
This is the OCI for the 1st description of, the physical record numbered HGF which was created by MSW and accessioned by TNA in 1981.
MSW.1981.HGF.P.5
This is the OCI for the 5th description of, the physical record numbered HGF which was created by MSW and accessioned by TNA in 1981.
MSW.1999.TSF.P.M1
This is the OCI for the 1st manifestation of, the physical record numbered HGF which was created by MSW and accessioned by TNA in 1999.
MSW.1999.TSF.P.M5
This is the OCI for the 5th manifestation of, the physical record numbered HGF which was created by MSW and accessioned by TNA in 1999.

Astute readers may have noticed that we could potentially remove the . character between the components in an OCI without loosing precision or introducing ambiguity. This is an interesting idea, and one that I discussed with the Catalogue Team, whilst it would make no difference to a machine, the majority felt that it was clearer for human use if they remained.

The alphabet for encoding record numbers in OCIs was created by:

Starting with the Base32 alphabet from RFC 4648.
Eliminating the English vowels - A, E, I, O, and U. We don't want to incidentally create meaningful words!
Removing the characters P, D, and M, as they are reserved to signify Physical, Digital, and Manifestation.
Removing the digit 0 (zero) as it could be misconstrued as numeric padding.
Removing the character B as it could be confused with the digit 8 (eight) when read or written by humans.
Adding the characters W, X, and Y. We opted not to add Z so as to avoid any confusion with GCRs.

OCI Base25 Alphabet

Numeric Value	Encoded Symbol
0	1
1	2
2	3
3	4
4	5
5	6
6	7
7	8
8	9
9	C
10	F
11	G
12	H
13	J
14	K
15	L
16	N
17	Q
18	R
19	S
20	T
21	V
22	W
23	X
24	Y

The Base25 alphabet as used in OCIs has a small advantage over that used in GCRs - encoded data maintains its sort order when it is compared bit-wise.

I believe that the OCI scheme has two key advantages over GCRs:

100% Persistent.
A URI using an OCI will never change, even if the description or arrangement of the record changes. Once an OCI is created in Omega it lives forever.
Conveys Knowledge.
Like a CCR an OCI confers some information about the record that it identifies, however unlike a CCR this is done without compromising on persistence of the identifier.

The OCI scheme should certainly be considered as a draft at the moment, and I am looking forward to both experimenting with it and receiving further feedback.

We have placed the source code for two software tools for encoding/decoding OCI Base25 (and also GCR Base25) onto GitHub: OCI Tools (Scala) and OCI Tools (TypeScript).

Full Circle back to URIs

As discussed at the start of this article... for Project Omega I defined a static base URI for expressing TNAs resources in RDF, and I have now also defined a suitable identifier scheme - OCI.

The URI for our data now look like this for a record (concept):

http://cat.nationalarchives.gov.uk/MSW.1970.7GH.P

and this for a record's description:

http://cat.nationalarchives.gov.uk/MSW.1970.7GH.P.2

Just above I said: "URI for our data", ideally such URI should be resolvable via the Web with various content negotiation options. At present alongside Project Omega, TNA is also operating Project Alpha. Alpha is focused upon the User Experience around the discoverability of records through TNAs website. There has already been some collaboration and information sharing between the two projects around URI. Yet, it is important to kept in mind that URI for addressing records (Omega), are not necessarily the same as URI for finding records (Alpha).

I expect that there will be further collaboration in the near-future between the Alpha and Omega projects to ensure that TNA can benefit from the exciting Linked Data on the Web applications that Omega is unlocking! :-)

Business Source License Adoption

Adam Retter — Thu, 26 Mar 2020 16:59:31 GMT

I have been following the re-licensing upheaval which has been happening with Open Source database software over the last few years. Last year, I gave a couple of small talks on the subject at both, the London Open Source Databases Meetup (slides: Database Licensing Chaos), and at Declarative Amsterdam (slides: Are we still Open Source?).

The summary is that, many well-known databases that were previously Open Source have either moved to an Open-Core Model and/or changed to Source Available licensing. There are various reasons behind this including refinement of business models and protecting investment in intellectual property. I won't debate the motivations or merits of such approaches in this article, there are already many other articles out there which do! Instead I will look briefly at one such Source Available license, the Business Source License, and whom has adopted it and how.

The Business Source License

The Business Source License 1.1 (BSL) is a Source Available software license, which guarantees that the software's source code will become Open Source after a period of time (up to a maximum of 4 years).

The BSL was created by Michael Widenius and David Axmark and developed with Linus Nyman back in 2013 (see: Introducing “Business Source”: The Future of Corporate Open Source Licensing?). As you likely know, Michael and David have a long history and involvement in MySQL and MariaDB. Previously they had taken a Dual-Licensing approach with MySQL, however they felt that did not work well due to how customers wanted to use and deploy the software. As a response they developed the BSL for use with MariaDB products.

Finally, I hope that BSL will pave the way for a new business model that sustains software development without relying primarily on support.

Michael Widenius, 2016

The Business Source License was revised in version 1.1 with the help of Bruce Perens (a co-founder of the OSI, and creator of the OSD); Adjustments were made from BSL 1.0 to make it clearer to users what they get and when, to impose constraints on the licensor as to both, the period of the Change Date (a maximum of 4 years) and the choice of Change License (must be GPL 2 or later compatible).

BSL is a parameterised license, which allows the copyright holder some flexibility in how they apply it to their work. The three parameters are:

Change License

A license which is compatible with GPL version 2.0 or later, which the work becomes licensed under after the "Change Date".
Change Date

The date at which the work ceases to be licensed under BSL and instead becomes licensed under the "Change License".
Additional Use Grant (Optional)

The BSL by default prohibits production use of the software. This parameter can optionally be used to grant additional rights to the licensee by the licensor. For example, so that it may be used with various restrictions in some form of production environment. It cannot be used to limit the other rights granted by the license.

Who is using the BSL?

The obvious adopter of BSL is MariaDB, who are using it for their MaxScale, ColumnStore Backup Restore Tool, ColumnStore MaxScale CDC Data Adapter, and ColumnStore Kafka Data Adapter software products.

In addition to MariaDB, finding other software products that have adopted BSL has not been easy. Through Google, I was able to only identify four more products: CockroachDB, Sentry.io, Materialize, and ZeroTier. There are likely others, but even with relatively little effort I was surprised to find so few!

How is the BSL being Used?

I did a brief survey of how each of the adopters that I found are parameterizing the BSL for their needs...

Change License

MaxScale: GPL Version 2 or later
CockroachDB: Apache 2.0
Sentry.io: Apache 2.0
Materialize: Apache 2.0
ZeroTier: Apache 2.0

The majority of those appear to be using Apache 2.0 as their Change License. BSL 1.1 states the following about the choice of Change License:

a license that is compatible with GPL Version 2.0 or a later version, where “compatible” means that software provided under the Change License can be included in a program with software provided under GPL Version 2.0 or a later version.

This is interesting because actually, the Apache License, Version 2.0 is not compatible with GPL Version 2. However, it is compatible with GPL Version 3 (see: Apache's GPL Compatibility), so I guess that the "or later version" text allows this to work.

(Period of) Change Date

MaxScale: 4 Years
CockroachDB: 3 Years
Sentry.io: 3 Years
Materialize: 4 Years
ZeroTier: 3 Years, 4 Months *

* This seems an unusual period; it was calculated based on their commits and stated Change Date).

Additional Use Grant

MaxScale

Allows your application to use MaxScale up to and including two server instances for any purpose (e.g. production).
CockroachDB

Allows you to use CockroachDB for any purpose (e.g. production) as long as you are not offering it as a commercial DBaaS (Database as a Service). It seems likely that they are trying to head off large Cloud Providers (e.g. Amazon AWS) from monetizing their work for free.
Sentry.io

Their wording is almost identical to that used for CockroachDB. You can use Sentry for any purpose (e.g. production) as long as you are not offering is as a commercial SaaS (Software as a Service). Likewise, I suspect they also wanted to inhibit Cloud Providers from monetizing their work for free.
Materialize

Allows you to use as many non-clustered isolated server instances of Materialize as you want for any purpose (e.g. production) as long as you are not offering it as a commercial DBaaS.
ZeroTier

Their Additional Use Grant is the most complicated of the adopters that I looked at. In summary I interpret it as allowing you to use ZeroTier (e.g. in production), providing you are not:
1. offering a commercial service e.g. ZeroTier SaaS
2. creating a non-open source commercial derivative work
3. using it within Government, unless for physical or mental health care, family and social services, social welfare, senior care, child care, and the care of persons with disabilities.

Applying the BSL to Software

MariaDB provide guidance on adopting the BSL in the form of an FAQ, however the details of how to correctly apply the BSL to your software are not crystal clear. In some places MariaDB also appear to offer contradictory advice.

Consider the following statement from https://mariadb.com/bsl-faq-adopting/#future:

Q: How far in the future is the recommended Change Date?

A: At most five years from the initial alpha release of your BSL software. Picking the Change Date depends on how rapidly the software is changing. For most software, the recommendation is four years.

This seems to directly conflict with the following statement from the BSL 1.1 itself:

Effective on the Change Date, or the fourth anniversary of the first publicly available distribution of a specific version of the Licensed Work under this License, whichever comes first

If the maximum period is the Change Date or the fourth anniversary, then what would be the purpose of the Change Date being "At most five years". The period of limitation is presumably 4 years maximum?

Likewise, consider the following statement from the BSL Adopting FAQ on Change Date expiry:

All source files under BSL have a Change Date and the name of an Open Source license in the header.

This advice seems to vary from https://mariadb.com/bsl-faq-mariadb/, which by my interpretation, seems to imply that the Change Date need only appear in the BSL license file:

To convert your software to BSL, you have to add the BSL header to all your software files and include the BSL license file in your software distribution. In addition, you have to add the usage limits and Change Date that suits your software in the header of the BSL license file.

I have yet to find an explicit definition of what the "BSL header" is. In my opinion, understanding clearly how to apply the BSL to your software is at the moment difficult. Conversely, most Open Source licenses make this easy by having an explicit section with instructions on how to apply the license to your software, and often include a template header which can be copied and pasted into each source file. Similar explicit and precise instructions for BSL would be welcome.

How has BSL been applied so far?

Looking at each of the adopters that I previously identified, between them there seem to have been two distinct approaches taken to applying the BSL to their software:

Maintaining the Change Date in both, a central license file, and also within a license header at the top of each source code file.
This is the approach taken by both, MaxScale (LICENSE.TXT / server.cc), and ZeroTier (LICENSE.txt / Node.cc).
Maintaining the Change Date only in a central license file, the license header at the top of each source code file then references the central license file. This is the approach taken by both: CockroachDB (BSL.txt / server.go), and Materialize (LICENSE / server.rs). Sentry.io also takes a similar approach (LICENSE), but unusually (at least in my experience) does not include any license header or copyright notice at the top of their source files.

The advantage of approach (2) for the developer/publisher is that they only have to update their licenses Change Date in a single place. This removes any opportunity for inconsistency across multiple files.

However, from a user/developer (or even archivist) perspective I think there are advantages to approach (1), in so far as, if I am viewing the source code files individually, perhaps because they had been distributed individually or separately to the main body of work, then the Change Date is immediately apparent.

Which approach is correct? I don't know! It could be one or the other, both, or neither.

Regardless, at present there is no evidence of a consistent approach. I believe it is likely that this may have been caused by a lack of explicit and precise guidance on how to apply the BSL to your software.

Personally, I rather like the BSL for the purpose that it serves, and hope for clarification on how it should be applied in the near-future.

RDF Plugins for Pentaho KETTLE

Adam Retter — Mon, 02 Mar 2020 13:58:25 GMT

;tldr: Jena Plugins for Pentaho Kettle (GitHub), and Demo of building a SQL to RDF Workflow (YouTube).

Background

At the end of 2019 TNA (The National Archives) launched a small Proof-of-Concept project called Project OMEGA. The goal of Project OMEGA was to investigate and prototype a potential replacement for their Catalogue. Initially the scope of the project was limited to PROCat and ILDB, the GUI and database respectively, that forms their existing Catalogue system for physical (e.g. paper) records.

After much research into TNAs business of cataloguing, and at our suggestion, the project has been expanded. The goal of the project is now to build a new singular "Pan-Archival Catalogue" system, which will replace several existing systems, and be able to describe all types of records held by TNA, i.e. physical, born-digital, and digital surrogates.

One of the obvious complexities for a Pan-Archival Catalogue is bringing together data about records from multiple sources, each of which has a different logical model. We opted to first identify a new suitable data model for TNAs future Pan-Archival Catalogue, after evaluating the major models in use, and the latest research for archival records modelling, we settled upon a Graph based model. The Graph based model allows us to describe complex relationships between records, and to easily add new facts and relationships to the graph in future as TNAs knowledge and interpretation of its records is enriched.

In a previous post entitled - "Pentaho KETTLE and SQL Server", I explained that we were using Pentaho Kettle to work with our initial data from ILDB which is a SQL Server database. Having chosen a Graph based data model, and specifically RDF (Resource Description Framework), we needed to have our data transformation workflows in Kettle produce RDF output for us. Unfortunately, out-of-the box Kettle doesn't include any workflow steps for producing RDF, and we could not find any 3rd-party plugins to do this. So we built and Open Sourced our own...

Integrating Kettle and Jena

Fortunately we were able to develop custom workflow steps to generate RDF in Java for Pentaho Kettle's plugin API. For the heavy RDF lifting, our custom workflow steps make use of Apache Jena, which is a Java framework for building Semantic Web and Linked Data applications. Our workflow steps provide the UI dialogs for configuration in Kettle, and act as the glue between Kettle and Jena by mapping row fields from Kettle to RDF Resources and Properties in Jena.

We developed two custom workflow steps:

Create Jena Model
For each input row provided by Kettle, a Jena Model is created and stored as a target field in the output row. The step allows the user to configure a mapping of fields from the input row to RDF Resources and Properties in the Jena Model of the output row.
Serialize Jena Model
This step is designed to receive input rows from Kettle which contain a Jena Model as one of their fields (created via a Create Jena Model step). The step allows the user to configure a file path and RDF serialization type, for a file that will be serialized to disk from the Jena Model.

For anyone else who wants to create RDF with Pentaho Kettle, TNA have kindly agreed to release these custom workflow step plugins as Open Source under the MIT license. If you are interested, you can read more about TNA's Open Source licensing policy. The plugins were developed in Java 8 and tested with Pentaho Kettle 8.3.0.9-719 and Apache Jena 3.14.0. You can find them on their GitHub: https://github.com/nationalarchives/kettle-jena-plugins

Example Kettle workflow with Jena RDF creation

Configuring the Create Jena Model step

The Create Jena Model step is concerned with mapping fields from the input row to an RDF Resource and Properties in the output row. The step's configuration dialog is shown below.

Create Jena Model step configuration dialog

Target Field Name
This is the name of the field in the output row which will hold the Jena Model. You can call this anything you want, e.g. my_jena_model.
Remove Selected Fields?
When thus is selected, then any fields added in the Fields to RDF Properties table, will no longer be available in the output row.
Namespace Prefix / Namespace URI
This table holds the namespace mappings. You must add entries in here for any prefixes which you use in the Resource rdf:type field, or Fields to RDF Properties table.
Resource rdf:type
This is the name of the RDF class that your resource instantiates.
Resource URI (field)
This is the field from the input row which contains the URI of your resource.
Fields to RDF Properties
This table maps input row fields to RDF Properties. If you leave the RDF Property type empty, then xsd:String is assumed. Rather than properties, you can also map to other resources by setting the RDF Resource type to Resource and making sure that your field contains a URI or QName.

Configuring the Serialize Jena Model step

The Serialize Jena Model step is concerned with serializing a previously created Jena Model which is present in a field of the input row. Typically a Serialize Jena Model step follows a Create Jena Model step. The step's configuration dialog is shown below.

Serialize Jena Model step configuration dialog

Field (Jena Model)
This is the name of the field in the input row which contains the Jena Model. This should be the same as the Target Field Name from the corresponding Create Jena Model step.
Serialization Format
The format of the RDF file that you wish to create. e.g. RDF/XML, RDF/XML-Abbrev, N3, Turtle, or N-Tripples.
Filename
This is the path and name of the file on disk that you wish to create.
Create Parent Folder
When checked, the parent folder of the output file will be created if it does not already exist.
Include step number in Filename
When checked, the Kettle step number will be appended into the output filename. This can help to uniquely identify the output file in complex workflows.
Include partition number in Filename
When checked, the Kettle partition number will be appended into the output filename. This can help to uniquely identify the output file in complex workflows.
Include date in Filename
When checked, the date will be appended into the output filename. This can help to uniquely identify the output file in complex workflows.
Include time in Filename
When checked, the step number will be appended into the output filename. This can help to uniquely identify the output file in complex workflows.

Demo of building a SQL to RDF Workflow

We have also produced a simple screencast which demonstrates using our plugins to create RDF from a SQL database.

The Shell on my Mac

Adam Retter — Fri, 29 Nov 2019 12:34:33 GMT

On my previous Mac I used powerlevel9k to both give my shell a nice look, and to add some extra metadata to my prompt when working with git repositories (which for me is most of the time). I recently got a new Mac and somewhere between following the install instructions and importing my old .zshrc, I ended up with a corrupted looking shell prompt, which I could not seem to fix.

So, I decided to reinstall all the shell stuff I use from scratch, along the way I discovered powerlevel10k which claims to be much faster than powerlevel9k.

This blog article is really for my own reference on how to setup a nice shell on a Mac, should I ever encounter such problems again. It might also be useful for anyone else who wants a fancy shell on their Mac.

The installation consists of iTerm2, Zsh, Oh My Zsh!, and powerlevel10k, and it assumes that you have Homebrew already installed.

Install iTerm2

From the Mac Terminal.app:

$ brew cask install iterm2

Then close Terminal.app and launch a new iTerm2 terminal and follow the instructions below in iTerm2.

Install Z shell

$ brew install zsh

Z Shell Configuration

My basic ~/.zshrc reflects the fact that I have several tools for Java, Rust and Node installed, and that I have a little script that generates a nice custom MOTD (Message of the Day) for me; anyone else can likely ignore this configuration.

fpath=(/usr/local/share/zsh-completions $fpath)

export PATH="/usr/local/bin:$PATH"
export PATH="$HOME/.cargo/bin:$PATH"
export PATH="/opt/local/bin:/opt/local/sbin:$PATH"
export PATH="/usr/local/maven/bin:$PATH"

export JAVA_HOME="/Library/Java/JavaVirtualMachines/zulu8"

export EDITOR=vim

export NVM_DIR="$HOME/.nvm"
source /usr/local/opt/nvm/nvm.sh

# My custom MOTD
$HOME/random-cowsay-fortune.sh

Install Oh My Zsh!

sh -c "$(curl -fsSL https://raw.github.com/ohmyzsh/ohmyzsh/master/tools/install.sh)"

Updated Z Shell configuration

Installing Oh My Zsh will make changes to your Z Shell configuration file ~/.zshrc, you can then further modify or customise it, mine looks like:

export ZSH="$HOME/.oh-my-zsh"

ZSH_THEME="robbyrussell"
CASE_SENSITIVE="true"
HYPHEN_INSENSITIVE="false"
HIST_STAMPS="dd/mm/yyyy"

plugins=(
        brew
        colored-man-pages
        docker
        git
        mosh
        mvn
        osx
        ripgrep
        rust
        sbt
        scala
        sublime
        vscode
        xcode
)

source $ZSH/oh-my-zsh.sh


# User configuration

export PATH="/usr/local/bin:$PATH"
export PATH="$HOME/.cargo/bin:$PATH"
export PATH="/opt/local/bin:/opt/local/sbin:$PATH"
export PATH="/usr/local/maven/bin:$PATH"

export JAVA_HOME="/Library/Java/JavaVirtualMachines/zulu8"

export EDITOR=vim

export NVM_DIR="$HOME/.nvm"
source /usr/local/opt/nvm/nvm.sh

# My custom MOTD
$HOME/random-cowsay-fortune.sh

Install powerlevel10k

$ git clone --depth=1 https://github.com/romkatv/powerlevel10k.git $ZSH_CUSTOM/themes/powerlevel10k

Update Z Shell configuration for powerlevel10k

You need to modify the ZSH_THEME variable in your Z Shell configuration to use the powerlevel10k theme. After which the start of my ~/.zshrc file looks like:

export ZSH="$HOME/.oh-my-zsh"

ZSH_THEME=powerlevel10k/powerlevel10k

...

Configure powerlevel10k

When you next open a terminal window, the powerlevel10k configuration script will run. If it prompts you to install a font and/or restart iTerm2, then do so. It will then prompt you with a number of questions about how you want the shell's prompt to visually appear. After which it will append the line [[ ! -f ~/.p10k.zsh ]] || source ~/.p10k.zsh to your ~./zshrc for you.

Just for my own future reference, the answers I gave to the questions of the configuration script are recorded in the top of my ~/.p10k.zsh file:

# Generated by Powerlevel10k configuration wizard on 2019-11-29 at 12:18 CET.
# Based on romkatv/powerlevel10k/config/p10k-rainbow.zsh, checksum 20931.
# Wizard options: nerdfont-complete + powerline, small icons, rainbow, round separators,
# sharp heads, flat tails, 1 line, sparse, many icons, concise, transient_prompt.

...

Conclusion

Based on the settings I chose, my iTerm2 terminal prompt now looks rather nice IMHO.

A new Terminal shell

Listing directory contents of a Git repository; prompt shows the git branch and status.

My custom MOTD script

In case anyone is interested, I have included my MOTD script random-cowsay-fortune.sh below. It requires you to first install cowsay, fortune, and lolcat via Homebrew.

#!/usr/bin/env bash

set -e

file=$( ls /usr/local/Cellar/cowsay/3.04/share/cows/*.cow | sort -R | tail -1 )

/usr/local/Cellar/fortune/9708/bin/fortune  | cowsay -f "$file" | lolcat

random-cowsay-fortune.sh

Numeric Value	Encoded Symbol
0	!
1	$
2	&
3	'
4	(
5	)
6	*
7	+
8	,
9	-
10	.
11	0
12	1
13	2
14	3
15	4
16	5
17	6
18	7
19	8
20	9
21	:
22	=
23	@
24	B
25	C
26	D
27	F
28	G
29	H
30	J
31	K
32	L
33	M
34	N
35	P
36	Q
37	R
38	S
39	T
40	V
41	W
42	X
43	Y
44	Z
45	_
46	b
47	c
48	d
49	f
50	g
51	h
52	j
53	k
54	l
55	m
56	n
57	p
58	q
59	r
60	s
61	t
62	v
63	w
64	x
65	y
66	z
67	~

Numeric Value	Encoded Symbol
0	B
1	C
2	D
3	F
4	G
5	H
6	J
7	K
8	L
9	M
10	N
11	P
12	Q
13	R
14	S
15	T
16	V
17	W
18	X
19	2
20	3
21	4
22	5
23	6
24	7

Numeric Value	Encoded Symbol
0	1
1	2
2	3
3	4
4	5
5	6
6	7
7	8
8	9
9	C
10	F
11	G
12	H
13	J
14	K
15	L
16	N
17	Q
18	R
19	S
20	T
21	V
22	W
23	X
24	Y

Numeric Value	Encoded Symbol
0	!
1	$
2	&
3	'
4	(
5	)
6	*
7	+
8	,
9	-
10	.
11	0
12	1
13	2
14	3
15	4
16	5
17	6
18	7
19	8
20	9
21	:
22	=
23	@
24	B
25	C
26	D
27	F
28	G
29	H
30	J
31	K
32	L
33	M
34	N
35	P
36	Q
37	R
38	S
39	T
40	V
41	W
42	X
43	Y
44	Z
45	_
46	b
47	c
48	d
49	f
50	g
51	h
52	j
53	k
54	l
55	m
56	n
57	p
58	q
59	r
60	s
61	t
62	v
63	w
64	x
65	y
66	z
67	~

Numeric Value	Encoded Symbol
0	B
1	C
2	D
3	F
4	G
5	H
6	J
7	K
8	L
9	M
10	N
11	P
12	Q
13	R
14	S
15	T
16	V
17	W
18	X
19	2
20	3
21	4
22	5
23	6
24	7

Numeric Value	Encoded Symbol
0	1
1	2
2	3
3	4
4	5
5	6
6	7
7	8
8	9
9	C
10	F
11	G
12	H
13	J
14	K
15	L
16	N
17	Q
18	R
19	S
20	T
21	V
22	W
23	X
24	Y

Numeric Value	Encoded Symbol
0	!
1	$
2	&
3	'
4	(
5	)
6	*
7	+
8	,
9	-
10	.
11	0
12	1
13	2
14	3
15	4
16	5
17	6
18	7
19	8
20	9
21	:
22	=
23	@
24	B
25	C
26	D
27	F
28	G
29	H
30	J
31	K
32	L
33	M
34	N
35	P
36	Q
37	R
38	S
39	T
40	V
41	W
42	X
43	Y
44	Z
45	_
46	b
47	c
48	d
49	f
50	g
51	h
52	j
53	k
54	l
55	m
56	n
57	p
58	q
59	r
60	s
61	t
62	v
63	w
64	x
65	y
66	z
67	~

Numeric Value	Encoded Symbol
0	B
1	C
2	D
3	F
4	G
5	H
6	J
7	K
8	L
9	M
10	N
11	P
12	Q
13	R
14	S
15	T
16	V
17	W
18	X
19	2
20	3
21	4
22	5
23	6
24	7

Numeric Value	Encoded Symbol
0	1
1	2
2	3
3	4
4	5
5	6
6	7
7	8
8	9
9	C
10	F
11	G
12	H
13	J
14	K
15	L
16	N
17	Q
18	R
19	S
20	T
21	V
22	W
23	X
24	Y