Extreme Identifiers (for use in URIs)

Recently I have been thinking about how Identifiers for things should be constructed. More specifically, as part of Project Omega for TNA (The National Archives), I have been thinking about identifiers for resources in RDF. This blog post continues on from my previous posts: Archival Identifiers for Digital Files, and Archival Catalogue Record Identifiers.

Whilst this article frames the content in the context of RDF and Archives, the principles are much more widely applicable. This blog post shows how to efficiently encode any positive numeric integer into a URI.

A resource in RDF is identified by a URI and that URI often consists of two major components, first a base URI, and then some sort of (domain specific) local identifier. For example:

A Base URI: http://cat.nationalarchives.gov.uk
A local identifier: Record12345
The final URI: http://cat.nationalarchives.gov.uk/Record12345

One of the types of resource that we need to describe in RDF is that of a Digital File, you can think of it simply as a file on your computer. It is the identifiers of those digital files that we will concern ourselves with in this article.

In my recent article - Archival Identifiers for Digital Files, I covered two different identifier schemes for Digital File: UUID (Universally Unique Identifier), and ACID (Archival Content Identifier). One thing that bothered me was the length of the presentation representation (i.e. hexadecimal encoding) of those identifiers.

If for a moment we ignore the Hash Function Type prefix of the ACID scheme, we can recognise that each scheme is really just generating a large positive integer! As hexadecimal has a numeric base of 16, i.e. there are only 16 symbols in its alphabet, we might likewise recognise that it is perhaps not the most efficient presentation representation.

Our Digital File identifier will ultimately form the local identifier component of our RDF resource URI. What would be the most efficient way to encode that Digital File identifier into a URI?

Encoding a Number into a URI Path

If we wanted to encode any positive integer (i.e. Base10) into a URI path as efficiently as possible, we first need to create an alphabet that utilizes every possible character that can be legally expressed within the path of a URI. Once we have that alphabet we can encode a Base10 number into a BaseN number, where N is the number of characters in our new alphabet.

I took the following steps to create such an alphabet:

Start with all possible path characters for use in a URI from RFC 2396. These are defined as pchar in section 3.3 of the RFC.
Eliminate escaped characters by removing the % character. An escaped character is modelled in a URI by using the % character and then two hexadecimal characters. We are interested in using the least characters possible, so using 3 characters here to represent 1 character is the opposite of what we want to do!
(optional) Eliminate the English vowels - A, E, I, O, and U, and a, e, i, o, and u. We don't want to incidentally create meaningful words! This is a requirement for TNA; if they were to publish their RDF as Linked Data, then such encoded numbers would become publicly visible.
Sort the remaining characters byte-wise by their UTF-8 numeric value.

Following these steps yields an alphabet of 78 characters, or if you perform the optional Step 3, then 68 characters.

Numeric Value	Encoded Symbol
0	!
1	$
2	&
3	'
4	(
5	)
6	*
7	+
8	,
9	-
10	.
11	0
12	1
13	2
14	3
15	4
16	5
17	6
18	7
19	8
20	9
21	:
22	=
23	@
24	B
25	C
26	D
27	F
28	G
29	H
30	J
31	K
32	L
33	M
34	N
35	P
36	Q
37	R
38	S
39	T
40	V
41	W
42	X
43	Y
44	Z
45	_
46	b
47	c
48	d
49	f
50	g
51	h
52	j
53	k
54	l
55	m
56	n
57	p
58	q
59	r
60	s
61	t
62	v
63	w
64	x
65	y
66	z
67	~

Our 68 characters can yield a Base68 representation, which is much more compact than a Base16 (hexadecimal) representation.

Encoding from Base10 to any other base is performed by a common recursive mathematical function. How this is achieved is not particularly relevant, but for both completeness and those readers that are interested, I have written an expression of it for encoding to Base68 using the Scala programming language:

/**
 * Encodes a Base10 positive integer to a Base68 String.
 *
 * @param value a positive integer
 * @return the encoded representation
 */
def encode(value: BigInt): String = {

  val b68Alphabet = Seq(
    '!', '$', '&', ''', '(', ')', '*', '+', ',', '-', '.',
    '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':',
    '=', '@', 'B', 'C', 'D', 'F', 'G', 'H', 'J', 'K', 'L',
    'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'X', 'Y',
    'Z', '_', 'b', 'c', 'd', 'f', 'g', 'h', 'j', 'k', 'l',
    'm', 'n', 'p', 'q', 'r', 's', 't', 'v', 'w', 'x', 'y',
    'z', '~'
  )
  val len = b68Alphabet.length

  @tailrec
  def encode(v: BigInt, accum: List[Char]): String = {
    if(v == 0 && accum.nonEmpty) {
      accum.mkString
    } else if(v <= 1) {
      (b68Alphabet(v.toInt) :: accum).mkString
    } else {
      val div = v / len
      val mod = v % len
      encode(div, b68Alphabet(mod.toInt) :: accum)
    }
  }

  encode(value, List.empty[Char])
}

Examples

Let's look at some examples of what these encoded identifiers look like.

A Version 4 UUID in its default hexadecimal representation: d879f8b2-5f67-495d-8796-5ce5b06ba238, consists of 36 characters and is equivalent to the Base10 number: 287746559179145117594110380901673968184.

If we instead encode the Base10 number into our Base68 alphabet this yields the string: xDZTz4*0-L0+S5V@4wFZB, which is just 21 characters in length. Compared to the default hexadecimal representation, this is a saving of 15 characters, i.e. ~42%.
An ACID in its default hexadecimal representation: !3cbae8f16217ad44981e5843100092cd582202e69d452eb094480f2d24abdb49, after dropping the ! character (for the Hash Function Type prefix), consists of 64 characters and is equivalent to the Base10 number: 27469012181874709647382529974656231352255408628267256258746051793118097562441.

If we instead encode the Base10 number into our Base68 alphabet this yields the string: 94TTsZ-tsvNkZzcM2jWXYCy,ym4d1XZ8N7).8:N9v6, which is just 42 characters in length. Compared to the default hexadecimal representation, this is a saving of 22 characters, i.e. ~34%.

Our Digital File resource URI would now look like:

A UUID based identifier for a Digital File using Base68:

http://cat.nationalarchives.gov.uk/xDZTz4*0-L0+S5V@4wFZB

An ACID based identifier for a Digital File using Base68 (with the Hash Function Type prefix reinstated):

http://cat.nationalarchives.gov.uk/!94TTsZ-tsvNkZzcM2jWXYCy,ym4d1XZ8N7).8:N9v6

Do these Base68 encoded identifiers look ugly to the human eye? Absolutely!
But... I did not design them for human-use or even to be humane! ;-) Instead, this encoding scheme is designed to fit the identifier for a Digital File (a number), into a URI as succinctly as possible, whilst still yielding a valid URI. I expect these URIs to be used by machines and not humans!

Is this a good solution? It certainly meets the requirements I set out, and it feels computationally neat. In practice, it likely needs further thought and experimentation. Will these identifies prove too cumbersome for developers writing SPAQL queries against our data? Possibly!

Decoding from Base68

Finally, decoding from BaseN to Base10 is performed by a very simple table lookup against our alphabet. Like encoding, how this is achieved is not particularly relevant, but again for both completeness and those readers that are interested, I have included a decoder from Base68 using the Scala programming language:

/**
 * Decodes a Base68 string to a Base10 positive integer.
 *
 * @param str the encoded representation
 * @return a positive integer
 */
def decode(str: String): BigInt = {

  val b68Alphabet = Seq(
    '!', '$', '&', ''', '(', ')', '*', '+', ',', '-', '.',
    '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':',
    '=', '@', 'B', 'C', 'D', 'F', 'G', 'H', 'J', 'K', 'L',
    'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'X', 'Y',
    'Z', '_', 'b', 'c', 'd', 'f', 'g', 'h', 'j', 'k', 'l',
    'm', 'n', 'p', 'q', 'r', 's', 't', 'v', 'w', 'x', 'y',
    'z', '~'
  )
  val len = b68Alphabet.length

  val indicies = str.map(b68Alphabet.indexOf(_))
  val vs: Seq[BigInt] = for(i <- 0 to indicies.length - 1) yield {
    val exp = (indicies.length - i) - 1
    indicies(i) * BigInt(len).pow(exp)
  }
  vs.reduceLeft(_ + _)
}

Open Source

More complete encoders and decoders can be found on GitHub implemented in both Scala: oci-tools-scala, and TypeScript: oci-tools-typescript.

Numeric Value	Encoded Symbol
0	!
1	$
2	&
3	'
4	(
5	)
6	*
7	+
8	,
9	-
10	.
11	0
12	1
13	2
14	3
15	4
16	5
17	6
18	7
19	8
20	9
21	:
22	=
23	@
24	B
25	C
26	D
27	F
28	G
29	H
30	J
31	K
32	L
33	M
34	N
35	P
36	Q
37	R
38	S
39	T
40	V
41	W
42	X
43	Y
44	Z
45	_
46	b
47	c
48	d
49	f
50	g
51	h
52	j
53	k
54	l
55	m
56	n
57	p
58	q
59	r
60	s
61	t
62	v
63	w
64	x
65	y
66	z
67	~

Numeric Value	Encoded Symbol
0	!
1	$
2	&
3	'
4	(
5	)
6	*
7	+
8	,
9	-
10	.
11	0
12	1
13	2
14	3
15	4
16	5
17	6
18	7
19	8
20	9
21	:
22	=
23	@
24	B
25	C
26	D
27	F
28	G
29	H
30	J
31	K
32	L
33	M
34	N
35	P
36	Q
37	R
38	S
39	T
40	V
41	W
42	X
43	Y
44	Z
45	_
46	b
47	c
48	d
49	f
50	g
51	h
52	j
53	k
54	l
55	m
56	n
57	p
58	q
59	r
60	s
61	t
62	v
63	w
64	x
65	y
66	z
67	~

Numeric Value	Encoded Symbol
0	!
1	$
2	&
3	'
4	(
5	)
6	*
7	+
8	,
9	-
10	.
11	0
12	1
13	2
14	3
15	4
16	5
17	6
18	7
19	8
20	9
21	:
22	=
23	@
24	B
25	C
26	D
27	F
28	G
29	H
30	J
31	K
32	L
33	M
34	N
35	P
36	Q
37	R
38	S
39	T
40	V
41	W
42	X
43	Y
44	Z
45	_
46	b
47	c
48	d
49	f
50	g
51	h
52	j
53	k
54	l
55	m
56	n
57	p
58	q
59	r
60	s
61	t
62	v
63	w
64	x
65	y
66	z
67	~