GraalVM UTF-8 Validation

Many, many years ago (in the mid-2000's) when I was not long out of University and had my first job in private industry, we were working on a large business system written in Java which needed to ingest and process millions and millions of EPCs (Energy Performance Certificates). These EPCs were sent to us as XML documents from many different organisations.

Back in these dark years, many (other) developers liked to treat XML as "just strings", and it was not uncommon to see lots of XML elements hardcoded in strings inside Java applications, with much use of sub-string and concatenation functions to build XML documents. In this dark age, not only was XML hugely abused, but perhaps more sadly, so was character encoding!

We expected to receive valid well-formed XML using a character encoding that matched that declared in the XML Declaration, typically UTF-8. We often received XML documents which were neither in the default expected encoding (UTF-8), or in the encoding that they declared, e.g. ISO 8859-1. Let's not even mention BOMs (Byte Order Marks). This problem was occurring even before we could check whether the XML document was well-formed or valid with respect to any prescribed XML Schemas.

In the best case, the XML parser would complain and tell us that the XML document had some sort of character encoding issue, but much more likely was that parsing would complete and later a developer would be looking at the XML, or a business analyst would be looking at a report or PDF produced from the XML, and wonder what all the strange or corrupted characters were!

To try and catch these problems early on, I wrote a small Open Source tool in my spare time which could validate that a file was valid UTF-8. We used this rather simple tool to pre-validate the incoming XML documents and ensure that their character encodings matched what we expected, immediately rejecting any that failed validation. This allowed us to catch problems early on, and greatly reduce unnecessary processing of bad documents.

This tool first became publicly available many years later, whilst building the new Digital Archive at The National Archives. I donated this code to their Digital Preservation effort and it appeared on their GitHub: UTF-8 Validator.

Enter GraalVM

Whilst the original UTF-8 Validator was very simple and did the job it set out to do... it was quick, but it wasn't fast! I had often considered spending some time on making it faster, but never found the necessary hours.

I have been loosely following the development of GraalVM for sometime and reading the various articles on Hacker News as they came along. GraalVM is a new polyglot VM which can execute code from both Java and other languages. GraalVM incorporates a number of new optimisations above and beyond those afforded by the JVM, such as complex Escape Analysis.

I have wanted to experiment with Graal for some-time, and in particular its native-image tool which allows you to AOT (Ahead Of Time) compile your Java code into native machine code; The main benefit seems to be that you can avoid the JVM (Java Virtual Machine) startup time, and also any warm up iterations that are needed for the critical parts of your application to be JIT (Just In Time) compiled to native machine code.

Graal's native-image does have some limitations on what it supports, for example not all of Java's reflection is supported, it also can't support dynamically loading and unloading classes. My UTF-8 Validator tool seemed like a nice fit to me for experimenting with Graal as it is written entirely in Java and has zero external dependencies.

If we ignore all the hard sciency stuff, and just drool over the claims of performance increases in the various articles out there on the web, we would expect to get a nice performance gain just from running our application on GraalVM when compared to the OpenJDK 8 JVM.

My test machine is a fairly stock MacBook Pro mid-2015 with 1TB SSD, 16GB RAM, and macOS High Sierra (10.13.6). I have OpenJDK 1.8.0_172 and GraalVM EE 1.0.0-rc7 (also based on OpenJDK 1.8.0_172). For testing the validation speed I am using a UTF-8 encoded XML file of 269,703,903 bytes (~257 MB) from PubMed.

First, we run the UTF-8 Validator with the OpenJDK JVM:

$ zulu8.30.0.1-jdk8.0.172-macosx_x64/bin/java -jar target/utf8-validator-1.3-SNAPSHOT.jar pubmed-257m.xml
Validating: pubmed-257m.xml
Valid OK (took 18323ms)

Then, for comparison, we run UTF-8 Validator with GraalVM:

$ graalvm-ee-1.0.0-rc7/Contents/Home/bin/java -jar target/utf8-validator-1.3-SNAPSHOT.jar pubmed-257m.xml
Validating: pubmed-257m.xml
Valid OK (took 17365ms)

NOTE: All timings are actually the average of multiple interleaved runs to reduce skew from other intermittent processes.

So we do indeed see that the process runs faster under GraalVM than the JVM, with a ~5.23% reduction in processing time. Not perhaps the increase in performance that we had quite hoped for! So, what about creating a native image?

Graal native-image

We can add the following profile to the pom.xml of UTF-8 Validator to have Maven produce a native image:

<profile>
    <id>native</id>
    <build>
        <plugins>
            <plugin>
                <groupId>org.codehaus.mojo</groupId>
                <artifactId>exec-maven-plugin</artifactId>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>exec</goal>
                        </goals>
                    </execution>
                </executions>
                <configuration>
                    <executable>native-image</executable>
                    <workingDirectory>${project.build.directory}</workingDirectory>
                    <arguments>
                        <argument>-da</argument>
                        <argument>--class-path</argument>
                        <classpath/>
                        <argument>uk.gov.nationalarchives.utf8.validator.Utf8ValidateCmd</argument>
                        <argument>utf8validate</argument>
                    </arguments>
                </configuration>
            </plugin>
        </plugins>
    </build>
</profile>

If we now run mvn clean compile package -P native we will find a native executable at the path target/utf8validate. This is pretty amazing, as we now don't even need a JVM to run our application! How does the performance compare:

$ target/utf8validate pubmed-257m.xml
Validating: pubmed-257m.xml
Valid OK (took 28198ms)

Hmm... the native image of the UTF-8 Validator is actually slower. Compared to the JVM it is ~53.89% slower, whilst compared to GraalVM it is ~62.38% slower. This is not quite what I expected. It would seem then that the problem with the performance of our UTF-8 Validator is in fact not anything related to JVM startup time or absent JIT compilation.

One last thing that we can try is using PGO (Profile Guided Optimization) with the native-image tool. Basically, we compile a native image which when run produces a profile of the running application, we then compile again with this profile guide to create a further optimize native image.

To get the profile guide, we need to add the argument --pgo-instrument the first time we invoke the native-image compilation. Then running the profile collecting native image we see much slower performance, which is to be expected as it is gathering and logging the profile data into the file default.iprof:

$ target/utf8validate pubmed-257m.xml
Validating: pubmed-257m.xml
Valid OK (took 43145ms)

$ ls -la default.iprof
-rw-r--r--  1 aretter  wheel  424208  7 Oct 18:09 default.iprof

To use the profile guide, we need to move the default.iprof file to the target/ folder add the argument --pgo the second time we invoke the native-image compilation:

$ mvn clean compile
$ mv default.iprof target/
$ mvn package -P native

The resultant PGO native image when run gives us:

$ target/utf8validate pubmed-257m.xml
Validating: pubmed-257m.xml
Valid OK (took 21989ms)

Whilst the PGO native image is ~22% faster than the non-PGO native image, sadly it is still slower than both the JVM and GraalVM. Compared to the JVM it is ~20% slower, whilst compared to GraalVM it is ~26.63% slower.

Conclusion

Whilst it is likely that the UTF-8 Validator spends most of its time in either byte comparisons or disk I/O operations, and it is unlikely that JVM startup is the biggest cost, I had still expected a native-image to be faster than running via the JVM. Exactly why it is not faster is not yet clear to me, and I have rather run out of personal time to investigate this further right now. I will continue to follow the progress of GraalVM as I believe it has a huge amount of potential. I also hope to have some time in the near future to revisit this and understand why the native image is slower.

GraalVM UTF-8 Validation

Enter GraalVM

Graal native-image

Conclusion

TypeScript Holiday

EXPath and Asynchronous HTTP