Preventing Large Range Counts from Crashing eXist-db
The Office of the Historian at the US State Department recently kindly provided the funding to enable me to finally build a modern XQTS (XQuery Test Suite) runner for eXist-db: https://github.com/exist-db/exist-xqts-runner/.
Unfortunately, one of the XQuery tests within the XQTS was causing eXist-db to exhaust all available memory and crash the JVM with the dreaded java.lang.OutOfMemoryError
(#1971). The test in question cbcl-subsequence-011
executes the following unassuming little XQuery:
count(subsequence(1 to 3000000000, -2147483649))
To understand why this was using all of the available memory, I did a little bit of digging to understand how eXist-db evaluates the query. There are three main steps in evaluating the query: (1) a range expression, (2) a subsequence function, and (3) a count function. Ultimately, we will discover that it is the interaction of the first two of these steps that are most important to understand when devising a solution.
The Range Expression
The range expression 1 to 3000000000
is implemented by org.exist.xquery.RangeSequence
. Very sensibly it is lazily evaluated, which means that we don't have to realise all 3000000000
integers which would otherwise cost us ~11.18GB of RAM; it just holds the start and end of the sequence, and provides an iterator for moving forward through the sequence. Something like:
private static class RangeSequenceIterator implements SequenceIterator {
private long current;
private final long end;
public RangeSequenceIterator(final long start, final long end) {
this.current = start;
this.end = end;
}
public Item nextItem() {
if (current <= end) {
return new IntegerValue(current++);
} else {
return null;
}
}
public boolean hasNext() {
return current <= end;
}
}
There is nothing obviously wrong with the above code. So let's keep digging...
The SubSequence Function
The subsequence function is implemented by org.exist.xquery.functions.fn.FunSubSequence
. Unfortunately it was not lazily evaluated. The key parts of its evaluation (abridged):
Sequence tmp = new ValueSequence();
// skip to the first item
Item item;
final SequenceIterator iterator = seq.iterate();
for(int i = 0; i < start; i++) {
item = iterator.nextItem();
}
// get all remaining items
int i=0;
while (iterator.hasNext() && i < length) {
item = iterator.nextItem();
tmp.add(item);
i++;
}
return tmp;
In the code above each call to iterator.nextItem()
is going to call RangeSequenceIterator#nextItem()
which is going to construct a new IntegerValue
object and add a reference to it to the tmp
value sequence.
The exact memory required for each IntegerValue
object in eXist-db is dependent on the length of the integer that it has to represent. For numbers between 1 and 3000000000, the memory use is between 80 and 84 bytes per object. So if we call nextItem
3000000000 times and keep a reference, then the memory required is between 80 * 3000000000 and 84 * 3000000000, i.e.: between 223.52 GB and 234.69 GB! My MacBook Pro which I used for testing only has 16GB of RAM, and so hence the OutOfMemoryError
.
So what can we do about this?
Actually, when considering just the subsequence function, there appears to be very little we can do on the surface. We could perhaps detect the amount of memory available to the JVM and impose a limit on the number of items that can be realized by the subsequence function, raising an error to the user if that is exceeded, rather than running out of memory and crashing. I personally consider that to be an option of last resort, and not a satisfactory solution for general use.
If, however, we consider the larger picture of how the subsequence function is used, it reveals two use-cases:
- As the final expression of an XQuery, where subsequence is used to limit the results that are serialized by the query processor and returned to the user.
In eXist-db when a query returns its results to a user, it does so by returning the results one after another in a streaming fashion. The problem is that the subsequence function is collecting all the results in memory, before they are returned to the user one-by-one. But... we don't need to memorize the full subsequence of the range expression, and then return it to the user. Instead, we could just calculate the first subsequence item (of the range expression) return it to the user, and then repeat that process over and over for each item in the subsequence. - As a sub-expression within a larger XQuery. Just like in our original query at the top of this article, where
subsequence
is a sub-expression inside thecount
function.
When the subequence function is used as a sub-expression, we have no advance idea what the outer-expression will do with the subseqence. For example, if we have a (simplified) query like:subsequence(1, 2, subsequence(1, 1000, $data))
we calculate the first 1000 items of$data
, but then it turned out that we actually only want 2 items. That initial calculation and memorisation of those 1000 items was very wasteful. Instead, if we lazily evaluatedsubsequence
and had the outer-expression pull each item, one after another, in a stream like fashion, we would only need to calculate 2 items from$data
.
So both use-cases identify that we can use some form of streaming and lazy evaluation to avoid storing all the intermittent items described by the subequence function into memory.
Our new lazily-evaluated subsequence function evaluation now looks like:
result = new SubSequence(fromInclusive, toExclusive, seq);
Super simple! eXist-db's fn:subsequence
now returns a new sequence type which is a place holder for the actual (subsequence of items). The key parts of my implementation of org.exist.xquery.value.SubSequence
look like:
public class SubSequence extends AbstractSequence {
private final long fromInclusive;
private final long toExclusive;
private final Sequence sequence;
@Override
public SequenceIterator iterate() throws XPathException {
if (isEmpty()) {
return SequenceIterator.EMPTY_ITERATOR;
}
return new SubSequenceIterator(fromInclusive, toExclusive, sequence);
}
private static class SubSequenceIterator implements SequenceIterator {
private long position;
private final long toExclusive;
private final SequenceIterator iterator;
public SubSequenceIterator(final long fromInclusive, final long toExclusive, final Sequence sequence) throws XPathException {
this.position = 1;
this.toExclusive = toExclusive;
this.iterator = sequence.iterate();
// move sequence iterator to start of sub-sequence
if (position != fromInclusive) {
// move to start
if (iterator.skip(fromInclusive - position) > -1) {
position = fromInclusive;
} else {
// SequenceIterator does not support skipping, we have to iterate through each item :-/
for (; position < fromInclusive; position++) {
iterator.nextItem();
}
}
}
}
@Override
public boolean hasNext() {
return iterator.hasNext() && position < toExclusive;
}
@Override
public Item nextItem() {
if (iterator.hasNext() && position < toExclusive) {
final Item item = iterator.nextItem();
position++;
return item;
}
return null;
}
@Override
public long skippable() {
return Math.min(iterator.skippable(), toExclusive - position);
}
@Override
public long skip(final long n) {
final long seqSkipable = iterator.skippable();
if (seqSkipable == -1) {
return -1; // underlying iterator does not support skipping
}
final long skip = Math.min(n, Math.min(seqSkipable, toExclusive - position));
if (skip <= 0) {
return 0; // nothing to skip
}
final long skipped = iterator.skip(skip);
position += skipped;
return skipped;
}
}
}
Those readers who have been paying close attention, will also notice that we have implemented a pair of new methods SequenceIterator#skippable()
and SequenceIterator#skip(long)
. These methods allow an expression which is operating on a sequence to reposition to the desired starting point in the sequence at no cost, i.e.: without having to call nextItem
repeatedly which we know is expensive as it allocates an object!
The Count Function
The count function is implemented by org.exist.xquery.functions.fn.FunCount
. The key part of its implementation is rather simple, it simply asks the sequence for the number of items it contains:
result = new IntegerValue(sequence.getItemCount());
Before our implementation of the SubSequence
sequence type, the sequence would have been a ValueSequence
, which calculates the number of items like so:
public int getItemCount() {
sortInDocumentOrder();
return size + 1;
}
One might ask, why are we (potentially) sorting the items of the sequence into document order when we just to count them? Well, sortInDocumentOrder()
also potentially does some de-duplication of items! Remember also, that when we were previously using ValueSequence
we could have many thousands of realised IntegerValue
objects in memory, so any deduplication and sorting here is going to be expensive. The remaining simple integer arithmetic of size + 1
is perfectly fine.
Now, that we have implemented a SubSequence
sequence type, we can do much better! As shown above, SubSequence
holds the fromInclusive
and toExclusive
positions of the subsequence within the larger sequence. When the count function is operating on a SubSequence
type, to calculate the item count, it is almost as simple as the formula toExclusive - fromInclusive
. We don't need to perform any expensive deduplication or sorting. My implementation of SubSequence#getItemCount()
looks like:
if (toExclusive < 1) {
return 0;
}
long subseqAvailable = sequence.getItemCountLong() - (fromInclusive - 1);
if (subseqAvailable < 0) {
subseqAvailable = 0;
}
long length = toExclusive - fromInclusive;
if (length < 0) {
length = 0;
}
return Math.min(length, subseqAvailable);
Conclusion
I was able to prepare a set of patches (#1977) for eXist-db which now allow for working with subsequences of unlimited length without running out of memory. This was achieved by switching eXist-db's subsequence function from eager evaluation to lazy evaluation, and having it return a new iterable sequence type (called SubSequence
) which streams the sub-items of the original sequence (e.g. a range expression) one by one.
I believe that the new org.exist.xquery.value.SubSequence
type also opens up some new avenues for further optimisations in eXist-db. For example, it could perhaps be used for optimising positional predicate expressions like: $data[12345]
.