Search This Blog

Testing code for excessively large inputs

When writing unit tests we mostly focus on business correctness. We do our best to exercise happy path and all edge cases. We sometimes microbenchmark and measure throughput. But one aspect that is often missed is how our code behaves when the input is excessively large? We test how we handle normal input files, malformed files, empty files, missing files... but what about insanely large input files?

Let's start from a real-life use case. You were given a task to implement GPX (GPS Exchange Format, basically XML) to JSON transformation. I chose GPX for no particular reason, it's just another XML format that you might have come across e.g. when recording your hike or bicycle ride with GPS receiver. Also I thought it will be nice to use some standard rather than yet another "people database" in XML. Inside GPX file there are hundreds of flat <wpt/> entries, each one representing one point in space-time:

That's easy! First of all I started with generating JAXB classes using xjc utility from JDK and GPX 1.0 XSD schema. Please note that GPX 1.1 is the most recent version as of this writing, but examples I got use 1.0. For JSON marshalling I used Jackson. The complete, working and tested program looks like this:

Looks fairly good, despite few traps I left intentionally. We load GPX XML file, extract waypoints to a List, transform that list into lightweight LatLong objects, first filtering out broken waypoints. Finally we dump List<LatLong> back to disk. However one day extremely long bicycle ride crashed our system with OutOfMemoryError. Do you know what happened? The GPX file uploaded to our application was huge, much bigger then we ever expected to receive. Now look again at the implementation above and count in how many places we allocate more memory then necessary?

But if you want to refactor immediately, stop right there! We want to practice TDD, right? And we want to limit WTF/minute factor in our code? I have a theory that many "WTFs" are not caused by careless and inexperienced programmers. Often it's because of these late Friday production issues, totally unexpected inputs and unpredicted side effects. Code gets more and more workarounds, hard to understand refactorings, logic more complex then one might anticipate. Sometimes bad code was not intended, but required given circumstances we had long forgotten. So if one day you see null check that can't possible happen or hand-written code that could've been replaced by a library - think about the context. That being said let's start from writing tests proving our future refactorings are needed. If one day someone "fixes" our code, assuming "this stupid programmer" complicated things without good reason, automated tests will tell precisely why.

Our test will simply try to transform insanely big input files. But before we begin we must refactor the original implementation a bit, so that it accapets InputStream and OutputStream rather than input and output Files - there is no reason to limit our implementation to file system only:

Step 0b: Writing input (stress) test

Input will be generated from scratch using repeat(byte[] sample, int times) utility developed earlier. We will basically repeat the same <wpt/> item millions of times, wrapping it with GPX header and footer so that it is well-formed. Normally I would consider placing samples in src/test/resources, but I wanted this code to be self-containing. Notice that we neither care about the actual input, nor output. This is already tested. If transformation succeeds (we can add some timeout if we want), it's OK. If it fails with any exception, most likely OutOfMemoryError, it's a test failure (error):

There are actually 7 tests here, running GPX to JSON transformation for inputs of size: 1 KiB, 1 MiB, 10 MiB, 100 MiB, 1 GiB, 8 GiB and 32 GiB. I run these tests on JDK 8u11x64 with the following options: -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xmx1g. 1 GiB of memory is a lot, but clearly can't fit the whole input file in memory:

While small tests are passing, inputs above 1 GiB are failing fast.

Step 1: Avoid keeping whole files in Strings

The stack trace reveals where the problem lies:

java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3326)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:569)
at java.lang.StringBuilder.append(StringBuilder.java:190)
at org.apache.commons.io.output.StringBuilderWriter.write(StringBuilderWriter.java:138)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2002)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1980)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1957)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:1907)
at org.apache.commons.io.IOUtils.toString(IOUtils.java:778)
at com.nurkiewicz.gpx.GpxTransformation.loadWaypoints(GpxTransformation.java:56)
at com.nurkiewicz.gpx.GpxTransformation.transform(GpxTransformation.java:50)

Similarly we fixed dumpJson as it was first dumping JSON into String and later copying that String into OutputStream. Results are slightly better, but again 1 GiB fails, this time by going into infinite death loop of Full GC and finally throwing:

java.lang.OutOfMemoryError: Java heap space
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.LeafPropertyLoader.text(LeafPropertyLoader.java:50)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallingContext.text(UnmarshallingContext.java:527)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.SAXConnector.processText(SAXConnector.java:208)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.SAXConnector.endElement(SAXConnector.java:171)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:609)
[...snap...]
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:649)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:243)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal(UnmarshallerImpl.java:214)
at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(AbstractUnmarshallerImpl.java:157)
at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(AbstractUnmarshallerImpl.java:204)
at com.nurkiewicz.gpx.GpxTransformation.loadWaypoints(GpxTransformation.java:54)
at com.nurkiewicz.gpx.GpxTransformation.transform(GpxTransformation.java:47)

Step 2: (Poorly) replacing JAXB with StAX

We can suspect that main issue now is XML parsing using JAXB, which always eagerly maps the whole XML file into Java objects. It's easy to imagine why turning a 1 GiB file into object graph fails. We would like to somehow take more control over reading XML and consuming it in chunks. SAX was traditionally used in such circumstances, however the push programming model in SAX API is very inconvenient. SAX uses callback mechanism, which is very invasive and not very readable. StAX (Streaming API for XML), working on a slightly higher level, exposes pull model. It means client code decides when, and how much input to consume. This gives us better control over input and allows more flexibility. To familiarize you with the API, here is almost equivalent code to loadWaypoints(), but I skip attributes of <wpt/> which aren't needed later:

See how we explicitly ask XMLStreamReader for more data? However the fact that we are using more low-level API (and a lot more code) doesn't mean it has to be better if used incorrectly. We keep building huge waypoints list, so it's not a surprise we again see OutOfMemoryError:

java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3204)
at java.util.Arrays.copyOf(Arrays.java:3175)
at java.util.ArrayList.grow(ArrayList.java:246)
at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:220)
at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:212)
at java.util.ArrayList.add(ArrayList.java:443)
at com.nurkiewicz.gpx.GpxTransformation.loadWaypoints(GpxTransformation.java:65)
at com.nurkiewicz.gpx.GpxTransformation.transform(GpxTransformation.java:52)

Exactly where we anticipated. The good news is that 1 GiB test passed (with 1 GiB heap), so we are sort of going in the right direction. But it took 1 minute to complete due to excessive GC.

Step 3: StAX implemented properly

Notice that implementation using StAX in previous example would be just as good with SAX. However the reason I chose StAX was that we can now turn an XML file into an Iterator<Gpx.Wpt>. This iterator will consume XML file in chunks, lazily and only when asked. We can later consume that iterator lazily as well, which means we no longer keep whole file in memory. Iterators, while clumsy to work with, are still much better than working with XML directly or with SAX callbacks:

This is getting complex! I'm using AbstractIterator from Guava to handle tedious hasNext() state. Every time someone tries to pull next Gpx.Wpt item from an iterator (or call hasNext()) we consume a little bit of XML, just enough to return one entry. If XMLStreamReader encounters end of XML (</gpx> tag), we signal iterator end by returning endOfData(). This is a very handy pattern where XML is read lazily and served via convenient iterator. This implementation alone consumes very little, constant amount of memory. However we changed the API from List<Gpx.Wpt> to Iterator<Gpx.Wpt>, which forces changes to the rest of our implementation:

toCoordinates() was previously accepting List<Gpx.Wpt>. Iterators can't be turned into Stream directly, so we need this clunky transformation through Spliterator. Do you think it's over? ! GiB test passes a little bit faster, but more demanding ones are failing just like before:

java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3175)
at java.util.ArrayList.grow(ArrayList.java:246)
at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:220)
at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:212)
at java.util.ArrayList.add(ArrayList.java:443)
at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
at java.util.Iterator.forEachRemaining(Iterator.java:116)
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:512)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:502)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at com.nurkiewicz.gpx.GpxTransformation.toCoordinates(GpxTransformation.java:118)
at com.nurkiewicz.gpx.GpxTransformation.transform(GpxTransformation.java:58)
at com.nurkiewicz.LargeInputSpec.Should not fail with OOM for input of size #readableBytes(LargeInputSpec.groovy:49)

Remember that OutOfMemoryError is not always thrown from a place that actually consumes most memory. Luckily it's not the case this time. Look carefully to the bottom: collect(toList()).

Step 4: Avoiding streams and collectors

This is disappointing. Streams and collectors were designed from the ground up to support laziness. However it's virtually impossible to implement a collector (see also: Introduction to writing custom collectors in Java 8 and Grouping, sampling and batching - custom collectors) from stream to iterator effectively, which is a big design flaw. Therefore we must forget about streams altogether and use plain iterators all the way down. Iterators aren't very elegant, but allow consuming input item-by-item, having full control over memory consumption. We need a way to filter() input iterator, discarding broken items and map() entries to another representation. Guava, again, provides few handy utilities for that, replacing stream() completely:

Memory consumption is low and stable, I think we can safely assume it's constant. Our code processes about 40 MiB/s, so don't be surprised by almost 14 minutes it took to process 32 GiB. Oh, and did I mention that I run the last test with -Xmx32M? That's right, processing 32 GiB was successful without any performance loss using thousand times less memory. And 3000 times less, compared to initial implementation. As a matter of fact the last solution using iterators is capable of handling even infinite streams of XML. It's not really just theoretical case, imagine some sort of streaming API that produces never-ending flow of messages...

Summary (TL;DR)

If you were not patient enough to follow all steps, here are three main takeaways:

Your first goal is simplicity. Initial JAXB implementation was perfectly fine (with minor modifications), keep it like that if your code doesn't have to handle large inputs.

Test your code against insanely large inputs, e.g. using generated InputStream, producing gigabytes of input. Huge data set is another example of edge case. Don't test manually, once. One careless change or "improvement" might ruin your performance down the road.

Optimization is not an excuse for writing poor code. Notice that our implementation is still composable and easy to follow. If we went through SAX and simply inlined all logic in SAX callbacks, maintainability would greatly suffer.