8.3 From Raw I/O to Smokin' I/O

So far we have looked only at
general points about I/O and logging. Now we look at an example of
tuning I/O performance. The example consists of reading lines from a
large file. This section was inspired by an article from Sun
Engineering,[4] though I go somewhat further along the tuning cycle.

The
initial attempt at file I/O might be to use the
FileInputStream to read through a file. Note that
DataInputStream has a readLine(
) method (now deprecated because it is
byte-based rather than
char-based, but ignore that for the moment), so
you wrap the
FileInputStream with the
DataInputStream, and run. The code looks like:

For these timing tests, I use two different files, a 1.8 MB file with
about 20,000 lines (long lines), and a one-third of a megabyte file
with about 34,000 lines (short lines). I test using several VMs to
show the variations across VMs and the challenges in improving
performance across different runtime environments. To make
comparisons simpler, I report the times as normalized to 100% for the
JDK 1.2.2 VM with JIT. The long-line case and the short-line case are
normalized separately. Tests are averages across at least three test
runs. For the baseline test, I have the following chart (see Table 8-1 and Table 8-2 for full results). Note that the server mode results show
the second run of tests, after HotSpot has had a chance to apply its
optimizations.

Short lines

1.1.8

1.2.2a

1.3.1

1.3.1-server

1.4.0

1.4.0-server

1.4.0-Xint

Unbuffered input stream

86%

100%

109%

98%

103%

115%

146%

Long lines

1.1.8

1.2.2[5]

1.3.1

1.3.1-server

1.4.0

1.4.0-server

1.4.0-Xint

Unbuffered input stream

83%

100%

108%

95%

95%

110%

170%

[5] The short-line 1.2 and long-line 1.2 cases have
been separately normalized to 100%. All short-line times are relative
to the short-line 1.2, and all long-line times are relative to the
long-line 1.2.

The first test in absolute times is really dreadful because you are
executing I/O one byte at a time. This performance is the result of
using a plain FileInputStream without buffering
the I/O, because the process is completely I/O-bound. For this
reason, I expected the absolute times of the various VMs to be
similar, since the CPU is not the bottleneck. But curiously, they are
varied. Possibly the underlying native call implementation may be
different between VM versions, but I am not interested enough to
spend time deciding why there should be differences for the
unbuffered case. After all, no one uses unbuffered I/O. Everyone
knows you should buffer your I/O (except when memory is really at a
premium, as in an embedded system).

So let's immediately move to wrap the
FileInputStream with a
BufferedInputStream .[6]

[6] Buffering I/O does not require the use of buffered classes. You
can buffer I/O directly from the FileInputStream
class and other low-level classes by passing arrays to the
read( ) and write( ) methods.
This means you need to handle buffer overflows yourself.

However, the times are already faster by an order of magnitude, as
you can see in the following charts:

Short lines

1.1.8

1.2.2

1.3.1

1.3.1-server

1.4.0

1.4.0-server

1.4.0-Xint

Unbuffered input stream

86%

100%

109%

98%

103%

115%

146%

Buffered input stream

7%

6%

3%

3%

3%

2%

21%

Long lines

1.1.8

1.2.2

1.3.1

1.3.1-server

1.4.0

1.4.0-server

1.4.0-Xint

Unbuffered input stream

83%

100%

108%

95%

95%

110%

170%

Buffered input stream

6%

5%

3%

3%

3%

2%

24%

The lesson is clear, if you haven't already had it
drummed home somewhere else: buffered I/O performs much better than
unbuffered I/O. Having established that buffered I/O is better than
unbuffered, you renormalize your times on the buffered I/O case so
that you can compare any improvements against the normal case.

So far, we have used only the default
buffer, which is a 2048-byte buffer (contrary to the JDK 1.1.6
documentation, which states it is 512 bytes; always check the source
on easily changeable things like this). Perhaps a larger buffer would
be better. Let's try 8192 bytes:

The variations are large, but there is a mostly consistent pattern.
The 8K buffer doesn't seem to be significantly
better than the default.

Let's get back to the fact that we are using a
deprecated method, readLine(
). You should really be using
Readers instead of
InputStreams, according to the Javadoc, for full
portability, etc. Let's move to
Readers and ascertain what this change costs us:

These results tell us that someone at Sun spent time optimizing
Readers. You can reasonably use
Readers in most situations where you would have
used an InputStream. Some situations can show a
performance decrease, but generally there is a performance increase.
Note that if you are running your own versions of these tests, you
need to repeat some measurements within the VM, even in plain JIT
VMs, to eliminate the JIT compiler overhead.

Now let's get down to
some real tuning. So far we have just been working from bad coding to
good working practice. The final version so far uses buffered
Reader classes for I/O, as recommended by Sun. Can
we do better? Well of course, but now let's get down
and dirty. You know from general tuning practices that creating
objects is overhead you should try to avoid. Up until now, we have
used the readLine( ) method, which returns a
string. Suppose you work on that string and then discard it, as is
the typical situation. You would do better to avoid the
String creation altogether. Also, if you want to
process the String, then for performance purposes
you are better off working directly on the underlying
char array. Working on
char arrays is quicker since you can avoid the
String method overhead (or, more likely, the need
to copy the String into a char
array buffer to work on it). See Chapter 5 for more details on this
technique.

Basically, this means that you need to implement the
readLine( ) functionality with your own buffer
while passing the buffer to the method that does the string
processing. The following implementation uses its own
char array buffer. It reads in characters to fill
the buffer, then runs through the buffer looking for ends of lines.
Each time the end of a line is found, the buffer, together with the
start and end index of the line in that buffer, is passed to the
doSomething( ) method for processing. This
implementation avoids both the String-creation
overhead and the subsequent String-processing
overhead, but these are not included in any timings here. The only
complication comes when you reach the end of the buffer and you need
to fill it with the next chunk from the file, but you also need to
retain the line fragment from the end of the last chunk. It is
unlikely your 8192-char chunk will end exactly on
an end of line, so there are almost always some characters left to be
carried over to the next chunk. To handle this, simply copy the
characters to the beginning of the buffer and read the next chunk
into the buffer starting from after those characters. The commented
code looks like this:

public static void myReader(String string)
throws IOException
{
//Do the processing myself, directly from a FileReader
//But don't create strings for each line, just leave it
//as a char array
FileReader in = new FileReader(string);
int defaultBufferSize = 8192;
int nextChar = 0;
char[ ] buffer = new char[defaultBufferSize];
char c;
int leftover;
int length_read;
int startLineIdx = 0;
//First fill the buffer once before we start
int nChars = in.read(buffer, 0, defaultBufferSize);
boolean checkFirstOfChunk = false;
for(;;)
{
//Work through the buffer looking for end of line characters.
//Note that the JDK does the eol search as follows:
//It hardcodes both of the characters \r and \n as end
//of line characters, and considers either to signify the
//end of the line. In addition, if the end of line character
//is determined to be \r, and the next character is \n,
//it winds past the \n. This way it allows the reading of
//lines from files written on any of the three systems
//currently supported (Unix with \n, Windows with \r\n,
//and Mac with \r), even if you are not running on any of these.
for (; nextChar < nChars; nextChar++)
{
if (((c = buffer[nextChar]) = = '\n') || (c = = '\r'))
{
//We found a line, so pass it for processing
doSomethingWith(buffer, startLineIdx, nextChar-1);
//And then increment the cursors. nextChar is
//automatically incremented by the loop,
//so only need to worry if 'c' is \r
if (c = = '\r')
{
//need to consider if we are at end of buffer
if (nextChar = = (nChars - 1) )
checkFirstOfChunk = true;
else if (buffer[nextChar+1] = = '\n')
nextChar++;
}
startLineIdx = nextChar + 1;
}
}
leftover = 0;
if (startLineIdx < nChars)
{
//We have some characters left over at the end of the chunk.
//So carry them over to the beginning of the next chunk.
leftover = nChars - startLineIdx;
System.arraycopy(buffer, startLineIdx, buffer, 0, leftover);
}
do
{
length_read = in.read(buffer, leftover,
buffer.length-leftover );
} while (length_read = = 0);
if (length_read > 0)
{
nextChar -= nChars;
nChars = leftover + length_read;
startLineIdx = nextChar;
if (checkFirstOfChunk)
{
checkFirstOfChunk = false;
if (buffer[0] = = '\n')
{
nextChar++;
startLineIdx = nextChar;
}
}
}
else
{ /* EOF */
in.close( );
return;
}
}
}

The following chart shows the new times:

Short lines

1.1.8

1.2.2

1.3.1

1.3.1-server

1.4.0

1.4.0-server

1.4.0-Xint

Buffered input stream

132%

100%

48%

60%

48%

36%

373%

Buffered reader

192%

96%

56%

24%

60%

24%

590%

Custom-built reader

28%

24%

36%

164%

34%

24%

420%

Long lines

1.1.8

1.2.2

1.3.1

1.3.1-server

1.4.0

1.4.0-server

1.4.0-Xint

Buffered input stream

123%

100%

65%

59%

57%

37%

487%

Buffered reader

53%

43%

43%

20%

52%

24%

582%

Custom-built reader

28%

28%

38%

37%

37%

22%

547%

The timings are the best so far,
with the single exception of the 1.3.1 server mode VM, and most times
are significantly better than before.[7] The 1.3.1 server mode results are quite peculiar and look
like a compiler bug of some sort caused the compiler to generate
inefficient code. If you were running this code under a 1.3.1 server
VM, you would need to track down what produced the anomalous times by
changing a little bit of the code at a time, until you produced a
workaround.

[7] Note that the
HotSpot
timings
are, once again, for the second run of the repeated tests. No other
VMs exhibited consistent variations between the first and second run
tests. See Table 8-1 and Table 8-2 for the full set of results.

You can try one more thing: performing the
byte-to-char
conversion. The code comes from
Chapter 7, where we looked at this conversion in detail.
The changes are straightforward. Change the
FileReader to FileInputStream
and add a byte array buffer of the same size as
the char array buffer:

Measuring the performance with these changes, the times are now
significantly better in almost every case:

Short lines

1.1.8

1.2.2

1.3.1

1.3.1-server

1.4.0

1.4.0-server

1.4.0-Xint

Buffered input stream

132%

100%

48%

60%

48%

36%

373%

Custom-built reader

28%

24%

36%

164%

34%

24%

420%

Custom reader and converter

12%

12%

20%

12%

20%

12%

120%

Long lines

1.1.8

1.2.2

1.3.1

1.3.1-server

1.4.0

1.4.0-server

1.4.0-Xint

Buffered input stream

123%

100%

65%

59%

57%

37%

487%

Custom-built reader

28%

28%

38%

37%

37%

22%

547%

Custom reader and converter

9%

10%

19%

29%

17%

14%

125%

Once again, all the VMs produce their best time except for the 1.3.1
server mode long-line case (which was faster with the
BufferedReader).[8] All
the times are now under one second, even on a slow machine. Subsecond
times are notoriously variable, although in my tests the results were
fairly consistent.

[8] This shows that
HotSpot is quite variable with its optimizations.
HotSpot sometimes makes an unoptimized loop faster, and sometimes the
manually unrolled loop comes out faster. Tables 8-1 and 8-2 show
HotSpot producing both faster and slower times
for the same manually unrolled loop, depending on the data being
processed (i.e., short lines or long lines).

We have, however, hardcoded in the ISO 8859_1 type of
byte-to-char conversion rather
than supporting the generic case (where the conversion type is
specified as a property). But this conversion represents a common
class of character-encoding conversions, and you could
fall back on the method used in the previous test where the
conversion is specified differently (in the System
property file.encoding). Often, you will read from
files you know and whose format you understand and can predict. In
those cases, building in the appropriate encoding is not a problem.

Using a buffered reader is adequate for most purposes. But we have
seen that it is possible to speed up I/O even further if
you're willing to spend the effort. Avoiding the
creation of intermediate Strings gives you a good
gain. This is true for both reading and writing and allows you to
work on the char arrays directly. Working directly
on char arrays is usually better for performance,
but also more work. In specialized cases, you might want to consider
taking control of every aspect of the I/O right down to the
byte-to-char encoding, but for
this you need to consider how to maintain compatibility with the JDK.

Tables Table 8-1 and Table 8-2
summarize all the results from these experiments.