cLabs

At my current gig, we have a custom Java .jar that we also cross-compile to .NET with IKVM. Recently, we ran across a bug that occurred only intermittently with the .NET assembly (and very infrequently - maybe 3 times out of a 100). The .jar never had the problem. Here's some example code that demonstrates the issue:

(Since the production IDataProvider is essentially an InputStream, the MockProvider here is setup to hand over the data in chunks via multiple calls to read(byte[]). Also, the real code is dealing with much larger chunks of data, and uses a 1KiB buffer.)

Trouble with this test, is it's not real enough. The test passes, in Java or .NET 100% of the time. “1,2” is always returned, but sometimes in production, when the problem would occur, we'd only get “2”. So ... where's the problem?

After looking over this a couple of times, my eye was drawn to these lines in the production code:

When the read method reaches the end of the data, the buffer isn't always going to be full - only if the total amount of data is a multiple of 10. Yet the last line in the above snippet appends the complete buffer everytime. So, the last time through, there'll be some leftover bytes from the prior call added to the end of the data.

That would explain the intermittency wouldn't it? Well ... no. Even in our example code, the data would have to be a multiple of 10 most of the time, and the production data was not guaranteed to have any consistency in length, so it'd probably avg. out to 90% of the time there being garbage added to the end of the data. In the production code, the buffer is 1 KiB in length, which means on average, only 1 out of 1,024 times would the data align itself just so. And our test, which doesn't have aligned data, is passing 100% of the time.

Breaking out the debugger at this point proved futile, since the bug would never occur in the debugger. This could either be because it happened so rarely, or because we had a heisenbug on our hands**. It was time for some old school debugging.

...then setup a Ruby script which would run the code in question over and over again, parsing the console output and breaking whenever the console output was not what it should be.

Here's the correct string when output:

<data><bar foo=1/><bar foo=2/></data>2/>

(There's the garbage at the end, that doesn't seem to cause problems for our parser).

And here's an incorrect string when the bug was finally caught in the act:

<data><bar fata><baroo=1/><bar foo=2/></data>2/></

Ahhh ... interesting. Now we're getting somewhere. Naturally, with that string, we'd only be able to parse out the foo=2, since somehow the foo=1 was corrupt. And, I realize I'd been a bit short-sighted with the flaw in the buffer append.

Some more debug output to confirm my new thought:

int bytesRead = data.read(buffer); System.out.println(bytesRead);

New console output now when it works:

1010107-1<data><bar foo=1/><bar foo=2/></data>2/>

And when it doesn't:

10210105-1<data><bar fata><baroo=1/><bar foo=2/></data>2/></

Proof of my short-sightedness.

Earlier, I'd stated, “When the read method reaches the end of the data, the buffer isn't always going to be full.” True, but incomplete. The read method is never guaranteed to return the full buffer of data.

The production code in question was reading data over HTTP. Apparently, the code in Java never performed in such a way as to loop fast enough over the read calls to get a less-than-full buffer in the middle of the run, but the IKVM'd code in .NET would occasionally.

** We had a heisenbug. The reason we could never catch this in the debugger was because it slowed down the loop calling read over and over, and there was always a full packet of data waiting to be read by the time we stepped through that line of code.