Monday, April 26, 2004

Here's what we think is happening, thanks to Evan's suggestion to use gdb and Carol's assistance in recreating the loopback's IP address and route:

The first attempt by Java to open a socket is preceeded with an initialization of its socket code.
The socket initialization code calls java.net.PlainDatagramSocketImpl.leave, as is indicated in this stack trace from gdb:

The call in question seems to be to make the machine leave the multicast group. See here.

Leaving the multicast group must involve connecting to loopback on a random port (the port it chooses changes and is always above 32768), and then shoving some random bytes through. That's my theory. Update: This is almost certainlyrtnetlink.

It just sits there trying to communicate with itself, and times out after ~3.5 minutes.

It appears that it has its problem because the loopback device is not configured with an IP address and is not in the route table.

After we issued the following commands, everything works just fine and the 3.5 minute delay turns into 18ms delay:

ip addr add 127.0.0.1/8 dev lo

/sbin/route add -net 127.0.0.0/8 dev lo

We found a most interesting thread on the Java forum that seems to mirror our problem. But the guy apparently never figured it out. Maybe I should post a solutionthere.

We went through the following other possible problems:

I always thought it was some kind of nfs file locking problem, but that's not the case at all.

We redirected the logging output to a local disk on the machine. That didn't help at all.

We saw the process was blocked reading /dev/random; /dev/random must use loopback to generate random numbers. To solve this we used/dev/urandom which is not as random, but removed the block. But the connection delay persisted. This might explain why it took 50 minutes to send the first message on the SSL connection. Once we removed the block on/dev/random the 50 minute message send delay seemed to go away.

We wrote some code to try to connect without SSL, but I'm not convinced that ever worked. It's still in the code and can be activated with a configuration setting, and I tested that configuration setting in my client.

We then thought it was a delay caused by doing a reverse DNS lookup on the peer's IP address--likely so it could do certificate validation/throw nice exceptions. We saw in gdb that the stack trace was deep in some Java code that was trying to do some kind of DNS operation. /etc/resolv.conf was empty, so we added our name server to it and rebooted the machine. That didn't help, but the stack trace changed.

Then the stack trace was stuck inJava_java_net_PlainDatagramSocketImpl_leave; I thought that might have still been some DNS hosage, so I changed /etc/hosts to include the addresses of the peers. That didn't help and didn't change the stack trace.

Finally we typed /sbin/ifconfig. That showed us that lo did not have an IP address.

Sometimes we would see on some of our Linux boxes a 3 minute delay between an attempt to open a socket and a successful connection. This did not make sense... but I eventually determined that this was caused by the loopback device not having an address, or not having a route in the route table.

/dev/random would block for about 3 minutes, probably because it depends on loopback to get its results.

Java trying to do a reverse DNS lookup would block for about 3 minutes, probably because it was trying to get results from 0.0.0.0, because /etc/resolv.conf was empty, and 0.0.0.0 was being interpreted as 127.0.0.1... Update: see this postabout a related fix...

Here is a thread where I replied with this information; the thread also suggests other solutions, perhaps to the same or similar problems.

Now if I could just figure out how that Linux box got into that situation...