We have a rich Java client application that uses EJBs. We are running on JBoss 4.2.3.GA and just clustered our application. Our code uses a ServiceFactory to call JNDI to get the remote service interfaces that the client needs and currently do NOT cache the returned service interface objects. We have an ENVIRONMENT table that contains the initial context factory class and the provider url. Our provider url is in the format of "node1:1100,node2:1100" and all of this works well.

The problem we have is that when node1 is taken down and our rich Java client code makes calls to the service factory to get a new remote service interface, it seems like the time it takes to failover to node2 is too long. Based on my reading, we send "node1:1100,node2:1100" in the provider url property and the system (Naming and/or HAJNDO) is looking for node1 and when not found, it fails over to node2, but I believe we need to reduce that time it takes for that failover. Will setting the jnp.timeout property help reduce this time?

The problem may be compounded because EVERY time the client needs a service, it goes this same process, and it seems like we have about a 30 second delay at least.

Any suggestions on configuration settings or better ways to handle this failover in a more timely manner because our java client is not very usable when node1 is down, due to the ordering of the providerURL setting.

Since we got no reply on this - we had to kind of hack the approach. As part of calling our ServiceFactory to get the remote EJB services, we added code that attempts to open a socket on the nodes in the provider url list. Nodes that are down (can't respond to socket request), we move to the end of the provider url string. As soon as we find a node that is up, that node stays at the front of the provider url string until it is no longer able to respond to a socket request.

It's not pretty but it certainly is an improvement over waiting on the clustering code to failover.

Someone else on our team did the work because I got pulled into other issues, and he said that setting the timeout didn't work. I have not had time to very that for myself. I am at the point where sometimes I am too busy and need to trust the team's response. In this case, I am a bit skeptical and hopefully will find the time to verify it myself.

Thanks for the log. Our test wasn't quite as involved, but we put a timer around a couple of the statements, before and after creating the InitialContext and before/after the context.lookup() method. Ultimately it was the context.lookup() method that was taking around 30 seconds when the first node in the provider_url was down and it appeared to be linear, but we only tried with a max of three nodes.