Basically, in NamingContext, we use a WeakReferenceMap to cache the jndi server stub information.on the client side. So during a server restart, unless the reference has been gc-ed (e.g., if -Dsun.rmi.dgc.client.gcInterval is defaulted to 60 seconds and server restart takes longer than that time), another client request (e.g., ic.lookup(name)) will result in an exception: java.rmi.NoSuchObjectException: no such object in table

So this is obviously a flaw since we can't depend on the dgc to recycle the server stub.

A proper way that I propose is to specifically catch the NoSuchObjectException in the lookup() code. When we encounter such an exception, we would flush out the server from the cache and do a fresh lookup again. It should be a simple fix that won't impact the code base.

Please note that this fix won't cover the case when the home stub is cached by the application (that is, no jndi lookup every time). It'd still generate the error but in this case, we can use the RetryInterceptor to catch the error and retry.

This would already be the case for the clustering call (i.e., SingleRetryInterceptor).

(I was able to track down the thing I was concerned about with deploying a 2nd NamingService and made sure the test didn't cause a problem.)

The problem with HA-JNDI is as you described, Ben -- when HA-JNDI is stopped, the server-side Remote is unexported. The client side NamingContext has a cached refer to the stub to that Remote; invoking on that fails.

I was able to determine why the test doesn't fail with regular JNDI:

1) For regular JNDI, the server-side Remote object is an instance of NamingServer. That instance is cached in a static field NamingContext.localServer. Thus the server-side object actually survives a restart of the NamingService.

2) NamingService doesn't actually unexport that NamingServer as part of it's stop() processing. From org.jnp.server.Main.stop():

Field "isStubExported" is never set to "true", so the unexportObject call never happens.

Effect is a remote NamingContext still has a valid stub after a restart of the NamingService. If the test actually restarted the server rather than just bouncing a NamingService, it wouldn't work. This is what happens with Ben's manual test above. When I modified the test NamingService so it no longer used the static NamingServer in 1) above, the test fails.

1. Ok, that explains it why you weren't seeing any error on the non-ha jndi restart.

2. Yes, since Naming is used everywhere, more extensive testing would be needed. But since this fix would only add exception catch clause, so normal lookup should work just as is.

3. As for other logic, I was proposing only lookup call since there is current logic for retry already (becuase of server overload, etc.). And I don't see the retry logic in other naming calls (e.g., bind/unbind). In those calls, we'd throw a CommunicationException directly.

What do people think? I am tempted to fix only the lookup that should be majority of use cases to minimize the code impact.

Just a note for the record: the fix we're implementing is based on catching java.rmi.NoSuchObjectException. If the naming service uses Remoting or even the pooled or http invokers, that exception won't be thrown. However, usually if those invokers are used the client-side proxy will still be valid after a restart (unless the server address or port has changed.) Fixing the edge case where the address or port has changed would involve catching and retrying after a large variety of exceptions, which is too big a behavior change.

I went ahead and fixed this for all the Context operations, not just lookup(). If we're going to fix this we might as well really fix it. It was simple enough to encapsulate the error handling in a method and then apply it via a simple try/catch around each remote call. The class javadoc for NoSuchObjectException states that:

If a NoSuchObjectException occurs attempting to invoke a method on a remote object, the call may be retransmitted and still preserve RMI's "at most once" call semantics.

Just so there's no confusion, I built the branch Ben and you created, and used that AS (I didn't try to copy fixed JARs etc). I also copied the client/jbossall-client.jar to the client (it is the only JAR the client uses besides log4j.jar).

So, I performed this test through the two scenarios listed below:

* Start 2 servers* Start client that uses same proxy over and over* Make a few requests* Kill both servers* Restart the second server* Make more requests

Case #1: java.naming.provider.url specified as list

This now works as expected.

Case #2: java.naming.provider.url not specified (use discovery)

I could not always get this to work seamlessly.

Log info Jimmy provided:

2007-08-21 23:36:24,956 TRACE [org.jboss.proxy.ejb.RetryInterceptor] Begin reestablishInvokerProxy
2007-08-21 23:36:24,956 TRACE [org.jboss.proxy.ejb.RetryInterceptor] Using retry properties from NamingContextFactory
2007-08-21 23:36:25,057 TRACE [org.jboss.proxy.ejb.RetryInterceptor] Looking for invoker: Hello-RemoteInvoker
2007-08-21 23:36:25,060 TRACE [org.jboss.ha.framework.interfaces.HARMIClient] Invoking on target=HARMIServerImpl_Stub[UnicastRef2 [liveRef: [endpoint:[lo2:1101](remote),objID:[253c198f:1148bd97a0d:-8000, 2]]]]
2007-08-21 23:36:25,064 TRACE [org.jboss.ha.framework.interfaces.HARMIClient] Invoke failed, target=HARMIServerImpl_Stub[UnicastRef2 [liveRef: [endpoint:[lo2:1101](remote),objID:[253c198f:1148bd97a0d:-8000, 2]]]]
java.rmi.NoSuchObjectException: no such object in table
at sun.rmi.transport.StreamRemoteCall.exceptionReceivedFromServer(StreamRemoteCall.java:247)
at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:223)
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:126)
at org.jboss.ha.framework.server.HARMIServerImpl_Stub.invoke(Unknown Source)
at org.jboss.ha.framework.interfaces.HARMIClient.invokeRemote(HARMIClient.java:172)
at org.jboss.ha.framework.interfaces.HARMIClient.invoke(HARMIClient.java:267)
at $Proxy0.lookup(Unknown Source)
at org.jnp.interfaces.NamingContext.lookup(NamingContext.java:664)
at org.jnp.interfaces.NamingContext.lookup(NamingContext.java:624)
at javax.naming.InitialContext.lookup(InitialContext.java:351)
at org.jboss.proxy.ejb.RetryInterceptor.reestablishInvokerProxy(RetryInterceptor.java:247)
at org.jboss.proxy.ejb.RetryInterceptor.invoke(RetryInterceptor.java:185)
at org.jboss.proxy.TransactionInterceptor.invoke(TransactionInterceptor.java:61)
at org.jboss.proxy.SecurityInterceptor.invoke(SecurityInterceptor.java:70)
at org.jboss.proxy.ejb.StatelessSessionInterceptor.invoke(StatelessSessionInterceptor.java:112)
at org.jboss.proxy.ClientContainer.invoke(ClientContainer.java:100)
at $Proxy2.sayHello(Unknown Source)
at example.StdInClient.main(Unknown Source)
2007-08-21 23:36:25,066 TRACE [org.jboss.proxy.ejb.RetryInterceptor] Retry attempt 1: Failed to lookup proxy
javax.naming.CommunicationException [Root exception is java.rmi.RemoteException: Service unavailable.]
at org.jnp.interfaces.NamingContext.lookup(NamingContext.java:777)
at org.jnp.interfaces.NamingContext.lookup(NamingContext.java:624)
at javax.naming.InitialContext.lookup(InitialContext.java:351)
at org.jboss.proxy.ejb.RetryInterceptor.reestablishInvokerProxy(RetryInterceptor.java:247)
at org.jboss.proxy.ejb.RetryInterceptor.invoke(RetryInterceptor.java:185)
at org.jboss.proxy.TransactionInterceptor.invoke(TransactionInterceptor.java:61)
at org.jboss.proxy.SecurityInterceptor.invoke(SecurityInterceptor.java:70)
at org.jboss.proxy.ejb.StatelessSessionInterceptor.invoke(StatelessSessionInterceptor.java:112)
at org.jboss.proxy.ClientContainer.invoke(ClientContainer.java:100)
at $Proxy2.sayHello(Unknown Source)
at example.StdInClient.main(Unknown Source)
Caused by: java.rmi.RemoteException: Service unavailable.
at org.jboss.ha.framework.interfaces.HARMIClient.invokeRemote(HARMIClient.java:213)
at org.jboss.ha.framework.interfaces.HARMIClient.invoke(HARMIClient.java:267)
at $Proxy0.lookup(Unknown Source)
at org.jnp.interfaces.NamingContext.lookup(NamingContext.java:672)
... 10 more

The final line of the stack trace shows the failure occured in the retry after flushing the cache. I'm looking into this, but I'm pretty sure this issue revolves around the fact that NamingContext.removeServer(Hashtable serverEnv) only removes an entry if java.naming.provider.url is specified.

I'm seeing this problem JBOSS EAP 4.2CR1. Did this fix get propagated to that code base? We have a Swing Client using JBoss remoting with the rmi transport and unified invoker. We have code to catch potential Server problems and then re-authenticate the user, but subsequent lookups, when the server comes back online, displays:

Caused by: java.rmi.NoSuchObjectException: no such object in table at sun.rmi.transport.StreamRemoteCall.exceptionReceivedFromServer(StreamRemoteCall.java:247) at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:223) at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:126) at org.jboss.remoting.transport.rmi.RMIServerInvoker_Stub.transport(Unknown Source) at org.jboss.remoting.transport.rmi.RMIClientInvoker.transport(RMIClientInvoker.java:238) ... 14 more

However, as I try to get a handle on all the various numbering techniques, the MANIFEST.MF is probably the most meaningful for those of you that are digging into the code and need an actual branch/tag to reference. As far as I can tell, 4.2.0.GA_CP01 *is* 4.2.0.CR1, but apparently 4.2.0.GA_CP01 is the more specific way to reference the build we're using...

So, any ideas if this particular error discussed in this thread is indeed also occuring in this build, within the Remoting code?

OK, sounds like you probably have 4.2.0.GA_CP01; don't know why the jar was named "CR1". CR sounds for "Candidate for Release", i.e. a release soon before the GA; that's not something one should use after the GA comes out. CP stands for "Cumulative Patch"; that's a bug-fix release made after GA and provided to subscription customers.

CP releases don't include fixes for every issue found after a GA comes out; we find many customers prefer that changes in CPs are minimal and that fixes are limited to those requested by customers or critical issues like security patches. Based on that, the JBAS-4622 hasn't been ported to the EAP 4.2 CP branch yet. If you're a customer and want to see that fixed in the next EAP 4.2 CP release, I suggest you raise an issue on the Customer Support Portal.

All that said, you're actually interested in JBREM-906 anyway. :-) I don't know enough about the remoting code to know if the same basic problem is there, but from looking at the stack trace, it seems pretty likely. Basically, any RMI-based transport is vulnerable to the problem of a client holding onto an RMI stub that no longer matches the server. Do you need to use an RMI connector?

>> Basically, any RMI-based transport is vulnerable to the problem of a>> client holding onto an RMI stub that no longer matches the server.>> Do you need to use an RMI connector?

The short answer to this is no -- in fact, in production, per the suggestions of a JBoss consultant that we'd see better performance with Socket transport, we are using Socket transport.

The reason we're investigating a switch to RMI is that the RMI transport "properly" handles aborted transactions, in that it immediately throws an exception back to the client that the transaction was rolled back. Socket Transport will not catch that the transaction is rolled back, so the client doesn't get notified -- until the thread completes... assuming it does... and even then, the client only receives a notice that the long running thread's transaction is dead so, even though we made you wait for the long running thread to complete, we're only going to give you an InvalidStateException because the wrapping transaction has timed out.

So long answer is that we're trying to find a transport that meets all our needs -- which are pretty simple -- notify client immediately when Transaction times out (some type of Exception is fine TransactionRolledBack or whatever) AND handle a server restart.

Any advice on what the best approach is to get one of these transports fixed to meet our needs is very much appreciated.