Customer is running Solaris 8 on eight CPU system.
When they experience a Full GC using 1.4.2, their transaction server
throws DataValidationExceptions complaining the integrity of the data has
changed after the collection is finished. This causes rollbacks in
transaction requests and trades will go unfilled and money is lost.
The server is only using parallelgc at that time for cleaning the
young generation.
The only change they say in their trade environment is switching out
1.4.1_03 and using 1.4.2_03. The transaction server is written in c++
so they have natives threads referencing Java Objects. Though turning
on the CMS collection with the UseParGC seems to hide the problem. They
never see the update exceptions after a Full GC using these options
with 1.4.2.
The customer has run a couple of test with the UseParNewGC collector for several
hours and did not experience any UpdateExceptions from their transaction
server after Full GC occurred. The heap options are listed below:
command: /usr/local/j2sdk1.4.2_01/bin/java
-server -showversion -Xms512m
-Xmx512m -XX:NewSize=500m -XX:MaxNewSize=500m -XX:InitialSurvivorRatio=4
-XX:TargetSurvivorRatio=100 -XX:+PrintCompilation -XX:+UseParNewGC
-XX:MaxPermSize=256MB -XX:PermSize=3m -XX:MinPermHeapExpansion=1m
-XX:MaxPermHeapExpansion=10m -XX:-UseAdaptiveSizePolicy
-XX:+DisableExplicitGC -XX:+PrintTenuringDistribution
-XX:+PrintHeapAtGC -verbose:gc -XX:+PrintGCTimeStamps -Xnoclassgc
Why would there be such a difference in behavior?
Background
-----------
The application is more like 98% Java, and 2% C++.
The C++ code handles some of their ORB transport code (using
a C-API to a 3rd-party sockets vendor). When the Java code talks
to another process, it calls down to the C++ layer. This C++ code
establishes the connection to the outside, and creates a
"Receive" thread to receive messages from the newly created socket.
Or, if another process initiates contact, the same receive thread is
created for incoming messages.
When a new message is received from a remote process, very
minimal processing is done at the C++ layer before the JNI UpCall
takes place. The Java code invoked from JNI, processes the ORB message
and figures out which handling thread (a pure Java thread) the
message should be dispatched to. The message is just put onto
an internal queue, and then the dispatch thread picks it up and
calls application code (like plug in code) to actually do
application-level work.
Objects in Question
-------------------
So, the C++ stuff is pretty thin, and just interacts with the
older C-API to the 3rd party vendor software (which itself is
really just a layer on top of sockets). The C++ threads that
are created are connected to the JVM so they can make calls to
the VM to create buffers, which the incoming messages are copied
into. That buffer is basically the only Java object that the
C++ thread creates, and it is passed up during the JNI up-call.
This buffer is copied into separate objects created by the ORB
code, so after the JNI call, the C++ created buffers are no
longer referenced.
So, the objects that get modified (unexpectedly) after the few
FullGC's in 1.4.2, are not C++ created, nor are they
stored/referenced in the C++ code.
----------
The GC output and log information is available in the attachments.

EVALUATION
###@###.### 2004-04-18
PSPromotionManager::oop_promotion_failed() pushes object that
could not be promoted onto the claimed stack.
claimed_stack()->push(obj);
at or near line 345. The claimed stack is a per thread stack that has
a fixed size. The could does not check if the push succeeded. If
the push fails, the object may not be scanned during promotion failure
handling.
An application could have a reference to an object in eden or from-space that
does not get updated to the final. This is consistent with some of CBOE's
observations. However, this should sometimes lead to a VM crash which
is not observed.
This problem exists in 1.4.2 and 1.5.
1.4.1 does not use a claimed stack but rather uses a GrowableArray for
saving such objects so would not have this problem.
I've built a 1.4.2 VM with a fix to see if it will eliminate the
problem.
###@###.### 2004-04-19
This bug can lead to two copies of the same object - one in to-space and the
other in eden or from-space. Suppose object A has a reference
to X and A is scanned so that X is copied from eden to to-space.
Say object B has a reference to X but is not scanned. Then
B's reference points to the copy in eden. When the full collection
comes along to clean up after the failed promotion, A and B are
both scanned and their respective copies of X (both appearing
alive) survive the collection resulting in two copies.