com.gemstone.gemfire
Class SystemFailure

This class represents a catastrophic failure of the system,
especially the Java virtual machine. Any class may,
at any time, indicate that a system failure has occurred by calling
initiateFailure(Error) (or, less commonly,
setFailure(Error)).

In practice, the most common type of failure that is likely to be
reported by an otherwise healthy JVM is OutOfMemoryError. However,
GemFire will report any occurrence of VirtualMachineError as
a JVM failure.

When a failure is reported, you must assume that the JVM has broken
its fundamental execution contract with your application.
No programming invariant can be assumed to be true, and your
entire application must be regarded as corrupted.

Failure Hooks

GemFire uses this class to disable its distributed system (group
communication) and any open caches. It also provides a hook for you
to respond to after GemFire disables itself.

Failure WatchDog

When this class is loaded, a "watchdog" Thread is started that
periodically checks to see if system corruption has been reported. When
system corruption is detected, this thread proceeds to:

Close GemFire -- Group communication is ceased (this cache
member recuses itself from the distributed system) and the cache
is further poisoned (it is pointless to try to cleanly close it at this
point.).

After this has successfully ended, we launch a

failure action, a user-defined Runnable
setFailureAction(Runnable).
By default, this Runnable performs nothing. If you feel you need to perform
an action before exiting the JVM, this hook gives you a
means of attempting some action. Whatever you attempt should be extremely
simple, since your Java execution environment has been corrupted.

GemStone recommends that you employ
Java Service Wrapper to detect when your JVM exits and to perform
appropriate failure and restart actions.

Finally, if the application has granted the watchdog permission to exit the JVM
(via setExitOK(boolean)), the watchdog calls System.exit(int) with
an argument of 1. If you have not granted this class permission to
close the JVM, you are strongly advised to call it in your
failure action (in the previous step).

Each of these actions will be run exactly once in the above described
order. However, if either step throws any type of error (Throwable),
the watchdog will assume that the JVM is still under duress (esp. an
OutOfMemoryError), will wait a bit, and then retry the failed action.

It bears repeating that you should be very cautious of any Runnables you
ask this class to run. By definition the JVM is very sick
when failure has been signalled.

Failure Proctor

In addition to the failure watchdog, this class creates a second
thread (the "proctor") that monitors free memory. It does this by examining
free memory,
total memory and
maximum memory. If the amount of available
memory stays below a given
threshold, for
more than WATCHDOG_WAIT seconds, the watchdog is notified.

Note that the proctor can be effectively disabled by
setting the failure memory threshold
to a negative value.

The proctor is a second line of defense, attempting to detect
OutOfMemoryError conditions in circumstances where nothing alerted the
watchdog. For instance, a third-party jar might incorrectly handle this
error and leave your virtual machine in a "stuck" state.

Note that the proctor does not relieve you of the obligation to
follow the best practices in the next section.

Periodically Check For Errors

Check for serious system errors at
appropriate points in your algorithms. You may elect to use
the checkFailure() utility function, but you are
not required to (you could just see if getFailure()
returns a non-null result).

A job processing loop is a good candidate, for
instance, in com.gemstone.org.jgroups.protocols.UDP#run(),
which implements Thread.run():

Create Logging ThreadGroups

If you create any Thread, a best practice is to catch severe errors
and signal failure appropriately. One trick to do this is to create a
ThreadGroup that handles uncaught exceptions by overriding
ThreadGroup.uncaughtException(Thread, Throwable) and to declare
your thread as a member of that ThreadGroup. This also has a
significant side-benefit in that most uncaught exceptions
can be detected:

Catches of Error and Throwable Should Check for Failure

Keep in mind that peculiar or flat-outimpossible exceptions may
ensue after a VirtualMachineError has been thrown anywhere in
your virtual machine. Whenever you catch Error or Throwable,
you should also make sure that you aren't dealing with a corrupted JVM:

catch (Throwable t) {
// Whenever you catch Error or Throwable, you must also
// catch VirtualMachineError (see above). However, there is
// _still_ a possibility that you are dealing with a cascading
// error condition, so you also need to check to see if the JVM
// is still usable:
SystemFailure.checkFailure();
...
}

DISTRIBUTION_HALTED_MESSAGE

DISTRIBUTED_SYSTEM_DISCONNECTED_MESSAGE

WATCHDOG_WAIT

public static final int WATCHDOG_WAIT

This is the amount of time, in seconds, the watchdog periodically awakens
to see if the system has been corrupted.

The watchdog will be explicitly awakened by calls to
setFailure(Error) or initiateFailure(Error), but
it will awaken of its own accord periodically to check for failure even
if the above calls do not occur.

This can be set with the system property
gemfire.WATCHDOG_WAIT. The default is 15 sec.

MEMORY_POLL_INTERVAL

public static final long MEMORY_POLL_INTERVAL

This is the interval, in seconds, that the proctor
thread will awaken and poll system free memory.
The default is 1 sec. This can be set using the system property
gemfire.SystemFailure.MEMORY_POLL_INTERVAL.

MEMORY_MAX_WAIT

public static final long MEMORY_MAX_WAIT

This is the maximum amount of time, in seconds, that the proctor thread
will tolerate seeing free memory stay below
setFailureMemoryThreshold(long), after which point it will
declare a system failure.
The default is 15 sec. This can be set using the system property
gemfire.SystemFailure.MEMORY_MAX_WAIT.

MONITOR_MEMORY

public static final boolean MONITOR_MEMORY

Flag that determines whether or not we monitor memory on our own.
If this flag is set, we will check freeMemory, invoke GC if free memory
gets low, and start throwing our own OutOfMemoryException if
The default is false, so this monitoring is turned off. This monitoring has been found
to be unreliable in non-Sun VMs when the VM is under stress or behaves in unpredictable ways.

TRACE_CLOSE

setExitOK

Indicate whether it is acceptable to call System.exit(int) after
failure processing has completed.

This may be dynamically modified while the system is running.

Parameters:

newVal - true if it is OK to exit the process

Returns:

the previous value

loadEmergencyClasses

public static void loadEmergencyClasses()

Since it requires object memory to unpack a jar file,
make sure this JVM has loaded the classes necessary for
closure before it becomes necessary to use them.

Note that just touching the class in order to load it
is usually sufficient, so all an implementation needs
to do is to reference the same classes used in
emergencyClose(). Just make sure to do it while
you still have memory to succeed!

emergencyClose

public static void emergencyClose()

Attempt to close any and all GemFire resources.
The contract of this method is that it should not
acquire any synchronization mutexes nor create any objects.

The former is because the system is in an undefined state and
attempting to acquire the mutex may cause a hang.

The latter is because the likelihood is that we are invoking
this method due to memory exhaustion, so any attempt to create
an object will also cause a hang.

This method is not meant to be called directly (but, well, I
guess it could). It is public to document the contract
that is implemented by emergencyClose in other
parts of the system.

getFailure

A return value of null indicates that no system failure has yet been
detected.

Object synchronization can implicitly require object creation (fat locks
in JRockit for instance), so the underlying value is not synchronized
(it is a volatile). This means the return value from this call is not
necessarily the first failure reported by the JVM.

Note that even if it were synchronized, it would only be a
proximal indicator near the time that the JVM crashed, and may not
actually reflect the underlying root cause that generated the failure.
For instance, if your JVM is running short of memory, this Throwable is
probably an innocent victim and not the actual allocation (or
series of allocations) that caused your JVM to exhaust memory.

If this function returns a non-null value, keep in mind that the JVM is
very limited. In particular, any attempt to allocate objects may fail
if the original failure was an OutOfMemoryError.

Returns:

the failure, if any

setFailureAction

Sets a user-defined action that is run in the event
that failure has been detected.

This action is run after the GemFire cache has been shut down.
If it throws any error, it will be reattempted indefinitely until it
succeeds. This action may be dynamically modified while the system
is running.

setFailureMemoryThreshold

public static long setFailureMemoryThreshold(long newVal)

Set the memory threshold under which system failure will be
notified.
This value may be dynamically modified while the system
is running. The default is 1048576 bytes. This can be set using the
system property gemfire.SystemFailure.chronic_memory_threshold.