This page describes an older version of the product. The latest stable version is 15.2.

Failure Detection

Failure detection is the time it takes for the space and the client to detect that failure has occurred. Failure detection consists of two main phases:

The backup space detects that the primary space is down, and takes over as primary.

The client detects that the machine running the primary space is down. In case it is running against a clustered space, it routs its requests to the new primary space (the backup space that has just taken over as primary).

One of two main failure scenarios might occur:

Process failure or machine crash

Network cable disconnection

Reducing Failure Detection Time

Configuring failure detection time can help you handle extreme failure scenarios more effectively. For example, in extreme cases of network disconnection, you might want the failover process to take 2-3 seconds.

Here is a good combination for the space settings you may use to reduce the failover time - these should be used with a fast network:

Jini Lookup Service Parameters

The LeaseRenewalManager in the advanced-space.config file is also related to failure detection and recovery:

Parameter

Parameter Description

Default Value

maxLeaseDuration

The time the system waits between every lease renewal, for example: if the parameter value is 8000, the system renews the space lease every 8000 [milliseconds].
As this value is reduced, renewal requests are performed more frequently while the service is up, and lease expiration occurs sooner when the service goes down.

8000

roundTripTime

This parameter instructs the renewal process to begin a certain amount of time (by default, 100 [milliseconds]) before the actual renewal time, thus making sure that the renewal process is successful.
Significantly low values might result in failure to renew a lease. Durations of managed leases should exceed the roundTripTime.

4000

Lookup Service Unicast discovery parameters

When a Jini Lookup Service fails and is brought back online, a client (such as a GSC, space or a client with a space proxy) needs to re-discover it. It uses Jini unicast discovery retrying to connect to the failed remote lookup service. The default unicast retry protocol provides a graduating approach, increasing the amount of time to wait before the next discovery attempts are made - upon each invocation, eventually reaching a maximum time interval over which discovery is re-tried. In this way, the network is not flooded with unicast discovery requests referencing a lookup service that may not be available for quite some time (if ever).

The downside is that it may delay the discovery of services if these are not brought up quickly. A discovery can be delayed us much as 15 minutes. If you have two GSMs and one fails, but it will be brought back up only in the next hour, then it’s discovery will take ~15 minutes after it has loaded.