6.1 Overview of Setting Up Clusters

Coherence provides a default out-of-box cluster configuration that is used for demonstration purposes. It allows clusters to be quickly created and often requires little or no configuration changes. However, beyond demonstration, the default setup should not be used. Instead, unique clusters should be set up based on the network environment in which they run and based on the requirements of the applications that use them. A cluster that runs in single-server mode can be configured for unit testing and trivial development.

At a minimum, setting up a cluster includes defining the cluster's name and the cluster's multicast address. If multicast is undesirable or unavailable in an environment, then setting up the Well Known Addresses (WKA) feature is required. The rest of the tasks presented in this chapter are typically used when setting up a cluster and are completed when the default settings must be changed.

Clusters are set up within an operational override file (tangosol-coherence-override.xml). Each cluster member uses an override file to specify unique values that override the default configuration that is defined in the operational deployment descriptor. See "Specifying an Operational Configuration File" for detailed information on using an operational override file. In addition, refer to Appendix A, "Operational Configuration Elements," for descriptions and usage information for all the operational elements that are discussed in this chapter.

6.2 Specifying a Cluster's Name

A cluster name is a user-defined name that uniquely identifies a cluster from other clusters that run on the network. Cluster members must specify the same cluster name to join and cluster. A cluster member does not start if the wrong name is specified when attempting to join an existing cluster. A unique cluster name is often used with a unique multicast port to create distinct clusters on the same network.

Note:

A cluster member uses a system generated cluster name if a name is not explicitly specified. Using the system generated name (and the out-of-box multicast defaults) increases the chance of having overlapping cluster configurations on the network. This can lead to cluster members accidentally joining an unexpected cluster.

To specify a cluster name, edit the operational override file and add a <cluster-name> element, within the <member-identity> element, that includes the cluster name. For example:

The tangosol.coherence.cluster system property is used to specify the cluster name instead of using the operational override file. For example:

-Dtangosol.coherence.cluster=name

6.3 Specifying a Cluster Member's Identity

A set of identifiers are used to give a cluster member an identity within the cluster. The identity information is used to differentiate cluster members and conveys the members' role within the cluster. Some identifiers are also used by the cluster service when performing cluster tasks. Lastly, the identity information is valuable when displaying management information (for example, JMX) and facilitates interpreting log entries. The following list describes each of the identifiers:

Site Name – the name of the geographic site that hosts the cluster member. The server's domain name is used if no name is specified. For WAN clustering, this value identifies the datacenter where the member is located and can be used as the basis for intelligent routing, load balancing, and disaster recovery planning (that is, the explicit backing up of data on separate geographic sites).

Rack Name – the name of the location within the geographic site that hosts the cluster member. This is often a cage, rack, or bladeframe identifier and can be used as the basis for intelligent routing, load balancing and disaster recovery planning (that is, the explicit backing up of data on separate bladeframes).

Machine Name – the name of the server that hosts the cluster member. The server's host name is used if no name is specified. The name is used as the basis for creating an ID. The cluster service uses the ID to ensure that data are backed up on different computers to prevent single points of failure.

Process Name – the name of the JVM process that hosts the cluster member. The JVM process number is used if no name is specified. The process name makes it possible to easily differentiate among multiple JVMs running on the same computer.

Member Name – the cluster member's unique name. The name makes it easy to differentiate cluster members especially when multiple members run on the same computer or within the same JVM. Always specify a member name (as a best practice) even though it is not required to do so.

Role Name – the cluster member's role in the cluster. The role name allows an application to organize cluster members into specialized roles, such as cache servers and cache clients. Default role names (CoherenceServer for cache servers and application_class_name for cache clients) are used if no role name is specified.

To specify member identity information, edit the operational override file and add the member identity elements within the <member-identity> element as demonstrated below:

6.4 Configuring Multicast Communication

Cluster members use multicast communication to discover other cluster members and when a message must be communicated to multiple members of the cluster. The cluster protocol makes very judicious use of multicast and avoids things such as multicast storms. By default, data is only transmitted over multicast if it is intended for more than 25% of the cluster members. The vast majority of traffic is transmitted using unicast even when multicast is enabled. For typical partitioned cache based clusters, most transmissions is point-to-point and only cluster membership and partition ownership is broadcast to the entire cluster.

Multicast communication is configured in an operational override file within the <multicast-listener> node. Many system properties are also available to configure multicast communication when starting a cluster member.

6.4.1 Specifying a Cluster's Multicast Address

A multicast address (IP address and port) can be specified for a cluster member. Cluster members must use the same multicast address and port to join and cluster. Distinct clusters on the same network must use different multicast addresses.

A cluster member uses a default multicast address if an address is not explicitly specified. The default value depends on the release version and follows the convention of {build}.{major version}.{minor version}.{patch} for the address and {major version}.{minor version}.{patch} for the port.

Note:

Using the default multicast address and port (and the system generated cluster name) increases the chance of having overlapping cluster configurations on the network. This can lead to cluster members accidentally joining an unexpected cluster. Always use a unique port value to create a distinct cluster.

To specify a cluster multicast address, edit the operational override file and add both an <address> and <port> element and specify the address and port to be used by the cluster member. For example:

The tangosol.coherence.clusteraddress and tangosol.coherence.clusterport system properties are used to specify the cluster multicast address instead of using the operational override file. For example:

6.4.1.1 Changing the Multicast Socket Interface

The multicast socket is bound to the same network interface (NIC) as the unicast listener IP address. A different NIC for multicast can be configured but, with rare exception, it is strongly discouraged as it can lead to partial failure of the cluster.

With two NICs, the interface (and thus network) used for multicast traffic is different from the interface (and thus network) used for unicast (UDP/IP) and TCP-ring (TCP/IP) traffic. Communication on one interface (or network) continues to succeed even if the other interface has failed; this scenario prolongs failure detection and failover. Since the clustering protocol handles member (and thus interface) failure, it is preferable to have all communication fail so that a failed member is quickly detected and removed from the cluster.

To change the default multicast network interface, edit the operational override file and add an <interface> element that specifies the IP address to which the multicast socket binds. For example:

6.4.2 Disabling Multicast Communication

Multicast traffic may be undesirable or may be disallowed in some network environments. In this case, use the Well Known Addresses feature to prevent Coherence from using multicast. This disables multicast discovery and also disable multicast for all data transfers; unicast (point-to-point) is used instead. Coherence is designed to use point-to-point communication as much as possible, so most application profiles do not see a substantial performance impact. See "Using Well Known Addresses".

Note:

Disabling multicast does puts a higher strain on the network. However, this only becomes an issue for large clusters with greater than 100 members.

6.4.3 Specifying the Multicast Time-to-Live

The time-to-live value (TTL) setting designates how far multicast UDP/IP packets can travel on a network. The TTL is expressed in terms of how many hops a packet survives; each network interface, router, and managed switch is considered one hop.

The TTL value should be set to the lowest integer value that works. Setting the value too high can use unnecessary bandwidth on other LAN segments and can even cause the operating system or network devices to disable multicast traffic. Typically, setting the TTL value to 1 works on a simple switched backbone. A value of 2 or more may be required on an advanced backbone with intelligent switching. A value of 0 is used for single server clusters that are used for development and testing. See "Enabling Single-Server Mode" for more information on single server clusters.

To specify the TTL, edit the operational override file and add a <time-to-live> element that includes the TTL value. For example:

The tangosol.coherence.ttl system property is used to specify the TTL value instead of using the operational override file. For example:

-Dtangosol.coherence.ttl=3

6.4.4 Specifying the Multicast Join Timeout

The multicast join timeout defines how much time a cluster member waits to join a cluster. If the timeout is reached and an existing cluster is not detected, then the cluster member starts its own cluster and elects itself as the senior cluster member. A short timeout can be specified during development and testing. A timeout of 30 seconds is generally adequate for production environments.

Note:

The first member of the cluster waits the full duration of the join timeout before it assumes the role of the senior member. If the cluster startup timeout is less than the join timeout, then the first member of the cluster fails during cluster startup. The cluster member timeout is specified using the packet publisher timeout (<timeout-milliseconds>). See "packet-delivery".

To specify the join timeout, edit the operational override file and add a <join-timeout-milliseconds> element that includes the timeout value. For example:

6.4.5 Changing the Multicast Threshold

Cluster members use both multicast and unicast communication when sending cluster packets. The multicast threshold value is used to determine whether to use multicast for packet delivery or unicast. Setting the threshold higher or lower can force a cluster to favor one style of communication over the other. The threshold setting is not used if multicast communication is disabled.

The multicast threshold is a percentage value and is in the range of 1% to 100%. In a cluster of n members, a cluster member that is sending a packet to a set of destination nodes (not counting itself) of size d (in the range of 0 to n-1) sends a packet using multicast only if the following hold true:

The packet is being sent over the network to multiple nodes (d> 1).

The number of nodes is greater than the specified threshold (d> (n-1) * (threshold/100)).

For example, in a 25 member cluster with a multicast threshold of 25%, a cluster member only uses multicast if the packet is destined for 6 or more members (24 * .25 = 6).

Setting this value to 1 allows the cluster to use multicast for basically all multi-point traffic. Setting this value to 100 forces the cluster to use unicast for all multi-point traffic except for explicit broadcast traffic (for example, cluster heartbeat and discovery) because the 100% threshold is never exceeded. With the setting of 25 (the default) a cluster member sends the packet using unicast if it is destined for less than one-fourth of all nodes, and sends the packet using multicast if it is destined for one-fourth or more of all nodes.

To specify the multicast threshold, edit the operational override file and add a <multicast-threshold-percent> element that includes the threshold value. For example:

6.5 Specifying a Cluster Member's Unicast Address

Cluster members use unicast for direct member-to-member (point-to-point) communication, which makes up the majority of communication on the cluster. A default unicast address (IP address and ports) is used but can be specified as required within the <unicast-listener> element.

The unicast listener, as configured out-of-box, selects the unicast address as follows:

address – A cluster member attempts to obtain the IP to bind to using the java.net.InetAddress.getLocalHost() call. Explicitly specify the address on computers with multiple IPs or NICs, if required. Moreover, the localhost setting may not work on systems that define localhost as the loopback address; in that case, the computer name or the specific IP address must be specified. The multicast socket binds to the same interface as defined by this address. See "Changing the Multicast Socket Interface".

ports – A cluster member uses two unicast UDP ports. The default behavior is to attempt to use port 8088 for the first port (port1). If port 8088 is not available, automatic port adjustment is used to select the next available port. The second port (port2) is automatically opened and defaults to the next available port after port1 (port1+ 1 if available). Automatic port adjustment can be disabled. In this case, port1 must be available and the second port is always port1+ 1.

Two UDP ports are used because:

It reduces contention between inbound and outbound traffic and avoids doing both heavy transmits and receives on the same port

It allows for coherence members to communicate at the optimal packet size based on the Maximum Transmission Unit (MTU) of the operating system. One UDP port is used for large packets, and the other port is for packets with sizes at or under the network MTU. The separation allows for separate packet buffers based on size.

It allows for large clusters (> 500 members) to be run without artificially increasing the size of the socket buffers

To specify a cluster member's unicast address, edit the operational override file and add both an <address> and <port> element (and optionally a <port-auto-adjust> element) and specify the address and port to be used by the cluster member. For example:

The tangosol.coherence.localhost, tangosol.coherence.localport, and tangosol.coherence.localport.adjust system properties are used to specify the unicast address instead of using the operational override file. For example:

6.6 Using Well Known Addresses

The Well Known Addresses (WKA) feature is a mechanism that allows cluster members to discover and join a cluster using unicast instead of multicast. WKA is most often used when multicast networking is undesirable or unavailable in an environment or when an environment is not properly configured to support multicast. All cluster multicast communication is disabled if WKA is enabled.

WKA is enabled by specifying a small subset of cluster members (referred to as WKA members) that are able to start a cluster. The optimal number of WKA members varies based on the cluster size. Generally, WKA members should be about 10% of the cluster. One or two WKA members for each switch is recommended.

WKA members are expected to remain available over the lifetime of the cluster but are not required to be simultaneously active at any point in time. Only one WKA member must be operational for cluster members to discover and join the cluster. In addition, after a cluster member has joined the cluster, it receives the addresses of all cluster members and then broadcasts are performed by individually sending messages to each cluster member. This allows a cluster to operate even if all WKA members are stopped. However, new cluster members are not able to join the cluster unless they themselves are a WKA member or until a WKA member is started. In this case, the senior-most member of the cluster polls the WKA member list and allows the WKA member to rejoin the existing cluster.

There are two ways to specify WKA members. The first method explicitly defines a list of addresses. The second method uses an address provider implementation to get a list of WKA addresses. Both methods are configured in an operational override file within the <well-known-addresses> subelement of the <unicast-listener> element.

6.6.1 Specifying WKA Member Addresses

WKA members are explicitly specified within the <socket-address> element. Any number of <socket-address> elements can be specified and each must define both the address and port of a WKA member by using the <address> and <port> elements. If a cluster member specifies its own address, then the cluster member is a WKA member when it is started. The list of WKA members must be the same on every cluster member to ensure that different cluster members do not operate independently from the rest of the cluster. The following example specifies two WKA members:

A single WKA member can be specified using the tangosol.coherence.wka and tangosol.coherence.wka.port system properties instead of specifying the address in an operational override file. The system properties are intended for demonstration and testing scenarios to quickly specify a single WKA member. For example:

To create additional system properties to specify multiple WKA member addresses, an operational override file must be used to define multiple WKA member addresses and a system-property attribute must be defined for each WKA member address element. The attributes must include the system property names to be used to override the elements. The below example defines two addresses including system properties:

Note:

Defining additional system properties to specify a list of WKA members can be used during testing or in controlled production environments. However, the best practice is to exclusively use an operational override file to specify WKA members in production environments. This ensure the same list of WKA members exists on each cluster member.

6.6.2 Specifying a WKA Address Provider

A WKA address provider offers a programmatic way to define WKA members. A WKA address provider must implement the com.tangosol.net.AddressProvider interface. Implementations may be as simple as a static list or as complex as using dynamic discovery protocols. The address provider must return a terminating null address to indicate that all available addresses have been returned. The address provider implementation is called when the cluster member starts.

Note:

implementations must exercise extreme caution since any delay with returned or unhandled exceptions causes a discovery delay and may cause a complete shutdown of the cluster service on the member. Implementations that involve more expensive operations (for example, network fetch) may choose to do so asynchronously by extending the com.tangosol.net.RefreshableAddressProvider class.

To use a WKA address provider implementation, add an <address-provider> element and specify the fully qualified name of the implementation class within the <class-name> element. For example:

As an alternative, the <address-provider> element supports the use of a <class-factory-name> element that is used to specify a factory class for creating AddressProvider instances, and a <method-name> element to specify the static factory method on the factory class that performs object instantiation. The following example gets an address provider instance using the getAddressProvider method on the MyAddressProviderFactory class.

Any initialization parameters that are required for a class or class factory implementation can be specified using the <init-params> element. Initialization parameters are accessible by implementations that support the com.tangosol.run.xml.XmlConfigurable interface or implementations that include a public constructor with a matching signature. The following example sets the iMaxTime parameter to 2000.

6.7 Enabling Single-Server Mode

Single-Server mode is a cluster that is constrained to run on a single computer and does not access the network. Single-Server mode offers a quick way to start and stop a cluster and is typically used during unit testing or development.

To enable single-server mode, edit the operational override file and add a <time-to-live> element that is set to 0 and a unicast <address> element that is set to an address that is routed to loopback. On most computers, setting the address to 127.0.0.1 works. For example:

The tangosol.coherence.ttl and tangosol.coherence.localhost system properties are used to enable single-server mode instead of using the operational override file. For example:

-Dtangosol.coherence.ttl=0 -Dtangosol.coherence.localhost=127.0.0.1

6.8 Configuring Death Detection

Death detection is a cluster mechanism that quickly detects when a cluster member has failed. Failed cluster members are removed from the cluster and all other cluster members are notified about the departed member. Death detection allows the cluster to differentiate between actual member failure and an unresponsive member, such as the case when a JVM conducts a full garbage collection.

Death detection works by creating a ring of TCP connections between all cluster members. TCP communication is sent on the same port that is used for cluster UDP communication. Each cluster member issues a unicast heartbeat, and the most senior cluster member issues the cluster heartbeat, which is a broadcast message. Each cluster member uses the TCP connection to detect the death of another node within the heartbeat interval. Death detection is enabled by default and is configured within the <tcp-ring-listener> element.

6.8.1 Changing TCP-Ring Settings

Several settings are used to change the default behavior of the TCP-ring listener. This includes changing the amount of attempts and time before determining that a computer that is hosting cluster members has become unreachable. These default to 3 and 15 seconds, respectively. The TCP/IP server socket backlog queue can also be set and defaults to the value used by the operating system.

To change the TCP-ring settings, edit the operational override file and add the following TCP-Ring elements:

Note:

The values of the <ip-timeout> and <ip-attempts> elements should be high enough to insulate against allowable temporary network outages.

6.8.3 Disabling Death Detection

Death detection is enabled by default and must be explicitly disabled. Disabling death detection can alleviate network traffic but also prolongs the detection of failed members. If disabled, a cluster member uses the packet publisher's resend timeout interval to determine that another member has stopped responding to UDP packets. By default, the timeout interval is set to 5 minutes. See "Changing the Packet Resend Timeout" for more details.

To disable death detection, edit the operational override file and add an <enabled> element that is set to false. For example:

6.9.1 Specifying a Cluster Member's Priority

A cluster member's priority is used as the basis for determining tie-breakers between members. If a condition occurs in which one of two members is ejected from the cluster, and in the rare case that it is not possible to objectively determine which of the two is at fault and should be ejected, then the member with the lower priority is ejected.

To specify a cluster member's priority, edit the operational override file and add a <priority> element, within the <member-identity> node, that includes a priority value between 1 and 10 where 1 is the highest priority. For example:

The tangosol.coherence.priority system property can also be used to specify a cluster member's priority instead of using the operational override file. For example:

-Dtangosol.coherence.priority=1

6.9.2 Specifying Thread Priority

Multiple cluster components support thread priority. The priority is used as the basis for determining Java thread execution importance. The components include: the multicast listener, the unicast listener, the TCP ring listener, the packet speaker, the packet publisher, and the incoming message handler. The default priority setup gives the packet publisher the highest priority followed by the incoming message handler followed by the remaining components.

Thread priority is specified within each component's configuration element (<unicast-listener>, <multicast-listener>, <packet-speaker>, <packet-publisher>, <tcp-ring-listener>, and <incoming-message-handler> elements, respectively). For example, to specify a thread priority for the unicast listener, edit the operational override file and add a <priority> element, within the <unicast-listener> node, that includes a priority value between 1 and 10 where 1 is the highest priority: