Routing for clusters

Diagnosing routing problems

Make sure your interface card is bound to an IP address
(check with ifconfig -a).
You should be able to ping all nodes from each other.
Make sure tracepath or traceroute
do not show more hops than necessary.

Make sure you have direct routes to all other nodes.
Any of the following print route tables, with IP addresses
or hostnames:

$ route -n
$ route
or
$ netstat -rn
$ netstat -r

To diagnose problems with routing on a cluster, examine the
following information on the mayor node and on a worker node.

Make sure that your default installation does not run
a firewall packet filter by default. If you have the
ipchains, iptables packages installed, you can disable firewall
support by either uninstalling those packages, or running lokkit and
selecting "no firewall".

See if you have identified hosts explicitly.

$ cat /etc/hosts

If so, make sure "hosts" are resolved by "files" first:

$ cat /etc/nsswitch.conf

We use the lookup order "hosts: files nis dns".

If you are not using DNS, then /etc/resolv.conf
must be empty.
If you are using DNS, then specify the DNS nameserver
by IP address, and specify default domains for completing host names.
DHCP may recreate this file for you every time you reboot.

Internal and external addresses

A special routing problem may occur if a Linux
cluster mayor node has two ethernet interfaces --- one for
an external address and one for an internal cluster IP address.
If the mayor's hostname corresponds to the external address,
then the machine will mis-identify itself to other cluster nodes.
Those nodes will route through the external interface, if they
are able to route at all. A conservative fix would use
the internal address of the mayor node as the first-hop gateway for
the external address of the mayor node.

$ route add -net 146.27.172.254 netmask 255.255.255.255 gw 172.16.0.1

where 172.16.0.1 is the internal IP address
of the mayor node and 146.27.172.254 is the external
address of the mayor node.

Even better, set the route on
all cluster nodes to use the internal address of the mayor
as the first-hop gateway for any unknown external address:

$ route add -net 0.0.0.0 netmask 0.0.0.0 gw 172.16.0.1

This fix makes the previous fix unnecessary.

Outside machines might not have a route to the cluster nodes
either. To add a route to a PC that needs to see a cluster node,
set the route to use the external address of the mayor node
as the first-hop gateway to all cluster node addresses:

$ route add 172.16.0.0 mask 255.255.0.0 146.27.172.254

where 172.16.0.0 with a mask of 255.255.0.0
specifies the address range of the cluster nodes,
and 146.27.172.254
is the external address of the cluster mayor node.