systhread.net

Fixing a Problem with Nmap

Ever had an ipv4 network address that is supposed to migrate over via a
high availability mechanism simply not work or even stranger if there were
several addresses some do and some do not? An experienced network
administrator probably has seen mysterious non-migrating addresses, however,
within this context is presented a rather interesting solution to
when it has been observed.

The Setup

For simplicity two addresses will be used, the idea being if a service or
server in part of a 2 node high availability cluster are detected down via
a heartbeat check the node that is up takes over (unless it is the one
already holding the addresses). There are some details that need to be
presented as well:

The compute nodes are on the same logical but different physical networks
and communicate via a switch.

If node A goes offline node B assumes the two addresses and services
associated with them. It is also important to note that these systems
were not in production use which allowed for a great deal of time to
troubleshoot.

The Symptoms

Simply put when node A was powered down for testing the addresses would
be reassigned on node B, however, they could not be accessed. Node B's
physical address worked fine (192.168.1.31). Letting system administrator
instinct take over the following steps were taken:

Services associated with the addresses restarted.

Node B rebooted (while node A remained powered down)

So came the time for a little more in depth sleuthing.

Troubleshooting

Since it looked like there was a virtual split brain, that is only one of
the two addresses appeared to be working, the next logical step seemed to
be running a traceroute. A trace to the working address seemed fine but a
trace to the non-working address hung just past the global site selector.
Pings showed a similar pattern, the working address would answer and the
non working address would seem to not get any sort of response.
Seeing the traceroute hang led to a suspicion, what if the network didn't
think the address had moved? To validate this node A was brought back online
but without the shared addresses, they remained assigned to node B. Another
ping was kicked off using the non-working address while at the same time
a tcpdump on node A looking for ICMP traffic generated by the ping command
was fired up. The results were pretty clear, the ICMP requests showed up
on node A even though the address was no longer assigned to it. Somewhere
along the network path a device had the wrong information about where
the IP address was residing.

The Problem

Actually there was not a real error. The cisco global site selector (GSS)
keeps a sticky mac address table. It updated one address properly but did
not do the other one. Remember that in reality there were many more
addresses being shared (roughly 20 or so) and only half of them appeared
to function after node B took them over. Node A was returned to operation
with the shared addresses, however, now the problem reversed itself. The
previous non-working address worked and vice verse. What is interesting
is this held true for all of the shared addresses.

Fixes

The real solution to this particular issue would have been to reduce the
timeout for those particular addresses or use some other failover mechanism
(which is in fact what the solution ended up becoming). Unfortunately at the
time the systems did need to come back online for application testing;
pondering - how to fix it right now?

This is where the network mapper or
nmap came into play. With the current issue
being when node A was online and not all addresses had the current mac
address in the global site selector's table - they needed to be updated
without rebooting the global site selector. Using Nmap's mac spoofing
a scan against the global site selector with the current interface's
mac address did the trick, following is an example of what it looked like:

Where xx:xx:xx:xx:xx:xx was the mac address of the physical
network card that the shared addresses were on and gss_IP_address
was one of the two global site selector addresses.
While tools like nmap are great for seeing what is on a network or system
they can also be used as demonstrated not to just aid in troubleshooting
but actually help to fix issues as well.