Mark Hlawatschek wrote:
> I'm seeing the same problem in a 4.7 cluster.
> Chrissi, is there a solution or another bz for the problem ?
>
As I said (rather unhelpfully) in my email I have no more information
than that. No-one has provided any more detail and I haven't managed to
reproduce it. Not that I've had much chance with the work on RHEL5 and
later code I've been doing.
Actually, providing useful information for this is almost impossibly
difficult anyway unless you're some sort of magician with kernel cores.
We have one at Red Hat but, as you might imagine, he's in very high
demand ;-)
About the only thing that a normal mortal might be able to do that could
be useful is a tcpdump of the node talking successfully to the cluster
following by it going down and failing to join. If you're feeling a
little more adventurous then compiling the code with DEBUG_COMMS enabled
would help enormously too.
It's possible that the workaround program I posted in the BZ might
mitigate the problem a little, but without knowing much more about what
is happening I can't honestly be sure.
Chrissie
> -Mark
>
> On Wednesday 11 February 2009 10:17:30 Chrissie Caulfield wrote:
>> thijn wrote:
>>> Hi,
>>>
>>> I have the following problem.
>>> CMAN: removing node [server1] from the cluster : Missed too many
>>> heartbeats
>>> When the server comes back up:
>>> Feb 10 14:43:58 server1 kernel: CMAN: sending membership request
>>> after which it will try to join until the end of times.
>>>
>>> In the current problem, server2 is active and server1 has the problem
>>> not being able to join the cluster.
>>>
>>> The setup is a two server setup cluster.
>>> We have had the problem on several clusters.
>>> We "fixed" it usualy with rebooting the other node at which the cluster
>>> would repair itself and all ran smoothly from thereon.
>>> Naturally this will disrupt any services running on the cluster. And its
>>> not really a solution that will win prices.
>>> The problem is that server1(the problem one) is in a inquorate state and
>>> we are unable to get it to a quorate state, neither do we see why this
>>> is the case.
>>> We tried to use a test setup to replay the problem, we were unable.
>>>
>>> So we decided to try to find a way to fix the state of the cluster using
>>> the tools the system provides.
>>>
>>> The problem we see presents itself after a fence action by either node.
>>> When we would bring down both nodes to stabilize the issue, the cluster
>>> would become healthy and after that we can reboot either node and it
>>> will rejoin the cluster.
>>> It seems the problem presents itself when "pulling the plug" out of the
>>> server.
>>> We run on IBM Xservers using the SA-adapter as a fence device.
>>> The fence device is in a different subnet then the subnet on which the
>>> cluster communicates.
>>> Bot fence devices are on the same subnet/vlan.
>>>
>>> CentOS release 4.6 (Final)
>>> Linux server2 2.6.9-67.ELsmp #1 SMP Fri Nov 16 12:48:03 EST 2007 i686
>>> i686 i386 GNU/Linux
>>> cman_tool 1.0.17 (built Mar 20 2007 17:10:52)
>>> Copyright (C) Red Hat, Inc. 2004 All rights reserved.
>>>
>>> All versions of libraries and packages, kernel modules and all that is
>>> dependent for the GFS cluster to operate are identical on both nodes.
>>>
>>> Cluster.conf
>>> [root server1 log]# cat /etc/cluster/cluster.conf
>>> <?xml version="1.0"?>
>>> <cluster config_version="3" name="NAME_cluster">
>>> <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>>> <clusternodes>
>>> <clusternode name="server1.production.loc" votes="1">
>>> <fence>
>>> <method name="1">
>>> <device name="saserver1"/>
>>> </method>
>>> </fence>
>>> </clusternode>
>>> <clusternode name="server2.production.loc" votes="1">
>>> <fence>
>>> <method name="1">
>>> <device name="saserver2"/>
>>> </method>
>>> </fence>
>>> </clusternode>
>>> </clusternodes>
>>> <cman expected_votes="1" two_node="1"/>
>>> <fencedevices>
>>> <fencedevice agent="fence_rsa" ipaddr="10.13.110.114" login="saadapter"
>>> name="saserver1" passwd="XXXXXXX"/>
>>> <fencedevice agent="fence_rsa" ipaddr="10.13.110.115" login="saadapter"
>>> name="saserver2" passwd="XXXXXXX"/>
>>> </fencedevices>
>>> <rm>
>>> <failoverdomains/>
>>> <resources/>
>>> </rm>
>>> </cluster>
>>>
>>> [root server1 log]# cat /etc/hosts
>>> 127.0.0.1 localhost.localdomain localhost
>>>
>>> Both server are able to ping each other and also the broadcast address,
>>> so there is no firewall filtering UDP packets
>>> When i tcpdump the line i see traffic going both ways,
>>>
>>> Both servers are in the same vlan
>>> 14:51:28.703240 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto
>>> 17, length: 56) server2.production.loc.6809 >
>>> broadcast.production.loc.6809: UDP, length 28
>>> 14:51:28.703277 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto
>>> 17, length: 140) server1.production.loc.6809 >
>>> server2.production.loc.6809: UDP, length 112
>>> 14:51:33.703240 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto
>>> 17, length: 56) server2.production.loc.6809 >
>>> broadcast.production.loc.6809: UDP, length 28
>>> 14:51:33.703310 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto
>>> 17, length: 140) server1.production.loc.6809 >
>>> server2.production.loc.6809.6809: UDP, length 112
>>>
>>> Is this normal network behavior when a cluster is inquorate?
>>> I see that server1 is talking to server2, but server2 is only talking in
>>> broadcasts.
>>>
>>> When i start of try to join the cluster
>>> Feb 10 09:36:06 server1 cman: cman_tool: Node is already active failed
>>>
>>> [root server1 ~]# cman_tool status
>>> Protocol version: 5.0.1
>>> Config version: 3
>>> Cluster name: NAME_cluster
>>> Cluster ID: 64692
>>> Cluster Member: No
>>> Membership state: Joining
>>>
>>> [root server2 log]# cman_tool status
>>> Protocol version: 5.0.1
>>> Config version: 3
>>> Cluster name: RWSEems_cluster
>>> Cluster ID: 64692
>>> Cluster Member: Yes
>>> Membership state: Cluster-Member
>>> Nodes: 1
>>> Expected_votes: 1
>>> Total_votes: 1
>>> Quorum: 1
>>> Active subsystems: 7
>>> Node name: server2.production.loc
>>> Node ID: 2
>>> Node addresses: server1.production.loc
>>>
>>> [root server1 ~]# cman_tool nodes
>>> Node Votes Exp Sts Name
>>>
>>> [root server2 log]# cman_tool nodes
>>> Node Votes Exp Sts Name
>>> 1 1 1 X server1.production.loc
>>> 2 1 1 M server2.production.loc
>>>
>>> When i start cman
>>> service cman start
>>>
>>> Feb 10 14:06:30 server1 kernel: CMAN: Waiting to join or form a
>>> Linux-cluster
>>> Feb 10 14:06:30 server1 ccsd[21964]: Connected to cluster infrastruture
>>> via: CMAN/SM Plugin v1.1.7.4
>>> Feb 10 14:06:30 server1 ccsd[21964]: Initial status:: Inquorate
>>>
>>>
>>> It seems to me that this should be fixable with the tools as provided
>>> with the RedHat Cluster Suite, without disturbing the running cluster.
>>> It seems quite insane if i need to restart my cluster to have it all
>>> working again.. kinda spoils the idea of running a cluster.
>>> This setup is running in a HA envirmoment and we can have nearly to no
>>> downtime.
>>>
>>> The logs on the healthy server (server2) does not mention/complain
>>> anything of errors when rebooting, restarting cman or when server1 want
>>> to join the cluster.
>>> We see no disallowed, refused or anything that server2 is not willing to
>>> play with server1
>>>
>>> I have been looking at this thing for a while now.. am i missing
>>> anything?
>> This is a known bug, see
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=475293
>>
>> It's fixed in 4.7 or you can run a program to set up a workaround.
>>
>> Having said that I have heard reports of is still happening in some
>> circumstances ... but I don't have any more detail
>>
>> --
>>
>> Chrissie
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster redhat com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
--
Chrissie