High Availabilty with heartbeat and ldirectord

Hello,
I am setting up a highly available, load balancing apache cluster. I think I have everything in place, and everything works except the load balancing. Heartbeat is used for the failover and works fine. I am using source hashing as the scheduling-method for ldirectord. Ldirectord does see the two nodes as the out put of "ipvsadm -L -n" shows:
SLES9-CLUSTER1:~ # ipvsadm -L -n
IP Virtual Server version 1.2.0 (size=4096)
Prot LocalAddressort Scheduler Flags
-> RemoteAddressort Forward Weight ActiveConn InActConn
TCP 192.168.200.79:80 sh
-> 192.168.200.78:80 Route 1 0 0
-> 192.168.200.77:80 Local 1 0 0

And when I shut down one of the boxes, they are pulled from the pool and the master will roll to the other like it is supposed to. However, the actual web request on port 80 fails when going to the non-local node (192.168.200.78 in the above example.) It will come through fine on the local node. So about half of the web requests fail. I did enable ip forwarding, is there anything else I need to do? Oh, it is suse enterprise linux 9, and the service address gets bound to eth0 as eth0:0. I don't know if this is right, but most of the examples I found online set up the service address as lo:0.
I can post some config files if needed.

All I need is step 6, everything else works. However, I did mine with only two boxes instead of four. Each has a loadbalancer and http service on it. Only one loadbalancer is active at a time, but I still want both boxes to balance the http requests. There is an article I found on doing this very setup:http://www.ultramonkey.org/2.0.1/topologies/sl-ha-lb-eg.html

However, I needed to tweak mine a little as I am running SLES 9. I am basically using a hybrid config between the two tutorials.

All I need is step 6, everything else works. However, I did mine with only two boxes instead of four. Each has a loadbalancer and http service on it. Only one loadbalancer is active at a time, but I still want both boxes to balance the http requests. There is an article I found on doing this very setup:http://www.ultramonkey.org/2.0.1/topologies/sl-ha-lb-eg.html

However, I needed to tweak mine a little as I am running SLES 9. I am basically using a hybrid config between the two tutorials.

thanks,

Click to expand...

In HA cluster, Only one loadbalancer is active at a time. The another one is standby load balancer which will be active when the primary loadbalancer is failed.

Sorry, what I mean is the output of the "ipvsadm -L -n" command lists the nodes as local or route:
TCP 192.168.200.79:80 sh
-> 192.168.200.78:80 Route 1 0 0
-> 192.168.200.77:80 Local 1 0 0

They are all private ip addresses. 192.168.200.79 is my virtual address, .77 is the active load balancer but is also an available node to recieve http requests. .78 is the other node, but I'm not sure that the loadbalancer is passing requests to that node or not. Since half of my requests were failing, I assumed the ones that failed were the ones getting forwarded to .78 and then getting dropped.

OK, I think I found my problem here:
The Linux Virtual Server has three different ways of forwarding packets: Network Address Translation (NAT), IP-IP encapsulation or tunnelling and Direct Routing.

* Direct Routing: Packets from end users are forwarded directly to the real server. The IP packet is not modified, so the real servers must be configured to accept traffic for the virtual server's IP address. This can be done using a dummy interface, or packet filtering to redirect traffic addressed to the virtual server's IP address to a local port. The real server may send replies directly back to the end user. That is if a host based layer 4 switch is used, it may not be in the return path.

I need to set up an ip alias on my loopback (lo:0) for the apache web server to accept connections for the virtual ip address (192.168.200.79). However, the tutorial explains how to do it in debian, do you know how it is done on SLES 9? I'll check in the meantime. Thanks,

So, i did like the howto described, but when i nmap the VIP, the http and mysql port is filtered.. Is there anything else i have to do? e.g. change the default route or add a new route on the realservers?

nmap-ing the VIP shows only the HTTP and MySQL ports as filtered. But, now it's only sometimes filtered.. So it works a couple of hours, after a restart of 1 RS, the ports change to filtered.. strange, isn't it?

I've been trying to do this and the real server loses all contact with this outside world.
In fact, the server won't respond to any requests after I add such a loopback alias.
Any one else here having the same issue?

I even used the "correction" script from http://classcast.blogspot.com/2006/12/two-node-lvs-dr-setup-on-centos.html that was supposed to solve the loopback alias problem...
Except the the "correction" script locks out everything once it tries to raise the loopback alias. Also the correction script wants an executable that doesn't exist: /etc/ha.d/rc.d/arptables-noarp-addr_takeip. (I did a yum search for arptables and ended up installing arptables_jf, but that didn't install such an executable either).

I've tried experimenting with different configurations out of ldirectord.cf, including changing gate to masq and (gasp!) ipip.

There is an "ldirectord.html" on each of the nodes that is successfully acknowledged... if the node is not running with a loopback alias. If I do set my node's loopback alias as follows:
ipconfig lo:0 10.0.0.100 netmask 255.255.255.255
...the node stops responding to the load balancer. However, I can still hit the node from anywhere else except the load balancer.

If I take the loopback alias down on the nodes, ldirectord says it can see the nodes, but any attempt to hit the virtual IP now times out.

...still no dice. However, since then, I've noticed some interesting other behavior...

I tried setting the LB's "checkinterval" value to 30, so that it checks to see if it can access nodes 30 seconds apart. (Or a "tick" in old MUD parlance). At this current point, the loopback interface on every node is down.

Then I fire up ldirectord, and let it see the nodes. (If the loopback alias on the nodes is currently up, then it won't get a response from the nodes, and will flag those nodes as unavailable.)

If I were to hit the Virtual IP from a web browser it'll time out.
However, if I turn on the loopback aliases on the nodes right now, everything works perfectly - the requests successfully route to a random node.

At least, until the next tick, maximum 30 seconds later, at which point, the load balancer cannot make a request of the node and marks it as being nonfunctional.

It is almost as if the Load Balancer does forward packets to the node, but cannot receive confirmation that it has done so. ldirectord marks the node disabled after "checkinteraval" seconds have passed, because requests to the node don't come back. It is obvious that the node is listening, but is unable to respond to the LB because the node's loopback alias is set to the Virtual IP.

Any help would be appreciated.

(From a loopback standpoint, I don't understand how a node is ever expected to communicate with another server when the node loopback alias is set to be the same as that other server.)