ssic-linux-devel

Hi,
I was going through the cluster documentation of Tru64 to figure out
what exactly is CVIP.It actually made me prepare this wish list. I
wonder if any of this is already there.
1)Since we talk of Single System Image I guess we need to have the
external interface( I mean interface connected to the external network )
of all the nodes to be shown on each node once i do ifconfig ( I guess
Brian once told about this. Brian ? )
ie, on node1 if I do a ifconfig -a I should get
eth0
eth1
eth2
....
.... where eth1 and eth2 are interfaces on node2 and node3. I should be
able to do any ifconfig operation on these interfaces. I should be able
to run a server on node1 that listen on IP associated with eth1 or eth2
( I guess this brings in the requirement of bind/connect and listen to
be clusterwide. am I correct Bruce ? ) I am not sure how we should
name the cluster interconnect interface in this particular case.
2) It will great if i can see my cluster alias as an interface on all
the nodes and if I can do a "ifconfig clu0 down" That will bring the
cluster IP down. If you have multiple cluster alias then ifconfig
should show only those alias to which the particular node belongs.
3) I should be able to configure a cluster alias and should be able to
say node1 node2 and node3 belong to this alias and node4, node5 and
node6 belong to cluster alias2
3) A user configurable load balancing on the basis of connection in the
case of a multi instance configuration of different server like
webserver ( LVS ? )
4) ...... ( Not there yet but as I read more about CVIP I will update
this part :) )
-aneesh

> I was going through the cluster documentation of Tru64 to figure out
> what exactly is CVIP.It actually made me prepare this wish list. I
> wonder if any of this is already there.
Shouldn't you be looking at the UnixWare NSC doc rather than Tru64? Or
is the cvip implementation the same?
> 1)Since we talk of Single System Image I guess we need to have the
> external interface( I mean interface connected to the external
> network ) of all the nodes to be shown on each node once i do
> ifconfig ( I guess Brian once told about this. Brian ? )
Here's what I get on my NSC system: (net0.[01] is the cluster
interconnect, net1.[01] is the external ether, cvip0 is the cvip
and lo0 is the loopback.)
# onall ifconfig -a
(node 1)
lo0: flags=4049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384
inet 127.0.0.1 netmask ff000000
net0.1: flags=24043<UP,BROADCAST,RUNNING,MULTICAST,ICS> mtu 1500
inet 192.168.255.1 netmask ffffff00 broadcast 192.168.255.255
ether 00:03:47:40:ce:a3
cvip0: flags=84041<UP,RUNNING,MULTICAST,CVIP> mtu 1500
inet 213.39.1.236 netmask ffffffe0
net1.1: flags=4043<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 213.39.1.243 netmask ffffffe0 broadcast 213.39.1.255
ether 00:06:5b:38:49:cf
net1.2: flags=44043<UP,BROADCAST,RUNNING,MULTICAST,REMOTE> mtu 1500
inet 213.39.1.244 netmask ffffffe0 broadcast 213.39.1.255
ether 00:06:5b:38:49:ba
net0.2: flags=64043<UP,BROADCAST,RUNNING,MULTICAST,ICS,REMOTE> mtu 1500
inet 192.168.255.2 netmask ffffff00 broadcast 192.168.255.255
ether 00:03:47:3f:c9:0d
(node 2)
lo0: flags=4049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384
inet 127.0.0.1 netmask ff000000
net0.2: flags=24043<UP,BROADCAST,RUNNING,MULTICAST,ICS> mtu 1500
inet 192.168.255.2 netmask ffffff00 broadcast 192.168.255.255
ether 00:03:47:3f:c9:0d
cvip0: flags=84041<UP,RUNNING,MULTICAST,CVIP> mtu 1500
inet 213.39.1.236 netmask ffffffe0
net1.2: flags=4043<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 213.39.1.244 netmask ffffffe0 broadcast 213.39.1.255
ether 00:06:5b:38:49:ba
net0.1: flags=64043<UP,BROADCAST,RUNNING,MULTICAST,ICS,REMOTE> mtu 1500
inet 192.168.255.1 netmask ffffff00 broadcast 192.168.255.255
ether 00:03:47:40:ce:a3
net1.1: flags=44043<UP,BROADCAST,RUNNING,MULTICAST,REMOTE> mtu 1500
inet 213.39.1.243 netmask ffffffe0 broadcast 213.39.1.255
ether 00:06:5b:38:49:cf
I.E. the order of the output is different but the same information is
printed.
> .... where eth1 and eth2 are interfaces on node2 and node3. I should
be
> able to do any ifconfig operation on these interfaces.
Can do on UW NSC.
> I should be able to run a server on node1 that listen on IP
associated
> with eth1 or eth2
Can do on UW NSC.
> 2) It will great if i can see my cluster alias as an interface on all
> the nodes and if I can do a "ifconfig clu0 down" That will bring the
> cluster IP down.
On UW NSC doing "ifconfig cvip0 down" takes down the cvip, but the
individual node interfaces (net1.1 and net1.2 for me) stay up.
> If you have multiple cluster alias then ifconfig should show
> only those alias to which the particular node belongs.
> 3) I should be able to configure a cluster alias and should be able
> to say node1 node2 and node3 belong to this alias and node4, node5
> and node6 belong to cluster alias2
AFAIK the cvip (all the cvip's, you can have many) on UW NSC applies to
all nodes.
> 3) A user configurable load balancing on the basis of connection in
> the case of a multi instance configuration of different server like
> webserver ( LVS ? )
On UW NSC it seems to be round robin.

Hi,
On Tue, 2002-03-26 at 14:28, John Hughes wrote:
> > I was going through the cluster documentation of Tru64 to figure out
> > what exactly is CVIP.It actually made me prepare this wish list. I
> > wonder if any of this is already there.
>
> Shouldn't you be looking at the UnixWare NSC doc rather than Tru64? Or
> is the cvip implementation the same?
I don't have a UnixWare NSC here. :) I was trying to look into the
different features that CVIP provides
>
> > 1)Since we talk of Single System Image I guess we need to have the
> > external interface( I mean interface connected to the external
> > network ) of all the nodes to be shown on each node once i do
> > ifconfig ( I guess Brian once told about this. Brian ? )
>
> Here's what I get on my NSC system: (net0.[01] is the cluster
> interconnect, net1.[01] is the external ether, cvip0 is the cvip
> and lo0 is the loopback.)
>
> # onall ifconfig -a
> (node 1)
> lo0: flags=4049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384
> inet 127.0.0.1 netmask ff000000
> net0.1: flags=24043<UP,BROADCAST,RUNNING,MULTICAST,ICS> mtu 1500
> inet 192.168.255.1 netmask ffffff00 broadcast 192.168.255.255
> ether 00:03:47:40:ce:a3
> cvip0: flags=84041<UP,RUNNING,MULTICAST,CVIP> mtu 1500
> inet 213.39.1.236 netmask ffffffe0
> net1.1: flags=4043<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
> inet 213.39.1.243 netmask ffffffe0 broadcast 213.39.1.255
> ether 00:06:5b:38:49:cf
> net1.2: flags=44043<UP,BROADCAST,RUNNING,MULTICAST,REMOTE> mtu 1500
> inet 213.39.1.244 netmask ffffffe0 broadcast 213.39.1.255
> ether 00:06:5b:38:49:ba
> net0.2: flags=64043<UP,BROADCAST,RUNNING,MULTICAST,ICS,REMOTE> mtu 1500
> inet 192.168.255.2 netmask ffffff00 broadcast 192.168.255.255
> ether 00:03:47:3f:c9:0d
> (node 2)
> lo0: flags=4049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384
> inet 127.0.0.1 netmask ff000000
> net0.2: flags=24043<UP,BROADCAST,RUNNING,MULTICAST,ICS> mtu 1500
> inet 192.168.255.2 netmask ffffff00 broadcast 192.168.255.255
> ether 00:03:47:3f:c9:0d
> cvip0: flags=84041<UP,RUNNING,MULTICAST,CVIP> mtu 1500
> inet 213.39.1.236 netmask ffffffe0
> net1.2: flags=4043<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
> inet 213.39.1.244 netmask ffffffe0 broadcast 213.39.1.255
> ether 00:06:5b:38:49:ba
> net0.1: flags=64043<UP,BROADCAST,RUNNING,MULTICAST,ICS,REMOTE> mtu 1500
> inet 192.168.255.1 netmask ffffff00 broadcast 192.168.255.255
> ether 00:03:47:40:ce:a3
> net1.1: flags=44043<UP,BROADCAST,RUNNING,MULTICAST,REMOTE> mtu 1500
> inet 213.39.1.243 netmask ffffffe0 broadcast 213.39.1.255
> ether 00:06:5b:38:49:cf
>
> I.E. the order of the output is different but the same information is
> printed.
I didn't know that it is already there . Great.
>
> > .... where eth1 and eth2 are interfaces on node2 and node3. I should
> be
> > able to do any ifconfig operation on these interfaces.
>
> Can do on UW NSC.
>
> > I should be able to run a server on node1 that listen on IP
> associated
> > with eth1 or eth2
>
> Can do on UW NSC.
>
>
> > 2) It will great if i can see my cluster alias as an interface on all
> > the nodes and if I can do a "ifconfig clu0 down" That will bring the
> > cluster IP down.
>
> On UW NSC doing "ifconfig cvip0 down" takes down the cvip, but the
> individual node interfaces (net1.1 and net1.2 for me) stay up.
That's fine. So that is also there :)
>
> > If you have multiple cluster alias then ifconfig should show
> > only those alias to which the particular node belongs.
> > 3) I should be able to configure a cluster alias and should be able
> > to say node1 node2 and node3 belong to this alias and node4, node5
> > and node6 belong to cluster alias2
>
> AFAIK the cvip (all the cvip's, you can have many) on UW NSC applies to
> all nodes.
What happens if you have a cluster spread across different subnets. I
mean Let's say i have a three node cluster and node1 is on subnet1 and
node2 is on subnet 2. Now i have a cluster IP configured and the cluster
IP falls in subnet1 . Now a client sitting on a subnet one will do a
frame relay to send any packet to the cluster IP because both of them
fall on the same subnet. Here in this case node1 will reply for the arp
of cluster IP and will read the packet . Let's assume that the packet
is destined for service A and is running on node2 ( yes it is bind to
cluster IP ) . Now from the IP layer of node1 packet will get redirected
to the TCP layer of node2 and the application works smoothly. Now what
happens if the node1 goes down. My application is still running on node2
and the cluster is still there but the further arp request from my
client on subnet 1 will not be replied becuase node 1 is down and there
is no other node in the cluster connected to that subnet.
How to prevent . Make it a condition that cluster cannot spread across
subnet . OR all always impose cluster IP to fall in a different subnet
so that always packet get sent to the router and then to one of the
nodes in the cluster.
Another solution is to allow the cluster IP to be configure per subnet
that is we should have cluster IP per subnet in this case also we should
ensure that the cluster IP contains only nodes that are in that subnet.
other wise previously stated problem will occur.
>
> > 3) A user configurable load balancing on the basis of connection in
> > the case of a multi instance configuration of different server like
> > webserver ( LVS ? )
>
> On UW NSC it seems to be round robin.
>
It is better to have a priority based weighted round robin. That's what
Tru64 have. It gives me fine control over load balancing.
>
> _______________________________________________
> ssic-linux-devel mailing list
> ssic-linux-devel@...
> https://lists.sourceforge.net/lists/listinfo/ssic-linux-devel

>> (onall ifconfig -a ...)
> I didn't know that it is already there . Great.
On UnixWare NSC.
>> AFAIK the cvip (all the cvip's, you can have many) on UW NSC applies
>> to all nodes.
> What happens if you have a cluster spread across different subnets. I
> mean Let's say i have a three node cluster and node1 is on subnet1
> and node2 is on subnet 2.
AFAIK on UW NSC you can't do that. You can have multiple cvip's,
each on different subnets.
> (load balancing)
>
>> On UW NSC it seems to be round robin.
>>
> It is better to have a priority based weighted round robin. That's
what > Tru64 have. It gives me fine control over load balancing.
That'd be nice if you have asymetric clusters. How is the priority
set? Is it a per node priority or a per bind priority?

Hi,
> what > Tru64 have. It gives me fine control over load balancing.
>
> That'd be nice if you have asymetric clusters. How is the priority
> set? Is it a per node priority or a per bind priority?
>
I think it is per node. But it will be nice to have a per service
priority and weight,, so that i can say i should get most of web server
connection to this node1 and most of database connection to node2 ....
and when the web server connection reaches the limit send the connection
to node3 and database connection when it reaches the limit send it to
node4.
-aneesh

>>> (re weighted round robin load balancing)
>>
>> That'd be nice if you have asymetric clusters. How is the priority
>> set? Is it a per node priority or a per bind priority?
>>
>
> I think it is per node. But it will be nice to have a per service
> priority and weight,, so that i can say i should get most of web
> server connection to this node1 and most of database connection to
> node2 .... and when the web server connection reaches the limit send
> the connection to node3 and database connection when it reaches the
> limit send it to node4.
I think it'd have to be per service to be useful. And wouldn't you
normaly send x% to node 1, y% to node 2, z% to node 3... rather than
filling each node to the limit before going on to the next?

"Aneesh Kumar K.V" wrote:
> ... My application is still running on node2
> and the cluster is still there but the further arp request from my
> client on subnet 1 will not be replied becuase node 1 is down and there
> is no other node in the cluster connected to that subnet.
>
> How to prevent . Make it a condition that cluster cannot spread across
> subnet . OR all always impose cluster IP to fall in a different subnet
> so that always packet get sent to the router and then to one of the
> nodes in the cluster.
I think a classic Unix mixed metaphor applies: "Give the user enough
rope to shoot themselves in the foot." The fear that a user might put
together a bad configuration does not justify restrictions on
potentially useful configurations. In the scenario described above, the
sysadmin should make sure that at least two nodes are up on subnet 1, so
that no single point of failure takes down service on that cluster IP
address.
> Another solution is to allow the cluster IP to be configure per subnet
> that is we should have cluster IP per subnet in this case also we should
> ensure that the cluster IP contains only nodes that are in that subnet.
> other wise previously stated problem will occur.
That could cause problems for an unmodified app that wants to read an IP
address out of a configuration file and do a listen on it. If it gets
started on a node that has no access to that cluster interface, the app
will get confused and probably exit.
Part of the SSI philosophy is that apps should be able to run without
modification on any node in the cluster, because it all looks like one
big machine. Sometimes it's acceptable to deviate from this philosophy
because it's not worth the effort to implement it 100%. An example is
not allowing a process to bind to another node's physical IP address,
because the process can just as easily bind to a highly available
cluster IP address. To not allow it the freedom to bind to any cluster
IP address, however, breaks the SSI model too much.
> It is better to have a priority based weighted round robin. That's what
> Tru64 have. It gives me fine control over load balancing.
The round robin system on NSC was chosen for ease of implementation in
the face of higher priorities. A better system for load-leveling
connections is definitely desirable. Doesn't LVS already have algorithms
for this?
--
Brian Watson | "Now I don't know, but I been told it's
Linux Kernel Developer | hard to run with the weight of gold,
Open SSI Clustering Project | Other hand I heard it said, it's
Compaq Computer Corp | just as hard with the weight of lead."
Los Angeles, CA | -Robert Hunter, 1970
mailto:Brian.J.Watson@...
http://opensource.compaq.com/

"Aneesh Kumar K.V" wrote:
> 1)Since we talk of Single System Image I guess we need to have the
> external interface( I mean interface connected to the external network )
> of all the nodes to be shown on each node once i do ifconfig ( I guess
> Brian once told about this. Brian ? )
>
> ie, on node1 if I do a ifconfig -a I should get
>
> eth0
> eth1
> eth2
> ....
> .... where eth1 and eth2 are interfaces on node2 and node3. I should be
> able to do any ifconfig operation on these interfaces. I should be able
> to run a server on node1 that listen on IP associated with eth1 or eth2
Yes, that's my opinion. I dislike the NSC convention of node encoding
the network device (e.g., net0.1 on node 1, net0.2 on node 2). The SSI
philosophy is that the cluster should look like one big machine with a
whole bunch of NICs, so the devices should be numbered that way.
Note that the interface to node number mapping would not necesarily be
eth0 on node 1, eth1 on node 2, etc. The interfaces would probably be
numbered in whatever order the cluster learns about them, although maybe
a user-interface to rearrange the numbering might be a good idea at some
point.
> ( I guess this brings in the requirement of bind/connect and listen to
> be clusterwide. am I correct Bruce ? )
Not exactly. I think NIC configuration should be clusterwide, so that a
user can ifconfig any interface from any node in the cluster. OTOH,
allowing a program to do a listen on any interface from anywhere in the
cluster may not be worth the work. Programs should just listen on one of
the cluster interfaces, which can be done from any node in the cluster
thanks to LVS. Correct me if I'm wrong, Kai or Bruce.
The reason for remote bind, connect, listen, etc., is a bit different.
Right now, processes can migrate, but sockets can't. Eventually, I would
like to make sockets migrateable, but there could still be situations
where two or more processes on different nodes share a socket.
With the current SSI code, if a process does a socket() call, gets
migrated by the load-leveler, then tries to do a connect(), it'll break.
If it finishes the connect() before it migrates, then file ops like
read(), write(), poll(), etc., will work fine. I just need to do the
same for socket ops like connect() and listen().
> 2) It will great if i can see my cluster alias as an interface on all
> the nodes and if I can do a "ifconfig clu0 down" That will bring the
> cluster IP down. If you have multiple cluster alias then ifconfig
> should show only those alias to which the particular node belongs.
As with the physical NICs, a user should be able to ifconfig a cluster
interface from any node in the cluster.
--
Brian Watson | "Now I don't know, but I been told it's
Linux Kernel Developer | hard to run with the weight of gold,
Open SSI Clustering Project | Other hand I heard it said, it's
Compaq Computer Corp | just as hard with the weight of lead."
Los Angeles, CA | -Robert Hunter, 1970
mailto:Brian.J.Watson@...
http://opensource.compaq.com/

> Yes, that's my opinion. I dislike the NSC convention of node encoding
> the network device (e.g., net0.1 on node 1, net0.2 on node 2). The SSI
> philosophy is that the cluster should look like one big machine with a
> whole bunch of NICs, so the devices should be numbered that way.
Personly I think that devices shouldn't be "numbered". What is the
point of knowing that this is the "nth" ethernet device?
In other words "eth23" is no more or less stupid than "net17.23" and
encodes less information. Maybe useless information, but I hate
discarding it before you know whether you need it.

John Hughes wrote:
> Personly I think that devices shouldn't be "numbered". What is the
> point of knowing that this is the "nth" ethernet device?
>
> In other words "eth23" is no more or less stupid than "net17.23" and
> encodes less information. Maybe useless information, but I hate
> discarding it before you know whether you need it.
The reason I dislike it is because it breaks the base naming convention,
which then breaks apps and utilities that depend on the base naming
convention. For awhile, I had the "joy" of maintaining the NSC changes
to the UnixWare netcfg command. Most of the changes were required
because of the node-encoded naming convention for network devices.
An alternative way to get node information for network interfaces could
be an option to the where command.
--
Brian Watson | "Now I don't know, but I been told it's
Linux Kernel Developer | hard to run with the weight of gold,
Open SSI Clustering Project | Other hand I heard it said, it's
Compaq Computer Corp | just as hard with the weight of lead."
Los Angeles, CA | -Robert Hunter, 1970
mailto:Brian.J.Watson@...
http://opensource.compaq.com/

> The reason I dislike it is because it breaks the base naming
convention,
> which then breaks apps and utilities that depend on the base naming
> convention. For awhile, I had the "joy" of maintaining the NSC changes
> to the UnixWare netcfg command. Most of the changes were required
> because of the node-encoded naming convention for network devices.
Ok, I see your point, reducing the number of changes to the base system
seems like a pretty important goal.