Dual NIC clients

We use dual NIC client boxes, each on different VLANS. Machines will PXE boot but the screens go blank during that process. In earlier versions (0.28, 0.29 and 0.32) we’d get a TFTP time out.

The quick and dirty fix is to kill the relevant switch ports on the LAN not used for FOG while imaging.

Since it works fine with only one active NIC, it would seem that the pxe booted kernel gets confused about which interface to use for TFPT. I tried changing TFTP_ADDRESS=“0.0.0.0:69”; in /etc/default/tftpd-hpa to point at the server IP, but that didn’t help. I also tried adding option tftp-server-name X.X.X.X; to my dhcpd config file, but no luck.

I’ve not tested, as I don’t readily have available a machine with multiple interfaces, but I think I’ve got a universal init that you can pass a custom kernel argument to, which will ensure the correct interface is up, and others are down. So far, I’ve coded for three possible interfaces. I already had many of these functions already written in another project I’ve been working on.

in the init, I’ve edited the file /usr/share/fog/lib/funcs.sh to include these functions:

@tag Well the way I write it, it’ll work with however many NICs a system has. We will need to use the new feature that @Tom-Elliott so kindly implemented maybe a month ago, the host’s Host Init field. Basically we will build an init for each of your subnets, then use groups to assign the right inits to the right computers - so that those computers in those subnets use the correct interface.

Sounds like a lot - but I really don’t think it is. I think this is going to be very easy.

Seems kind of inflexible, though… The same init is used for all, right? We even have some clients with three NICs at other locations… If it has to take various hw scenarios into account, it might take some fancy scripting.

I know some basic scripting but nothing really fancy.

The only way to determine the correct interface would be to filter on IP, as I see it.

That would in my case get the network ID of the correct network and other disimilar outputs from the other interfaces in the list which could then be compared to a set value of the correct network ID. Based on that comparison you could then turn on or off the interfaces.

I’m sure someone else could do something a lot niftier.

I haven’t tested any of this and it might screw up if the number of interfaces actually present is different from the number in the list.

@tag I think you’re going to have to build a custom init. You can change the fog.upload and fog.download scripts. The idea would be to use shell script to determine which interface is on the right network, and then disable the other interface (or enable it). It should be pretty simple.

I tried playing around with grcan.enable0=[0|1] and grcan.enable1=[0|1] as well as grcan.select=[0|1] but none had any effect. The kernel continues to choose the slower link in most cases. Seemed promising, though…

What I do notice for the first time, though, is that whatever NIC is not chosen is disabled. I hadn’t noticed as I can’t see the backs of the boxes very well. Here it is also obvious that the active NIC changes on occasion, as the LEDs die on the disabled NIC.

@Wayne-Workman The problem is, what arguments can we use to differentiate? It sounds like they’re basically identical NICs connecting to different network outlets, probably getting inconsistent names as well. (based on his results)

@Quazz
No, they are registered and deployed through tasks. Actually it’s not a question of one NIC being faster than the other - they’re more or less identical mobo dual NICs - as it is the link speed. They’re 100Mbps switch ports on a 100Mbps trunk to the layer 3 switch. It was never intended for large data transfers - just remote access and so on.

The primary MAC in the host registration is the faster link, so that has no effect, I’m afraid. These are 1Gbps switch ports for the clients and two ports in ether channel for the server.

Might be possible to tell it to use the faster NIC by registering them, assigning the faster NIC as primary NIC and deploying in that manner, but I really can’t be certain on that, I was kind of hoping someone else would chime on in on this, heh.

@Quazz
Yes, sometimes an image will deploy at 5GB/min - a few minutes later when trying again with the same client it will only deploy at a fraction of that speed…

Otherwise I agree; I too believe it has to do with the order of the NICs.

I tried swapping the cables at first to see if it just chose one specific hardware ID first, but that was not the case.

Since then I’ve done quite a few test deployments on the same eigth machines. Mostly they’ll deploy slowly, but every now and again one will run on the faster NIC, which can be verified by pulling the cables and seeing which one makes it pause.

I’m not saying it’s arbitrary, but I don’t see a pattern so it seems arbitrary to me. ;)

OK - did some thinking. It times out because there is no default gateway set on the secondary link. Setting that it will connect. The problem is, how do I know which network it chooses? I’m getting inconsistent transfer speeds now, average of 5GB/min versus 225MB/min - apparently depending on which NIC it connects through.

I should mention that inter-VLAN routing is enabled on the layer 3 switch of the primary network. Removing the secondary network from the static route list or pulling the physical link kills it again - this time at trying to send an inventory before deploying.

If I pull the power on the secondary network they will all deploy at high speeds.
With the secondary network on (and inter-VLAN routing), some will deploy normally, others slowly - apparently arbitrarily, as the same machines will act differently from task to task.
It would seem the kernel arbitrarily sets which NIC is eth0 from boot to boot? That would perhaps explain why it would appear to use different NICs.

If I pull the plug on the secondary network while deploying it stops deploying until plugged back in. So it’s using the “wrong” NIC…