Yesterday I spent 4 hours trying to get my network's DHCP/DNS/SMB server back online. Long story short, it took numerous wildly frustrated shots in the dark (no DNS = no internet resources for help) and no fewer than half a dozen reboots to finally restore my server to functioning order.

What precipitated this was configuring and enabling my server's second Ethernet port in /etc/network/interfaces. That's when it all hit the fan. I've finally gotten eth1 disabled again and eth0 is working as before, but this isn't the state I want this server to be in.

eth0 and eth1 are both gigabit ports built into the motherboard (an ASUS something-or-other), and previously they were both bonded together (round-robin, I think); however, the server's been completely reformatted and re-installed since then (hard drive failure precipitated that), so I would think that anything the bonding driver had configured would be dead and gone.

While the server was offline, ifconfig seemed to be showing that it was receiving packets just fine, but every single outgoing packet was being dropped. (I should have saved the output from ifconfig during the issue, but the 'TX' line showed "packets:0" and "dropped:123"; also "errors:0 ... overrun:0 carrier:0".)

eth0 is configured with a static IP; I did the same for eth1. Here is /etc/network/interfaces:

root@odin:~# cat /etc/network/interfaces
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).
# The loopback network interface
auto lo
iface lo inet loopback
# The primary network interface
auto eth0
iface eth0 inet static
address 10.12.0.50
netmask 255.0.0.0
gateway 10.12.0.2
# The secondary network interface
# Commented out now because this was the only way I could get it to work again
#auto eth1
#iface eth1 inet static
# address 10.12.0.51
# netmask 255.0.0.0
# gateway 10.12.0.2

The output for eth1 is identical, except that it shows "Link detected: no" because it's disabled currently; "Link detected" was always "yes" for either interface when it was supposedly enabled, even when eth0 was apparently unable to send any packets.

The "from" address is different, but it's always eth1 and always "source" 10.12.0.50 or .51. That "martian" thing reminded me that I am running Shorewall, but turning it off (and verifying that iptables -L showed nothing but accepting everything from/to anywhere) had no effect whatsoever. I'm not even sure why eth1 would be seeing traffic intended for eth0's address in the first place, given that they're connected to a switch that (in my understanding, anyway) would only send packets to their intended destinations. (It is an unmanaged gigabit switch, Linksys I think.)

I don't even know how to begin to diagnose or troubleshoot what went wrong here. Frankly, I'm afraid to try to start eth1 again, especially since I don't even know what finally fixed the problem so I don't know that I could get it reverted again to its current state. What can I do to figure out what happened, and to fix it so that I can again turn on eth1 without blowing up the server's networking again? Could the hardware still be mis-configured from the previous system install using the bonding driver? How could I determine that and, if that's the case, fix it?

Both ports worked perfectly independently on the previous install before I set up bonding, and I had no issues at all during that time. I re-installed the system about 4-ish weeks ago, and eth1 has been disabled since then (Ubuntu detected it during the installation routine, but I of course chose eth0 as my "primary" interface during the install and Ubuntu apparently made no effort to configure eth1 after that).

One hint regarding your DNS problem: It's a good idea to keep the address of an external DNS ready for exactly this case. Google's DNS is a good option because it's easy to remember: 8.8.8.8 and 8.8.4.4
–
Sven♦May 12 '11 at 18:34

@Sven Yeah, that occurred to me after I finally got everything working again, even though I could have easily grabbed OpenDNS' addresses (my DNS server has them configured as the forwarders, and I was sitting at the KVM that connected to said server). Frustration and confusion at what was going on kept me from thinking of it at the time, though.
–
KromeyMay 12 '11 at 18:48

2 Answers
2

If you have a bond with two ports connected to the same unmanaged switch, it won't support the necessary protocols to bond the ports together. You must use mode=active-backup

No, your previous configuration won't affect your setup now.

The martians are a result of having two NICs on the same subnet. They're being sent to eth1 as they're broadcast packets. Other than cluttering your logs, you shouldn't have trouble with these in your setup.

the transmit timeouts look like some sort of a hardware problem

What you should do:

Try running: ip addr flush dev eth1; ip link set up dev eth1 to see if merely bringing up eth1 causes eth0 to fail. If it does, you likely have hardware problems.

Set up a single bonded interface (mode=active-backup) with both eth0 and eth1 as slaves and assign the server's IP address to that.

My understanding was that some of the other modes -- e.g. round-robin -- were also able to be used on an unmanaged switch. Regardless, though, I don't want to use bonding this time around, but I do want to move some services off to the second NIC. I'll try those commands next time I have physical access to the box -- not about to try to do this over SSH in case it all goes haywire again.
–
KromeyMay 12 '11 at 19:18

1

It may work for some value of work, but I would trust it enough to put it into production. At best it'll cause all sorts of mac address flapping, at worst it'll confuse and kill the switch. You'll certainly not get more than 1Gbps inbound. ---- If you're concerned about losing remote access, do: (ip link set up dev eth1; sleep 30; ip link set down dev eth1) inside a screen session.
–
MikeyBMay 12 '11 at 19:40

@MikeyB Good idea to do the up-and-down inside a screen session, but bringing eth1 down (with ifdown eth1) didn't solve my problem before, so I remain extremely wary (to the point of paranoia). Incidentally, I've never taken the time to learn to use the ip command effectively; can you point me toward a beginner's tutorial and/or cheat sheet for it? I know I'm supposed to be using it now, but old habits and all that...
–
KromeyMay 12 '11 at 19:49

I'm not sure of a reference, although the command help is quite good. A nice thing about using 'ip link' commands is that it only affects the link without touching anything else.
–
MikeyBMay 12 '11 at 20:12

Well, I'm not sure if this is what did it, but I followed your suggestion using ip to flush and then bring up eth1, and everything is hunky-dory now. So while I'm not sure if this is what solved the problem or not, I'm nonetheless accepting this as the answer.
–
KromeyMay 13 '11 at 2:28

If you nics were previously bonded together, it's quite possible you need to reconfigure the switch ports. The ports may have been trunked or try plugging you nics into untagged ports on the same VLAN.