WAN down momentarily = Reboot to recover

Have found a lot of people reporting what are likely related issues, with no resolution proposed yet…

While setting up new 2.1 pfSense install on a re-purposed IBM ThinkCentre, there was some plugging and unplugging of cables, and afterwards there was no connectivity to the internet. The dashboard showed 100% CPU utilization and the WAN link state was alternating up, down, up, down, constantly. After trying a number of things to recover, I rebooted the machine, and it came back up with a working WAN, all was well. A little later I unplugged the WAN cable again temporarily, and the same thing happened, and the only solution was to reboot. Here is what may be helpful information for others experiencing similar issues:

-If WAN is DHCP and cable is momentarily unplugged (ISP device reboot), CPU utilization goes to 100%, WAN adaptor link state cycles up/down constantly, and all Internet connectivity is lost. Rebooting the machine via Diagnostics or Terminal is the only fix.
-If WAN is Static and cable is momentarily unplugged (ISP device reboot), CPU utilization goes to 100%, WAN adaptor link state cycles up/down constantly, and all Internet connectivity is lost. Rebooting the machine via Diagnostics or Terminal is the only fix.

I switched WAN from the onboard Intel NIC to a new Realtek Gigabit PCI card and repeated the above steps, with the same result. The only packages currently installed are Unbound, Squid v3 and SquidGuard.

The ISP is pretty reliable, unlike some of the reports from others having similar issues, but since the only solution is to reboot the machine, this seems to have the potential of being a big issue for anyone running critical services in-house. Any suggestions?

-If WAN is DHCP and cable is momentarily unplugged (ISP device reboot), CPU utilization goes to 100%, WAN adaptor link state cycles up/down constantly, and all Internet connectivity is lost. Rebooting the machine via Diagnostics or Terminal is the only fix.

While the dashboard shows 100% CPU utilization, what does the pfSense shell command```
top -S -H

@niner:
> -If WAN is Static and cable is momentarily unplugged (ISP device reboot), CPU utilization goes to 100%, WAN adaptor link state cycles up/down constantly, and all Internet connectivity is lost. Rebooting the machine via Diagnostics or Terminal is the only fix.
While the dashboard shows 100% CPU utilization, what does the pfSense shell command```
top -S -H
```show as the top 3 or 4 CPU consumers and what proportion are they consuming?
@niner:
> I switched WAN from the onboard Intel NIC to a new Realtek Gigabit PCI card and repeated the above steps,
What type of Intel NIC?
I'm running pfSense build
> 2.1-BETA1 (i386)
> built on Sat Feb 23 16:39:07 EST 2013
. Its WAN interface is vr0 and it uses DHCP. I disconnected the WAN link from the switch for a few minutes and plugged it back into the switch and there was no sign of the sort of meltdown you described in that I could connect to the GUI over the WAN link and the dashboard showed an uptime of over 25 days: the WAN recovered without a reboot.

While the dashboard shows 100% CPU utilization, what does the pfSense shell command```
top -S -H

Both DHCP and Static configurations yield basically the same results. The stats fluctuate, as seen in the following 2 screenshots, but check_reload_status (40%-70%) is almost always at the top of the list, along with syslogd (5%-25%), php (5%-75%) and unbound (5%-10%) :

I'm running pfSense build 2.1-BETA1 (i386)
built on Sat Feb 23 16:39:07 EST 2013. Its WAN interface is vr0 and it uses DHCP. I disconnected the WAN link from the switch for a few minutes and plugged it back into the switch and there was no sign of the sort of meltdown you described in that I could connect to the GUI over the WAN link and the dashboard showed an uptime of over 25 days: the WAN recovered without a reboot.

My WAN interface is normally fxp0 and Static (or re0 and DHCP, if I am testing and using the Gigabit NIC) and the dashboard and LAN access remain available during the loss of WAN, so I can initiate a reboot locally…but if I were not onsite it would be another matter.

If there is other information I can provide, or if there are suggestions for further troubleshooting, please let me know. Thanks.

It would probably also be helpful to have the extract of the system log from a few lines BEFORE the WAN interface is reported down for maybe the next few minutes. You will probably need to display the whole log file by pfSense shell command```
clog /var/log/system.log

A packet capture on the WAN interface would also be interesting to see - is anybody trying to start a conversation?

As requested, more info below. After a reboot the log settled down and there was no activity until I logged into the Gui, so there is nothing interesting to show before the WAN interface goes down. Here is what happens when it does go down:

Thanks for the capture and the log extract. I presume these were with the WAN interface configured with static IP (no sign of DHCP in the packet capture) BUT I notice /etc/rc.linkup appears to write those DEVD messages on interfaces which don't have static IPs.

It would be more informative to have a packet capture covering the time span of the log extract but there are a number of things that seem strange to me:

Perhaps check_reload_status is setting the interface down in order to reset things before attempting to restart the link, and the consequent kernel report is sneeking past the check_reload_status announcement to the kernel log. Otherwise it is hard to explain why check_reload_status is starting the interface immediately after the kernel reports it down.

Perhaps the carrier signal from the modem is "flapping". Perhaps these messages are coming from independent processes that don't know what the other is doing.

Actually, when first setting up the system and experiencing this issue the WAN was connected to an AP in bridge mode, and was only connected directly to the ISP modem later in the setup/troubleshooting process. So, while it is possible that the modem and the AP are contributing to this behavior, it's not likely.