Fixing high CPU use on Cisco 7600/6500

Oct 26, 2013,
Categories: cisco,network

Recently

some time ago (this blog post has also been
lying in draft for a while) someone came to me with a problem they had
with a Cisco 7600. It felt sluggish and show proc cpu showed that
the weak CPU was very loaded.

This is how I fixed it.

show proc cpu history showed that the CPU use had been high for quite a while,
and too far back to check against any config changes. The CPU use of the router
was not being logged outside of what this command can show.

show proc cpu sorted showed that almost all the CPU time was spent in interrupt mode.
This is shown after the slash in the first row of the output. 15% in this example:

Interrupt mode CPU time is (a bit simplified and restricted to the
topic at hand) used when the router has to react to some user
traffic. Now why would the 7600 use the CPU for the forward plane?
It’s a hardware-based platform, isn’t it? Yes and no. The “normal”
traffic path is handled in hardware, but if there’s anything
nonobvious that it has to do then it’s “punted” to the CPU and handled
there.

Ok, so packets are being sent to the CPU because there’s something
special about them. What? You can sniff the RP and send it using
ERSPAN to some UNIX box, and run tshark/wireshark there. But in some
cases there is no UNIX box (or other sniffer recipient) than can peel
off the ERSPAN header and look inside, and you have no machine more
directly attached so that you can run SPAN or RSPAN.

Enter debug netdr capture rx. It starts a sniff of punted packets
and puts it into a small buffer. When the buffer is full it stops
sniffing. If you’re punting a lot of packets this buffer will of
course fill fast. Then run show netdr captured-packets to see the
packets in the buffer. It’s safe to do on a live system.

When looking at the packets I saw no reason for them to be punted. It
was normal IPv4 packets, TTL wasn’t 0, it wasn’t directed to the
router itself, a route existed in the routing table for it,
etc.. However, all of the packets were destined to addresses within
the same /24. And sure enough, when tracerouting to the address it was
obviously a routing loop.

I fixed the routing loop, and the CPU use dropped by ~10%. I then did
a new sniff and found more routing loops that had been misconfigured
at the same time. Eventually the interrupt CPU went down to 10%, which
was normal for the features and traffic patterns in use. But the total
CPU load was still at over 90%.

show proc cpu sorted showed a whole bunch of “Virtual Exec”
processes eating most of it. show users revealed that there were
multiple logins by user “rancid”.
RANCID is a program that logs in
now and then and runs a few commands to save configuration
changes. Apparently these never had time to finish under the high
load, and having 10 of these logged in at the same time caused quite a
bit of load in itself, so I kicked them out with clear user vty X.

The CPU went back to normal, and everyone lived happily ever after.

Side nodes

We couldn’t turn off ip unreachables or even rate-limit the
punting (mls rate-limit or COPP) because (long story short) that
breaks stuff for us. Normally you can, though. But if you’re using
it as a “fix” then you may just be fixing the symptoms, not the
problem. Had we done it here we would have lowered the CPU use, but
the routing loop would still be there.