Throwing this out there: Cisco 6500 FIB test failure and high CPU usage

I have a number of Cisco 6500 layer 3 switches running in native IOS mode. A few have SUP720-3BXL processors, a few still have SUP2-MFSC2 processors. One in particular is having a weird issue.

It's a 6509 with dual sup2-msfc2 (with pfc2) and a mix of a 6148-GE, a couple 6248 10/100 and a couple 6516-GE line cards. It isn't particularly loaded as far as routes (a couple hundred total routes at most), traffic is usually a few megabit per port on average, spiking up to 30 or 80 megabit at short intervals on some ports, most of it is switched or local routed traffic. It is running the latest advanced IP Services release as of a recent update. (s222-advipservicesk9_wan-mz.122-18.SXF17b.bin)

In general the switch is fine. Over the last year though, it has started a strange behavior. Every few months, for a couple weeks to a month at a time, it suddenly gets high CPU load for no apparent reason. We have tried all kinds of tweaks and tests and things with no discernible result. Most of the time the CPU average is around 3 or 4%, when the issue happens, it goes up to 40 or even 50% and the load varies through the day with the traffic patterns on the switch (cpu load increases during the day when more users are active vs. night and you see the same increase in network traffic).

Here is a show processes cpu sorted output from just now, for example, with about 30% CPU (it is still early enough in the day here):

You can see it is running CEF and almost all interfaces are using CEF route caching as per the default. Only a few interfaces have 'ip route-cache same-interface' applied when they have multiple subnets on the same layer 3 interface. I have found that lowers CPU load when this issue happens and removing that and reverting to CEF on those interfaces only increases CPU load.

IP Input cpu load is low, and nothing else stands out. Occasionally SNMP load pops up to the top when we are taking stats to generate graphs for traffic and system numbers. That is about it.

I have tried everything I can think of to figure this out and the issue has persisted from an older software release to this one and when failing from one CPU to the other (during the software upgrade, we preload the new software on the standby card and then fail over to it which is faster than a cold reboot of both at the same time).

An interesting wrinkle came up with this new software. Apparently the last version of software we had either didn't have the GOLD hardware monitoring features in it yet, or wasn't reporting the same way but after upgrading to the latest release we starting seeing this error:

Now most of the google hits for any of the error codes in there seem to indicate that the problem is caused by the test TestSPRPInbandPing failing. Almost all references to this are regarding a bug that would cause the switch processor/controller to reboot or fail to its secondary processor (if there is one) after 10 consecutive failures of the TestSPRPInbandPing or similar hardware monitoring tests. That bug was fixed a while ago which is why I think ours is not rebooting or failing to the second processor despite the fact that we have seen the failure count go as high as 90 before we get a recovery message like this:

Of course it reports the failure again in a couple of minutes or so. At most we have seen about 9 or 10 minutes with no failure report.

From docs, this test appears to basically send 'pings' back and forth between the various parts of the PFC on the switch CPU card to make sure the various functions of the switch and router systems are working. Supposedly it will restart the cpu or fail to the secondary if it errors out enough but we have not seen that happen yet. Again, other than high CPU we have not seen any issues with traffic or stability.

So this error being reported got me looking at other tests and I found a couple of other tests that fail that are only done on demand (the one above is automatic every 15 seconds). TestFibDevices and TestIpFibShortcut both fail with error code 0x1. No idea what that means.I assume that if the FIB is failing somehow, that would result in CEF not working right and a lot of traffic being software switched, resulting in high CPU. If that is a correct assumption, the question then follows: is this a hardware fault (across 2 previously known-good cpus) or a software problem (across 2 different releases)?

Unfortunately the sup2-msfc2 is eol a long time back and support for it ends in a couple months, even though it is still a great device option for our needs. We have a ticket open with cisco on this and they are looking at a 'show tech' output now. Just wondering if anyone else has ever seen this kind of thing?

Sup2 is a little before my time from a deep dive perspective. Generally in IOS if the CPU usage is high, but isn't shown for an individual process the usage is typically occurring at the interrupt level. On software routers that's general packet forwarding, I couldn't say what it would be on that box where I'd expect forwarding to be done in hardware or process switched. Total WAG, but you might check out tcam usage especially with FIB failures.

Yeah I figure the same way really. The tcam usage is pretty minimal since it has only a couple hundred routes at most. It is used for routing between different subnets for servers and connects to a couple of core routers which in turn connect to edge routers that take full internet route tables. So it definitely shouldn't be tcam exhaustion and there are no logs to indicate that issue either. Definitely a FIB or tcam related problem though. I'm leaning toward a hardware issue that might be coupled with a software bug.

Yeah I figure the same way really. The tcam usage is pretty minimal since it has only a couple hundred routes at most. It is used for routing between different subnets for servers and connects to a couple of core routers which in turn connect to edge routers that take full internet route tables. So it definitely shouldn't be tcam exhaustion and there are no logs to indicate that issue either. Definitely a FIB or tcam related problem though. I'm leaning toward a hardware issue that might be coupled with a software bug.

tcam will also have ACLs etc, probably worth actually looking if you haven't. The other thing could be a feature enabled that isn't supported in hardware. Might check non cef switched statistics.

I'm too lazy to look it up, but is there an equivalent to show platform hardware capacity on the sup2 like on the sup720?

The only ACLs are used for VTY access control and a couple interfaces that have a 'set next hop' routemap applied but there isn't that much traffic on those. IP address matching and set next hop routemap ACLs are supposed to be done in hardware so I think that is ok.

There is that command 'show platform hardware ...' but only certain things give any output and most of what you get on the 720 is not there unfortunately. Mostly you have to rely on correlating counters and stuff on interfaces and packet counters.

We plan on going to 720s as soon as possible, it's just a pain because the thing was assembled with the processors at the top of the chassis instead of the middle. So some stuff has to be re-cabled or adjusted and there is a good bit of downtime involved. It's no fun.

it isn't clear if this is available on Sup2, but it may be worth a shot to see if you can capture packets and see what is hitting the CPU so hard. Maybe someone is starting up a multicast stream with ttl=1 that is jacking the CPU.

I haven't had a chance to do that monitor session capture yet. I did check the output of 'show interfaces switching' and the biggest numbers for ip process packets are on the port channel interfaces that go to the core routers, which is not surprising. The number of process switches traffic (as opposed to fast/auton-sse) is about 1:4 on most of the interfaces at most. I would think the ratio would be lower but looking at other 6500 switches that do not have this cpu load issue, the ratios are about the same.

If I don't hear back from Cisco with some progress I will probably do the traffic capture. That is about the only thing I can see giving any more information.

If that doesn't yield an obvious result, we will probably just have to pull the trigger on sup720s and deal with the pain of swapping them out.

Process switching should show up as "IP Input", or at least it does on other platforms. Don't see why the Sup2 would be different, since Process Switching means being switched by a process, rather than interrupt level or hardware.

The ip input numbers from show proc cpu are nice and small. At this point I am reasonably sure the config is correct and the switch is using CEF and other switching paths properly, it's just that something bad happens when some of the traffic is processed in CEF and it somehow makes it back to the CPU to be processed. It just doesn't make any sense.

Oh, and multicast routing is disabled so it shouldn't be anything to do with that.

Doh. Cisco just replied after putting us off for about 36 hours saying they were researching it.

Their response: It's a bug! CSCsc33990 They are suggesting we update the IOS release.

Oh, except the sup is not resetting or failing to the standby unit. That also doesn't address the high cpu issue (the primary concern) or the other failed FIB tests.And the fact that we are running a newer release that supposedly includes the bug fix already.

That earned a bit of a terse response to the ticket. Hopefully they will come back with something better now.

The ip input numbers from show proc cpu are nice and small. At this point I am reasonably sure the config is correct and the switch is using CEF and other switching paths properly, it's just that something bad happens when some of the traffic is processed in CEF and it somehow makes it back to the CPU to be processed. It just doesn't make any sense.

Oh, and multicast routing is disabled so it shouldn't be anything to do with that.

Doh. Cisco just replied after putting us off for about 36 hours saying they were researching it.

Their response: It's a bug! CSCsc33990 They are suggesting we update the IOS release.

Oh, except the sup is not resetting or failing to the standby unit. That also doesn't address the high cpu issue (the primary concern) or the other failed FIB tests.And the fact that we are running a newer release that supposedly includes the bug fix already.

That earned a bit of a terse response to the ticket. Hopefully they will come back with something better now.

Chopped out most of the interface information. Almost none have any access lists applied or anything other than your standard access vlan or trunk config. There are about 200 vlan interfaces, not all active. Slots 3, 4, 5, 7, 8, and 9 are line cards with between 16 and 48 ports. Not all ports are used, about 70 ports are free.All IPs and descriptions and stuff have been changed to whatever I felt like typing at the time.

The other stupid hunch is the GRE tunnels. IIRC on many platforms use an ACL behind the scene to kick GRE traffic up for special handling, and doesn't necessarily only match GRE termination points on the location box, just GRE in general. Just plain transit GRE traffic can impact the box.

The GRE tunnels are for the route-mapped traffic which amounts to about an average of 400 kbit of traffic, occasional spikes up to a megabit or two for a few minutes at a time. I don't think it is that, especially since we have been running it that way for like 7 years now. The CPU load issue just popped up a few days back from where before it was averaging 4% cpu load, to now 40%.

The ospf timers are aggressive, we have been talking about backing off to normal 10 second timers but haven't really had any issues besides once or twice when an improper OIR caused a stalled bus a couple years back.

The GRE tunnels are for the route-mapped traffic which amounts to about an average of 400 kbit of traffic, occasional spikes up to a megabit or two for a few minutes at a time. I don't think it is that, especially since we have been running it that way for like 7 years now. The CPU load issue just popped up a few days back from where before it was averaging 4% cpu load, to now 40%.

The ospf timers are aggressive, we have been talking about backing off to normal 10 second timers but haven't really had any issues besides once or twice when an improper OIR caused a stalled bus a couple years back.

I'd be surprised if it was GRE, but where I was going was that transit GRE traffic may matter, not GRE traffic destined to the box itself. In that case it'd be based on other changes in the environment if there was new/more GRE traffic passing through the 6500, rather than a static GRE config.

The ttl thing applies to any packets with ttl of 1, so it could be unicast as well as multicast. So you don't need multicast routing turned on for someone to start a stream with ttl of 1 to cause the high CPU.

I'll have to do the monitor/span session with the RP tomorrow morning to see if I can catch anything. The latest suggestion from Cisco was to either rate limit ip igmp snooping or turn it off altogether. Tried both and nothing happened. As far as I can tell there is almost no igmp stuff going on.

As for the tunnels, from graphs over the past year, nothing has changed. At least nothing significant, a few packets per second more or less.

Hmm, unfortunately it looks like the monitor session with the RP (cpu) as source is not supported in the latest IOS or maybe just not on the MSFC2 at all. Most of the docs refer to CatOS commands and the IOS ones are not specific on where/what platform it is supported but the syntax isn't there and just entering the commands from the document fails.

An interesting wrinkle came up with this new software. Apparently the last version of software we had either didn't have the GOLD hardware monitoring features in it yet, or wasn't reporting the same way but after upgrading to the latest release we starting seeing this error:

Are you seeing these errors when both supervisors are in primary mode, or just one? The reason I ask is this error, from my understanding, is usually indicative of a hardware issue with the supervisor. If you're seeing the high CPU on both supervisors, but this error on only one supervisor, the two issues are likely unrelated.

MaxIdiot was right with his first post...high CPU with no corresponding process being shown as responsible means that the errors are occurring in interrupt. This means that traffic is being interrupt switched by the CPU, which could be caused by a number of things.

I'm pretty sure it's not GRE; I think GRE would be process-switched, not interrupt-switched, so you'd see high utilization there under the IP Input process.

Quote:

Hmm, unfortunately it looks like the monitor session with the RP (cpu) as source is not supported in the latest IOS or maybe just not on the MSFC2 at all.

It should be. Did the TAC engineer tell you to do this and not provide the appropriate commands to enable it?

I used to run a few 6509's with Sup2's, had them in production until, well, let's just say for too long. Had them in hybrid mode and migrated them to native about four or five years ago (not fun, but worth the hassle).

Anyway I noticed it looks like you are running netflow. Is that something that you could live without? I seem to recall issues with netflow being present with the "ip route-cache same interface" command and them coexisting. Might not be the case here, but I had hit or miss luck with that command, sometimes it would be fine, sometimes it would hurt. Mind you, that was approx 8-9 years ago on the same hardware as you are running, but of course much older software rev. Figured I'd pass that along just in case.

An interesting wrinkle came up with this new software. Apparently the last version of software we had either didn't have the GOLD hardware monitoring features in it yet, or wasn't reporting the same way but after upgrading to the latest release we starting seeing this error:

Are you seeing these errors when both supervisors are in primary mode, or just one? The reason I ask is this error, from my understanding, is usually indicative of a hardware issue with the supervisor. If you're seeing the high CPU on both supervisors, but this error on only one supervisor, the two issues are likely unrelated.

MaxIdiot was right with his first post...high CPU with no corresponding process being shown as responsible means that the errors are occurring in interrupt. This means that traffic is being interrupt switched by the CPU, which could be caused by a number of things.

I'm pretty sure it's not GRE; I think GRE would be process-switched, not interrupt-switched, so you'd see high utilization there under the IP Input process.

Quote:

Hmm, unfortunately it looks like the monitor session with the RP (cpu) as source is not supported in the latest IOS or maybe just not on the MSFC2 at all.

It should be. Did the TAC engineer tell you to do this and not provide the appropriate commands to enable it?

No the TAC guy has been almost useless so far, offering up a bug that was obviously not what we were experiencing, didn't address the high cpu issue and ignored the version of IOS we are running. Then when we called him out on that, he came back with asking us to disable IGMP snooping or rate limit it even though there is nothing to indicate that IGMP has anything to do with the issue. We tried his suggestions and it did nothing. He also hasn't addressed the FIB test errors either.

This issue has occurred on and off over the last 6 or 8 months, with the dual sup2 cards in SSO mode (with the second in hot standby). We went back to RPR+ mode to do the software upgrade last week and that didn't change the CPU load. After reboot for the upgrade (and switching to the other CPU) the issue remained. So it happens on 2 different versions of IOS and on both CPU cards.

The router is currently in SSO mode still and reports that it is ok and ready to swap between them if needed.

Uhlek wrote:

One other question -- are you running into any actual performance issues related to the high CPU utilization, or is it just something you're concerned about?

Well, yes and no. It performs fine under normal load but we had what appeared to be a UDP flood attack from a server at rackspace a couple weeks back. That is what brought the issue under attention again. We had been living with it in the hope that we would see it get fixed in our next scheduled software upgrade or when we eventually went to sup 720s (mid 2012 if we keep the schedule). The UDP flood wasn't anything we haven't seen before. On this router (before the problem) and others, a similar attack doesn't really even register on the CPU usage graph. However, apparently because of this error, this attack caused the CPU to max out at 100% for the 15 minutes or so it continued unabated which caused a lot of packet loss and latency eventually causing the BGP session on the thing to flap because it was too busy to respond to updates from its peers. After we blocked the server that was sending the traffic on our upstream routers, things returned to normal.

SpacemanSpiff wrote:

I used to run a few 6509's with Sup2's, had them in production until, well, let's just say for too long. Had them in hybrid mode and migrated them to native about four or five years ago (not fun, but worth the hassle).

Anyway I noticed it looks like you are running netflow. Is that something that you could live without? I seem to recall issues with netflow being present with the "ip route-cache same interface" command and them coexisting. Might not be the case here, but I had hit or miss luck with that command, sometimes it would be fine, sometimes it would hurt. Mind you, that was approx 8-9 years ago on the same hardware as you are running, but of course much older software rev. Figured I'd pass that along just in case.

Best of luck!

Actually it is not required on this one though we have been meaning to make more use of it. We have a couple other 6509s with sup2s with netflow running in similar configs where we use the netflow data to do traffic analysis and graphing for billing but on this one we are still doing SNMP based interface graphing.

This issue has occurred on and off over the last 6 or 8 months, with the dual sup2 cards in SSO mode (with the second in hot standby). We went back to RPR+ mode to do the software upgrade last week and that didn't change the CPU load. After reboot for the upgrade (and switching to the other CPU) the issue remained. So it happens on 2 different versions of IOS and on both CPU cards.

My apologies, but I'm not reading your response right. You say you're seeing the same CPU load issues on when either supervisor is active, but are you seeing the syslog errors (the %CONST_DIAG-SP-3-HM_TEST_FAIL errors) when either supervisor is active as well?

Quote:

However, apparently because of this error, this attack caused the CPU to max out at 100% for the 15 minutes or so it continued unabated which caused a lot of packet loss and latency eventually causing the BGP session on the thing to flap because it was too busy to respond to updates from its peers.

If a UDP flooding attack caused a CPU spike that bad, odds are the extra 20% load wasn't what pushed you over the edge. I'd recommend setting up control plane policing to help prevent this from occurring again. If you need some assistance in setting that up, feel free to drop me a note.

Quote:

No the TAC guy has been almost useless so far, offering up a bug that was obviously not what we were experiencing, didn't address the high cpu issue and ignored the version of IOS we are running. Then when we called him out on that, he came back with asking us to disable IGMP snooping or rate limit it even though there is nothing to indicate that IGMP has anything to do with the issue. We tried his suggestions and it did nothing. He also hasn't addressed the FIB test errors either.

If you're feeling like you're not getting the service you're expecting from your TAC engineer, you need to escalate the problem to the TAC duty manager. Call the TAC number, request the duty manager, explain the problem, and request the case be re-queued and escalated to a more senior engineer.

This issue has occurred on and off over the last 6 or 8 months, with the dual sup2 cards in SSO mode (with the second in hot standby). We went back to RPR+ mode to do the software upgrade last week and that didn't change the CPU load. After reboot for the upgrade (and switching to the other CPU) the issue remained. So it happens on 2 different versions of IOS and on both CPU cards.

My apologies, but I'm not reading your response right. You say you're seeing the same CPU load issues on when either supervisor is active, but are you seeing the syslog errors (the %CONST_DIAG-SP-3-HM_TEST_FAIL errors) when either supervisor is active as well?

Quote:

However, apparently because of this error, this attack caused the CPU to max out at 100% for the 15 minutes or so it continued unabated which caused a lot of packet loss and latency eventually causing the BGP session on the thing to flap because it was too busy to respond to updates from its peers.

If a UDP flooding attack caused a CPU spike that bad, odds are the extra 20% load wasn't what pushed you over the edge. I'd recommend setting up control plane policing to help prevent this from occurring again. If you need some assistance in setting that up, feel free to drop me a note.

Quote:

No the TAC guy has been almost useless so far, offering up a bug that was obviously not what we were experiencing, didn't address the high cpu issue and ignored the version of IOS we are running. Then when we called him out on that, he came back with asking us to disable IGMP snooping or rate limit it even though there is nothing to indicate that IGMP has anything to do with the issue. We tried his suggestions and it did nothing. He also hasn't addressed the FIB test errors either.

If you're feeling like you're not getting the service you're expecting from your TAC engineer, you need to escalate the problem to the TAC duty manager. Call the TAC number, request the duty manager, explain the problem, and request the case be re-queued and escalated to a more senior engineer.

Thanks for the thoughts. We have not reloaded/failed from one cpu to the other since the software so I am not sure whether the other cpu will produce the diag errors and fibtest errors the same way the current one does but they both have the same strange cpu load behavior (apparently running a lot of interrupt processed traffic for no obvious reason). By that, I mean the problem existed before the software update and the switch from one cpu to the other and continued after it.

The extra 20-30% cpu load we are seeing, I believe, is a sort of side-effect of the actual problem. I think because of whatever is causing it, the router is not processing some or a lot of traffic properly so the 'normal' traffic is causing an unusual load on the CPU, and therefor the spike in traffic (which was almost 500 megabit, at peak) basically overwhelmed it.

The control plane traffic controls are something I have been wanting to look into. I have done it on a few smaller routers running 12.4 and 15 but not an 'older' version of IOS like this. I'll have to look into it and see if it looks easy to do.

My boss has been heading the ticket with Cisco and was out of the office yesterday so I'll probably try to prod him to escalate it with them today.

Thanks again.

Oh, and I tried leaving netflow disabled (removed the ip route-cache flow and everything) for about 24 hours and it made no or little difference, maybe a few cpu load percentage points but on the average almost no difference.

Folks here are correct. High CPU is in interrupt context which means software CEF switching. Some feature you are running, or traffic, cannot be MLS CEF switched so is punted to CPU but still CEF, not process, switched.

please send output for "show tcam interface Vlan102 acl in ip" you can mangle the IPs, it's just the action and hit counters I care about.

Also suggestion to do an RP inband SPAN to get a look at what traffic is punted to the CPU is very good one. If other ideas fail this should be top priority.

Yeah those sprpping test failures are being reported on the current active Sup. It just started spitting them out after the software upgrade we did in the hope that the high CPU load was a bug or other issue that would be fixed with a reboot. I think the GOLD tests either weren't part of the previous version IOS or were not running by default or just not logging the failure.

It is interesting, the sprpping test reports up to 90 consecutive failures (the highest I have seen anyway) before recovery (a successful test I assume) but it has never rebooted/failed to the hot standby Sup yet. According to the docs, it is supposed to do so after 10 unless you disable the test completely.

As for the fib test failures, I'm at least somewhat clear on what they mean, or at least I understand what role the FIB plays. That is why I ran the tests, to see if the FIB was working properly or possibly had anything to do with what appeared (and I think is now confirmed by majority opinion) to be CEF not working properly (interrupt switching).

I have tried to do the RP inband SPAN but so far, the docs I have found (the one linked in this thread and ones linked from it) suggest either commands for CatOS or commands that appear to be unsupported/non-existent in this release of IOS or maybe just on the Sup2/MSFC2. I would like to do that capture though, if I can find out how. So far it seems the RP (or any 'local source) is not a viable source for the monitor session. If you have info on how to do it with a Sup2 I would love to see it.

Thanks for the mention of BFD. I haven't even heard of it, that I remember. Sub-second timers are a big deal to my boss so I will definitely bring that up with him as a preferred method. It looks like a Big F-ing Deal.

As for new boxes being preferred... aren't they always?

The results of 'show tcam interface Vlan102 acl in ip' are not super revelatory to me. Here is the output (with randomized IPs):

Not knowing any better, I am going to assume that looks like it should. The policy-route in question is a match IP, set next hop policy so I believe those are supposed to be done in hardware with CEF. Hopefully that is not the cause of this issue, especially since we have been running things that way for certain traffic for about 7 or 8 years now and it is only a few hundred kilobit of traffic, on average.

Yea, everything looks fine in the output. Was looking for "punt" statements. PBR on 6500 does not always react well to an ACE that is an explicit deny in the ACL it references. In some scenarios this causes a punt to be programed in the interface ACL TCAM.

For RP in-band SPAN. It's been a loooong time but pretty sure its supported on a SUP2. The local source stuff is for SXH and newer. For SXF you want...

Thanks. It worked after I also specified a source interface (a shutdown port). The part I was missing before was running the 'remote command switch' in front of the monitor add part. I hadn't done anything like that since working on my old 5500 with an RPSP module in it. That was probably 8 years ago now.

Anyway, I got a few hundred MB of capture. So far nothing stands out as an obvious cause or common factor. Basically a lot of traffic is hitting the RP that shouldn't be. HTTP stuff, FTP, other common internet protocols.

My Cisco engineer actually sent an update late yesterday I hadn't heard about till today. He found a good handful of similar cases where no real solution was ever found except to replace the Sup(s) with either the same or upgraded hardware. That is not a great answer but it is at least an answer.

I'm going to continue sorting through the capture but if there is no obvious common thread there, I think we will be upgrading to sup 720s sooner than I had planned. It's a pain but we need to anyway for continued support and better management features, so... I guess it was unavoidable anyway.

Here is an example as the last two are tough to follow without a concret example. These are the ABCs of verifying unicast hardware forwarding on a 6500(excluding features which just fuck up everything -- PBR, NAT, WCCP).

First we look at the control plane(routing table)Second is the software forwarding table(CEF)Then the hardware forwarding table(MLS CEF aka FIB TCAM) Lastly the hardware adjacency table

Each one programs the one below it so all must agree and errors will flow down the tree.

Yours will look a little diff as this is SXI but still basically the same stuff.

Hehe, all those command work great except 'sh mls cef 1.2.3.0 24 detail'. It just gives a blank output. I think it doesn't work on the Sup2 because I tried it on my ones with sup720bxl and it works great there. That was the command that was most interesting to me too.

Hehe, all those command work great except 'sh mls cef 1.2.3.0 24 detail'. It just gives a blank output. I think it doesn't work on the Sup2 because I tried it on my ones with sup720bxl and it works great there. That was the command that was most interesting to me too.

Umm, it really should work. I don't have any SUP2s or SXF anywhere to test on but it, or something very close, SHOULD work. If the command was not valid you should get the carrot thingy. Sorry dude without a SUP2 to mess around with I can't get past that cobwebs in my brain that my SUP2 knowledge is hinden behind.

Hehe, all those command work great except 'sh mls cef 1.2.3.0 24 detail'. It just gives a blank output. I think it doesn't work on the Sup2 because I tried it on my ones with sup720bxl and it works great there. That was the command that was most interesting to me too.

Umm, it really should work. I don't have any SUP2s or SXF anywhere to test on but it, or something very close, SHOULD work. If the command was not valid you should get the carrot thingy. Sorry dude without a SUP2 to mess around with I can't get past that cobwebs in my brain that my SUP2 knowledge is hinden behind.

Well, I am pretty sure I am doing it right. I assumed your example was meant to indicate that I should be looking up information about a specific IP in the higher level tests and then the netblock for that IP in the lower level tests, like the 'show mls cef detail' test. I follwed the suggested syntax (both yours and the command line format). On that command it just gives a blank result and returns the prompt. No error.

Hehe, all those command work great except 'sh mls cef 1.2.3.0 24 detail'. It just gives a blank output. I think it doesn't work on the Sup2 because I tried it on my ones with sup720bxl and it works great there. That was the command that was most interesting to me too.

Umm, it really should work. I don't have any SUP2s or SXF anywhere to test on but it, or something very close, SHOULD work. If the command was not valid you should get the carrot thingy. Sorry dude without a SUP2 to mess around with I can't get past that cobwebs in my brain that my SUP2 knowledge is hinden behind.

Well, I am pretty sure I am doing it right. I assumed your example was meant to indicate that I should be looking up information about a specific IP in the higher level tests and then the netblock for that IP in the lower level tests, like the 'show mls cef detail' test. I follwed the suggested syntax (both yours and the command line format). On that command it just gives a blank result and returns the prompt. No error.

No, that's deff correct. I could be missing something. If you try just "show mls cef" it should give a brief version of a dump of all the MLS cef entries. If you get nothing on this then I am missing something. If you get output from "show mls cef" but then nothing when checking an exact prefix then hardware is not being programed for some prefix and they will deff punt to CPU.

Yeah that is where I am at too. I get good results from everything you listed but that command. That basically matches the problem I have though, everything is correct but the last bit fails and things get punted to the RP.

Gosh. I have to apologize for not coming back to this topic to complete it.

Of course, Cisco support was pretty much correct, I think. Replacing the sup2 cards with sup720s eliminated the problem, either because of a hardware failure of some kind or because of 'over subscription'.

Of course, we had a wonderful snafu when we found that our research had been a bit incomplete. The line cards we were using were compatible with the Sup 720.

However.

They were not compatible with the Sup 720 running the 12.2.33(SXJ) code release that we wanted to run. So on boot up after getting everything redone with the new supervisors, 2 48 port fast ethernet cards failed to show up in the switch. Cue mad scramble to replace those cards. Fortunately I had a 48 port gigabit card I had been meaning to swap in there anyway and I was able to cannibalize another from another switch that only had 2 ports in use so far so I could easily move them to another card. After that everything was all back to normal. CPU use on the switch is back to averaging 6-9%