The one page I found says "temporary/short outages" on ospf but honestly we've seen outages of many hours or days on ospf for some sites, and it doesn't happen to all the sites.

Should i just disable the to-cpu on all ports of our aggregation switches? Is their a draw back to doing that if those switches are only qinq and vlans no l3 beyond inband management ip? Can i do the commands on a live network or will it affect traffic flow on the transit switch?

I'm having issues understanding the problem, theirs 4 solutions listed but no explanation really of figuring out which switches are the issue actually, or what the draw back is to each of the solutions

None of the options seem to be for running on the actual OSPF switch (the ones with the ip interfaces) so is the issue only on switches that are layer 2 transit switches?

The hellos from the router which is in Init state is not reaching it's neighbor. but it sees the hello from it's neighbor. you need to track down the path from the init router to it's neighbor and run "show igmp snooping vlan vlan-name" and look for 224.0.0.5 entry in the senders list and find where in the path you miss the entry from one of the neighbors.

Will test it next time it goes down, to find which switch is doing it, is their a list of affected firmware versions? Is it actually a bug thats fixed in a future release? Or is it something we just have to apply workarounds for such as the doc linked?

In addition i just noticed we have a few sites that have 1 full and one in EX_START and on the core side its the same 1 full and one EX_START not sure if its related or the same issue. in the EX_START case i checked and i see the 224.0.0.5 on the vlan from what appears to be all switches

Chris, we are an metro carrier running on a layer 2 core of 8900's and this has been a sporadic issue for years. Best work around we have found is disable IGMP snooping on your layer 2 only switches as much as possible. If you are not running video that needs to be pruned then just kill it 100 %...

Fix/workaround to help you find which switch you may be having issues with ( theory is a mcast entry get's corrupted in hardware and then not forwarded) ... You need to make sure that your two routers can at least ping their outer ip interfaces... If not then you have other issues.

One switch at a time when you have OSPF or router agency issues... clear igmp snooping clear fdbclear ipmc fdb

check to see if your routers re-gain their agencies after each switch you clear in the path till you find the one that was at fault... GTAC will tell you which code you will need to be running for the switch and setup you have. 15. had some issues for sure.

We have found that this seems to be a very random thing and is usually triggered after a topology event where you have an EAPS failover. Seems to happen when we have port in the rings that flaps multiple times in a short time period.

Good luck, We have never found this to be a lack of resources so there are always open buckets in the memory for more entries. By default unless you have an ACL in place to block 224.0.0.0/24 all modern layer 2 switches should always forward the mcast traffic from router mcast ip's period. CPU only moves it to hardware first time... so good luck on your efforts... I will be tracking this one closely too for new info or ideas.

And yes it's a mcast issue because ping always works between the l3 interfaces of the backbones

Does disabling igmp disrupt traffic when you do it on the layer 2 aggregation? Like I should do it during a maintenance window?
Also do I need to reboot th switch to clear the hardware corrupted entry or will disabling the l2-CPU for all ports be enough.

Chris, Disabling IGMP snooping should not break any routing as all disabling does is the switch does not prune back and mcast traffic and treats it like a broadcast packet so it forwards all mcast including router announcements to all ports in the vlan the routers are in...

If it makes you feel better you can do it during window but we have done it under an outage scenario with 89K macs 2 k vlans on those 8900 at peak without issues. You can also clear the tables I listed and see if that gives you some relief. If you have one broken and can do it one switch at a time till it starts working then you can maybe narrow down the culprit.

When all else fails and everything we try to restore the router adjacencies fails we have had to delete the vlan or vman and reprovision it to clean the hung table.. We have never had to reboot to fix this.

Also what cards are you running on the 8900's XL cards with MSM 128 need to match up. If you put one of the c cards in a chassis the whole chassis will drop down to the lesser card. Same thing for MSM. You cut your processing power in half by only running one card.

My problem has been we dont have any visibility or access in customer's routers so when they report a problem I have to get them back up now and have limited time to trouble shoot this kind of issue. We know there is an issue but it is impossible to replicate on demand... We will go 2 o3 months with no issues. With us it is always EIGRP because they are a Cisco shop ... Been one of those things we all are aware of including the NOC and know how to fix quickly when it is reported. It also seems to always be smaller less used services 1 to 5 mbs not any of the larger ones...

Disabling the L2-CPU has not seemed to make much difference with this problem. It may make your messages go away but not passing mcast router announcements is something different I believe.

even followed your recommendation and did a full disable igmp snooping

but was still stuck...with sites dropping in and out of idle.

I guess next option unless extreme or you have another recommendation is upgrding from these releases to 21.x/22.x as i really starting to get the feeling that 15.6 was just a buggy branch and my agg switches and core switches are running on 15.6.2.12 (no-patch)

Chris, sorry to hear you are still having issues... here is link to recommended code http://www.extremenetworks.com/extreme-hardwaresoftware-compatibility-recommendation-matrices/softwa... It seems you only did the 8900's in your core, are there other layer 2 only switches in the path?... There is a chance that you may have a resource issues in the blocks and tables and I would start working with GTAC if you have not already to open a case and see if you can figure this out. I can tell you already though they will ask you to get to current code before they will do very much of anything. If there are other layer 2 switches in your path you may also need to clear their tables too... Clearing the tables does not affect traffic so it cant hurt.

You said yours are coming and going right? This is a bit different than what we have seen in the past. Once we see the issue on a vlan the routers will not find their neighbors till we intervene and clear the tables. They go down and stay down. You said yours are coming and going so you may indeed have another issue.

Ya still having issues, as a note we don't have 8900's we have all x460-g2 and x670-g2's

I plan to upgrade to recommended latest patches soon, but due to some traffic engineering issues our redundancy isn't fully redundant at the moment so trying to get that fixed before i do any reboots/upgrades, and we created temporary static routes to back up the ospf routes until we get the problem solved so it isn't affecting customer traffic.

I've seen both site that drop out and stay out for seemingly forever and others drop randomly, but i don't want to open a gtac case until i upgrade the affected routers and my core to the recommended versions to avoid reporting something thats already been fixed in the recommended versions.

Makes sense... I am very interested in how this proceeds and what the final fix is... we will be moving away from the 8900 in the nest year or so and going to 670 G2's and 870's. I am in same boat as you. Would love to reboot and upgrade our 8900 but there are just to many single links and critical services... Some have been running over 1200 days so they are way over due...

If we have any additional insight in this I will drop you all an update too.