At 08:24 PM 11/20/2006, Mike Tancsa wrote: >I was trying out the latest snapshot from the cvs with a simple config

This is pretty easy to reproduce. On one of the peers, if I make it busy forwarding a lot of packets/s I can trigger the broken / stuck state. Is there any way to work around this ? Any other debugging info I can provide ?

On Tue, Nov 21, 2006 at 08:05:44AM -0500, Mike Tancsa wrote: > At 08:24 PM 11/20/2006, Mike Tancsa wrote: > >I was trying out the latest snapshot from the cvs with a simple config > > This is pretty easy to reproduce. On one of the peers, if I make it > busy forwarding a lot of packets/s I can trigger the broken / stuck > state. Is there any way to work around this ? Any other debugging > info I can provide ? > > ---Mike >

Turn on debugging of fsm, look whether you get something like

<date....> BGP: <ip> [FSM] Timer (holdtime timer expire)

to check whether it is an expiry of a hold timer.

With a whole lot of packets you may loose too many keepalive packets, causeing the hold timer to expire, which then tears down the session.

If the fix fixes this (it did for me, but nobody else said something), then the peer will not get stuck in Clearing. So you'll have no problem any more with the other peer closing the session.

But, having a holdtime expire because of too much traffic is another problem. You do not want the session to be dropped in that situation, you want to continue forwarding your traffic! Possible fixes for that could be: - send more keepalives - increase the holdtime (either one, or both) but this also increases the time bgpd will take to note a real loss of a session, which increases the time for a re-connect.

So that trade-off won't get you far.

You may have to give bgp packets precedence other the "bulk traffic" you are shovelling through if you really have to keep the traffic rolling, i.e. that would mean quality of service (which is not part of quagga, but of your underlying OS).

At 09:09 AM 11/21/2006, Juergen Kammer wrote: >On Tue, Nov 21, 2006 at 08:05:44AM -0500, Mike Tancsa wrote: > > At 08:24 PM 11/20/2006, Mike Tancsa wrote: > > >I was trying out the latest snapshot from the cvs with a simple config > > > > This is pretty easy to reproduce. On one of the peers, if I make it > > busy forwarding a lot of packets/s I can trigger the broken / stuck > > state. Is there any way to work around this ? Any other debugging > > info I can provide ? > > > > ---Mike > > > >Turn on debugging of fsm, look whether you get something like > ><date....> BGP: <ip> [FSM] Timer (holdtime timer expire) > >to check whether it is an expiry of a hold timer. > >With a whole lot of packets you may loose too many keepalive packets, >causeing the hold timer to expire, which then tears down the session.

Hi, Thanks for the info! In this case, its all 3 test peers that can get stuck on the clearing state. I have one box in the middle (which I posted the config for) that is doing the high pps routing and 2 test peers that just have a bunch of static routes defined that I am then advertising to the central router. Any one of the three can be made to get stuck in the clearing state. I added debugging to syslog and I see the following over and over again

>If the fix fixes this (it did for me, but nobody else said something), >then the peer will not get stuck in Clearing. So you'll have no problem >any more with the other peer closing the session. > > >But, having a holdtime expire because of too much traffic is >another problem.

Actually, that problem is fine and I can live with that. There are only so many pps the box can deal with. The real problem I want to work around is when the routing software gets stuck to the point where I cant clear the session. i.e. a clear or a shut/no shut does not even work.

>I added a possible fix on 2006-11-17 11:46 there. > > >If the fix fixes this (it did for me, but nobody else said something), >then the peer will not get stuck in Clearing. So you'll have no problem >any more with the other peer closing the session.

Hi, That indeed seems to have fixed it ! Are there any other side effects with this change, or is it an obvious bug ?

On Tue, Nov 21, 2006 at 04:50:32PM -0500, Mike Tancsa wrote: > At 09:09 AM 11/21/2006, Juergen Kammer wrote: > > > >I added a possible fix on 2006-11-17 11:46 there. > > > > > >If the fix fixes this (it did for me, but nobody else said something), > >then the peer will not get stuck in Clearing. So you'll have no problem > >any more with the other peer closing the session. > > Hi, > That indeed seems to have fixed it ! Are there any other side > effects with this change, or is it an obvious bug ?

Splendid! A confirmed bug fix!

The only situation affected is when a hold time expire happens.

It is a bug, but not an obvious one; you have to think a bit in circles to find this ;-). I happened to stumble upon this in a test environment with a quagga hosted an a virtual machine - and because the time on the virtual machine is not reliant, once a day the hold time expired on its peers, and they got stuck in Clearing. The fix Paul did cleaned up another race, but this one was still deeper covered. Paul fixed a missing Clearing_Completed because of a race, this here is a bgp_stop which never gets executed - so this time the Clearing_Completed is missing because it never gets generated by bgp_stop.

The question is whether to fix *entering* Clearing, either by a change in the state machine (as I did), or by a change in the routine setting up for bgp_stop when a hold time expire happens, which are both quite reliably without side effects, or handling events *in Clearing*, which also could be used to solve this but would have to be thought over more carefully. ... Paul, are you reading this?

At 05:40 PM 11/21/2006, Juergen Kammer wrote: >On Tue, Nov 21, 2006 at 04:50:32PM -0500, Mike Tancsa wrote: > > At 09:09 AM 11/21/2006, Juergen Kammer wrote: > > > > > > >I added a possible fix on 2006-11-17 11:46 there. > > > > > > > > >If the fix fixes this (it did for me, but nobody else said something), > > >then the peer will not get stuck in Clearing. So you'll have no problem > > >any more with the other peer closing the session. > > > > Hi, > > That indeed seems to have fixed it ! Are there any other side > > effects with this change, or is it an obvious bug ? > >Splendid! A confirmed bug fix! > > >The only situation affected is when a hold time expire happens. > >It is a bug, but not an obvious one; you have to think a bit in circles to >find this ;-). I happened to stumble upon this in a test environment >with a quagga hosted an a virtual machine - and because the time on the >virtual machine is not reliant, once a day the hold time expired on >its peers, and they got stuck in Clearing. The fix Paul did cleaned up >another race, but this one was still deeper covered. Paul fixed a >missing Clearing_Completed because of a race, this here is a bgp_stop >which never gets executed - so this time the Clearing_Completed is missing >because it never gets generated by bgp_stop. > >The question is whether to fix *entering* Clearing, either by a >change in the state machine (as I did), or by a change in the routine >setting up for bgp_stop when a hold time expire happens, >which are both quite reliably without side effects, or handling events >*in Clearing*, which also could be used to solve this but would have to be >thought over more carefully. >... Paul, are you reading this?

I have it deployed on one of my ibgp routers and so far so good. I havent seen any ill effects.

> It is a bug, but not an obvious one; you have to think a bit in > circles to find this ;-). I happened to stumble upon this in a > test environment with a quagga hosted an a virtual machine - and > because the time on the virtual machine is not reliant, once a day > the hold time expired on its peers, and they got stuck in Clearing. > The fix Paul did cleaned up another race, but this one was still > deeper covered. Paul fixed a missing Clearing_Completed because of > a race, this here is a bgp_stop which never gets executed - so this > time the Clearing_Completed is missing because it never gets > generated by bgp_stop.

Urg, wowser, yes. That's a silly bug. Clearing state assumes:

- can only be entered from Established - clear_route_all gets called (ie via bgp_stop())

Technically the state machine also is buggy for ConnectRetry_timer_expired and TCP_connection_open_failed, but those events should never be raised in Established in practice.

> The question is whether to fix *entering* Clearing, either by a > change in the state machine (as I did), or by a change in the routine > setting up for bgp_stop when a hold time expire happens, > which are both quite reliably without side effects, or handling events > *in Clearing*, which also could be used to solve this but would have to be > thought over more carefully.

Yeah, might it be more robust to handle this unconditionally on transition into Clearing I wonder? E.g. as in:

> bgp_clear_route gets called when we are leaving Idle, oups: you do > not check whether you change into Clearing at all there, you > unconditionally clear all routes whenever a transition happens, uh, > oh.

D'Oh. Ok, I didn't say I tested it. ;)

Just curious what you think of the approach at least, given your comment on best place to do it.

Hm... to do something whenever the state changes into a specific one sounds more like sweeping something under the carpet. Would be better to ensure that we have done the right thing when we transition into Clearing.

A look into the transition table shows that we are entering Clearing only after calling: bgp_stop - does already the right thing bgp_ignore - happens only in situations when no connection was there anyway bgp_fsm_holdtime_expire - does not do the right thing bgp_stop_with_error - calls bgp_stop, see there

So, if this is right, it should suffice that bgp_fsm_holdtime_expire does not enqueue BGP_Stop, but call bgp_stop instead (or call only bgp_clear_route if that suffices).

> Hm... to do something whenever the state changes into a specific > one sounds more like sweeping something under the carpet.

Nah, it's just to ensure that something which /must/ be done on transition into a state *does* get done.

We could add a field to the FSM table (or add another table) to specify actions specific to transition /into/ some state, but given for this we're only talking about one state, we might as well as just have it in the state-change function, rather than bloat up the FSM table (and add more function-call indirection).

We did similar cleanups in ospfd's neighbour FSM btw, where there were actions common to whole classes of state-changes, and we cleaned up code by moving such actions to the state-change function, rather than replicating code/calls across several different transition action functions.

On Wed, Dec 06, 2006 at 09:37:43PM +0000, Paul Jakma wrote: > We did similar cleanups in ospfd's neighbour FSM btw, where there > were actions common to whole classes of state-changes, and we cleaned > up code by moving such actions to the state-change function, rather > than replicating code/calls across several different transition > action functions.

OK, if there is a modus operandi for such thingies, I'm the last to complain.

> Its because the FSM state has reached "Clearing" and only > ClearingCompleted event can move the state in next ones. Now none of > such event comes, and the work queue items increase because of that.

Sure. The question is whether or not this queue of items is being processed at all. E.g. have a look at "show work-queues", and "show thread cpu" from the bgpd vty.

I.e. is this some kind of "rate work arriving > rate work being done" backlog, or is there some bug with processing the work?

>On Thu, 29 Mar 2012, Preeti Khurana wrote: > >> Its because the FSM state has reached "Clearing" and only >> ClearingCompleted event can move the state in next ones. Now none of >> such event comes, and the work queue items increase because of that. > >Sure. The question is whether or not this queue of items is being >processed at all. E.g. have a look at "show work-queues", and "show >thread >cpu" from the bgpd vty. > >I.e. is this some kind of "rate work arriving > rate work being done" >backlog, or is there some bug with processing the work?

Paul,

One interesting thing I noticed is that when run the same bgp with lesser number of routers ( 50 as compared to some 100 + routers), it runs fine and doesn't show any such symptom. So now as workaround I have run two bgpd processed peeing with 50 routers each. Since this was production, I can't get you the output you are looking for. >