topic Re: EX9200 MC-LAG Failover Recovery Times in Ethernet Switchinghttps://forums.juniper.net/t5/Ethernet-Switching/EX9200-MC-LAG-Failover-Recovery-Times/m-p/309790#M16484
<P>Hi, I'm currently running 17.2. &nbsp; Can you elaborate on how enhanced convergence is obtained on L2/L3 interfaces?</P>
<P>&nbsp;</P>
<P>As an update to the situation, this is the progress so far:</P>
<P>&nbsp;</P>
<P>1) For software upgrade, reboots, graceful shutdowns for maintenance scenarios, there the downing of all ae’s on the desired member works with little to no loss of traffic.&nbsp; Although it’s not a procedure I expect for this class of networking equipment, but at least there is a method I suppose.</P>
<P>&nbsp;</P>
<P>2) For complete fail, hangs, or power failure on a member of the core pair:</P>
<P>&nbsp;</P>
<P>If the standby member suffers from a complete unexpected sudden outage, there will be no to little (sub-second) loss of traffic.</P>
<P>&nbsp;</P>
<P>However if the active member suffers from a similar failure, there will be about a 10 second outage to layer 3 traffic only.&nbsp; It’s not a lot but it’s still not acceptable for this level of network switching. I believe this is happening due to the fact that I've noticed the LACP downstream links have to reconverge where once completed, traffic then begins to resume.</P>
<P>&nbsp;</P>
<P>Is this expected behavior with MC-LAG on EX platforms in a power-down failure scenario? &nbsp;Thanks,</P>
<P>&nbsp;</P>
<P>&nbsp;</P>Thu, 29 Jun 2017 20:06:30 GMTjhoseee12017-06-29T20:06:30ZEX9200 MC-LAG Failover Recovery Timeshttps://forums.juniper.net/t5/Ethernet-Switching/EX9200-MC-LAG-Failover-Recovery-Times/m-p/308551#M16349
<P>Hello All,</P>
<P>&nbsp;</P>
<P>I would appreciate any answers to gain some sort of consensus for my current situation.&nbsp; I currently have two EX9204 chassis configured in an MC-LAG along with VRRP and when simulating a failover situation by rebooting any one of the members, I experience a network disruption somewhere in the realm of 20 seconds before connectivity resumes internally and externally.&nbsp; What I mean by that is multi homed nodes (switches and servers) connected directly to the MC-LAG and external nodes upstream from the EX9204 switches.</P>
<P>&nbsp;</P>
<P>I’ve been working with support for the past month and we have not been able to reduce this recovery time.&nbsp; What I’m trying to find out is if everyone else believes this is expected behavior from this platform?&nbsp; Our company went with this platform and are anxious to put this in production to replace antiquated equipment from another vendor and although 20 seconds is not a lot of time, but this failover and recovery period is somewhat unacceptable in my opinion if it were to ever happen during business hours.</P>
<P>&nbsp;</P>
<P>Please, anyone, inform me of your experiences, guidance, and/or opinions.&nbsp; Thanks!</P>Thu, 01 Jun 2017 06:25:44 GMThttps://forums.juniper.net/t5/Ethernet-Switching/EX9200-MC-LAG-Failover-Recovery-Times/m-p/308551#M16349jhoseee12017-06-01T06:25:44ZRe: EX9200 MC-LAG Failover Recovery Timeshttps://forums.juniper.net/t5/Ethernet-Switching/EX9200-MC-LAG-Failover-Recovery-Times/m-p/309451#M16450
<P>Hello,</P>
<P>&nbsp;</P>
<P>Could you please use the below configuration and update the status?</P>
<P>&nbsp;</P>
<P>set protocols iccp local-ip-addr 1.1.1.2<BR />set protocols iccp peer 1.1.1.1 session-establishment-hold-time 50<BR />set protocols iccp peer 1.1.1.1 backup-liveness-detection backup-peer-ip 30.30.30.2<BR />set protocols iccp peer 1.1.1.1 liveness-detection minimum-interval 300<BR />set protocols iccp peer 1.1.1.1 liveness-detection transmit-interval minimum-interval 300</P>
<P>&nbsp;</P>
<P>set interfaces ae0 aggregated-ether-options lacp active<BR />set interfaces ae0 aggregated-ether-options lacp system-id 00:00:00:00:00:01<BR />set interfaces ae0 aggregated-ether-options lacp admin-key 1<BR />set interfaces ae0 aggregated-ether-options mc-ae mc-ae-id 1<BR />set interfaces ae0 aggregated-ether-options mc-ae chassis-id 1<BR />set interfaces ae0 aggregated-ether-options mc-ae mode active-active<BR />set interfaces ae0 aggregated-ether-options mc-ae status-control standby<BR />set interfaces ae0 aggregated-ether-options mc-ae init-delay-time 30<BR />set interfaces ae0 aggregated-ether-options mc-ae events iccp-peer-down</P>
<P>&nbsp;</P>Thu, 22 Jun 2017 07:02:34 GMThttps://forums.juniper.net/t5/Ethernet-Switching/EX9200-MC-LAG-Failover-Recovery-Times/m-p/309451#M16450srinireddy2017-06-22T07:02:34ZRe: EX9200 MC-LAG Failover Recovery Timeshttps://forums.juniper.net/t5/Ethernet-Switching/EX9200-MC-LAG-Failover-Recovery-Times/m-p/309668#M16475
<P>I have no experience with that technology, however do you have the latest version?</P>
<P class="Para1"><SPAN>Starting with Junos OS Release 14.2R3 on MX Series routers, enhanced convergence improves Layer 2 and Layer 3 convergence time when a multichassis aggregated Ethernet (MC-AE) link goes down or comes up in a bridge domain or <SPAN class="gterm"><A class="ct_orange_500" title="VLAN" href="https://www.juniper.net/documentation/en_US/junos/topics/concept/mc-lag-feature-summary-best-practices.html#jd0e565" rel="#jd0e565" target="_blank">VLAN</A></SPAN>.</SPAN> Convergence time is improved because the traffic on the MC-AE interface is switched to the interchassis link (ICL) without waiting for a MAC address update.</P>
<P class="Para1">If you have configured an <SPAN class="gterm"><A class="ct_orange_500" title="IRB" href="https://www.juniper.net/documentation/en_US/junos/topics/concept/mc-lag-feature-summary-best-practices.html#jd0e571" rel="#jd0e571" target="_blank">IRB</A></SPAN> interface over an MC-AE interface that has enhanced convergences enabled, then you must configure enhanced convergence on the IRB interface as well. Enhanced convergence must be enabled for both Layer 2 and Layer 3 interfaces.</P>Tue, 27 Jun 2017 21:30:40 GMThttps://forums.juniper.net/t5/Ethernet-Switching/EX9200-MC-LAG-Failover-Recovery-Times/m-p/309668#M16475lyndidon2017-06-27T21:30:40ZRe: EX9200 MC-LAG Failover Recovery Timeshttps://forums.juniper.net/t5/Ethernet-Switching/EX9200-MC-LAG-Failover-Recovery-Times/m-p/309789#M16483
<P>Hello&nbsp;@<SPAN class=""><A id="link_16" class="lia-link-navigation lia-page-link lia-user-name-link" href="https://forums.juniper.net/t5/user/viewprofilepage/user-id/151102" target="_self">srinireddy</A>,</SPAN></P>
<P>&nbsp;</P>
<P><SPAN class="">I will try to make those changes but I have a few questions:</SPAN></P>
<P>&nbsp;</P>
<P><SPAN class="">For the iccp settings, you don't specifiy&nbsp;redundancy-group-id-list settting. &nbsp;Is this not needed? Here is what I currently have on the active member:</SPAN></P>
<P><SPAN class="">&nbsp;</SPAN></P>
<P><SPAN class="">set protocols iccp local-ip-addr 10.20.127.1<BR />set protocols iccp peer 10.20.127.2 session-establishment-hold-time 50<BR />set protocols iccp peer 10.20.127.2 redundancy-group-id-list 1<BR />set protocols iccp peer 10.20.127.2 backup-liveness-detection backup-peer-ip 10.20.18.5<BR />set protocols iccp peer 10.20.127.2 liveness-detection minimum-interval 2000<BR />set protocols iccp peer 10.20.127.2 liveness-detection multiplier 4<BR /></SPAN></P>
<P>&nbsp;</P>
<P><SPAN class="">Also, with ae0, you are referring to a typical mc-ae and not the ICCP link correct? &nbsp;Further for <SPAN>mc-ae events iccp-peer-down, should&nbsp;&nbsp;prefer-status-control-active be set or not set? &nbsp;Thanks for your help,</SPAN></SPAN></P>
<P>&nbsp;</P>Thu, 29 Jun 2017 20:01:23 GMThttps://forums.juniper.net/t5/Ethernet-Switching/EX9200-MC-LAG-Failover-Recovery-Times/m-p/309789#M16483jhoseee12017-06-29T20:01:23ZRe: EX9200 MC-LAG Failover Recovery Timeshttps://forums.juniper.net/t5/Ethernet-Switching/EX9200-MC-LAG-Failover-Recovery-Times/m-p/309790#M16484
<P>Hi, I'm currently running 17.2. &nbsp; Can you elaborate on how enhanced convergence is obtained on L2/L3 interfaces?</P>
<P>&nbsp;</P>
<P>As an update to the situation, this is the progress so far:</P>
<P>&nbsp;</P>
<P>1) For software upgrade, reboots, graceful shutdowns for maintenance scenarios, there the downing of all ae’s on the desired member works with little to no loss of traffic.&nbsp; Although it’s not a procedure I expect for this class of networking equipment, but at least there is a method I suppose.</P>
<P>&nbsp;</P>
<P>2) For complete fail, hangs, or power failure on a member of the core pair:</P>
<P>&nbsp;</P>
<P>If the standby member suffers from a complete unexpected sudden outage, there will be no to little (sub-second) loss of traffic.</P>
<P>&nbsp;</P>
<P>However if the active member suffers from a similar failure, there will be about a 10 second outage to layer 3 traffic only.&nbsp; It’s not a lot but it’s still not acceptable for this level of network switching. I believe this is happening due to the fact that I've noticed the LACP downstream links have to reconverge where once completed, traffic then begins to resume.</P>
<P>&nbsp;</P>
<P>Is this expected behavior with MC-LAG on EX platforms in a power-down failure scenario? &nbsp;Thanks,</P>
<P>&nbsp;</P>
<P>&nbsp;</P>Thu, 29 Jun 2017 20:06:30 GMThttps://forums.juniper.net/t5/Ethernet-Switching/EX9200-MC-LAG-Failover-Recovery-Times/m-p/309790#M16484jhoseee12017-06-29T20:06:30ZRe: EX9200 MC-LAG Failover Recovery Timeshttps://forums.juniper.net/t5/Ethernet-Switching/EX9200-MC-LAG-Failover-Recovery-Times/m-p/392933#M18935
<P>you just need to add prefer-status-control active konb on all mc-ae interfaces on both MC_LAG peers.</P>
<P>&nbsp;</P>
<P>set interfaces ae0 aggregated-ether-options mc-ae events iccp-peer-down prefer-status-control-active</P>
<P>&nbsp;</P>
<P>it will make sure that LACP system ID will be retained to configured vlaues once MC_LAG peer is getting reboot or shutdown.&nbsp;&nbsp; You will get maximum 1 or 2 ping drops once knob is applied.</P>Thu, 01 Nov 2018 20:58:24 GMThttps://forums.juniper.net/t5/Ethernet-Switching/EX9200-MC-LAG-Failover-Recovery-Times/m-p/392933#M18935kashifnawaz2018-11-01T20:58:24ZRe: EX9200 MC-LAG Failover Recovery Timeshttps://forums.juniper.net/t5/Ethernet-Switching/EX9200-MC-LAG-Failover-Recovery-Times/m-p/415496#M19154
<P>In version 16 with a similar configuration, when you have&nbsp;<STRONG>mc-ae events iccp-peer-down prefer-status-control-active</STRONG> configured on both peers, you will get a warning like this one for every MC-AE when you try to commit the configuration:</P>
<P>&nbsp;</P>
<P>[edit interfaces ae83 aggregated-ether-options]</P>
<P>'mc-ae'</P>
<P>&nbsp;&nbsp;&nbsp; warning: prefer-status-control-active is used with status-control standby. Use this command only if BLD is configured</P>
<P>&nbsp;</P>
<P>However, when we removed the command, we experience the same situation described here - 60 seconds of outage while the downstream portchannels go down and back up due to LACP issues.&nbsp; And another 60 seconds when the primary chassis comes back online.&nbsp; No issues when the standby chassis goes down/up.</P>Thu, 13 Dec 2018 20:14:35 GMThttps://forums.juniper.net/t5/Ethernet-Switching/EX9200-MC-LAG-Failover-Recovery-Times/m-p/415496#M19154aswartz2018-12-13T20:14:35ZRe: EX9200 MC-LAG Failover Recovery Timeshttps://forums.juniper.net/t5/Ethernet-Switching/EX9200-MC-LAG-Failover-Recovery-Times/m-p/415575#M19160
<P>Please add hold timers greater than your BFD interval to the physical interfaces that make up your ISL and test failover again.</P>
<P>&nbsp;</P>
<P>For example with BFD = 3x1000ms</P>
<P>&nbsp;</P>
<PRE>set interfaces et-0/0/52 hold-time up 100
set interfaces et-0/0/52 hold-time down 4000
set interfaces et-0/0/52 ether-options 802.3ad ae0
set interfaces et-0/0/53 hold-time up 100
set interfaces et-0/0/53 hold-time down 4000<BR />set interfaces et-0/0/53 ether-options 802.3ad ae0</PRE>
<P>&nbsp;</P>
<P>&nbsp;</P>
<P>&nbsp;</P>
<P>&nbsp;</P>
<P>&nbsp;</P>Fri, 14 Dec 2018 00:15:51 GMThttps://forums.juniper.net/t5/Ethernet-Switching/EX9200-MC-LAG-Failover-Recovery-Times/m-p/415575#M19160smicker2018-12-14T00:15:51ZRe: EX9200 MC-LAG Failover Recovery Timeshttps://forums.juniper.net/t5/Ethernet-Switching/EX9200-MC-LAG-Failover-Recovery-Times/m-p/419898#M19181
<P>After implementing BLD and prefer-status-control-active on both MC-LAG peers, as well as the hold timers on the ISL link suggested above, I am still experiencing the same situation.&nbsp; If we manually shutdown all the interfaces on either chassis, or reboot the standby chassis, we see no traffic interruption.&nbsp; However, issuing a "request system reboot" command on the primary node causes the VRRP gateways to become unreachable immediately, and endpoints on downstream switches become unreachable after 10 seconds.&nbsp; Everything comes back up at the same time, 50 seconds after issuing the reboot command.&nbsp; I'm guessing the same would be true in a hard failure situation, although I haven't tried physically powering down the node without the reboot command.</P>
<P>&nbsp;</P>
<P>I'm still a little confused over which link to adjust the hold-down timers on (ISL vs ICCP).&nbsp; In our environment we have an ICCP link configured, but we're peering with an IP address on an IRB.&nbsp; Is the link described as "ICCP" actually doing any MC-LAG related for us, since we're peering with an IP associated with an IRB?&nbsp; See attached config for more details:</P>
<P>&nbsp;</P>
<PRE>admin@TEST-GSJA-re0&gt; show configuration interfaces ae0
apply-groups-except global-AE-PARAM;
description "[TEST-GSJA-to-TEST-GSJB ICCP Inter-chassis communications link ae0 ]";
aggregated-ether-options {
lacp {
active;
}
}
unit 0 {
family inet {
address 10.144.200.1/30;
}
}
{master}
admin@TEST-GSJA-re0&gt; show configuration interfaces ae10
apply-groups-except global-AE-PARAM;
description "[TEST-GSJA-to-TEST-GSJB ICL Inter-chassis communications link ae10 ]";
mtu 1518;
aggregated-ether-options {
lacp {
active;
}
}
unit 0 {
family ethernet-switching {
interface-mode trunk;
vlan {
members all;
}
}
}
admin@TEST-GSJA-re0&gt; show configuration interfaces irb.10
family inet {
address 10.144.100.1/30 {
arp 10.144.100.2 l2-interface ae10.0 mac ec:13:db:11:8f:f0;
}
}
admin@TEST-GSJA-re0&gt; show configuration protocols iccp
local-ip-addr 10.144.100.1;
peer 10.144.100.2 {
session-establishment-hold-time 50;
redundancy-group-id-list 10;
backup-liveness-detection {
backup-peer-ip 10.121.121.11;
}
liveness-detection {
minimum-receive-interval 500;
transmit-interval {
minimum-interval 500;
}
}
}
admin@TEST-GSJA-re0&gt; show configuration groups global-AE-PARAM
interfaces {
"&lt;ae[1-9]*&gt;" {
aggregated-ether-options {
lacp {
active;
system-id 00:01:02:03:04:05;
admin-key 5;
}
mc-ae {
redundancy-group 10;
chassis-id 0;
mode active-active;
status-control active;
events {
iccp-peer-down {
prefer-status-control-active;
}
}
}
}
}
}
</PRE>
<P>&nbsp;</P>Wed, 19 Dec 2018 19:46:20 GMThttps://forums.juniper.net/t5/Ethernet-Switching/EX9200-MC-LAG-Failover-Recovery-Times/m-p/419898#M19181aswartz2018-12-19T19:46:20ZRe: EX9200 MC-LAG Failover Recovery Timeshttps://forums.juniper.net/t5/Ethernet-Switching/EX9200-MC-LAG-Failover-Recovery-Times/m-p/419903#M19182
<P>&nbsp;</P>
<P>Does it work if you remove lacp from ae10? I’ve not seen lacp on the ISL in any of the juniper recommended configs for mclag—perhaps it interacts in some way with mcae?</P>Wed, 19 Dec 2018 21:31:12 GMThttps://forums.juniper.net/t5/Ethernet-Switching/EX9200-MC-LAG-Failover-Recovery-Times/m-p/419903#M19182smicker2018-12-19T21:31:12ZRe: EX9200 MC-LAG Failover Recovery Timeshttps://forums.juniper.net/t5/Ethernet-Switching/EX9200-MC-LAG-Failover-Recovery-Times/m-p/467398#M20957
<P>AE10 is ICL.</P>
<P>&nbsp;</P>
<P>From 9214 MC-Lag document:</P>
<P><A href="https://www.juniper.net/documentation/en_US/release-independent/nce/information-products/pathway-pages/nce/nce-145-mc-lag-ex-core-campus.pdf" target="_blank">https://www.juniper.net/documentation/en_US/release-independent/nce/information-products/pathway-pages/nce/nce-145-mc-lag-ex-core-campus.pdf</A></P>
<P>&nbsp;</P>
<P>set interfaces ae1 description ICL-LINK</P>
<P>set interfaces ae1 aggregated-ether-options lacp active<BR />set interfaces ae1 aggregated-ether-options lacp periodic fast<BR />set interfaces ae1 unit 0 family ethernet-switching interface-mode trunk<BR />set interfaces ae1 unit 0 family ethernet-switching vlan members all</P>Thu, 29 Aug 2019 17:27:07 GMThttps://forums.juniper.net/t5/Ethernet-Switching/EX9200-MC-LAG-Failover-Recovery-Times/m-p/467398#M20957RogerSpalding2019-08-29T17:27:07ZRe: EX9200 MC-LAG Failover Recovery Timeshttps://forums.juniper.net/t5/Ethernet-Switching/EX9200-MC-LAG-Failover-Recovery-Times/m-p/467405#M20958
<P>@<SPAN class=""><A id="link_8e049d2a2aa343" class="lia-link-navigation lia-page-link lia-user-name-link" href="https://forums.juniper.net/t5/user/viewprofilepage/user-id/150719" target="_self">jhosee</A>&nbsp;- I'll give you my 2 cents worth.&nbsp; For MC-LAG type technology, which is now close to 15 years old, failback/recovery ALWAYS takes longer, and in general is never sub-second.&nbsp; For these systems, it is easy to note and stop forwarding for a link or switch down event.</SPAN></P>
<P>&nbsp;</P>
<P><SPAN class="">For a recovery event, lots of things are going on for which all must be noted and learned by the system to start forwarding traffic properly.&nbsp; L2 only should be faster than L3, but system must learn link is back up and good (LACP), add link back into AE hash configuration, and restart sending while expecting (hoping) remote end has also adjusted properly.</SPAN></P>
<P>&nbsp;</P>
<P><SPAN class="">In general I have found recovery times of 3-5 seconds are "normal" but depending upon size/scale of network, and exactly what is being measured some flows are very likely to recover faster than others.&nbsp; Scale is a BIG determining factor, as the more items the systems needs to learn, the longer it is bound to take.&nbsp; Not sure if you have tested say 1 L2 MAC on a link recovery and seen what the recovery time is?</SPAN></P>
<P>&nbsp;</P>
<P><SPAN class="">I am not sure of what your exact MC-LAG configuration you are useing, as Juniper documentation for MC-LAG although greatly improved, has never been 100% consistent across products.</SPAN></P>
<P>&nbsp;</P>
<P><SPAN class="">As someone else pointed out there have also been some SW changes that depreciated some commands, and made other commands (<SPAN>prefer-status-control?) require additional associated knob settings.</SPAN></SPAN></P>
<P>&nbsp;</P>
<P><SPAN class=""><SPAN>In general I believe most common recovery situation is for a SW upgrade, where whole switch must be re-boot and brought back on line.&nbsp; For this situation, I recommend:</SPAN></SPAN></P>
<P>&nbsp;</P>
<P><SPAN class=""><SPAN>#1 - complete isolation of the switch to be re-booted to start with; deactivate all interfaces.</SPAN></SPAN></P>
<P><SPAN class=""><SPAN>#2 - then active just interfaces associated with ICL/ICCP, which it my network designs are always a single common AE</SPAN></SPAN></P>
<P><SPAN class=""><SPAN>#3 - let this sit for a while, and make sure MACs/ARPs are being synced between 2 nodes</SPAN></SPAN></P>
<P><SPAN class=""><SPAN>#4 - then bring up links, either one at a time or all at once, and at this time measure recovery.&nbsp; This recovery should be way less than 10-20 seconds, from my experience.</SPAN></SPAN></P>
<P>&nbsp;</P>
<P><SPAN class=""><SPAN>Just FYI, I have been working with MC-LAG type technology since 2005/6 and I am quite aware of what is required SW wise to get recovery to happen quicker.</SPAN></SPAN></P>
<P>&nbsp;</P>
<P><SPAN class=""><SPAN>If you really want faster recovery, I would suggest looking at EVPN/ESI based network designs, vs MC-LAG.&nbsp; At this point in networking, MC-LAG technologies should be considered as really being out-dated, IMHO.</SPAN></SPAN></P>
<P>&nbsp;</P>
<P><SPAN class=""><SPAN>Again, my 2 cents worth.</SPAN></SPAN></P>
<P>&nbsp;</P>
<P><SPAN class=""><SPAN>HTH and good luck.</SPAN></SPAN></P>Fri, 30 Aug 2019 00:09:13 GMThttps://forums.juniper.net/t5/Ethernet-Switching/EX9200-MC-LAG-Failover-Recovery-Times/m-p/467405#M20958rccpgm2019-08-30T00:09:13Z