It seems as if this bug surfaces due to load issues. While the fix provided by Venkata in https://bugs.launchpad.net/neutron/+bug/1731595 (https://review.openstack.org/#/c/522641/) should help clean things up at the time of l3 agent restart, issues seem to come back later down the line in some circumstances. xavpaice mentioned he saw multiple routers active at the same time when they had 464 routers configured on 3 neutron gateway hosts using L3HA, and each router was scheduled to all 3 hosts. However, jhebden mentions that things seem stable at the 400 L3HA router mark, and it's worth noting this is the same deployment that xavpaice was referring to.

keepalived has a patch upstream in 1.4.0 that provides a fix for removing left-over addresses if keepalived aborts. That patch will be cherry-picked to Ubuntu keepalived packages.

In order to avoid regression of existing consumers, the OpenStack team will run their continuous integration test against the packages that are in -proposed. A successful run of all available tests will be required before the proposed packages can be let into -updates.

The OpenStack team will be in charge of attaching the output summary of the executed tests. The OpenStack team members will not mark ‘verification-done’ until this has happened.

[Regression Potential]
The regression potential is lowered as the fix is cherry-picked without change from upstream. In order to mitigate the regression potential, the results of the aforementioned tests are attached to this bug.

1, First of all, we need to figure out why it will appear multiple ACTIVE master HA nodes in theory ?

Assume the master is dead (at this time, it's status in DB is still ACTIVE), then slave will be selected to new master. After the old master has recovered, the L444 this.enable_keepalived() [4] will be invoked to spawn keepalived instance, so multiple ACTIVE master HA nodes occur. (Related patch - https://review.openstack.org/#/c/357458/)

So the key to solving this problem is to reset the status of all HA ports into DOWN at a certain code path, so the patch https://review.openstack.org/#/c/470905/ is used to address this point. But this patch sets the status=DOWN at this code path 'fetch_and_sync_all_routers -> get_router_ids' which will lead to a bigger problem when the load is large.

2, Why setting status=DOWN in the code path 'fetch_and_sync_all_routers -> get_router_ids' will lead to a bigger problem when the load is large ?

If l3-agent is not active via heartbeat check, l3-agent will be set status=AGENT_REVIVED [1], then l3-agent will be triggered to do a full sync (self.fullsync=True) [2] so that the code logic 'periodic_sync_routers_task -> fetch_and_sync_all_routers' will be called again and again [3].

All these operations will aggravate the load for l2-agent, l2-agent, DB and MQ etc. Conversely, large load also will aggravate AGENT_REVIVED case.

So it's a vicious circle, the patch https://review.openstack.org/#/c/522792/ is used to address this point. It uses the code path '__init__ -> get_service_plugin_list -> _update_ha_network_port_status' instead of the code path 'periodic_sync_routers_task -> fetch_and_sync_all_routers'.

3, We have known, the small heartbeat value can cause AGENT_REVIVED then aggravate the load, the high load can cause other problems, like some phenomenons Xav mentioned before, I pasted them as below as well:

- We later found that openvswitch had run out of filehandles, see LP: #1737866
- Resolving that allowed ovs to create a ton more filehandles.

This is just an example, there may be other circumstances. All those let us mistake the fix doesn't fix the problem.

The high load can also cause other similar problem, for another example:

a, can cause the process neutron-keepalived-state-change to exit due to term singal [5] (https://paste.ubuntu.com/26450042/), neutron-keepalived-state-change is used to monitor vrrp's VIP change then update the ha_router's status to neutron-server [6]. so that l3-agent will not be able to update the status for ha ports, thus we can see multiple ACTIVE case or multiple STANDBY case or others.

b, can cause the RPC message sent from here [6] can not be handled well.

FYI: Resubscribing field SLA. It was not raised as critical by Kiko R.; I raised it and it's still an active ongoing problem on a customer site. Please do not unsubscribe again without discussion with the correct people.

Just as another thing to consider - the deployment where this is happening also experienced bug 1749425 which resulted in packet loss; the networks between network/gateway units is also made via OVS, so if OVS was dropping packets due to the large number of missing tap devices, its possible this was also impacting connectivity between keepalived instances for HA routers, resulting in active/active nasty-ness.

@cgregan I think this is really a situation where the desired scale/density is at odds with the fundamental design of neutron HA routers. It's not something to address in the charms or in the packaging. As such, I'd consider this a feature request, against upstream Neutron, and I don't have an assignee for that.

Issue #718 reported that if keepalived terminates abnormally when
it has vrrp instances in master state, it doesn't remove the
left-over VIPs and eVIPs when it restarts. This is despite
commit f4c10426c saying that it resolved this problem.

It turns out that commit f4c10426c did resolve the problem for VIPs
or eVIPs, although it did resolve the issue for iptables and ipset
configuration.

This commit now really resolves the problem, and residual VIPs and
eVIPs are removed at startup.

If keepalived terminates unexpectedly, for any instances for which
it was master, it leaves ip addresses configured on the interfaces.
When keepalived restarts, if it starts in backup mode, the addresses
must be removed. In addition, any iptables/ipsets entries added for
!accept_mode must also be removed, in order to avoid multiple entries
being created in iptables.

This commit removes any addresses and iptables/ipsets configuration
for any interfaces that exist when iptables starts up. If keepalived
shut down cleanly, that will only be for non-vmac interfaces, but if
it terminated unexpectedly, it can also be for any left-over vmacs.

I've uploaded new versions of keepalived to cosmic, bionic (awaiting SRU team review), pike-staging, and ocata-staging. I need to confirm with other cloud archive admins that this is ok to backport to pike/ocata cloud archives prior to promoting to pike-proposed/ocata-proposed. In the mean time if you'd like to test the ocata fix (where this was initially reported) you can install from the staging PPA:

* d/p/fix-removing-left-over-addresses-if-keepalived-abort.patch:
Cherry-picked from upstream to ensure left-over VIPs and eVIPs are
properly removed on restart if keepalived terminates abonormally. This
fix is from the upstream 1.4.0 release (LP: #1744062).

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-queens-needed to verification-queens-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-queens-failed. In either case, details of your testing will help us make a better decision.

Ok I have now completed testing the bionic-proposed keepalived package with Openstack Queens and am happy that it resolves the problem of ensuring that keepalived will teardown routes, vips, evips etc when it comes back up and transitions from master to backup. My test comprised of deploying Queens with 3 gateways, creating 100 users/projects each with 1 router, creating some instances with floating ips then forcibly killing both the keepalived and neutron-keepalived-state-change processes associated with a particular router for which i have an instance with a fip. I then observed that the qrouter ns interfaces for that router were definitely unconfigured and the vrrp transition happened as expected. This is in contrast to e.g. keepalived 1:1.2.19-1ubuntu0.2 available with all Xenial releases of Openstack for which I consistently see the qrouter interfaces remain configured on > 1 gateway.

For completeness (although not having any bearing on the keepalived fix) I also still see the other issue remain for bionic whereby in neutron the router is listed as being active on > 1 host e.g.

The reason for this is simple and the good news is that with the fixed keepalived it is also benign. Neutron detects state changes by running ip monitor on the qrouter interfaces and since my test involved killing both neutron-keepalived-state-change (that runs ip monitor) and keepalived, the vrrp transition appears to have happened before neutron had ip monitor running again. Looking at the l3-agent logs is see:

i.e. neutron starts keepalived BEFORE keepalived-state-change so if the transition and teardown happens prior to the latter coming up and launching ip monitor it never sees the changes and has nothing to report to neutron.

The verification of the Stable Release Update for keepalived has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

* d/p/fix-removing-left-over-addresses-if-keepalived-abort.patch:
Cherry-picked from upstream to ensure left-over VIPs and eVIPs are
properly removed on restart if keepalived terminates abonormally. This
fix is from the upstream 1.4.0 release (LP: #1744062).

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, without details of your testing we will not be able to proceed.

I see the same behavior in that when the l3agent respawns keepalived it does so prior to spawning the state change agent (a.k.a ip -o monitor) which results in (a) keepalived transitioning to backup (b) keepalived deconfiguring the vip, evip etc and (c) neutron never seeing the deconfigure since it happens prior to ip -o monitor being run (by neutron-keepalived-state-change). So (a) and (b) confirm this patch is good and (c) needs fixing in neutron.