Problem/Motivation

menu_rebuild() / Router::rebuild() has additional race conditions which aren't accounted for by the lock.

1. Processes that fail to acquire a lock, have the menu_rebuild_needed variable / state and no menu masks in memory. If they subsequently call menu_get_item() / any function on the router that checks for rebuild, that will trigger yet another menu_rebuild() / Router::rebuild().

2. Additionally, there's a window between variable_initialize() / state() retrieval and menu_get_item() / Router::[public] where another process may have fully completed a menu_rebuild() / Router::rebuild() and freed the lock. In this case we don't need to rebuild the menu, but we do anyway.

#1 is a serious race condition - it can cause menu_rebuild() / Router::rebuild() to occur over and over again - as one process acquires the lock, other processes lock_wait() / Lock::wait(), then rebuild, then acquire the lock, then more process lock_wait() / Lock::wait() etc. - proper lock stampede.

#2 is more of an optimization to make the current stampede protection more robust.

Both problems exist in 8.x too despite the code having changed a lot. They have been shown in above code with the / for D8 as the alternative implementation.
Its really just the naming that has changed.

Proposed resolution

Move the logic away from a lazy-rebuild model into a explicit rebuild model, which means, there is no global (across requests) flag, to trigger a rebuild.
Instead the router is marked as to be rebuild only in the current process. At the end of the request ($kernel->terminate()) the rebuild is triggered.

That way we have an important performance improvement (or at least avoiding regressions) with this patch:

User interface changes

none.

API changes

RouteBuilderIndicator + Interface + service is removed.

Original report

Everytime menu_rebuild() is called, we get a crazy amount of waits on INSERT INTO menu_router and a bunch of DELETE FROM menu_router. I believe there is some sort of race condition for rebuilding menus. The only way I can kill it is to kill off a bunch of processes. I would imagine that setting a lock for the first user to rebuild that table would be a simple enough fix, but I wanted to open a ticket and see if anyone else had come up with a solution first.

Please help, this is crashing our site: http://www.divx.com/ every time we update a menu or clear the cache.

Comments

menu_rebuild clears the cache. Now our 400+ requests per second all decide they need to rebuild the menu_router table since they couldn't pull the menu from cache.

everybody gets in on the party at line 2199 and starts to rebuild.

at line 2288 they all DELETE FROM {menu_router} simultaneously and now menu_router is empty... sort of.

whoever is first starts to INSERT around line 2370 and now we have DELETE FROM and INSERT INTO happening at the same time.

the end result is 600-700 connections stuck in the db trying to figure out how to serve their request and a bunch of waiting processes on the web servers.

now we stack up the web requests until our load balancer starts dumping them into our sorry server and everybody gets our beautiful site maintenance page.

so my theory is that we need to lock before we rebuild and make sure we don't clear cache until the initial menu is built form whoever has the lock so that we can keep serving the stale menu cache until the menu is done rebuilding.

Is this really a dup? It sounds like a bug related to the fix we put in for the cases where the rebuild failed. This does indeed seem pretty critical. We also know that using drupal variable is broken (also subject to race conditions) in terms of a locking/semaphore.

FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch d7-menu_rebuild-fix.patch. Unable to apply patch. See the log in the details link for more information. View

I'm reopening this issue because it seems the "duplicate" has gone off track from what this issue was about and this is for Drupal 7 rather than Drupal 6.

I believe the core issue around menu_rebuild is its poor safeguards from rebuilding the menu when it doesn't need to. When menu_rebuild() is called it tries to acquire a lock. If it can't acquire the lock. It waits, assumes the menu has been rebuilt and returns. It misses completing two extra tasks. One is to set menu_rebuild_needed to FALSE in the local process (using $conf) to prevent menu_rebuild being called again without menu_rebuild_needed explicitly being set correctly. The second is to reload menu_masks which may have changed during the menu rebuild.

Failure to do either of these two things will force another menu_rebuild attempt on the same process if menu_get_item is called again. This can often be the case for taxonomy menus or rendering the output of l() among other things. It can lead to many attempts in a single process trying to rebuild the menu and each time, it may encounter a lock, wait, then continue. I've seen this utilize usleep for almost a minute in XHProf data. The more traffic, the more this race condition is experienced and the longer it takes for the lock to eventually free up.

This patch fixes this issue, though I can imagine there may be some contention around the workaround for menu_masks.

if (!$this->lock->acquire('router_rebuild')) {
// Wait for another request that is already doing this work.
// We choose to block here since otherwise the routes might not be
// available, resulting in a 404.
$this->lock->wait('router_rebuild');
return FALSE;
}

If the state value has already been checked during the request, this won't reset it, so another call to rebuildIfNeeded() would attempt a rebuild again.

Menu masks also got ported to 8.x, but that is now an internal implementation detail of the database router, so not sure the best way to reset it. Easiest would be to re-init the state system if we get to this condition.

As Josh anticipated I'm not sure about the workaround for menu_masks in the 7.x patch - what about forcing a variable_init() again instead?

Under certain conditions e.g. using memcache or a MEMORY table for semaphore, the lock can be released _before_ the transaction is actually committed.

So what should happen is:

if (!acquire_lock()) {
lock_wait();
// Reset variables, load from cache.
variable_reset();
// Return early.
return FALSE;
}
// Check if someone else has written menu already, this can happen if request was started, and old variables loaded, but then menu rebuild completed.
variable_reset();
// If there is still a menu rebuild needed, continue
if (!variable_get('menu_rebuild_needed', FALSE)) {
release_lock();
return FALSE;
}
// Actually rebuild the menu
$transaction = db_transaction();
$rebuild_menu
try {
// ...
}
catch (Exception) {
$transaction->rollback();
// @todo This is a problem as now all the waiting processes will get old entries or 404.
// However if menu_rebuild fails we probably have other problems ...
return FALSE;
}
$transaction->commit();
// The lock was acquired outside a transaction, so should be released outside a transaction.
lock_release();
return TRUE;

+++ b/includes/menu.inc
@@ -2712,9 +2727,15 @@ function menu_rebuild() {
- return FALSE;
- }
+ if (!_menu_check_rebuild()) {
+ // If we get here and menu_masks was not set, then it is possible a menu
+ // being reloaded, or that the process rebuilding the menu was unable to
+ // complete successfully. A missing menu_masks variable could result in a
+ // 404, so re-run the function.
+ return menu_rebuild();
+ }
+ }

Moved it, also did some renaming and documentation improvements. Still don't like the naming there, I'd rather have a specific method for this on the router builder indicator instead of a protected wrapper. And if posssible, avoid talking about cache, that seems like an implementation detail. Thoughts?

+++ b/core/lib/Drupal/Core/Routing/RouteBuilderIndicator.php
@@ -52,4 +59,14 @@ public function setRebuildDone() {
+ // To absolutely ensure that the router rebuild is required, reset the cache
+ // in case they were set by another process.
+ $this->resetCache();

+++ b/core/tests/Drupal/Tests/Core/Routing/RouteBuilderTest.php
@@ -293,6 +297,9 @@ public function testRebuildIfNecessary() {
// This will not trigger a rebuild.
$this->assertFalse($this->routeBuilder->rebuildIfNeeded());
+
+ // This will not trigger a rebuild, because the indicator is not set at all.
+ $this->assertFalse($this->routeBuilder->rebuildIfNeeded());

You are absolutly right. This function is introduced by this patch, so its not called from anywhere in HEAD.
I can't think of a valid usecase, its a pure implementation detail that state has static caching, so in case someone
has that edge case, that someone could reset the static caching of $state.

Which calls into question why have and call a function whose result we won't believe. At a minimum, I think we need documentation as to why this is necessary. Ideally, we'd able to make the code more intuitive.

I also think the newly introduced recursion in rebuild() could be a bad implementation; if there are many processes, the recursion could really blow up the call stack with nested calls. If a process is never able to get the lock (possible at least in theory), it would keep recursing forever until the call stack runs out of memory. I'm used to seeing this kind of locking implemented using a while-loop to avoid the recursion.

@berdir, @xjm, @dawehner discussed this issue and how do document the behaviour what is going on.
The problems mentioned in your last part of the review won't appear, given that its much easier to break with too many mysql
connections for example.

- the state flag is written to outside the lock, so the process rebuilding the router info can unset the flag just after another process sets it, with undefined results for an undefined period
- the first thing we should try to do is make the rebuild less expensive. at a first approximation, caching the .yml info seems like a fairly obvious candidate
- we're still taking the likely most frequent case (readers) and mashing them against the lock system. seems like it would be better to make the writers fight for the lock, because they are (99% of the time) a) less frequent, and b) more expensive requests

Switching rebuilds to happen on write (and keeping the lock protection) seems like we should definitely try to do it for 8.x

I'm not sure it''ll be possible to backport that to 7.x - variable_set('menu_rebuild_needed') is a part of the API. However we have #44 running successfully in production on the site where we ran into this, and it's fine if we end up with two different solutions in two different versions.

here's a rough patch that caches the yml stuff in getRouteDefinitions(), and makes the module and theme systems call ->rebuild() instead of setting a flag.

on my VM, i see RouteBuilder::getRouteDefinitions() go from ~70ms to ~15ms. would love to see someone else run this and insert some simple microtime measurements in getRouteDefinitions() and report back.

should i open a separate issue for caching the .yml stuff separate from the other changes?

@beejeebus
Note: Its not only the parsing of the YML files, but we also have to dump the entries into the database,
and rebuild parts of the menu trees, as they become potentially invalid ...
We certainly need numbers to be able to judge something here.

Here is a an example run, which mostly had a router rebuild. Note: This is a crapy slow machine in general.

Out of those the event dispatcher dominates (... this is mostly spend in rebuild the menu links and rebuild the routes of views).
There is also quite some costs to dump the routes for itself (Drupal\Core\Routing\MatcherDumper::dump).

ok, this patch rips out all usage of rebuildIfNeeded, and replaces those calls with rebuild. mostly. there seem to be places where we do a dance around rebuildIfNeeded() where we can just remove it. i hope. see the changes in RouteProvider::getRouteCollectionForRequest().

looks like RouterRebuildSubscriber can just die, but i left it in for now.

I did a quick bit of archaeology on the menu_rebuild_needed variable, this race condition was rediscovered by Fabianx and me on a live Drupal 7 site recently so the Drupal 7 issue here is extremely real - it's not introduced by the 8.x listener. I've also seen the symptoms of it before in NewRelic without having actually tracked it down on other sites (since it always eventually self-repairs after a few minutes and various contrib modules rebuild menus sometimes).

The variable was originally added in #202955: Access denied after install - menu_rebuild() calls - this was to allow a menu rebuild to be triggered when in 'maintenance mode' (where you might not have a complete module list) so that it ran outside of maintenance mode. Not an optimization then, just a workaround.

Then field UI started to use it to optimize test runs since menu_rebuild() was getting triggered multiple times per request.

So I think it'd be a good change to switch to setting a flag (in memory only, not state) that triggers a menu rebuild at the end of request, and that will satisfy not wanting to rebuild more than once per request. Where the rebuild has to happen mid-request, can just trigger that directly.

For Drupal 7 I think we'll have to leave the variable in, and probably use the 7.x version of this patch, but perhaps we could also set $conf['menu_rebuild_needed'] and add something in drupal_exit() looking for it in many cases that core uses this, to try to reduce the chances of a read stampede there as well.

128.1 - should we add a note to that issue? or open a separate follow up?

128.2 - discussed this with dawehner in IRC. i'd like to keep the per-module clearing, because it will lead to better hit rates. however, i agree with him that it's kinda ugly - better implementation ideas most welcome.

128.3 - yes, but this is "normal". this patch brings the router builder cache back in line with most others - there's a window where a reader can get the "old" router information, just like every system all throughout Drupal with a cache. the "next" request, however, will get the new data. i don't think there's anything we need to do here - i think that's a better approach than blocking all requests and hammering the lock system.

128.4 - i like that suggestion, will put that in the next reroll.

128.5 - maybe? i'm not sure, to be honest i was just banging away at the frankentest stuff to make it go green.

The earlier version of the patch is still right for 7.x - we still have to support menu_rebuild_needed there whatever happens.
I'm fine with the end-of-request write in Drupal 8 though, definitely simplifies things. YAML caching definitely should not happen here though.

If necessary we could fork the issue between 7 and 8 since they'll be completely different patches.

I'd be tempted to split the issue - move this to 7.x (and note it's applied and working on the 7.x client site where we rediscovered this) then open another issue for beejeebus' approach for 8.x (still critical).

If it turns out the end of request rebuild doesn't work we've still got the 8.x patch from here, but looking through the commit history for the current situation we got here almost entirely by accident so stepping back from that seems good.

I recently got rid of a forced router rebuild after every module in my install profile (because there was an old call to it on default_content, it resulted in a 30% performance improvement: https://github.com/larowlan/default_content/pull/32#issuecomment-73381037). And that was *with* my ApcFileCache, that caches parsed yaml files. The expensive part is dynamic stuff like field_ui and views.

Yes, this only rebuilds at the end of the request if I understand correctly, but if you use the UI installer, that's still a lot of rebuilds.

That we need this in so many kernel tests is annoying, what exactly is triggering this?

This particular example is trigged by the rebuild in ThemeHandler

Shouldn't a views save do this automatically if needed?

Well, in HEAD we call setRebuildNeeded() in View::postSave() => views_invalidate_cache() ...
The problem for these tests are probably now the one which lead to the architecture we have now. #2164367: Rebuild router as few times as possible per request
So for example in that case we save the view, which does not trigger the rebuild. In a normal request we potentially end up in a race condition that the terminate is not done yet, but another query is already executed in the test process.

in field_ui_entity_form_mode_presave and field_ui_entity_view_mode_presave? I changed it there cause it kept the FieldUIRouteTest from blowing up. I have no idea whether it is or is not the correct thing to do. I'm anxiously awaiting any kind of feedback, cause I know I have tons to learn about the whole system.
I was surprised that the change the other way in ThemeHandler didn't blow anything up.

- We have 'eventual consistency' taking into account that requests coming while the rebuilding request is running see the old router.
- If there is no route stored at all, the route is build just once by whoever gets the lock - so no build stampede.
- We have almost direct consistency for when the ->rebuild() is called directly, requests see the new route once the rebuild is finished.

talking over this with @dawehner in irc, we think the rebuildIfNeeded function on RouteBuilder can go away.

After talking, got to thinking:
It's not used anywhere in all of core and it really doesn't make sense with this internal change, to have the API oriented in such a way as to have client code be concerned at all with whether or not the routing table needs to be rebuilt RIGHT NOW. The client code should just maybe notify that they made a change that requires a rebuild. When that rebuild occurs, client don't care.

The form submit rebuild for the field_ui test it's probably just best to eat right now. I'm much more concerned with the results of Drupal\Tests\Core\Routing\RouteBuilderTest and Drupal\system\Tests\Routing\RouteProviderTest. Since this whole change is for a routing issue, those really should (if TDD really is the bee's knees) go to the heart of the issue.

I was actually looking at rebuild() earlier when banging my head against it, and it doesn't check for whether a rebuild is needed or not, it just does it, afaict. So a rebuildNow isn't necessary, client code can just call rebuild().
I may be (probably am) wrong, but I think I understand a little better now than when I started. The idea is two-fold.
Step 1 is to lower the number of times the routing table is rebuilt on each request, for a performance improvement. And that lowers the probablility of a race condition, but does not altogether eliminate it, which is why we need:
Step 2, fix the possibility of a race condition by some kind of locking mechanism.

Step 2 is where I get kinda fuzzy and have to consult wikipedia. Being a WP dev, we don't concern ourselves with such things.
"A race condition occurs when two or more threads can access shared data and they try to change it at the same time"
So each request-response lives in it's own thread, right? So are we talking about two apache(or nginx, whatevs) threads, each bootstrapping and executing, and what exactly is the shared data? The routing table stored in the DB, object cacheing? Or am I being to literal and detail oriented, and need to just let it go to the 'cache' or some other such abstraction?

I recently got rid of a forced router rebuild after every module in my install profile (because there was an old call to it on default_content, it resulted in a 30% performance improvement: https://github.com/larowlan/default_content/pull/32#issuecomment-73381037). And that was *with* my ApcFileCache, that caches parsed yaml files. The expensive part is dynamic stuff like field_ui and views.

Does that mean modules which run stuff which potential could need a new router, like MenuLInkContent::preSave() should call RouteBuilder::rebuildIfNeeded() as well as default_content?
With that we would move the lazyness from the Route Provider into the calling code, which is argueably bad, but at least inserts don't happen often.

If we use the ApcFileCache for the parsing then we can drop it from here as suggested by @catch and don't need manual cache clears.

Just to be clear, the current patch indeed gets rid of the cache again and instead rely on the cache of YamlDiscovery

I'm not sure why some places call setRebuildNeeded() and most now call rebuild().

Well, places which previously called ->setRebuildNeeded should be still able to call setRebuildNeeded if we apply the logic mentioned above. Its tricky, but I think its okayish to apply additional forced rebuilds in tests.

That we need this in so many kernel tests is annoying, what exactly is triggering this?

I don't think this is possible for the installer calling router->setRebuild() several times, but then the wrong ModuleDirectories are 'cached' in the yamlDiscovery object.

What do you mean with possible? It happens at least the for UI installer already, because for every next batch request you have in HEAD the router marked,
and so we do a rebuild. Note: This helper method exists already in HEAD and was only introduced for better testability.
In real life a module install triggers a container rebuild, which triggers a new instance of the router builder anyway, so you afaik can never have wrong data as part of the YamlDiscovery object.

@dawehner: Hm, so that looks bad. What exactly did you test? The whole install process? with drush si?

Yeah, those are basic drush si calls ... with standard ... but you have to keep in mind, by doing it explicit, the other requests hitting the site,
won't have to deal with it any longer, which makes it easier for sites to recover.

@berdir suggested a different idea to test this.
Constantly request Drupal with for example ab.
While doing that enable a module or save a view, and keep looking at the load.
On HEAD, the lock should block those requests, with the patch, the rebuilding should be done explicitly, and the other requests just hit the previous router.

Experimenting a bit here with only setting the rebuild flag during install, let's see what goes wrong.

Ideally I think the logic could be:

Don't rebuild by default

Rebuild, in case any hook_install()/hook_modules_installed() ... etc. related code needs routing information.
But I think doing that right, is tricky.

OK, so as dawehner said I implemented a behavior change with this patch: routes are not automatically rebuild during the module install process.

That way we have two important performance improvements (or at least avoiding regressions) with this patch:
* No more lock stampede where multiple requests try to rebuild routes.
* No multiple route rebuilding during module installation

Disadvantages of this patch:
* Edge case: modules that rely on updated route information in hook_modules_installed() will have to call \Drupal::service('router.builder')->rebuildIfNeeded(); themselves.
* Some test cases need to call \Drupal::service('router.builder')->rebuild() themselves to make sure that the route information is up to date before they make requests.
* "drush en mymodule" is broken now, because drush will also have to do a \Drupal::service('router.builder')->rebuild(); ==> but this can easily be fixed in drush.

HEAD, without a view save: ~6 seconds taken for the tests
HEAD, with view saving: ~9 seconds taken for the tests
PATCH, without a view save: ~6 seconds taken for the tests
PATCH, with view save: ~7 seconds taken for the tests.

Interesting is probably the standard derivation for HEAD with view save and PATCH with view save, as it shows, how slow the individual requests are:

Regarding #2352641: Break router.builder dependency: that issue was about a service dependency circle between route builder and route provider - this patch changes route provider to not depend on route builder at all. We just forgot to remove that dependency, I did that now with the latest patch, yay!