3. Go in to the context table, find two rows with contextlevel = 30 (user). In one set depth to 0, and in the other set path to NULL. Run the 'false' version of the test script again. Make sure it sets the data back to the correct values.

4. Again, edit rows in the user table, but set depth to something silly, like 100, and path to a random string. Run the 'true' version of the test script again. Make sure it sets the data back to the correct values.

1. Make a test script like
<?php
require_once('config.php');
context_helper::build_all_paths(true);
echo 'done';
run it, and verify that there are no errors.
2. Change the 'true' to 'false', and run the script again.
3. Go in to the context table, find two rows with contextlevel = 30 (user). In one set depth to 0, and in the other set path to NULL. Run the 'false' version of the test script again. Make sure it sets the data back to the correct values.
4. Again, edit rows in the user table, but set depth to something silly, like 100, and path to a random string. Run the 'true' version of the test script again. Make sure it sets the data back to the correct values.

Description

Cron and accesslib.php were both changed substantially in Moodle 2.2.

One bit in particular has been causing us problems since we upgraded from 2.1.x this morning. In cron-lib, there are now calls to

context_helper::cleanup_instances(); // Line 149 of lib/cronlib.php
context_helper::build_all_paths(false); // Line 151 of lib/cronlib.php
context_helper::create_instances(); // Line 183 of lib/cronlib.php

At least one of thase calls was bringing down our server, but we are not 100% sure which. We commented out all three lines of code, and now running cron does not bring down the server, and we don't really want to experiment enabling them one at a time, and see which crashes our live systemwink

Petr Skoda
added a comment - 16/Apr/12 11:55 PM How can I reproduce this on my test server? Could you tell us at least how many contexts of each level you have in database and what database are you using?

We think the problem only affects Postgres, and is to do with how Postgres handles concurrency. Basically, in pg, when you alter a row, it actually writes a new row, then when the transaction is committed, marks the old row as deleted, and the new row as live. Later, the deleted rows get cleaned up by a VACUUM.

At the moment, context_user::build_paths is doing an update that will alter every row (even though the data in almost all the rows will not change). We think that this, or rather the auto-vacuum that it triggers soon after wards, is killing our Moodle by locking the the context table for some time.

Therefore, this patch (by Derek Woolhead) changes the query add a where clause so that we only rewrite rows where the data will actually change.

Tim Hunt
added a comment - 17/Apr/12 12:04 AM We think the problem only affects Postgres, and is to do with how Postgres handles concurrency. Basically, in pg, when you alter a row, it actually writes a new row, then when the transaction is committed, marks the old row as deleted, and the new row as live. Later, the deleted rows get cleaned up by a VACUUM.
At the moment, context_user::build_paths is doing an update that will alter every row (even though the data in almost all the rows will not change). We think that this, or rather the auto-vacuum that it triggers soon after wards, is killing our Moodle by locking the the context table for some time.
Therefore, this patch (by Derek Woolhead) changes the query add a where clause so that we only rewrite rows where the data will actually change.

Petr Skoda
added a comment - 17/Apr/12 12:05 AM I think it could be build paths for user context level, I can try adding $force condition similar to other context and test it with a few tens of thousands of users.

I think that the code for most other context levels already has this sort of check built in, or is just not significant. At the OU we have half a million users, a few thousand courses, and about a hundred categories.

Tim Hunt
added a comment - 17/Apr/12 6:38 PM I think that the code for most other context levels already has this sort of check built in, or is just not significant. At the OU we have half a million users, a few thousand courses, and about a hundred categories.

Tim Hunt
added a comment - 18/Apr/12 6:41 PM OK. Patch revised. Please can you look at https://github.com/timhunt/moodle/compare/master...MDL-32462 again.
I note that the corresponding code for courses and modules is much more complex for some reason.

Tim Hunt
added a comment - 18/Apr/12 10:27 PM I tested it to the extent of making a test script that called context_helper::build_all_paths(false); and context_helper::build_all_paths(true); and ensuring there were no fatal errors.

Submitting for integration now. Thanks to everyone who looked at this.

Note that we are not yet running this fix on our live servers. We don't normally change core code locally if we can help it. Instead, we wait until the commit is integrated into the stable branch and then cherry-pick it. However, in this case we are considering patching it to the live servers next week (which means next Tuesday morning UK time) which means that we will know if it solves our cron performance problems, or not, before the weeklies have to be rolled.

Tim Hunt
added a comment - 20/Apr/12 5:40 PM Submitting for integration now. Thanks to everyone who looked at this.
Note that we are not yet running this fix on our live servers. We don't normally change core code locally if we can help it. Instead, we wait until the commit is integrated into the stable branch and then cherry-pick it. However, in this case we are considering patching it to the live servers next week (which means next Tuesday morning UK time) which means that we will know if it solves our cron performance problems, or not, before the weeklies have to be rolled.

Eloy Lafuente (stronk7)
added a comment - 27/Apr/12 11:13 PM This has been near becoming rejected, because it's not the best code you are able to produce.
But, luckily, at the end, it has landed and has been spread to all repos out there.
Many thanks and, don't forget it, keep improving your skills, you can!
Closing, ciao

Derek Woolhead
added a comment - 14/May/12 6:54 PM Eloy, please explain your last comment because as far as I am concerned no thought whatsoever has been given to code quality/performance/scalability with regard to the context changes for Moodle 2.2

Hi Derek, I think Eloy meant it as a joke. You are right that the performance/scalability was not my major concern when coding the accesslib changes, the changes in 2.2 actually fixed bugs, cleaned up the internal implementation and part of it was also introduction of unit tests + inline documentation. This should allow us to work on new features (such as cohort contexts, multitenant if we ever agree to implement it, etc.) and implement new performance improvements (such as memcahe caches, deduplication of caches in cron, etc.). The biggest problem is that majority of core developers do not have access to any meaningful large test datasets, you can not expect us to actually test scalability when we do not have access to test servers with configuration and data similar to real production sites, right? So for now we can only make sure that the code returns correct data, carefully review code and wait till somebody tests it with data from/on production servers and if any problems are reported we try to fix them as soon as possible. But then again we have to rely on somebody else to actually prove the improvements work as expected for large installs.

Petr Skoda
added a comment - 14/May/12 7:10 PM Hi Derek, I think Eloy meant it as a joke. You are right that the performance/scalability was not my major concern when coding the accesslib changes, the changes in 2.2 actually fixed bugs, cleaned up the internal implementation and part of it was also introduction of unit tests + inline documentation. This should allow us to work on new features (such as cohort contexts, multitenant if we ever agree to implement it, etc.) and implement new performance improvements (such as memcahe caches, deduplication of caches in cron, etc.). The biggest problem is that majority of core developers do not have access to any meaningful large test datasets, you can not expect us to actually test scalability when we do not have access to test servers with configuration and data similar to real production sites, right? So for now we can only make sure that the code returns correct data, carefully review code and wait till somebody tests it with data from/on production servers and if any problems are reported we try to fix them as soon as possible. But then again we have to rely on somebody else to actually prove the improvements work as expected for large installs.