Description

Since updating from m194+ to m1.9.5+ (Version: May 27th 2009), under heavy traffic, we observed (initially) php crashes while executing: "cache_context($context);" (see logs below) leading to continuously dying apache instances right after birth. As a result pages aren't fully served any more, logged in useres can't perform any course/context related actions any more (well, at least nothing)....and moodle is completely messed-up. A graceful apache restart is sufficient to come back to a defined operation level.

To Reproduce:
On our development server i tried hard to reproduce such a state.... and therefore this seems to be a little bit "unrealistic" usecase:
0. in php.ini set "max_execution_time" to some useless short priod of time, e.g. 10 sec.
1. Set the myMoodle redirect after login as default ( admin/settings.php?section=mymoodle ) // mymoodleredirect
2. Set the maximum number of courses for the myMoodle site to.... something huge... e.g. 300 // mycoursesperpage
3. Check out an user having many, many courses.... e.g. 200 (or create one: simply by assigning an user e.g. as teacher within a category with 200+ courses)
4. Login in as this particular user... "start using moodle", and my be two or three more...
5. watch your apache und php logs......

Obervation:
lib/accesslib.php will be terminated due to the exceed of the maximum execution time.... and voila, here's the mess....
... in best cases logged in useres accessing their myMoodle Site will see "No course information to show"

Petr Skoda
added a comment - 27/Sep/09 3:51 AM lowering the severity, unfortunately our current roles related code uses a lot of memory - but in any case you EACCELERATOR should not be crashing like this - maybe there is a problem in it

Part I:
--------
In lib/accesslib.php the "caching" of the context table (cache_context()) is done by simply using an array with the limit of 5000 tuples ('MAX_CONTEXT_CACHE_SIZE').
When this limit is reached and there is need to cache more tuples the size of this array is calculated and one single tuple is deleted:

The observation here is:
when 5000 isn't enough by now, its very, very likely that there is much more to cache 6000, 6500,....,7000
and in this case the calculation of the array size and deleting one single tuple for 1000, 2000, .... 3000 times is far, far too expensive.

Solution:
1. get rid of "count($context_cache_id)" by simply counting
2. remove bigger chunks (e.g. 500 tuples at once) of the array which is much faster since we even don't know which tuples will be read first/next.....

Implication:
1. never allow direct write access to the cache array, other ways the cache count will be incorrect
2. when tuples are removed from the cache array, also remove them from $preloadedcourses
3. when the cache array is unset, also unset $preloadedcourses

Observation:
MAX_CONTEXT_CACHE_SIZE could be reduced to about (+/.- 500) 2500. Because above 2500 there is no performance plus and the cache is flushed serveral time anyway....

Part II:
---------
In course/lib.php the courses modules "activity" is cached ( get_fast_modinfo() ) in the same way.... and the array is limited via MAX_MODINFO_CACHE_SIZE=10.
get_fast_modinfo() is called when e.g. an admin assignes some particular user as teacher within a category -> as often as many courses there are within this category...
or on every visit of the myMoodle site for every course she/he is subscribed in....

Oberservation:
Since the tuples come in order to the cache and are read in the same order and if the first tuple is the one which is removed from cache when:
MAX_MODINFO_CACHE_SIZE1=10.
Its obvious that here's a cache miss.... and the cache holds for 10+ tuples never the desired one.... and every tuple will be read at least 4 times, if present.....

Soution:
1. get rid of "count($context_cache_id)" by simply counting
2. raise MAX_MODINFO_CACHE_SIZE to at least 20+++ // i choose 250 which leads the apache to approx. 100MB which should be ok for the most.....
3. remove randomly tuples which have no cache hit by now....

====
The performance plus of this two little changes is drastic!
E.g. myMoodle Pages with more than 200 courses which took 70+ secs to be served are now served within 2 1/2 secs.....

Jens Eremie
added a comment - 17/Oct/09 12:03 AM Part I:
--------
In lib/accesslib.php the "caching" of the context table (cache_context()) is done by simply using an array with the limit of 5000 tuples ('MAX_CONTEXT_CACHE_SIZE').
When this limit is reached and there is need to cache more tuples the size of this array is calculated and one single tuple is deleted:
while (count($context_cache_id) >= MAX_CONTEXT_CACHE_SIZE)
{
$first = reset($context_cache_id);
unset($context_cache_id[$first->id]);
unset($context_cache[$first->contextlevel][$first->instanceid]);
}
The observation here is:
when 5000 isn't enough by now, its very, very likely that there is much more to cache 6000, 6500,....,7000
and in this case the calculation of the array size and deleting one single tuple for 1000, 2000, .... 3000 times is far, far too expensive.
Solution:
1. get rid of "count($context_cache_id)" by simply counting
2. remove bigger chunks (e.g. 500 tuples at once) of the array which is much faster since we even don't know which tuples will be read first/next.....
Implication:
1. never allow direct write access to the cache array, other ways the cache count will be incorrect
2. when tuples are removed from the cache array, also remove them from $preloadedcourses
3. when the cache array is unset, also unset $preloadedcourses
Observation:
MAX_CONTEXT_CACHE_SIZE could be reduced to about (+/.- 500) 2500. Because above 2500 there is no performance plus and the cache is flushed serveral time anyway....
Part II:
---------
In course/lib.php the courses modules "activity" is cached ( get_fast_modinfo() ) in the same way.... and the array is limited via MAX_MODINFO_CACHE_SIZE=10.
get_fast_modinfo() is called when e.g. an admin assignes some particular user as teacher within a category -> as often as many courses there are within this category...
or on every visit of the myMoodle site for every course she/he is subscribed in....
Oberservation:
Since the tuples come in order to the cache and are read in the same order and if the first tuple is the one which is removed from cache when:
MAX_MODINFO_CACHE_SIZE1=10.
Its obvious that here's a cache miss.... and the cache holds for 10+ tuples never the desired one.... and every tuple will be read at least 4 times, if present.....
Soution:
1. get rid of "count($context_cache_id)" by simply counting
2. raise MAX_MODINFO_CACHE_SIZE to at least 20+++ // i choose 250 which leads the apache to approx. 100MB which should be ok for the most.....
3. remove randomly tuples which have no cache hit by now....
====
The performance plus of this two little changes is drastic!
E.g. myMoodle Pages with more than 200 courses which took 70+ secs to be served are now served within 2 1/2 secs.....
Please test & check !

Hi correct me if I'm wrong. I got the impression that this only happen when you upgrade from 1.9.4 to 1.9.5. If you start with a "fresh" installation (of 1.9.5 or 1.9.6), then you won't have this issue. Is this correct? Thanks!

Wen Hao Chuang
added a comment - 22/Oct/09 6:58 AM Hi correct me if I'm wrong. I got the impression that this only happen when you upgrade from 1.9.4 to 1.9.5. If you start with a "fresh" installation (of 1.9.5 or 1.9.6), then you won't have this issue. Is this correct? Thanks!

I'm sure this problem is independent from from upgrade or vanilla install. In fact you'll need enough data, courses, useres, producing a sufficient amount of contexts which is necessary to "fill" the context cache and this is probably the only the case when updating an existing system.....

Jens Eremie
added a comment - 22/Oct/09 4:19 PM I'm sure this problem is independent from from upgrade or vanilla install. In fact you'll need enough data, courses, useres, producing a sufficient amount of contexts which is necessary to "fill" the context cache and this is probably the only the case when updating an existing system.....

Jens, you are probably right. I am able to reproduce this PHP behaviour. I looks that count() is actually much slower than one might expect. See the attached demostration PHP file. There are two functions - cache_context() with our current implementation and the second one cache_context2() which just uses an external global counter to achieve the same functionality.

Anybody interested, please compare the script running time on your machines. At my PHP 5.2.12-pl0-gentoo it is like 7 secs versus 0.05 sec

David Mudrak
added a comment - 24/Mar/10 1:36 AM Jens, you are probably right. I am able to reproduce this PHP behaviour. I looks that count() is actually much slower than one might expect. See the attached demostration PHP file. There are two functions - cache_context() with our current implementation and the second one cache_context2() which just uses an external global counter to achieve the same functionality.
Anybody interested, please compare the script running time on your machines. At my PHP 5.2.12-pl0-gentoo it is like 7 secs versus 0.05 sec

Sam Marshall
added a comment - 24/Mar/10 1:42 AM Confirmed here (assuming the test script is correct, I just ran it didn't actually look at it):
4.8820159435272
0.051156044006348
not sure why we didn't reproduce this before. this is a duplicate of MDL-18768 probably?

I'm not sure the test is a correct (real) one. As I commented in MDL-18768, before any change in this area we need to revise all uses of the cache, because some times we are accessing/modifying it directly, and that would break any static counter in the function (getting out of sync easily).

So, also as commented in that bug, all we can do is to move the count() out from the loop. That will save time when we are deleting multiples, but won't help too much when deleting in 1-1 basis.

Eloy Lafuente (stronk7)
added a comment - 24/Mar/10 3:29 AM Well,
I'm not sure the test is a correct (real) one. As I commented in MDL-18768 , before any change in this area we need to revise all uses of the cache, because some times we are accessing/modifying it directly, and that would break any static counter in the function (getting out of sync easily).
So, also as commented in that bug, all we can do is to move the count() out from the loop. That will save time when we are deleting multiples, but won't help too much when deleting in 1-1 basis.
In my tests:
1) current: 6.02
2) 100% variable alternative (global/static): 0.15
3) count moved out from the loop: 3.8
I think we can only achieve 3, as commented.
Ciao

Mike Churchward
added a comment - 24/Mar/10 4:04 AM - edited This really isn't a fair test of count() though... It is being called each time through the loop on an array that has been decreased by one element.
nm... Didn't read it close enough...

Well, IMHO the test _is_pretty fair as it demonstrates how we really use the function at the moment - see

while($context = rs_fetch_next_record($rs)) {

cache_context($context);

}

in lib/accesslib.php and the function definition there as well - I basically copied it into the demo script.

What I do not think is that we can generalize this problem too much. I just wrote some simple loops having count() involved in the comparison and the figures are not that bad. I think we are just facing a malicious combination of several factors in this particular case because there are two arrays linking to the same objects, resetting the array too much often and unsetting the arrays' first elements too often...

What if we just wrap the cache into a class with an internal static counter and all the cache manipulation would be done via public methods of this class only. There would be no direct access to the array so the counter can't get out of sync and we will not call count never. If my greps are correct, we do not use the global array elsewhere but in lib/accesslib.php so it should be quite easy refactoring fun...

David Mudrak
added a comment - 24/Mar/10 7:44 AM Well, IMHO the test _is_pretty fair as it demonstrates how we really use the function at the moment - see
while($context = rs_fetch_next_record($rs)) {
cache_context($context);
}
in lib/accesslib.php and the function definition there as well - I basically copied it into the demo script.
What I do not think is that we can generalize this problem too much. I just wrote some simple loops having count() involved in the comparison and the figures are not that bad. I think we are just facing a malicious combination of several factors in this particular case because there are two arrays linking to the same objects, resetting the array too much often and unsetting the arrays' first elements too often...
What if we just wrap the cache into a class with an internal static counter and all the cache manipulation would be done via public methods of this class only. There would be no direct access to the array so the counter can't get out of sync and we will not call count never. If my greps are correct, we do not use the global array elsewhere but in lib/accesslib.php so it should be quite easy refactoring fun...

Eloy Lafuente (stronk7)
added a comment - 24/Mar/10 7:54 AM Yeah, David, with better words that's 100% what I said (need to encapsulate cache and counter into bullet-proof class).
But first we need to review all current uses of the cache to see if there is someplace where it's used in an strange way (like add XXX contexts, or array merge the cache with another... or whatever).
Ciao

due to the unchanged usage of "count($context_cache_id)" inside of "while" in the nearly unchanged functions:
cache_context(), get_fast_modinfo() (compared m19 to 20) , I'm pretty sure the problem still present.... i may be wrong
because i didn't do any further (real data) mesasurement on this.

For info: Just during today, I'm looking at this bug as part of the OU's Moodle Monday exercise. If I figure it out by the end of today, I'll attach a patch here. (If it turns out to be too difficult, I'll just leave it - sorry.)

note: I have some experience with this area of code, so hopefully can come up with working solution, presuming one is still needed. As agreed by others on this bug, I plan to check all current uses and, if possible, encapsulate the cache(s) into a class.

Sam Marshall
added a comment - 25/Oct/10 7:32 PM For info: Just during today, I'm looking at this bug as part of the OU's Moodle Monday exercise. If I figure it out by the end of today, I'll attach a patch here. (If it turns out to be too difficult, I'll just leave it - sorry.)
note: I have some experience with this area of code, so hopefully can come up with working solution, presuming one is still needed. As agreed by others on this bug, I plan to check all current uses and, if possible, encapsulate the cache(s) into a class.

i. I don't think the 100MB assumption will always stand up; modinfo can get much bigger depending on what features you use and how large courses are.

ii. 100MB might still cause it to run out of RAM.

iii. The count() thing is surely not a performance issue here because it doesn't take long to count 10 elements. Similarly, obtaining modinfo is a quite heavyweight task anyway so messing with a cache is probably going to be a small proportion of this. (Compared to context where getting context is easy.)

I have got Tim to review this patch. I think it's fairly important that it is committed and it is quality unit-tested code, so could I get a +1 to commit it please (or somebody else commit it).

We're fairly sure it is the right approach so if nobody has time to review it, I'll commit in a few days.

Sam Marshall
added a comment - 26/Oct/10 12:01 AM - edited Ok, I've addressed this issue. Here's what I did:
0. Neither $ACCESSLIB_PRIVATE nor the function cache_context( are referenced anywhere outside accesslib.
1. Created new class context_cache to replace function cache_context.
2. Reduce MAX_CONTEXT_CACHE_SIZE to 2500 as per original request.
3. Added REDUCE_SIZE constant (1000) so that it does the reductions in large batch as per original request.
4. Made it count approx cache size and use this rather than counting array.
5. Added unit test for new class which tests 2,500 contexts in cache (passes).
6. Verified existing unit tests (still pass except one unrelated one which was failing already).
7. Verified basic usage by clicking around the website (still works).
The patch is attached - accesslib.cache.patch (14 kB)
I did NOT make any changes regarding modinfo. This is because:
i. I don't think the 100MB assumption will always stand up; modinfo can get much bigger depending on what features you use and how large courses are.
ii. 100MB might still cause it to run out of RAM.
iii. The count() thing is surely not a performance issue here because it doesn't take long to count 10 elements. Similarly, obtaining modinfo is a quite heavyweight task anyway so messing with a cache is probably going to be a small proportion of this. (Compared to context where getting context is easy.)
I have got Tim to review this patch. I think it's fairly important that it is committed and it is quality unit-tested code, so could I get a +1 to commit it please (or somebody else commit it).
We're fairly sure it is the right approach so if nobody has time to review it, I'll commit in a few days.

Petr Skoda
added a comment - 27/Oct/10 12:22 AM I have committed sam's patch + renamed $ACCESSLIB_PRIVATE->contexcache and added some tweaks for potential drift problems
keeping this open for now because some other parts could use similar caching, thanks sam!