For developers:
1/ run the functional DB tests under sqlsrv. Verify that no error related with "test_session_locks" happens.
2/ no need to perform the tests under other DBs, nothing has changed there (last week we introduced and tested that).

Description

On many of our hosted sites we sometimes get periods of unavailability or out of memory conditions which trace back to a single user.

Most often in these cases, the user will be performing some long-running process (e.g backup, users upload) and attempts to continue opening things from another browser tab. The new requests get backed up behind the long request waiting to grab a session lock. This can also be triggered by bug, especially infinite loops in code sections with increased time limit.

This results in stuck server processes which may not be freed for hours, and can cause out-of-memory issues if more processes are started to handle other requests.

Over time, it should be a goal to migrate these long processes out of web requests and into background processing but it would be nice to have some kind of catchall to prevent these and bugs from causing site unavailability.

The suggestion is to, where possible, place a time limit on how long Moodle will wait to acquire a session lock before giving up. While this does cause an error to the user, they do not get hung browser windows forever, and the site remains available.

Please see the linked patch for an attempt at doing this for Postgres.

Petr Skoda
added a comment - 07/Nov/11 12:57 AM - edited Thanks a lot for the report and patch, I have cleaned up the session handler a bit more and fixed other drivers.
To integrators: make sure it works in mssql, sqlsrv and oracle and fix it if not, I do not have a test server for these (my MSDN license expired, sorry). I suppose it is too big for backporting.
ciao

Petr Skoda
added a comment - 07/Nov/11 1:08 AM oh, I had some problems with Chrome on OSX when testing this, it seemed to do some extra page access logging - weird. It worked 100% for me in Safari for both MySQL and PostgreSQL...

I did not test your patch much because I discovered major issues there while reviewing it - such as different timeout handling in each driver, double errors caused by exceptions in session driver inits, incorrect error reporting, saving of invalid session content, etc. So I fixed all these first and then tested the new code. I always need some time away to see my own bugs, sorry for the trouble...

Petr Skoda
added a comment - 07/Nov/11 4:45 PM I did not test your patch much because I discovered major issues there while reviewing it - such as different timeout handling in each driver, double errors caused by exceptions in session driver inits, incorrect error reporting, saving of invalid session content, etc. So I fixed all these first and then tested the new code. I always need some time away to see my own bugs, sorry for the trouble...

Tony Levi
added a comment - 08/Nov/11 7:15 AM - edited Hang on a minute, I only touched one driver; that doesn't even make any sense?
edit:
oh, I think you mean the rest of the existing code, OK
does this mean this issue has to wait for 2.3 now?

You touched only one driver which was not enough to make it work properly, we must support all drivers and there were other bugs which I had to fix. The Google Chrome trouble is a separate issue, I just wanted to warn the testers that it may behave in a different way than other browsers.

Making changes in session code is always risky, this may be backported to stable branches later, I think it is better to keep it in dev branch until more people test it on real sites with real browsers.

In any case the integrators have to decide if it gets backported and when, luckily it is not my job to do that...

Petr Skoda
added a comment - 08/Nov/11 3:45 PM You touched only one driver which was not enough to make it work properly, we must support all drivers and there were other bugs which I had to fix. The Google Chrome trouble is a separate issue, I just wanted to warn the testers that it may behave in a different way than other browsers.
Making changes in session code is always risky, this may be backported to stable branches later, I think it is better to keep it in dev branch until more people test it on real sites with real browsers.
In any case the integrators have to decide if it gets backported and when, luckily it is not my job to do that...

Eloy Lafuente (stronk7)
added a comment - 14/Nov/11 2:34 AM Hi, I've added 3 commits on top of your branch, doing:
Added unit tests for get/release_session_lock() + the 2/3mins typo.
Fixed behavior for mssql, sadly there is one bug somewhere in freetds/php causing scalar results from stored procedures not being returned ok. The implemented workaround solves that.
Fixed behavior for oracle, the execute call returns false if there is any problem in the execution of get_lock(), so now it's checking for that instead of the timing way.
Source: https://github.com/stronk7/moodle/compare/master...MDL-30026
Right now it is working under mysql/pgsql/mssql/oci. Only sqlsrv requires testing and feedback here. So I'm going to send this to integration, asking specially for sqlsrv testing... thanks and ciao

line ... of ...
line 4164 of \lib\dml\simpletest\testdml.php: call to UnitTestCase->assertTrue()
line ... of ...
line 53 of \admin\tool\unittest\ex_simple_test.php: call to TestSuite->run()
line ... of ...
line 113 of \admin\tool\unittest\dbtest.php: call to autogroup_test_coverage->run_with_external_coverage()

Let me know if there is anything more I can help with otherwise I'll wait for this to come up for testing again.

Sam Hemelryk
added a comment - 14/Nov/11 11:25 AM Hi guys,
I've just tested it now using sqlsrv and have had the following error occur consistently. (Using Server 2008)
Fail: lib/dml/simpletest/testdml.php / ► dml_test / ► test_session_locks
at [C:\www\integration\lib\dml\simpletest\testdml.php line 4164]
line ... of ...
line 4164 of \lib\dml\simpletest\testdml.php: call to UnitTestCase->assertTrue()
line ... of ...
line 53 of \admin\tool\unittest\ex_simple_test.php: call to TestSuite->run()
line ... of ...
line 113 of \admin\tool\unittest\dbtest.php: call to autogroup_test_coverage->run_with_external_coverage()
Let me know if there is anything more I can help with otherwise I'll wait for this to come up for testing again.
Cheers
Sam

Eloy Lafuente (stronk7)
added a comment - 14/Nov/11 5:05 PM I'm getting this to me. Has been agreed by Sam and me to keep this integrated (working under 4DBs) and immediately after running upstream, we'll work in a fix for sqlsrv.
Ciao

Aparup Banerjee
added a comment - 23/Nov/11 2:58 PM This has been integrated into master only. Open to more testing!
my tests:
For some reason i couldn't get past test_concurent_transactions() so i skipped that for my setup.
The test_session_locks() is working fine for me on sqlsrv on MSSQL server 2008.

Is session locking on all the time? Since 2.2 I'm seeing "Timed out while waiting for session lock." on all kinds of operations, after everything hangs for some minutes. And this is on a site with one user.

Martin Dougiamas
added a comment - 06/May/12 7:03 PM - edited Is session locking on all the time? Since 2.2 I'm seeing "Timed out while waiting for session lock." on all kinds of operations, after everything hangs for some minutes. And this is on a site with one user.
Are these regressions? MDL-31870
I'm going to try turning off database sessions (which are on by default).

That is the new expected behaviour, the session locks have timeout defined in SESSION_ACQUIRE_LOCK_TIMEOUT (by default 2 minutes). If you want longer timeout please define it in your config.php. Alternatively fix the script that executes for long time by closing session manually before starting the slow operations.

Petr Skoda
added a comment - 06/May/12 8:15 PM - edited That is the new expected behaviour, the session locks have timeout defined in SESSION_ACQUIRE_LOCK_TIMEOUT (by default 2 minutes). If you want longer timeout please define it in your config.php. Alternatively fix the script that executes for long time by closing session manually before starting the slow operations.

Tony Levi
added a comment - 06/May/12 8:24 PM Yep, this is the rather unfortunate but intended behavior, otherwise one user brings the system down by holding refresh (fills all server processes).
The only other way I think is to eventually lose the session locks - certainly not safe until everything uses transactions properly and so on. More like a "Moodle 3" feature.

From a user perspective (remember them?) there is something SERIOUSLY wrong if one little test site using MySQL is hanging completely just because one user is switching quickly between pages that use AJAX (such as the course page) on Moodle 2.2.

Martin Dougiamas
added a comment - 07/May/12 12:08 PM - edited From a user perspective (remember them?) there is something SERIOUSLY wrong if one little test site using MySQL is hanging completely just because one user is switching quickly between pages that use AJAX (such as the course page) on Moodle 2.2.
This was never such a problem before. Cure worse than the disease?
Perhaps database sessions should be default to being off.

When users starts clicking like crazy the http requests should be interrupted and the session locks released immediately. It would be great if you described the exact steps to reproduce this. Why do you think this is related to database sessions? Did you try file bases sessions? (it uses locking too) Did you try Postgresql to rule out mysql drivers problems?

Petr Skoda
added a comment - 07/May/12 2:39 PM When users starts clicking like crazy the http requests should be interrupted and the session locks released immediately. It would be great if you described the exact steps to reproduce this. Why do you think this is related to database sessions? Did you try file bases sessions? (it uses locking too) Did you try Postgresql to rule out mysql drivers problems?

It was pretty random, but one I remember was that I was in the course page, and it was partially loaded and I clicked a link to go to an activity page. Bang. It does seem better now on file-based sessions, problem has not happened again. I don't have an option to use Postgres for this site (and nor would many users).

Martin Dougiamas
added a comment - 07/May/12 6:40 PM It was pretty random, but one I remember was that I was in the course page, and it was partially loaded and I clicked a link to go to an activity page. Bang. It does seem better now on file-based sessions, problem has not happened again. I don't have an option to use Postgres for this site (and nor would many users).

I will try to work around mysql problems by trying to create new full featured file-based session driver and memcached session driver (so far it looks that MUC abstraction would be unnecessarily complicated by the session locking logic, so it might be better to be separate).

Petr Skoda
added a comment - 07/May/12 9:38 PM I will try to work around mysql problems by trying to create new full featured file-based session driver and memcached session driver (so far it looks that MUC abstraction would be unnecessarily complicated by the session locking logic, so it might be better to be separate).

If im logged in and put one of my fingers over the F5 for a few seconds, while im at "moodlesite/my" path. Then database is filled with conections and the request may be interrupted every "F5" pulse, but the database still making the queries necesary for close the session and then unlocking it. This causes a lot of apache threads waiting for the answer of the database and ends on filling memory on front end servers and connection slots on database server.

Even with the acquire_lock_timeout at 5 seconds, connections and request are filled faster speed that released...

Now scale this to a 300 concurrent and 21.000 potential users... and i think it becames on a big: God Luck my friend!

¿Its really necesary to lock the session just for browsing? I mean, i understand to do this on quiz attempts or whatever but just opening "my moodle" page or the course list...

Alberto Lorenzo Pulido
added a comment - 11/Oct/12 4:38 PM Hi all guys, using moodle 2.3.2+ and PostgreSQL here.
If im logged in and put one of my fingers over the F5 for a few seconds, while im at "moodlesite/my" path. Then database is filled with conections and the request may be interrupted every "F5" pulse, but the database still making the queries necesary for close the session and then unlocking it. This causes a lot of apache threads waiting for the answer of the database and ends on filling memory on front end servers and connection slots on database server.
Even with the acquire_lock_timeout at 5 seconds, connections and request are filled faster speed that released...
Now scale this to a 300 concurrent and 21.000 potential users... and i think it becames on a big: God Luck my friend!
¿Its really necesary to lock the session just for browsing? I mean, i understand to do this on quiz attempts or whatever but just opening "my moodle" page or the course list...

The basic problem is that pretty much all write activity in Moodle does not use proper transactional methods and so is prone to terrible race conditions. This is reduced only by the session lock though some cases can still cause hideous effects visible to other users for example when moving categories it is possible for other users to temporarily see broken context information...

There also is absolutely zero way to identify if a page will be performing this race-prone activity, especially at the time the session is locked so adding special mitigation to certain common pages is error prone.

Moodle really needs to grow up with proper transactional semantics where appropriate to ultimately remove the session lock. Unfortunately an apparent requirement to support 120% broken mysql configurations will likely block this forever.

Tony Levi
added a comment - 11/Oct/12 7:28 PM That's exactly right. It also isn't simple to fix at all...
The basic problem is that pretty much all write activity in Moodle does not use proper transactional methods and so is prone to terrible race conditions. This is reduced only by the session lock though some cases can still cause hideous effects visible to other users for example when moving categories it is possible for other users to temporarily see broken context information...
There also is absolutely zero way to identify if a page will be performing this race-prone activity, especially at the time the session is locked so adding special mitigation to certain common pages is error prone.
Moodle really needs to grow up with proper transactional semantics where appropriate to ultimately remove the session lock. Unfortunately an apparent requirement to support 120% broken mysql configurations will likely block this forever.

Replicating:
1) Access course with large amount/number of files in the course, and preferably a huge number of files in the Moodle instance in other courses.
2) Edit a section (type of content is not a problem)
3) Select 'Insert Image' -> 'Find and Upload an Image'
4) Select 'Server Files'. Display files as Tree (IMPORTANT)
5) Before 'server files' loads, select another item such as 'Recent Files' and then 'Upload a File' - clicking either.
6) Depending on the session time out, this will throw the "Session Timeout" error when the timeout is reached.
7) The window is then "frozen", with no ability to select anything on the page - such as close the upload window.
8) Reloading the window then also returns the "Session Timeout" error and the system is not usable until the window is closed and session cookie is lost (as happens when the window closes).

I have video of the issue being replicated if it will help the resolution process.

James McLean
added a comment - 02/Apr/14 9:33 AM - edited I have managed to replicate the issue Martin mentioned above reliably.
Moodle 2.6.1+
Chrome Version 33.0.1750.154 m
Firefox 28.0
IE (9, old version intentionally installed)
Replicating:
1) Access course with large amount/number of files in the course, and preferably a huge number of files in the Moodle instance in other courses.
2) Edit a section (type of content is not a problem)
3) Select 'Insert Image' -> 'Find and Upload an Image'
4) Select 'Server Files'. Display files as Tree (IMPORTANT)
5) Before 'server files' loads, select another item such as 'Recent Files' and then 'Upload a File' - clicking either.
6) Depending on the session time out, this will throw the "Session Timeout" error when the timeout is reached.
7) The window is then "frozen", with no ability to select anything on the page - such as close the upload window.
8) Reloading the window then also returns the "Session Timeout" error and the system is not usable until the window is closed and session cookie is lost (as happens when the window closes).
I have video of the issue being replicated if it will help the resolution process.

Hi James,
imo, you may have stumbled upon one of many ways to replicate the issue. Can i suggest the creation of a new issue and within that you copy paste the replication steps. It may turn out to be one of a number of fixes needed to grow moodle towards proper transactions, this way we could create a meta for a better overview plan.

Aparup Banerjee
added a comment - 03/Apr/14 11:08 AM Hi James,
imo, you may have stumbled upon one of many ways to replicate the issue. Can i suggest the creation of a new issue and within that you copy paste the replication steps. It may turn out to be one of a number of fixes needed to grow moodle towards proper transactions, this way we could create a meta for a better overview plan.