Bug Description

[27 Feb 13:35] Laurynas Biveinis
Description:
Another follow-up commits to the buffer pool mutex split patch (bug 75534) is [1]. We have debugged what seems to be exact same bug and believe a fix with less locking is possible.

The bug is one thread storing a buffer pool block pointer for optimistic access, with all the buffer pool latches released, and another thread freeing the page and reading in another page. Then it becomes possible that the 1st thread calls buf_page_optimistic_get, buffer-fixes the page, the 2nd thread reads in a different data page and resets the buf fix count to zero, and, the 1st thread unfixes the page, resulting in the count of UINT32_MAX.

[1] avoids the above by putting buffer pool page initialisation into a block mutex critical section, serializing it with buf_page_optimistic_get. But it is enough to compare modify_clock value instead without adding any new extra serialization. It is bumped in buf_LRU_free_page too, noting the file page -> empty transition, in a buffer block mutex critical section. Thus by adding an early modify_clock check in the first block mutex critical section of buf_page_optimistic_get, we use the already-existing serialisation between buf_page_optimistic_get and buf_LRU_free_page critical sections, and whenever the latter executes before the former, we take an early exit out of buf_page_optimistic_get, avoiding any buf_fix_count manipulation altogether. If buf_page_optimistic_get runs first, then the page is buffer-fixed by leaving the critical section, and buf_LRU_free_page will not evict it.

I'd also replace the buf_fix_count reset to zero in buf_page_init_low with a release build assert checking the same.

(Also in [1], I'd check if the added assert "ut_ad(count + 1 != 0)" to catch buf_block_unfix overflow would not be subject to some type promotion rules making it always false - in my debugging I could get it to fire only by rewriting it as "ut_ad(count != UINT32_MAX);")

There is a race: Thread 1, we set buf_fix_count to 0 in buf_page_init().
Thread 2, we decrease buf_fix_count to 0xffffffff in buf_page_optimistic_get().
Thread 2, we increase buf_fix_count to 0 in buf_page_get_gen(). Thread 3,
we evict the page from LRU list. Thread 2, assert fails: buf_page_in_file().