It was already said there that allocated page is not associated with
fileh. However the code is doing more - it does not add the page to
ram->lru_list and (obviously) to fileh->dirty_pages lists.
Document that explicitly.

Relicense to GPLv3+ with wide exception for all Free Software / Open Source projects + Business options.
Nexedi stack is licensed under Free Software licenses with various exceptions
that cover three business cases:
- Free Software
- Proprietary Software
- Rebranding
As long as one intends to develop Free Software based on Nexedi stack, no
license cost is involved. Developing proprietary software based on Nexedi stack
may require a proprietary exception license. Rebranding Nexedi stack is
prohibited unless rebranding license is acquired.
Through this licensing approach, Nexedi expects to encourage Free Software
development without restrictions and at the same time create a framework for
proprietary software to contribute to the long term sustainability of the
Nexedi stack.
Please see https://www.nexedi.com/licensing for details, rationale and options.

Like with loadblk (see f49c11a3 "bigfile/virtmem: Do loadblk() with
virtmem lock released" for the reference) storeblk() calls are
potentially slow and external code that serves the call can take other
locks in addition to virtmem lock taken by virtmem subsystem.
If that "other locks" are also taken before external code calls e.g.
with fileh_invalidate_page() in different codepath - a deadlock can happen:
T1 T2
commit invalidation-from-server received
V -> storeblk
Z <- ClientStorage.invalidateTransaction()
Z -> zeo.store
V <- fileh_invalidate_page (of unrelated page)
The solution to avoid deadlock, like for loadblk case, is to call storeblk()
with virtmem lock released.
However unlike loadblk which can be invoked at any time, storeblk is
invoked at commit time only so for storeblk case we handle rules for making
sure virtmem stays consistent after virtmem lock is retaken differently:
1. We disallow several parallel writeouts for one fileh. This way dirty
pages handling logic can not mess up. This restriction is also
consistent with ZODB 2 phase commit protocol where for a transaction
commit logic is invoked/handled from only 1 thread.
2. For the same reason we disallow discard while writeout is in
progress. This is also consistent with ZODB 2 phase commit protocol
where txn.tpc_abort() is not expected to be called at the same time
with txn.commit().
3. While writeout is in progress, for that fileh we disallow pages
modifications and pages invalidations - because both operations would
change at least fileh dirty pages list which is iterated over by
writeout code with releasing/retaking the virtmem lock. By
disallowing them we make sure fileh dirty pages list stays constant
during whole fileh writeout.
This restrictions are also consistent with ZODB commit semantics:
- while an object is being stored into ZODB it is not expected it
will be further modified or explicitly invalidated by client via
._p_invalidate()
- server initiated invalidations come into effect only at transaction
boundaries - when new transaction is started, not during commit time.
Also since now storeblk is called with virtmem lock released, for buffer
to store we no longer can use present page mapping in some vma directly,
because while virtmem lock is released that mappings can go away.
Fixes: nexedi/wendelin.core#6

This allows writeout code not to scan whole pagemap to find dirty pages
to write out, which should be faster.
But more importantly iterating whole pagemap on writeout would become
unsafe, when in upcoming patch storeblk() will be called with virt_lock
released: because there pagemap could be modified e.g. due to processing
other read accesses.
So maintain fileh->dirty_pages list and use it when we need to go
through dirtied pages.
Updates: nexedi/wendelin.core#6

@kazuhiko reports that wendelin.core build is currently broken on Python 3.5.
Indeed it was:
In file included from bigfile/_bigfile.c:37:0:
./include/wendelin/compat_py2.h: In function ‘_PyThreadState_UncheckedGetx’:
./include/wendelin/compat_py2.h:66:28: warning: implicit declaration of function ‘_Py_atomic_load_relaxed’ [-Wimplicit-function-declaration]
return (PyThreadState*)_Py_atomic_load_relaxed(&_PyThreadState_Current);
^
./include/wendelin/compat_py2.h:66:53: error: ‘_PyThreadState_Current’ undeclared (first use in this function)
return (PyThreadState*)_Py_atomic_load_relaxed(&_PyThreadState_Current);
^
./include/wendelin/compat_py2.h:66:53: note: each undeclared identifier is reported only once for each function it appears in
./include/wendelin/compat_py2.h:67:1: warning: control reaches end of non-void function [-Wreturn-type]
}
^
The story here is that in 3.5 they decided to remove direct access to
_PyThreadState_Current and atomic implementations - because that might
semantically conflict with other headers implementing atomics - and
provide only access by function.
Starting from Python 3.5.2rc1 the function to get current thread state
without asserting it is !NULL - _PyThreadState_UncheckedGet() - was added:
https://github.com/python/cpython/commit/df858591
so for those python versions we can directly use it.
After the fix wendelin.core tox tests pass under all python2.7, python3.4 and python3.5.
More context here:
https://bugs.python.org/issue26154https://bugs.python.org/issue25150
Fixes: nexedi/wendelin.core#1

_PyThreadState_Current_GET() is a function to get current python thread
state without asserting it is !NULL. It was added as part of d53271b9
(bigfile/virtmem: Big Virtmem lock)
We are going to adapt it to Python 3.5 (see next patch), so before doing
so move it to our compatibility place.
In the new place the name is _PyThreadState_UncheckedGet -- like such
function is named in Python 3.5 (again, see next patch).
Updates: nexedi/wendelin.core#1

Since the beginning of pagemap (45af76e6 "bigfile/pagemap: specialized
{} uint64 -> void * mapping") we had a bug sitting in
__pagemap_for_each_leaftab() (non-leaf iterating logic behind
pagemap_for_each):
After entry to stack-down was found, we did not updated tailv[l]
accordingly. Thus if there are non-adjacent entries an entry could be
e.g. emitted many times:
l 3 __down 0x7f79da1ee000
tailv[4]: 0x7f79da1ee000
-> tailv[4] 0x7f79da1ee000 __down 0x7f79da1ed000
l 4 __down 0x7f79da1ed000
tailv[5]: 0x7f79da1ed000
h 5 l 5 leaftab: 0x7f79da1ed000 <--
lvl 5 idx 169 page 0x55aa
ok 9 - pagemap_for_each(0) == 21930
l 5 __down (nil)
tailv[4]: 0x7f79da1ee008
-> tailv[4] 0x7f79da1ee008 __down 0x7f79da1ed000
l 4 __down 0x7f79da1ed000
tailv[5]: 0x7f79da1ed000
h 5 l 5 leaftab: 0x7f79da1ed000 <--
lvl 5 idx 169 page 0x55aa
not ok 10 - pagemap_for_each(1) == 140724106500272
And many-time-emitted entries are not only incorrect, but can also lead
to not-handled segmentation faults in e.g. fileh_close():
https://lab.nexedi.com/nexedi/wendelin.core/blob/v0.6-1-gb0b2c52/bigfile/virtmem.c#L179
/* drop all pages (dirty or not) associated with this fileh */
pagemap_for_each(page, &fileh->pagemap) {
/* it's an error to close fileh to mapping of which an access is
* currently being done in another thread */
BUG_ON(page->state == PAGE_LOADING);
page_drop_memory(page);
list_del(&page->lru); <-- HERE
bzero(page, sizeof(*page)); /* just in case */
free(page);
}
( because after first bzero of a page, the page is all 0 bytes including
page->lru{.next,.prev} so on the second time when the same page is
emitted by pagemap_for_each, list_del(&page->lru) will try to set
page->lru.next = ... which will segfault. )
So fix it by properly updating tailv[l] while we scan/iterate current level.
NOTE
This applies only to non-leaf pagemap levels, as leaf level is scanned
with separate loop in pagemap_for_each. That's why we probably did not
noticed this earlier - up until now our usual workloads was to change
data in adjacent batches and that means adjacent pages.
Though today @Tyagov was playing with wendelin.core in some other way and
it uncovered the bug.

loadblk() calls are potentially slow and external code that serve the cal can
take other locks in addition to virtmem lock taken by virtmem subsystem. If
that "other locks" are also taken before external code calls e.g.
fileh_invalidate_page() in different codepath a deadlock can happen, e.g.
T1 T2
page-access invalidation-from-server received
V -> loadblk
Z <- ClientStorage.invalidateTransaction()
Z -> zeo.load
V <- fileh_invalidate_page
The solution to avoid deadlock is to call loadblk() with virtmem lock released
and upon loadblk() completion recheck virtmem data structures carefully.
To make that happen:
- new page state is introduces:
PAGE_LOADING (file content loading is in progress)
- virtmem releases virt_lock before calling loadblk() when serving pagefault
- because loading is now done with virtmem lock released, now:
1. After loading completes we need to recheck fileh/vma data structures
The recheck is done in full - vma_on_pagefault() just asks its driver (see
VM_RETRY and VM_HANDLED codes) to retry handling the fault completely. This
should work as the freshly loaded page was just inserted into fileh->pagemap
and should be found there in the cache on next lookup.
On the other hand this also works correctly, if there was concurrent change
- e.g. vma was unmapped while we were loading the data - in that case the
fault will be also processed correctly - but loaded data will stay in
fileh->pagemap (and if not used will be evicted as not-needed
eventually by RAM reclaim).
2. Similar to retrying mechanism is used for cases when two threads
concurrently access the same page and would both try to load corresponding
block - only one thread issues the actual loadblk() and another waits for load
to complete with polling and VM_RETRY.
3. To correctly invalidate loading-in-progress pages another new page state
is introduced:
PAGE_LOADING_INVALIDATED (file content loading was in progress
while request to invalidate the page came in)
which fileh_invalidate_page() uses to propagate invalidation message to
loadblk() caller.
4. Blocks loading can now happen in parallel with other block loading and
other virtmem operations - e.g. invalidation. For such cases tests are added
to test_thread.py
5. virtmem lock now becomes just regular lock, instead of being previously
recursive.
For virtmem lock to be recursive was needed for cases, when code under
loadblk() could trigger other virtmem calls, e.g. due to GC and calling
another VMA dtor that would want to lock virtmem, but virtmem lock was
already held.
This is no longer needed.
6. To catch double faults we now cannot use just on static variable
in_on_pagefault. That variable thus becomes thread-local.
7. Old test in test_thread to "test that access vs access don't overlap" no
longer holds true - and is thus removed.
/cc @Tyagov, @klaus

FileH is a handle representing snapshot of a file. If, for a pgoffset,
fileh already has loaded page, but we know the content of the file has
changed externally after loading has been done, we need to propagate to
fileh that such-and-such page should be invalidated (and reloaded on
next access).
This patch introduces
fileh_invalidate_page(fileh, pgoffset)
to do just that.
In the next patch we'll use this facility to propagate invalidations of
ZBlk ZODB objects to virtmem subsystem.
NOTE
Since invalidation removes "dirtiness" from a page state, several
subsequent invalidations can make a fileh completely non-dirty
(invalidating all dirty page). Previously fileh->dirty was just a one
bit, so we needed to improve how we track dirtiness.
One way would be to have a dirty list for fileh pages and operate on
that. This has advantage to even optimize dirty pages processing like
fileh_dirty_writeout() where we currently scan through all fileh pages
just to write only PAGE_DIRTY ones.
Another simpler way is to make fileh->dirty a counter and maintain that.
Since we are going to move virtmem subsystem back into the kernel, here,
a simpler less-intrusive approach is used.

Previously we were limited to printing traceback starting down from just
storeblk() via explicit PyErr_PrintEx() - because pybuf was attached to
memory which could go away right after return from C function - so we
had to destroy that object for sure, not letting any traceback to hold a
reference to it.
This turned out to be too limiting and not showing full context where
errors happen.
So do the following trick: before returning, reattach pybuf to empty
region at NULL, and this way we don't need to worry about pybuf pointing
to memory which can go away -> thus instead of printing exception locally
- just return it the usual way it is done with C api in Python.
NOTE In contrast to PyMemoryViewObject, PyBufferObject definition is not
public, so to support Python2 - had to copy its definition to PY2 compat
header.
NOTE2 loadblk() is not touched - the loading is done from sighandler
context, which simulates as if it work in separate python thread, so it
is leaved as is for now.

At present several threads running can corrupt internal virtmem
datastructures (e.g. ram->lru_list, fileh->pagemap, etc).
This can happen even if we have zope instances only with 1 worker thread
- because there are other "system" thread, and python garbage collection
can trigger at any thread, so if a virtmem object, e.g. VMA or FileH was
there sitting at GC queue to be collected, their collection, and thus
e.g. vma_unmap() and fileh_close() will be called from
different-from-worker thread.
Because of that virtmem just has to be aware of threads not to allow
internal datastructure corruption.
On the other hand, the idea of introducing userspace virtual memory
manager turned out to be not so good from performance and complexity
point of view, and thus the plan is to try to move it back into the
kernel. This way it does not make sense to do a well-optimised locking
implementation for userspace version.
So we do just a simple single "protect-all" big lock for virtmem.
Of a particular note is interaction with Python's GIL - any long-lived
lock has to be taken with GIL released, because else it can deadlock:
t1 t2
G
V G
!G V
G
so we introduce helpers to make sure the GIL is not taken, and to retake
it back if we were holding it initially.
Those helpers (py_gil_ensure_unlocked / py_gil_retake_if_waslocked) are
symmetrical opposites to what Python provides to make sure the GIL is
locked (via PyGILState_Ensure / PyGILState_Release).
Otherwise, the patch is more-or-less straightforward application for
one-big-lock to protect everything idea.

We factored out SIGSEGV block/restore from fileh_dirty_writeout() to all
functions in cb7a7055 (bigfile/virtmem: Block/restore SIGSEGV in
non-pagefault-handling function). The restoration however just sets
whole thread sigmask.
It could be possible that between block/restore calls procmask for other
signals could be changed, and this way - setting procmask directly - we
will overwrite them.
So be careful, and when restoring SIGSEGV mask, touch mask bit for only
that signal.
( we need xsigismember helper to get this done, which is also introduced
in this patch )

Does similar things to what kernel does - users can mmap file parts into
address space and access them read/write. The manager will be getting
invoked by hardware/OS kernel for cases when there is no page loaded for
read, or when a previousle read-only page is being written to.
Additionally to features provided in kernel, it support to be used to
store back changes in transactional way (see fileh_dirty_writeout()) and
potentially use huge pages for mappings (though this is currently TODO)

Users can inherit from BigFile and provide custom ->loadblk() and
->storeblk() to load/store file blocks from a database or some other
storage. The system then could use such files to memory map them into
user address space (see next patch).

This thing allows to get aliasable RAM from OS kernel and to manage it.
Currently we get memory from a tmpfs mount, and hugetlbfs should also
work, but is TODO because hugetlbfs in the kernel needs to be improved.
We need aliasing because we'll need to be able to memory map the same
page into several places in address space, e.g. for taking two slices
overlapping slice of the same array at different times.
Comes with test programs that show we aliasing does not work for
anonymous memory.

For BigFiles we'll needs to maintain `{} offset-in-file -> void *` mapping. A
hash or a binary tree could be used there, but since we know files are
most of the time accessed sequentially and locally in pages-batches, we
can also organize the mapping in batches of keys.
Specifically offset bits are so divided into parts, that every part
addresses 1 entry in a table of hardware-page in size. To get to the actual
value, the system lookups first table by first part of offset, then from
first table and next part from address - second table, etc.
To clients this looks like a dictionary with get/set/del & clear methods,
but lookups are O(1) time always, and in contrast to hashes values are
stored with locality (= adjacent lookups almost always access the same tables).