On Tue, 5 Jan 2010, KAMEZAWA Hiroyuki wrote:> > I'm sorry if I miss something...how does this patch series avoid> that vma is removed while __do_fault()->vma->vm_ops->fault() is called ?> ("vma is removed" means all other things as freeing file struct etc..)

I don't think you're missing anything.

Protecting the vma isn't enough. You need to protect the whole FS stack with rcu. Probably by moving _all_ of "free_vma()" into the RCU path (which means that the whole file/inode gets de-allocated at that later RCU point, rather than synchronously). Not just the actual kfree.

However, it's worth noting that that actually has some very subtle and major consequences. If you have a temporary file that was removed, where the mmap() was the last user that kind of delayed freeing would also delay the final fput of that file that actually deletes it.

Or put another way: if the vma was a writable mapping, a user may do

munmap(mapping, size);

and the backing file is still active and writable AFTER THE MUNMAP! This can be a huge problem for something that wants to unmount the volume, for example, or depends on the whole writability-vs-executability thing. The user may have unmapped it, and expects the file to be immediately non-busy, but with the delayed free that isn't the case any more.

In other words, now you may well need to make munmap() wait for the RCU grace period, so that the user who did the unmap really is synchronous wrt the file accesses. We've had things like that before, and they have been _huge_ performance problems (ie it may take just a timer tick or two, but then people do tens of thousands of munmaps, and now that takes many seconds just due to RCU grace period waiting.

I would say that this whole series is _very_ far from being mergeable. Peter seems to have been thinking about the details, while missing all the subtle big picture effects that seem to actually change semantics.