I shall do my best. I hope your WWW pages look good in Lynx! :-)I shall certainly read your paper and source (when I get the time...).

Werner> Also, my code only works with the ATM side of networking - Werner> the whole IP stack is a bit more complex, but I think Alan Werner> has started working on that.

It sounds like you have networking specifically in mind. Certainly,file serving and routing are obvious applications to benefit from this(they're what I had in mind too), but also general I/O in allapplications is likely to benefit. There is a potential reduction inswapping too (see "Swapping Benefits").

I've noticed that when performance becomes important, people startadding `mmap' optimisations to their programs. INND springs to mind.And all my non-networking major I/O programs prefer to use `mmap' :-).Also there's the new `mmap' capability in the sound driver. Pagesharing would reduce the need for that sort of thing, with the remainingmajor bottleneck being the need for `read' to wait for the status of theread operation. Even that could be eliminated when it is acceptable foran application to receive a SEGV when it tries to access a page that itthought it had read successfully, but for which asynchronous I/O failed.If it doesn't reference the page but does another `read' instead, it canreceive the error code then.

The point is that these page sharing optimisations might well provide aworthwhile performance improvement to existing I/O intensive software,without requiring any changes to the software. They'd also leave moreCPU left for everything else.

>>>>> "Ingo" == Ingo Molnar <mingo@pc5829.hil.siemens.co.at> writes:

Ingo> havent checked out your source yet, but under AIX similar Ingo> approach is used for async IO. Under Linux, if processes that Ingo> get clone()-d with VM_SHARE, and run on another CPU (or even Ingo> the same if the socket operation sleeps) and use the same Ingo> page(s) which are locked, then we might get problems. If there Ingo> is a local variable right in the same page as your IO buffer, Ingo> doesnt this cause problems or unnecessary/unexpected sleeps?

Sharing semantics=================This is what I meant when I said that the sharing semantics get a bitmore complicated. In this case, if either cloned task writes to thepage, and they are not supposed to be able to modify the I/O buffer, acopy is made (copy-on-write), but the copy is still shared between thetwo tasks (though not the I/O buffer). The original page may still beshared between I/O buffers, other tasks, and even other addresses withinthe same cloned tasks. Sometimes the tasks are allowed to modify theI/O buffer, even during I/O (see "Asynchronicity").

By the way, with this model a single page can be shared betweendifferent places within a single task, even within the same VM area,without being shared in the traditional sense. That is, a write to oneinstance of the page should not be reflected to other instances. Thisis easy to produce: the task writes some pages to a file, and then readsthem to another part of its data area. The data then occurs in twoplaces in a single VM area in a single task, but modifications to eachpage must be made independent by copying the page on demand.Of course, a single page can also be properly shared in the traditionalsense, in all sorts of ways using the traditional mechanisms for this(`mmap' of /proc/../mem, `mmap' of files, shared-memory calls, VM_SHARE,etc.). Both sorts of sharing occur at the same time, so it's all quitecomplicated.

Asynchronicity==============It is wise to treat all I/O as asynchronous, with synchronous I/O beinga special case where the task sleeps until notified later. Whateveroptimisations apply to the synchronous case (with no other "traditional"sharing), try to apply them in general.

Being asynchronous, it doesn't really matter if other tasks see the databeing written directly into their VM space, does it? Similarly, othertasks are welcome to modify the data during an asynchronous writeoperation. It is only when some task (or group of them sharing in thetraditional sense) or I/O subsystem considers a page to be all its ownthat you have to mark them copy-on-write.

Writing=======For example, just before the net subsystem calculates the checksum for aregion (during a `write'), it will have to claim a "private" copy of thepages in question. This simply involves marking the pages in all VMareas in all tasks as requiring copy-on-write. Other information thatapplies to the tasks can be calculated lazily later. While the data isbeing moved though (IP routing, no room in the ethernet card's TXbuffer, etc.), there is no need to mark the pages in this way. Diskdrivers, on the other hand, don't care if a task modifies the data as itis written to disk, as this is presumably allowed by asynchronous writesemantics. No sleeping on behalf of the I/O systems is involved -- if apage is copied, it is the responsibility of the task that wrote to it todo the copy.

You get the maximum benefits from page sharing by delaying copying, andcopy-on-write marking (which isn't free either) as much as possible.

Reading=======When data is read, I/O appears to be modifying a page. If all of thepage-mappings in all tasks involved in the read expect to see the newdata, I/O can go ahead and simply modify the page. If not, the I/O canwrite to a new page, and when it has finished the page is mapped intoall the places where the data is supposed to be seen. If the I/Odoesn't fill the page with data, then after the I/O some data is copiedfrom the old page to the new one (which is then remapped). If you knowhow much data you're expecting, then in principle this latter copyingcan happen in parallel with the I/O. Even if not, the initial part ofthe page can always be copied in parallel. Actually, after the I/O hascompleted you can check to see if the page still needs to be copied, andif not it may or may not be faster to copy the newly read data into theoriginal page.

If you're reading from the network, involving headers and checksums andthings, then the I/O won't be happening directly into running tasks'pages. In this case, everything happens as in the second case above:I/O reads the data, headers are processed, and eventually the data iscopied into the desired page, or the rest of the original page copiedinto the newly read data page.

In the special case of I/O reading a whole number of pages, both of theabove scenarios avoid copying data. In general, they minimise amount ofdata copied.

As for partial page copies, as ever it is worth treating the copying ofa zero-mapped page as a special case.

Asynchronicity again====================In all situations, I think the whatever copying needs doing can be donewithout locking the rest of the kernel. That is, the copying can bedone on behalf of some task or other, and thus pre-empted safely ifthere are more pressing things to be doing. If you want to get silly,the post-read copies could be delayed until a task actuallyreferences the copied data, though in general tasks tend to referencethe data they just requested very soon afterwards, so there's no point.

I see what you mean about unexpected sleeps -- in a device driver, whendata would like to get written to a page, for example. Then there issometimes a need to allocate a new page for the incoming data. This isa consequence of delaying (and sometimes avoiding) page allocations. Itshouldn't ever lead to more sleeping, allocation or copying than thecurrent system, but it can occur at less predictable times. Perhaps thebest solution is to allocate the page as late as possible outsidebottom-halfs, etc., shortly before commencing an I/O operation thatneeds a new page. At that point, check if the page needs to beduplicated, if a new one needs to be allocated, or if the existing onecan be used. That is also a good time to check if the existing page isswapped out :-).

Page allocation and the zero-page pool====================================== Ingo> currently, apart from a (usually very small) initial time, all Ingo> buffers are in use. There is a small amount of free pages Ingo> reserved for interrupt handlers and alike, but "pages that Ingo> could be zeroed out" are not available.

As ever, it is worth maintaining a small pool of unallocated pages, sothat last-minute page allocations tend not to have to sleep. This isdone already of course. These pages can be shared with the zero-pagepool. Uninitialised pages turn into zeroed ones when there's nothingbetter to do. If you need a new page and there's none in theuninitialised pool, just take one of the zeroed ones. If there aren'tany of those, _then_ you have to steal another page, hopefully onemirrored in swap, but potentially requiring swapping something out andsleeping.

If you're doing a partial fill of a zero-mapped page, then you have theoption of fetching a pre-zeroed page and partially filling it, or doingthe partial fill into an uninitialised page and zeroing out theremainder, either in parallel or afterwards. The former avoids anyzero-fill right now, but negates the benefits of having pre-zeroed thepart that will be written over. That is bad if the page could have beenused later, by something that involves more zero-filling.

Swapping benefits=================If I/O (or other data transfer) will update pages that are swapped out,all this memory-copying avoidance also has the advantage that the pagesdon't need to be read from disk before they're updated. The disk copiescan simply be discarded.

Another benefit is that by sharing more data, everything requires lessRAM and swap space. So as well as cutting down on memory copy times,you also reduce swap times.

The cost vs. benefit of maintaining pre-zeroed and free pages is a moredifficult thing to quantify: it seems to be a good idea in somecircumstances, and a bad one in others. Then there are decisionsrelating to which way to copy data, and when to use pre-zeroed pages forpartial page fills. Someone will be able to find some reasonableheuristics to control all of this, I'm sure. One thing is clear: apartfrom the complexity of the implementation, this is unlikely to lead toworse performance than the current system under any reasonablecircumstances.Optimising competing strategies===============================There are costs and benefits to all of the various competing strategies.Delaying page copying, and thus avoiding much of it, seems to always bea good idea. Apart from the complexity that is, but there is probablysome clean way to model it. There is also some overhead in maintainingthe more complex page data structures, and updating the mappings in alltasks sharing the pages. That also applies now when a shared page isswapped out, but will be worse with more sharing. Or course, with moresharing there is less swapping anyway.

You can delay adding page mappings to a task (though not unmapping orcopy-on-write marking), which has its own benefits in that yourstatistics regarding a process' working set are more accurate (so youswap out the wrong pages less often), at the cost of more page faultprocessing. (There's something to be said for deliberately unmappingother, random pages from a task for this reason). Again, costsvs. benefits. There is the decision as to whether to use a pre-zeroedpage now, or to zero-fill part of a page either now or later. Not tomention the problem of how many free and pre-zeroed pages to aim for.Probably the best strategy is to maintain statistics estimating howeffective each strategy would have been in the recent past, and usethose to guide decisions when decisions need to be made. This is reallya different problem though, and is tied in with optimising swappingstrategies.

Finally=======The following silly shell command will go much, much faster: