I'd like to thank all the folks who emailed me with congratulations for the
3rd anniversary of Kernel Traffic. Even slashdot picked it up, which was a nice
surprise.

Oddly enough, one of the biggest complaints I see about KT in its rare
moments in the limelight, is that it isn't a complete summary of events on
the list. I think one of the slashdot commentaries was that KT sometimes
leaves out important threads, while going into too much detail on irrelevant
side discussions.

As far as leaving out important threads, that is certainly true. List
traffic has averaged 5.5 megs per week over the time I've been covering
it, and in my opinion almost all of that traffic is on topic. This does not
include the dozens of auxiliary mailing lists devoted to specific parts of the
kernel, or the never-ending IRC discussions in which much real development
takes place. Kernel Traffic, and other kernel-related publications, can
do little more than give the briefest flavor of what is going on in kernel
development. If you really want a thorough understanding, you have no choice:
you must subscribe to LKML and experience its tidal forces for yourself. Take
it from me: it is well worth it.

My goal with KT is to present the threads that most interest me personally.
And I am most interested in the way kernel development plays out as a
process. How are decisions made? Who is involved? What is free software
development? This development model was born with the Linux kernel. Before
then, although the sources may have been available under the GPL etc.,
the universally accepted wisdom was that high-quality software could only
be created by a small team of experts working for a long time in private,
putting out new releases only after many months or years of effort. This
method was found in the GNU project as it was in proprietary software
companies. Linus cracked that idea wide open, and the core essence of his
methods are now found in the organization of virtually every open source
software project out there. Even some commercial entities try to simulate
it in-house, with greater or lesser success.

These development processes are themselves still under development, and
I choose to search for them here, where it all began. Not all the threads I
cover focus on this desire, because the broad landscape of the kernel project
doesn't always reveal itself under tight focus. And some summaries are more
news-related than anything else, just presenting portions of the kernel as
they currently are.

I hope above all, people find Kernel Traffic enjoyable and interesting. If
its failings and insufficiencies provoke people to delve deeper into the
real kernel development forums, I count that as a complete success.

There's a user-level API and a couple of test programs in the attached
tarball. I haven't bothered wih the vital security hash/signature thing
yet.

It all seems to work (i686 UP and SMP), but isn't without issues:

It leaks. How were you going to refcount the kernel
portions? Could they be attached to the VM mapping?
Would a lockfs be too expensive?

It doesn't have a timeout. Is there something like a
down_timeout() available?

I don't do the:

if (kfs->user_address != fs)

goto bad_sem;

because it doesn't seem to add anything, and prevents
putting these locks in a non-fixed file or SysV SHM
map.

Is that a problem?

To this last, Linus Torvalds replied that he'd suggested that mainly as
another sanity check, that wasn't strictly needed. Ragerding the time-out
requirements, Linus said nothing existed for that as such in the kernel so far,
though in theory all the needed infrastructure would be provided eventually.
Finally, regarding the leaks, he said that attaching the refcounts to the VM
mappings would be an acceptable way to make sure all memory was freed at the
proper time. He added that he might
"also require
a flag at mmap time (MAP_SEMAPHORE - some other unixes have something like
it already) to tell the OS about the consistency issues that might come up
on some architectures (on x86 it would be a no-op)."
He and Matthew
exchanged a few words on how to implement the reference counting, and then
Linus said:

Note that there are other, potentially cleaner solutions. In particular,
some people like the "semaphore as file descriptor" approach, and I have to
say that I think they may be right. Then you just pass the file descriptor
along as the cookie, and you can do dup()/close() etc on it.

Mind trying that approach instead? It's not all that far off from
your current setup, and it would certainly have none of the security
implications..

After some off-list discussion, Matthew posted some code, and there
followed a technical discussion with Manfred Spraul, Matthew Kirkwood,
Alan Cox, and Rusty Russell.

On the principle that success reports for a given patch will not result
in actual inclusion in the kernel sources unless first lauded on LKML,
Andre Hedrick forwarded some private praise from Rob Radez, regarding
Andre's ide.2.4.16.12102001 patch. Rob said,
"I'm
using your ide.2.4.16.12102001 patch with a Promise PDC20269 controller
and a Maxtor 160GB hard drive on 2.4.17, and I just wanted to tell you
that it's working great so far."
A lot of other people agreed that
Andre's code was working perfectly, and urged inclusion in 2.4 and 2.5;
at one point Andre remarked,
"I know the driver
is stable and effectively perfect in operations. So I do not understand the
total ignore I receive about it."
Elsewhere, Andrew Morton said,
"I spent a couple of hours beating the crap out of it, and
none actually came out. I'd vote for prompt inclusion in 2.5, and inclusion
in 2.4.x-pre1 when it's shown to be stable."
Oliver Xymoron put in,
"I vote for doing the reverse. The 2.4 codebase
is the more tested, the 2.5 is a forward-port. Given all the related block
changes still settling out in 2.5, changing IDE might make block layer/IDE
issues hard to sort out. Let's see it in the next 2.4.x-pre1."

Oleg Drokin posted a patch (originally by Andreas Dilger) against the
2.4 kernel, to reserve space for volume label and UUID in the Reiserfs 3.6
superblock, and to generate random UUID for volumes converted from 3.5 to 3.6
format by the kernel. He urged inclusion in the sources, but Chris Mason said,
"This should not be applied until an updated (non beta)
reiserfsprogs package that supports these features has been released."
Oleg felt there was no need to wait for outside support before applying
the patch. He said,
"when actual reiserfsprogs
and util-linux support will appear, people will just start to use these
features."
He also cautioned that if tools were released, supporting
kernel features that were not yet implemented, bad things could happen. He
added that Hans Reiser also felt the time had come for the patch.

Chris replied that applying the patch would force changes in the userland
tools, which should as policy never be done during a stable series. But he
went on,
"But, the progs are improving so quickly
that we should bend this rule a little bit. Another example is the unlink
truncate patch never should have been sent to Marcelo without a non-beta
reiserfsprogs that understood it. Neither should this patch (even though
it is a much smaller problem)."

Oleg pointed out that the patch would not force any changes on userland
programs, although
"if someone will update their
progs voluntarily, we cannot forbit them to! ;))"
. Chris replied,
"The point is that we should never add something to
the kernel until our utils package understands it. Yes, this is a simple case,
but if we want to call reiserfs stable, there are some basic rules we need to
start following."
Oleg replied that actually, the latest reiserfsprogs
package did understand the new data organization, it just couldn't actually
change the content of the new fields itself. Chris, looking over the code,
didn't see how the tools were aware of the new design, or even what a UUID
was. Oleg said,
"It does not know about uuid per se,
but it know in that area some text data is stored."

At this point Oleg noticed,
"I see MArcello have
not applied this patch to 2.4.18-pre3, so we have some more time to prepare
reiserfsprogs ;)"
. End of thread.

Adam J. Richter posted a large patch to clean up the SCSI modules in 2.5, and
said he'd post smaller incremental patches for Linus Torvalds unless there were
objections. Alan Cox replied:

I specifically told people not to hack on the old NCR5380 driver. You've
taken a semi broken driver, destroyed it completely and risked disk corruption
for anyone who uses it.

What really annoys me is that I've already asked you specifically not
to submit patches to that driver but to take the 2.4.18pre version of the
driver and port that one forward if you must fiddle with it. Instead you've
wasted your time, and tried to make the future merge harder.

Its absolutely obvious from the changes that you have no grasp how the
locking in that driver is handled, nor what it depends upon. If you had
understood that locking you'd have realised you were hacking on a driver
version that was totally flawed.

How many other maintainers have you ignored trying to send in untested
patches to their drivers ?

Adam objected that he'd never received such instructions from Alan. Going
over his email archive, he couldn't find any email like the one Alan described.
But he added,
"Now that I am aware of your
request regarding using the 2.4.18pre version of the NCR driver for future
maintenance of the 2.5 driver, I am happy to follow it."

Alan apologized for confusing him with someone else, and suggested,
"I think you'll find it"
(the 2.4 code)
"a lot easier to follow too. The thing to watch is that the
queue of devices to process on an IRQ is not per host but driver global. The
rest should be obvious, but watch the co-routine locking. If you get that
wrong the driver does occasionally recurse down the stack and explode
mysteriously."
End of thread.

He pointed out the CIPE was itself GPLed, and asked,
"I remember reading on l-k a few times some stuff about GPLONLY_
but I have no idea what to do now that I've run into whatever the problem
is that is caused by this?"
Alan Cox instructed:

Add

MODULE_LICENSE("GPL");

to the cipe code and all will be well

Brian tried this with complete success, and Olaf Titz pointed out that
the fix had already made it into the CIPE CVS tree.

Andreas Haumer reported,
"I'm seeing a problem
with SMP Linux-2.2.20 on an ASUS CUR-DLS motherboard. I noticed there were
similar reports in the past few months and I got the impression the problem
should already be fixed in 2.2.20, but seemingly it isn't."
Benjamin
LaHaise said this was fixed in 2.4, and Andreas asked if there would be a
back-port into 2.2.21; Benjamin replied,
"That's
unlikely: the improvements in smp locking are what 2.4 was all about, so
"backporting" them is basically reinventing 2.4."
And Alan Cox also said
to Andreas,
"2.2 does not support VIA SMP, its probably
not a good kernel to choose for the buggy VIA chipsets either."

People keep bugging me about the -ac tree stuff so this is whats in my
current internal diff with the ll patch and the ide changes excluded.

Much of this is stuff just waiting to go to Marcelo but it has the 32bit
uid quota that some folks consider pretty critical and the rmap-11b VM which
I consider pretty essential

(Marcelo I'll be sending you stuff I've done from this anyway, if there
is other stuff you want extracting just ask)

Adam Kropelin reported,
"For the sake of
completeness I ran my large inbound FTP transfer test (details in the
"Writeout in recent kernels..." thread) on this release. Performance and
observed writeout behavior was essentially the same as for 2.4.17, both stock
and with -rmap11a. Transfer time was 6:56 and writeout was uneven. 2.4.13-ac7
is still the winner by a significant margin."
Alan replied,
"That is very useful information actually. That does rather
imply that some of the performance hit came from the block I/O elevator
differences in the old ac tree (the ones Linus hated ;)). Now the question
(and part of the reason Linus didnt like them) - is why ?"
Benjamin
LaHaise said,
"Iirc, Linus just didn't like
the low/high watermarks for starting & stopping io. Personally, I liked
it and wanted to use that mechanism for deciding when to submit additional
blocks from the buffer cache for the device (it provides a nice means
of encouraging batching). The problem that started this whole mess was a
combination of the missing wake_up in the block layer that I found, plus the
horrendous io latency that we hit with a long io queue and no priorities.
The critical pages for swap in and program loading, as well as background
write outs need to have a priority boost so that interactive feel is better.
Of course, with quite a few improvements in when we wait on ios going into
the vm between 2.4.7 and 2.4.17, we don't wait as indiscriminately on io
as we did back then. But write out latency can still harm us. In effect,
it is a latency vs thruput tradeoff."

Robert Love announced an update to allow his preemptive-kernel patch to
be used with Ingo Molnar's O(1) scheduler. Several folks pounded on it,
and William Lee Irwin III said,
"I have
at least run it on my laptop, together with rmap even. No pathological
behavior that I can tell. Of course, the interactive response is wonderful,
but I haven't precisely measured anything, as I have enough other things to
measure precisely it's a bit far afield."

The new scheduler holds IRQs off across the call to context_switch.
UML's _switch_to expects them to be enabled when it is called, and things
go badly wrong when they are not.

Because UML has a host process for each UML thread, SIGIO needs to be
forwarded from one process to the next during a context switch. A SIGIO
arriving during the window between the disabling of IRQs and forwarding of
IRQs to the next process will be trapped on the process going out of context.
This happens fairly regularly and causes hangs because some process is waiting
for disk IO which never arrives because the process that was notified of
the completion is switched out.

So, is it possible to enable IRQs across the call to _switch_to?

Davide Libenzi posted a fix which seemed to work, but Ingo Molnar pointed
out that it was broken for SMP systems. Elsewhere, Ingo also replied to
Jeff's request, saying:

unfortunately this cannot be done, due to exit(), ptrace() and other SMP
races. On SMP, the 'previous' task is protected by the runqueue lock. If we
do the context switch outside the runqueue lock then a task might be freed
on another CPU while it's in fact still in use.

there are other heavy implications as well:

current->processor is no longer valid from IRQ handlers.

a CPU might execute the 'previous' task before we have switched away
from it. (nothing but the runqueue lock keeps the load balancer from taking
the task from the runqueue.)

in 2.4 i've implemented irq-enabled context switches, and it was a major
PITA. To do it correctly one has to do reintroduce __schedule_tail() and do a
task_lock/task_unlock to get context-switch atomicity via other means than the
local runqueue lock. On 2.4 i did this because global runqueue contention was
such an issue for certain workloads that even the task-unlocking overhead was
worth it. With the O(1) scheduler this is pretty much out of the question.

we could enable interrupts on UP - because UP is special, disabling
interrupts there is in essence a cheap 'global interrupt lock'. But that
doesnt help the SMP/UML situation much.

i'd suggest to find some other solution for UML, besides signals.
__switch_to is a very internal function that can very well be called with
spinlocks disabled, we just cannot guarantee that it will be called with
irqs enabled. Signals are something that is often 'heavy', it cannot be done
atomically in the generic case.

Jeff replied:

You suggest implementing interrupts with something other than signals?
What else is there?

In any case, I stuck a little kludge in _switch_to which checks for
pending SIGIO and, if there is one, hits the incoming process with a SIGIO.
This seems to do the trick.