In the course of discussion, Linus Torvalds made some comments (quoted
below) about his requirements for a config system. In the middle of the
thread, Peter Samuelson had posted a patch against 2.4.20-pre1, to fix up the
kconfig language, which, as Greg Banks said in his reply, had been held together with spit and string. But Greg felt
Peter's changes might be too invasive for the 2.4 tree even if the patch was
great. Peter replied that his changes were trivial
enough, and easy enough to test, that I think it could go in 2.4, yes.
Obviously xconfig would need to be dealt with in sync with the others,
which I'm not doing during the prototyping / idea-mongering stage.
But Greg said, I think you're underestimating the
Gordian knot that is the CML1 corpus.

The patch itself (among other things) made the need for '$' in front of
dependency names optional instead of required. At one point Peter explained,
The main motivation for dropping the '$' was
to make possible the "" == "n" semantics. To this, Greg said, Changing the existing semantics, regardless of how broken we
all agree they are, is asking for a world of trouble. He went on:

But at this point in the menu tree for 14 of 17 architectures, CONFIG_SCSI
has not yet been defined. The result is that CONFIG_BLK_DEV_IDESCSI only
works in "make config" and "make allyesconfig" precisely because of the
semantic that you wish to change.

Peter said this was a bug in his code, and if Greg posted a list of all the
ones he found, Peter would go through and patch them. Greg said there were
thousands of occurrences, spread throughout 17 architectures. Sam Ravnborg
suggested, How about extending the language (side
effect) to automagically append (EXPERIMENTAL) or (OBSOLETE) to the menu line,
if dependent on those special tags?

Peter and Greg both agreed this was a good idea, an Greg pointed out that
CML2 had used that implementation as well. But Greg added, The trouble is actually achieving that in shell-based parsers where
shell code cannot tell whether $CONFIG_EXPERIMENTAL has been used in a
condition. Sam said, Remembering the CML2 war
there were no serious objections about
shifting away from shell based parsers (but certainly a lot about the
alternative selected). He asked, Where comes
the requirement that we shall keep the existing shell
based config parsers? And Linus replied:

I use them exclusively.

It is far and away the most convenient parsing - just to do "make oldconfig"
(possibly by making changes by hand to the .config file first).

As far as I'm personally concerned, the shell parsers are the _only_
parser that really matter. So if you want to replace them with something else,
that something else had better be pretty much perfect and not take all that
long to build.

PCIRandom Number GenerationSMPUSBReal-Time

Oliver Xymoron announced:

I've done an analysis of entropy collection and accounting in current
Linux kernels and founds some major weaknesses and bugs. As entropy
accounting is only one part of the security of the random number
device, it's unlikely that these flaws are compromisable, nonetheless
it makes sense to fix them.

Broken analysis of entropy distribution

Spoofable delta model

Interrupt timing independence

Ignoring time scale of entropy sources

Confusion of unpredictable and merely complex sources and trusting the
latter

Broken pool transfers

Entropy pool can be overrun with untrusted data

Net effect: a typical box will claim to generate 2-5 _orders of magnitude_
more entropy than it actually does.

Note that entropy accounting is mostly useful for things like the
generation of large public key pairs where the number of bits of
entropy in the key is comparable to the size of the PRNG's internal
state. For most purposes, /dev/urandom is still more than strong
enough to make attacking a cipher directly more productive than
attacking the PRNG.

The following patches against 2.5.31 have been tested on x86, but
should compile elsewhere just fine.

I've tried to cover some of the issues in detail below:

Broken analysis of entropy distribution

(I know the topic of entropy is rather poorly understood, so here's a couple
useful pieces of background for kernel folks:

Mathematically defining entropy

For a uniform distribution of n bits of data, the entropy is
n. Anything other than a uniform distribution has less than n bits of
entropy.

Non-Uniform Distribution Of Timing

Unfortunately, our sample source is far from uniform. For starters, each
interrupt has a finite time associated with it - the interrupt latency.
Back to back interrupts will result in samples that are periodically
spaced by a fixed interval.

A priori, we might expect a typical interrupt to be a Poisson
process, resulting in a gamma-like distribution. It would also have
zero probability up to some minimum latency, have a peak at minimum
latency representing the likelihood of back-to-back interrupts, a
smooth hump around the average interrupt rate, and an infinite tail.

Not surprisingly, this distribution has less entropy in it than a
uniform distribution would. Linux takes the approach of assuming the
distribution is "scale invariant" (which is true for exponential
distributions and approximately true for the tails of gamma
distributions) and that the amount of entropy in a sample is in
relation to the number of bits in a given interrupt delta.

Assuming the interrupt actually has a nice gamma-like distribution
(which is unlikely in practice), then this is indeed true. The
trouble is that Linux assumes that if a delta is 13 bits, it contains
12 bits of actual entropy. A moment of thought will reveal that
binary numbers of the form 1xxxx can contain at most 4 bits of
entropy - it's a tautology that all binary numbers start with 1 when
you take off the leading zeros. This is actually a degenerate case of
Benford's Law (http://mathworld.wolfram.com/BenfordsLaw.html), which
governs the distribution of leading digits in scale invariant
distributions.

What we're concerned with is the entropy contained in digits
following the leading 1, which we can derive with a simple extension
of Benford's Law (and some Python):

As it turns out, our 13-bit number has at most 9 bits of entropy, and
as we'll see in a bit, probably significantly less.

All that said, this is easily dealt with by lookup table.

Interrupt Timing Independence

Linux entropy estimate also wrongly assumes independence of different
interrupt sources. While SMP complicates the matter, this is
generally not the case. Low-priority interrupts must wait on high
priority ones and back to back interrupts on shared lines will
serialize themselves ABABABAB. Further system-wide CLI, cache flushes
and the like will skew -all- the timings and cause them to bunch up
in predictable fashion.

Furthermore, all this is observable from userspace in the same way
that worst-case latency is measured.

To protect against back to back measurements and userspace
observation, we insist that at least one context switch has occurred
since we last sampled before we trust a sample.

Questionable Sources and Time Scales

Due to the vagarities of computer architecture, things like keyboard
and mouse interrupts occur on their respective scanning or serial
clock edges, and are clocked relatively slowly. Worse, devices like
USB keyboards, mice, and disks tend to share interrupts and probably
line up on USB clock boundaries. Even PCI interrupts have a
granularity on the order of 33MHz (or worse, depending on the
particular adapter), which when timed by a fast processor's 2GHz
clock, make the low six bits of timing measurement predictable.

And as far as I can find, no one's tried to make a good model or
estimate of actual keyboard or mouse entropy. Randomness caused by
disk drive platter turbulence has actually been measured and is on
the order of 100bits/minute and is well correlated on timescales of
seconds - we're likely way overestimating it.

We can deal with this by having each trusted source declare its clock
resolution and removing extra timing resolution bits when we make samples.

Trusting Predictable or Measurable Sources

What entropy can be measured from disk timings are very often leaked
by immediately relaying data to web, shell, or X clients. Further,
patterns of drive head movement can be remotely controlled by clients
talking to file and web servers. Thus, while disk timing might be an
attractive source of entropy, it can't be used in a typical server
environment without great caution.

Complexity of analyzing timing sources should not be confused with
unpredictability. Disk caching has no entropy, disk head movement has
entropy only to the extent that it creates turbulence. Network
traffic is potentially completely observable.

(Incidentally, tricks like Matt Blaze's truerand clock drift
technique probably don't work on most PCs these days as the
"realtime" clock source is often derived directly from the
bus/PCI/memory/CPU clock.)

If we're careful, we can still use these timings to seed our RNG, as
long as we don't account them as entropy.

Batching

Samples to be mixed are batched into a 256 element ring
buffer. Because this ring isn't allowed to wrap, it's dangerous to
store untrusted samples as they might flood out trusted ones.

We can allow untrusted data to be safely added to the pool by XORing
new samples in rather than copying and allowing the pool to wrap
around. As non-random data won't be correlated with random data, this
mixing won't destroy any entropy.

Broken Pool Transfers

Worst of all, the accounting of entropy transfers between the
primary and secondary pools has been broken for quite some time and
produces thousands of bits of entropy out of thin air.

Linus Torvalds was skeptical. To Oliver's claim of two to five orders of
magnitude more entropy, Linus replied:

On the other hand, if you are _too_ anal you won't consider _anything_
"truly random", and /dev/random becomes practically useless on things that
don't have special randomness hardware.

To me it sounds from your description that you may well be on the edge
of "too anal". Real life _has_ to be taken into account, and not accepting
entropy because of theoretical issues is _not_ a good idea.

Quite frankly, I'd rather have a usable /dev/random than one that runs
out so quickly that it's unreasonable to use it for things like generating
4096-bit host keys for sshd etc.

In particular, if a machine needs to generate a strong random number,
and /dev/random cannot give that more than once per day because it refuses
to use things like bits from the TSC on network packets, then /dev/random
is no longer practically useful.

Theory is theory, practice is practice. And theory should be used to
_analyze_ practice, but should never EVER be the overriding concern.

So please also do a writeup on whether your patches are _practical_. I
will not apply them otherwise.

Oliver replied, My box has been up for about
the time it's taken to write this email and it's already got a full entropy
pool. A 4096-bit public key has significantly less than that many bits of
entropy in it (primes thin out in approximate proportion to log2(n)).
He went on:

Let me clarify that 2-5 orders thing. The kernel trusts about 10 times
as many samples as it should, and overestimates each samples' entropy by
about a factor of 10 (on x86 with TSC) or 1.3 (using 1kHz jiffies).

The 5 orders comes in when the pool is exhausted and the pool xfer function
magically manufactures 1024 bits or more the next time an entropy bit (or
.1 or 0 entropy bits, see above) comes in.

He concluded, The patches will be a nuisance
for anyone who's currently using /dev/random to generate session keys on
busy SSL servers. But [...] with the
old code, they were fooling themselves anyway. /dev/urandom is appropriate
for such applications, and this patch allows giving it more data without
sacrificing /dev/random. And he added that only people using
/dev/random to generate session keys on busy SSL servers would find his
patches a nuisance.

Linus took a look at the code, and said, No,
it appears to be a nuisanse even for people who have real issues, ie just
generating _occasional_ numbers on machines that just don't happen to run
much user space programs. He said that Oliver's code threw out a
lot of entropy sources that should have been kept. He spent another twenty
minutes looking at the code and replied to himself, saying:

Hmm.. After more reading, it looks like (if I understood correctly),
that since network activity isn't considered trusted -at-all-, your average
router / firewall / xxx box will not _ever_ get any output from /dev/random
what-so-ever. Quite regardless of the context switch issue, since that
only triggers for trusted sources. So it was even more draconian than I
expected.

Are you seriously trying to say that a TSC running at a gigahertz cannot
be considered to contain any random information just because you think you
can time the network activity so well from the outside?

Oliver, I really think this patch (which otherwise looks perfectly fine)
is just unrealistic. There are _real_ reasons why a firewall box (ie one
that probably comes with a flash memory disk, and runs a small web-server
for configuration) would want to have strong random numbers (exactly for
things like generating host keys when asked to by the sysadmin), yet you
seem to say that such a user would have to use /dev/urandom.

If I read the patch correctly, you give such a box _zero_ "trusted"
sources of randomness, and thus zero bits of information in /dev/random.
It obviously won't have a keyboard or anything like that.

This is ludicrous.

Alan Cox interjected:

The current policy has always been not to trust events that are
precisely externally controllable. Oliver hasn't changed the network
policy there at all.

Its probably true there are low bits of randomness available from such
sources providing we know the machine has a tsc, unless the I/O APIC is
clocked at a divider of the processor clock in which case our current
behaviour is probably much saner.

Oliver also replied to Linus, saying Linus' points were not false, but
that anyone who had the problem of zero trusted sources of entropy on their
system with his patch, would have had the same problem before. His patch
only made that explicit. But Linus said:

Be realistic. This is what I ask of you. We want _real_world_ security, not
a completely made-up-example-for-the-NSA-that-is-useless-to-anybody- else.

All your arguments seem to boil down to "people shouldn't use /dev/random
at all, they should use /dev/urandom".

Which is just ridiculous.

But elsewhere, he qualified, I suspect that
Oliver is 100% correct in that the current code is just _too_ trusting. And
parts of his patches seem to be in the "obviously good" category.

Henning P. Schmiedehausen

Jack Bloch asked if there were any plans to do an SCTP (Stream
Control Transmission Protocol) implementation as described in RFC 2960 under Linux. David S.
Miller replied, It's done, I'm going to merge it
in the next week or so into 2.5.x Search the list archives for the SCTP project
site as I don't have the URL handy. Henning P. Schmiedehausen gave
a link to http://www.sctp.de/, and Philipp
Matthias gave a link to the SCTP
project page on Sourceforge.

HyperthreadingSMPIngo Molnar

Timothy A Reed asked what kernel configuration options were needed in order
to make use of hyperthreading. James Bourne replied:

As long as you have a P4 and use the P4 support you will get hyperthreading
with 2.4.19 (CONFIG_MPENTIUM4=y). 2.4.18 you have to also turn it on with
a lilo option of acpismp=force on the kernel command line.

hyperthreading will give you some performance boostes, but *only* if
you have many runable processes a majority of the time, or very heavily
threaded applications running on the system. (an example would be running
4 setiathome clients on a dual processor system).

Hugh Dickins added, You do need CONFIG_SMP and a
processor capable of HyperThreading, i.e. Pentium 4 XEON; but CONFIG_MPENTIUM4
is not necessary for HT, just appropriate to that processor in other
ways. James said he'd been under the impressiong that the P4 XEON
was the only processor capable of hyperthreading. Kelsey Hudson replied,
This is currently correct, although I believe
Intel has plans to release a Hyperthreading-capable version of its desktop
P4. There followed some speculation that other processors were capable
of hyperthreading, but had it disabled. Alan Cox remarked at one point:

If you want to know the full HT capabilities of the processor you need
to read cpuid 1 and check ebx bits 16-23.

There has been some interesting speculation as to whether you can enable
HT by undocumented mtrrs on cpus that have "ht" but claim not to be doing
HT. Clearly the value returned is settable somewhere but I've seen no proof
yet than you can enable HT on non PIV Xeons this way.

Christoph Hellwig asked, Any chance
you could stop that BK megachangeset and instead do one changeset per cvs
commit? Jaroslav replied, I'll do more
frequent syncing with the kernel tree in the future (I assume per week),
but creating changesets per CVS commit is too overkill from the maintaince
view. Everybody interested in ALSA development might watch our CVSLOG mailing
list (archived) or use our CVS.

SMP

David Howells posted a patch and said, Here's a
patch to stop multiple simultaneous oopses on an SMP system from interleaving
with and overwriting bits of each other. It only permits lock breaking if
the printk lock is held by the same CPU. Benjamin LaHaise objected,
This is still wrong. It should attempt to
acquire the locks with a timeout before trampling on them, as there may
be a printk or other console output in progress on the other cpu.
But he thought better of it a few minutes later, and said instead, The patch is actually right, but bust_spinlocks still
blindly stops on locks that may not need to be stomped on.

POSIXShlomi Fish

Matthew Wilcox posted a patch and explained, Shlomi Fish asked about including first-come, first-served style
locking for posix and flock locks. After some back-and-forth, we came up
with the following patch which seems unintrusive enough to bother including.
Personally, I doubt the utility of this, but someone might have an application
for it, and the code's already written.

Kernel Build System

Chris Friesen said, I noticed the other day that
on a kernel compile, the timestamps of some files are changed. The funny thing
is that all the changed ones are header files, but not all header files are
modified. Is this expected behaviour? Sam Ravnborg replied, I assume you are compiling a 2.4 kernel, in which case
this is expected behaviour. For the 2.5 kernel kbuild has been changed such
that header files are no longer 'touched' during the compile process.
And George Anzinger added that in the 2.4 case, it
has to do with how dependencies are propagated from header file to header file
(i.e. where a header file includes another).

MaintainershipRussell King

Someone asked about the serial driver. The maintainer hadn't updated the
web page in a long time, and the poster had some hardware he/she wanted to
support. But he/she didn't want to do the work unless there was some chance
it might be accepted into the driver. Without a maintainer, that looked
doubtful. Stuart MacDonald replied, Ted doesn't
seem to be maintaining it anymore. If you look in the linux-kernel archive
you'll find that Russell King is doing a rewrite for 2.5/6 anyway.
He added, Update the driver, make a patch and
send it to the list. If it's good likely it will be included. You may want
to check out linux-serial also.

Bug TrackingFS: ext3

Stephen Tweedie posted a patch and explained:

Ext3's internal debugging has always assumed that it was illegal for
there to be parallel IO on a buffer-head which it is trying to modify.
That's reasonable --- if there is an IO collision, we end up with IOs hitting
disk out-of-order wrt the journal, so we lose recovery guarantees.

However, there are two cases where the test is a little over-zealous.
If user space is performing inherently non-transactional writes (eg. tune2fs
adding a label to a live filesystem and writing to the buffered device
superblock location) then we can hit the ext3 assertion.

More seriously, since 2.4.11 the page cache can lock a buffer_head for
read even if the bh is already under journal control. The tune2fs bug is
very rare: there have been no reports of it in Bugzilla or ext3-users lists,
and only one on 2.5 on linux-kernel. But now, a dump(8) on a live filesystem
can also give rise to the same condition, and in testing, dump + fs activity
reproduces the assertion-failure VERY rapidly.

This patch changes the jbd get-write-access code to take the buffer_head
lock before testing the uptodate and dirty state of a bh, and relaxes the
handling of unexpectedly-dirty buffers to be a printk warning, not a fatal
error. The lock will cure the dump(8) interaction, and the warning means
that we will still spot out-of-order writes, while not taking the whole
kernel down if we collide with a tune2fs(8).

The patch also removes a small potential hole in the recovery guarantees.
It is not safe for a transaction to steal buffers from checkpoint mode
until after that transaction has committed. Otherwise, a reboot at the
wrong moment might find the old copy of the buffer in the journal had been
removed from the recovery set before the new copy was written.

Corey Minyard announced:

I've split up the driver, creating working 2.4.19 and 2.5.31 versions
of the driver (and even tested them!) and split the emulation code into a
separate patch.

I also noticed that 2.5.31 timer interrupts occur at 1ms instead of 10ms,
so it should provide acceptable speed without high-res timers. 2.4 without
high-res timers or interrupts will still be very slow.

I have not yet tested interrupts, because I don't have a card that supports
them (it's on its way). However, that's pretty straightforward.

Please, try it out and tell me what you think. Again, I'm shooting for
getting this in the mainstream kernel.

FSAlexander Viro

Vincent Hanquez wanted to submit a documentation patch for some filesystem
code, and asked who the current maintainer was. Alan Cox said, Generally the maintainer of the code the documentation covers,
or the author on the file. If you aren't sure send it to the list.
Someone else said that for filesystem docs, Alexander Viro was the place to
send patches.

Virtual MemoryMarcelo TosattiAlan CoxRik van Riel

Federico Di Gregorio posted a patch and announced:

this is my first try at a kernel patch, i hope i am doing everything right;
if not, please just tell me. (i sent this patch to both the drm maintainer and
the linux-kernel ML. should i send 2.4 patches directly to marcelo? mm..)

anyway, this is just a backport of the 2.5 DRM driver for Intel 830M to
the 2.4 series. It is against 2.4.19 but, consisting only of added files
it should work clean on later kernels (tested on 2.4.20pre). The patch is
quite big (67252 bytes) and can be downloaded from:

Please don't do this. The 2.5 drm code is a piece of shit and even
crappier than the one in 2.4.

Alan, is there any chance you could send marcelo the -ac drm code?

Alan Cox invited Christoph to untangle the drm code from its rmap macro
dependencies and send it to Marcelo Tosatti himself. But Rik van Riel said that
those dependencies had been merged into 2.4 months before. Christoph said:

I've uploaded a patch that updates the mainline drm code to -ac, fixes
all compiler warnings and removes the remaining LINUX_VERSION_CODE checks
after most have already been removed in -ac.

No, we're not GPLing it but we are making a few adjustments and wanted
to make sure that it was an improvement, not a regression, in the eyes of
the free users. Sorry for the intrusion, I'll be as brief as possible.

3(a) Propagation to openlogging.org. The old license insisted that
you log your changes within 7 days; several people pointed out that
they are spending their dotcom dollars sitting on an island hacking the
kernel and they may not have connectivity every 7 days. Or something.
We upped the limit to 21 days, that should be enough, I have to believe
that you check your mail every three weeks if you are doing work.

3(c) Maintaining Open Source. Our intent was that the free use of
BitKeeper was for the purpose of helping the open source community;
it was not to provide commercial users a free product. We have had
a number of cases where managers up to VPs have told their engineers
"just don't put anything useful in the checkin comments and then we can
use it for free". Not what we had in mind. So we're adding a clause
which says that we reserve the right to insist that you make your
repositories available on a public port within 15 days of the request.

We understand that lots of legit open source users have very good
reasons for not wanting their changes made public, e.g., they are working
on a security fix. We are absolutely not going to ask these sorts of
repositories be forced out in the open and if you are concerned about
that we can work out some sort of written agreement to that effect.
We're very much committed to supporting open source development, in
particular the Linux kernel and even more specifically Linus, he's a
critical resource.

The only people we're going after are those people who are clearly
not part of the open source community. We thought about saying we
would only enforce this if they were working on source which did not
have an open source license and rejected it for the following reason:
there are commercial companies working on open source, using BitKeeper
to do so, and not sharing their changes for as long as they can to get a
competitive edge in the marketplace. There is nothing wrong with that
under the terms of the GPL, but we don't have to support what we view
as commercial activity for free. Open means open, it's about sharing,
not money, in our opinion.

It's a hard nut to crack, you can't just say "it's free if you are
doing everything out in the open" because there are legit reasons
for hiding. There also commercial reasons for hiding and our view is
that if that is what you are doing, you should be paying for the tools.
BK is free as a way to help people help each other.

4.4. Remove the $20,000 support clause. We had a clause that said that we
could shut you down if you cost us more than $20K in support. This was
a widely hated clause and we're aware of that. It was there as a way to
try and shut down those people who were really commercial. Since the
previous change will effectively do that, we don't need this clause.
That removes the fear that we'll shut down bkbits or the kernel's use
of BK.

That's it on the licensing stuff. Since I'm here, here's some BK
status.

We're in discussions with a very Linux friendly hosting service (4000
Linux servers hosted) to move bkbits.net and openlogging.org to their site
in exchange for BK licenses. This should make the bkbits.net service have
more bandwidth and the benefit of a an extremely well connected and well run
hosting environment. We don't need the bandwidth, BK is super stingy with
bandwidth, but it's cool to have bkbits.net in an air conditioned, UPSed,
multi peered environment instead of my office. We're psyched about this,
it's a good thing.

We're working on closing the first commercial deal which we can tie to the
use of BK by the kernel team. If this actually happens, I'm going to take
$25K of the deal and "give" it to Linus as "BK bucks" which he can spend.
What means is that he has $25K to spend on BK features that he wants.
This is above and beyond stuff that we're doing already, it's a way to give
him the power to insist that we do some work that we wouldn't do otherwise.
In general, we'd like to make a policy of doing this sort of thing. To date,
we can't credit the open source use of BK with any commercial business.
If that changes, that's good for us but it should also be good for the
kernel.

David Parsons said of item 3(c), This addendum
is somewhat [1] annoying, because I switched over to BK for _everything_
a couple of years ago and now I've got a moderately large body of stuff that
is NOT open source (my resume, my dns, little proofs of concept projects that
I did for people. I've not made one red penny off any of this [particularly
since the economy has gone south and put me out of work for the past year.
But I'm still not opensourcing my resume.]) that's under bitkeeper. If I
upgrade to a bk that uses the new license, then I get to play the exciting
game of ``break the new license and defraud my former employers'' [2], which
is about as appealing as Linus's alternative approach to resolving software
patent issues.

Sam Ravnborg had no comments about the license change, but did say:

I have a feature request. The view of changesets on bkbits is usefull,
but the sorting does not give the full picture.

Follow this example:
bk pull http://linux.bkbits.net/linux-2.5

Do some editing

Check in changes

Test the changes a few days

Submit the cset(s) to Linus

Linus do a bk pull from my repository

When accessing bkbits via the web interface, the canges are listed sorted
after the time I did the modifications, not when Linus actually did the bk
pull, so they may be preceeded by maybe 100 cset's.

Is it possible somehow to sort the cset(s) according to the time they were
applied to the local tree, and not when they were originally committed?

Larry replied:

If this is a correct statement of what you want, we're building it:

Instead of seeing events in time order of creation, you want to
see the events in order of arrival in a particular repository.

I agree that the current view is useless when what you want to know is
when did this change finally make it into the tree?

We're working on a "stack" of incoming events. BK/Web will use this to
give you the display you want and bk undo will be able to use this to
roll your repository backwards by "popping" the stack. You could do

while true
do bk undo -sf
done

and when it gets done, you'll have no repository, it will have popped it
away. bk unpull will just be come a special case of popping the stack.

Paolo Ciarrocchi reported:

I've just ran a few test "dbench" based against:

2.4.18

2.4.18 + compressed cache -0.24pre3

2.4.19

2.5.31

Ok, I know that dbench is not a "good" test,
but it should be at least a good stress test.
I got neither oops nor BUG().

Alan Cox accepted my recent patch to fix a checker warning in khttpd,
but not my earlier patch to fix an oops in khttpd.

That earlier patch must have hit some bogon filter... hmm. Yes.
It contained extraneous whitespace and style changes, was complex, and
had a poor description. So, here's a cleaner one with a better description.

This patch fixes four problems in khttpd:

An oops in DecodeHeader where Buffer[CPUNR] is NULL, happened whenever a
worker thread was restarted after being stopped. (The worker thread frees
its buffer on exit, but the manager thread neglected to allocate a buffer
for the worker thread when restarting it.)

A bug that caused worker threads to be spuriously restarted once on
startup (this made the previous bug much worse).

The end-user had to do a "sleep 1" after stopping the daemon before
restarting it. This was not documented, and was rather confusing.

There was no entry in /usr/src/linux/Documentation for khttpd, and
beginning users sometimes could not find the documentation.

Christoph Hellwig asked, BTW: would you
step up as khttpd maintainer? It seems no ones else cares for it and it's
always good to have someone to drop patches/complaints at, but there
was no reply. Elsewhere in the midst of a different thread under the Subject:
Linux 2.4.20-pre4-ac2, Christoph remarked, khttpd is gone in 2.5.

Disk Arrays: RAIDIoctlsJens Axboe

Lars Marowsky-Bree posted a patch and announced:

Jens Axboe did most of the work on this; I only stressed it a bit and fixed
some bugs in it. As he is currently on vacation, I would still like to present
it to you and solicit comments on it.

It compiles and passes my test script, so it can't be all wrong, I hope ;-) It
certainly isn't worse than the current code.

I've also done a small patch to mdadm to allow access to the new functionality
provided.

Paths can be set to either active or spare; a spare path will be used
in place of a failed active path but otherwise be disabled.

A path can be manually "cleared" (marked non-faulty). This is explicitly
only implemented for multipathing because it makes no sense for the other
RAID levels where this is definetely the job of the recovery process.

Automatic reprobing of failed paths was deliberately not implemented; this
can be done in user space, and the kernel shouldn't use live requests to do
so.

Some special cases in md.c for multipath were removed / fixed.

md will now enable all paths it finds during autorun. This leads to
"funny" messages ("Device changed to [07:04] from [00:00]" etc), but they
can be safely ignored.

Nested md devices are now also auto-detected; important for RAID1 on top of
multipath for example, required for a true disaster resilient configuration.
However, this isn't yet working perfectly and is subject to ongoing work
;-)

(If anyone has hints here, I would be grateful)

Killed some code which made no sense for the multipath module; ie,
code related to the md recovery.

The downside: We needed to add 3 additional ioctl()s for this.

Patch attached.

Of course, this is still subject to the general comments about the block
device error handling in 2.4.

Jens Axboe

Lars Marowsky-Bree posted a patch and said:

The attached small patch allows to "fail" a loop device on demand. Any
further request to the loop device will simply fail.

Even though it of course doesn't simulate the failures one might see
in the field, it is kind of handy for automated tests, for example for
multipath I/O.

Done by Jens Axboe. The reason why I am sending it: I need it most and
he is on vacation.

FSPatentsWeb Servers

Frederic Roussel asked:

Mr Daniel Phillips started the TUX2 filesystem project some time ago.
The links to `tux2' are either dead or quite old.

Does any kernel developer know about the status of that project ?

Daniel Phillips replied:

It's well down my list of priorities because of uncertainties due to
the U.S. patent system.

Does anybody want to know if patent chill exists, and is it hurting
open source? The answer is yes.

Someone asked what the patent issues were surrounding Tux2, and Hank
Leininger replied:

which contains lots of documentation on the whole linux-hotplug process.
There are also links to kernel patches, not currently in the main kernel
tree, that provide hotplug functionality to new subsystems (like CPU, SCSI,
Memory, etc.)

symmetric multithreading (hyperthreading) is an interesting new concept
that IMO deserves full scheduler support. Physical CPUs can have multiple
(typically 2) logical CPUs embedded, and can run multiple tasks 'in parallel'
by utilizing fast hardware-based context-switching between the two register
sets upon things like cache-misses or special instructions. To the OSs the
logical CPUs are almost undistinguishable from physical CPUs. In fact the
current scheduler treats each logical CPU as a separate physical CPU - which
works but does not maximize multiprocessing performance on SMT/HT boxes.

The following properties have to be provided by a scheduler that wants
to be 'fully HT-aware':

HT-aware passive load-balancing: the irq-driven balancing has to be
per-physical-CPU, not per-logical-CPU.

Otherwise it might happen that one physical CPU runs 2 tasks, while another
physical CPU runs no threads. The stock scheduler does not recognize
this condition as 'imbalance' - to the scheduler it appears as if the
first two CPUs had 1-1 task running, the second two CPUs had 0-0 tasks
running. The stock scheduler does not realize that the two logical CPUs
belong to the same physical CPU.

This is a mechanism that simply does not exist in the stock 1:1
scheduler - the imbalance caused by an idle CPU can be solved via the normal
load-balancer. In the HT case the situation is special because the source
physical CPU might have just two tasks running, both runnable - this is a
situation that the stock load-balancer is unable to handle - running tasks are
hard to be migrated away. But it's essential to do this - otherwise a physical
CPU can get stuck running 2 tasks, while another physical CPU stays idle.

HT-aware task pickup.

When the scheduler picks a new task, it should prefer all tasks that
share the same physical CPU - before trying to pull in tasks from other
CPUs. The stock scheduler only picked tasks that were scheduled to that
particular logical CPU.

HT-aware affinity.

Tasks should attempt to 'stick' to physical CPUs, not logical CPUs.

HT-aware wakeup.

again this is something completely new - the stock scheduler only knows
about the 'current' CPU, it does not know about any sibling [== logical CPUs
on the same physical CPU] logical CPUs. On HT, if a thread is woken up on a
logical CPU that is already executing a task, and if a sibling CPU is idle,
then the sibling CPU has to be woken up and has to execute the newly woken
up task immediately.

the attached patch (against 2.5.31-BK-curr) implements all the above
HT-scheduling needs by introducing the concept of a shared runqueue:
multiple CPUs can share the same runqueue. A shared, per-physical-CPU
runqueue magically fulfills all the above HT-scheduling needs. Obviously
this complicates scheduling and load-balancing somewhat (see the patch for
details), so great care has been taken to not impact the non-HT schedulers
(SMP, UP). In fact the SMP scheduler is a compile-time special case of the
HT scheduler. (and the UP scheduler is a compile-time special case of the
SMP scheduler)

the patch is based on Jun Nakajima's prototyping work - the lowlevel
x86/Intel bits are still those from Jun, the sched.c bits are newly implemented
and generalized.

There's a single flexible interface for lowlevel boot code to set
up physical CPUs: sched_map_runqueue(cpu1, cpu2) maps cpu2 into cpu1's
runqueue. The patch also implements the lowlevel bits for P4 HT boxes for
the 2/package case.

(NUMA systems which have tightly coupled CPUs with a smaller cache and
protected by a large L3 cache might benefit from sharing the runqueue as
well - but the target for this concept is SMT.)

some numbers:

compiling a standalone floppy.c in an infinite loop takes 2.55 seconds
per iteration. Starting up two such loops in parallel, on a 2-physical,
2-logical (total of 4 logical CPUs) P4 HT box gives the following numbers:

2.5.31-BK-curr: - fluctuates between 2.60 secs and 4.6 seconds.

BK-curr + sched-F3: - stable 2.60 sec results.

the results under the stock scheduler depends on pure luck: which CPUs
get the tasks scheduled. In the HT-aware case each task gets scheduled on
a separate physical CPU, all the time.

compiling the kernel source via "make -j2" [under-utilizes CPUs]:

2.5.31-BK-curr: 45.3 sec

BK-curr + sched-F3: 41.3 sec

ie. a ~10% improvement. The tests were the best results picked from lots of
(>10) runs. The no-HT numbers fluctuate much more (again the randomness
effect), so the average compilation time in the no-HT case is higher.

saturated compilation "make -j5" results are roughly equivalent, as
expected - the one-runqueue-per-CPU concept works adequately when the number
of tasks is larger than the number of logical CPUs. The stock scheduler works
well on HT boxes in the boundary conditions: when there's 1 task running,
and when there's more nr_cpus tasks running.

the patch also unifies some of the other code and removes a few more
#ifdef CONFIG_SMP branches from the scheduler proper.

(the patch compiles/boots/works just fine on UP and SMP as well, on the
P4 box and on another PIII SMP box as well.)

Rusty Russell was happy to see this, as it meant he wouldn't have to
do the implementation himself. But for Ingo's statement that "Tasks should
attempt to 'stick' to physical CPUs, not logical CPUs," Rusty replied:

Linus disagreed with this before when I discussed it with him, and with
the current (stupid, non-portable, broken) set_affinity syscall he's right.

You don't know if someone said "schedule me on cpu 0" because they really
want to be scheduled on CPU 0, or because they really *don't* want to be
scheduled on CPU 1 (where something else is running). You can't just assume
they are equivalent if they are the same physical CPU.

My modified set_affinity syscall (which takes a "include/exclude" flag)
allows the arch to make this decision (eventually) since you know what the
user wants (it also means that you know what to do if they give you a short
bitmap, or a new cpu comes online/goes offline).

Ingo replied that he didn't make assumptions on why a particular CPU was
chosen to receive a process. He said, There's also a
fair amount of code in the kernel that relies on binding threads to particular
CPUs, the patch does not break that in any way. And as far as Linus
Torvalds' opinion, Ingo countered, actually, affinity
still works just fine, users can bind tasks to logical CPUs as well. What
i meant was the affinity logic of the scheduler (ie. affinity decisions
done by the scheduler), not the externally visible affinity API.
This made sense to Rusty, who promised to go read the patch more carefully.

Greg Ungerer announced:

A new 2.5.31 MMU-less patch, linux-2.5.31uc1. Just a minor update,
couple of small fixes.

Don't do this. Alan already has a sane version in his tree which I've
made ready for and sent to Marcelo. It wouldn't hurt if you read lkml..

The patch you posted is the crap directly from the XFfree repo and backs
out kernel changes. It might be enough for a random collection of junk patches
but certainly does not meet the quality criteria for official kernels.

Willy Tarreau replied:

why do you always feel the need to discourage people who offer their
contribution ? Your two first sentences are quite enough to let Marc-Christian
understand that his patch isn't as good as YOURS. The rest of the mail is
pure gratuitous insults, just like every other mail you send these times
(except those in which you compliment yourself). Since a few weeks, each
time I see a mail from you, before opening it, I ask to myself "well, who
is he killing today ?".

Perhaps you're fed up with crap in the kernel, but IMHO that's not this
way that you'll get rid of it. This list is a developper's list, so it tends
to be constructive by nature. So please be a little more tolerant with other
people, particularly when they are contributing.

Christoph replied, It's not MY patch.
It's Alan & Arjans works, and I stated that clearly in the thread
a few days ago, where someone posted a patch to bring the XFree crap in.
I expect from someone who thinks of himself as kerneltree maintainer that he
atleast follows lkml, and watching the most important secondary tree (-ac)
won't hurt either. David Lang said:

for crying out loud, earlier this week we had a post from some of the
network maintainers chastising someone becouse they only sent the patch to
the kernel list and not to the network list becouse many of the developers
don't read the kernel list.

if core kernel developers are telling people they don't read L-K then
a new person sending in a patch and not reading L-K all the time is very
reasonable. you can't have it both ways.

as for the -ac being the most important secondary tree, that's a matter
of opinion, in many cases it is, but in many cases a lot of stuff shows up
in it that never will make it to the main tree as well.

And Willy also said to Christoph, I'm sorry not
to agree with you, but with the high number of messages, not everyone has the
time to catch them all. It has happened that I missed a thread for several
days, and noticed it while being quite advanced in the discussion (OK, I'm
not a kernel tree maintainer, but I'm interested in what's being done). And I
didn't notice this XFree patch either, and I read nearly all messages. You're
lucky if you have all this time to spare here, really. But Randy
Dunlap remarked, Yes, Christoph must spend as much
time per day as Alan does on lk email and patches, but that's a good thing.
I certainly don't spend as much time as they do.

FS: XFSMAINTAINERS FilePOSIXVersion Control

Christoph Hellwig announced:

This patch includes only the core functionality of the SGI XFS filesystem
for Linux 2.5.32. It does NOT include changes for Posix ACLs, dmapi, kdb
or other code included in the XFS CVS tree.

The patch adds the self-contained XFS code and makes almost no modifications
to existing kernel code. Diffstat output with new files stripped:

I've written an advanced tracing API as a potential replacement for ptrace. It
isn't quite complete yet, but sufficient functionality should exist to
implement strace.

It works by adding a new system call that deals with file descriptors with
"special" files attached (much as sysvipc shm does). The fds are, however,
exposed and can be polled. Each fd manages a thread group.

It has full support for threads created with CLONE_THREAD.

Documentation is included in the trace-2532 patch.

Comments would be appreciated.

It is available as a pair of patches to 2.5.32 plus a test/demo program:

Apply the orn-2532 and then the trace-2532 patches to a 2.5.32 kernel, build
and install. The trctl2 program needs access to the header files from the
patched kernel at the moment.

Run trctl2 under the patched kernel. It will fork off an "inferior" process
and begin trapping and displaying certain events from it. The inferior process
will then create a set of threads which will then also be managed by the
"debugger". These threads can be hit with signals to make events happen.

Luca Barbieri posted a patch and explained, This
patch changes the CPU selection mechanism so that each CPU is an independent
y/n choice. The advantage of this is that the user knows exactly and has
full control over the range of CPUs supported by the kernel. Without this
patch it's not clear, for example, how to build a kernel that will work on
both K6s and WinChips. In addition to the processor selection, a choice
is added for the CPU that the kernel should be optimized, which is used for
the -mcpu switch.

I2C

Albert Cranford posted a patch and said:

Attached are i2c patches that bring the kernel to the latest released and tested
version. Updates include: