The filp array would allow file descriptors to be redirected. It could be
terminated by a -1 and reference the file descriptors of the current
process (this could also potentially save some dup() syscalls).

If any of these parameters (exclusing p_path) are NULL, then the
appropriate values are taken from the current process.

I originally was thinking of a name of fexec() for such a syscall, but
since there are already "f" variant syscalls (fchmod, fstat, ...) that an
fexec() would make more sense about executing an already open file, so the
name spawn() came to mind.

I know almost all of my fork()-exec() code does almost the same thing. I
guess vfork() was a potential solution, but this somehow seems cleaner
(and still may be more efficient than having to issue two syscalls)...
the downside is, of course, another syscall.

There was not much enthusiasm. Only Rafael Costa dos Santos showed any
interest, offering to help code the thing up. Elsewhere, Larry McVoy, while not
outright against the idea, felt there were significant compatibility issues. In
particular, he suggested ensuring compatibility with Windows NT. Elsewhere,
Matthias Andree took a somewhat dimmer view of the compatibility issue. He
said adding a new system call was
"a major
showstopper, because it'd only be useful to non-portable, Unix-specific
applications (thus it wouldn't be put to much use)."

Other folks pointed out that the small amount of time saved by avoiding an
extra system call during process creation would be completely overwhelemed by
the time it took to actually execute the program that would run in that process.
On the other hand, Davide Libenzi suggested doing this as part of the C library,
in user space. A new system call would be overkill for such small gains, but it
might be worth adding a library call.

Bas Mevissen asked if Linux had any support for Broadcom's BCM4306 or
BCM2050 WLAN chips. He saw that the BCM4401 ethernet chip had a Linux
driver, and was hopeful that maybe the WLAN chips did as well. Martin
List-Petersen replied,
"It seems, that
the specs haven't been released yet. There are quite a few Wlan cards out
there based on the Broadcom chips (nearly all cards, that support 802.11g),
so it's quite a shame. (Actually this fits the the TrueMobile 1180, 1300
and 1400, speaking of Dell wireless lan cards)."
He added,
"The same problem is with the Intel Prowireless 2100
(Centrino) WLan card. No Linux support available yet, which is another choice
for the Dell notebooks at the moment."
But he also said there was a Petition folks
could sign, regarding this very issue. Martin concluded,
"I've tried to contact Broadcom directly, but they are just
ignoring mails containing the word "Linux", so it seems."
David S. Miller
also said:

Don't expect specs or opensource drivers for any of these pieces
of hardware until these vendors figure out a way to hide the frequency
programming interface.

Ie. these cards can be programmed to transmit at any frequency, and
various government agencies don't like it when f.e. users can transmit on
military frequencies and stuff like that.

The only halfway plausible idea I've seen is to not document the frequency
programming registers, and users get a "region" key file that has opaque
register values to program into the appropriate registers. The file is
per-region (one for US, Germany, etc.)and the wireless kernel driver reads
in this file to do the frequency programming.

So don't blame the vendors on this one, several of them would love to
publish drivers public for their cards, but simply cannot with upsetting
federal regulators.

Alan Cox remarked that folks were already cracking the Windows interface
on those cards, and that non-US governments cared about this issue as well. He
said,
"The fact people are already abusing the technology
suggests that they will be forced to go the crypted settings route for next
generation hardware anyway."
And added,
"I talked
to one vendor about this stuff and fingers crossed we will see open drivers
except for the radio module. In the longer term I suspect vendors will move
to signed register sets, so you can load "US 802.11g" but you can't load
"police frequency, full power""

At some point Bas suggested that if these vendors were really willing to
release their specs, but were only holding back to satisfy government agencies,
then maybe they could release some binary drivers in the interim. Martin
replied to this,
"I totally agree on this. A
binary driver could better than nothing at this point. Another thing that
wonders me, is why companies like Broadcom, if they are so open to releasing
the drivers at some point, where they can make the regulation agencies
somewhat happy, are so ignorant then. I've heard of serveral people, that
tried to get a statement on the possibilty for Linux drivers from then and the
return is nothing. I've actually tried myself. No response at all."

Elsewhere, Carl-Daniel Hailfinger's eyes lit up at the prospect of
transmitting on military frequencies. He said he
"wants binary only driver for these cards to build opensource
driver with ability to set "interesting" frequency range."
Martin
said,
"It's there for Windows."
And at some point, Richard B. Johnson said:

Contrary to popular opinion, there is no FCC regulation prohibiting
one from receiving some particular frequency. There is, however, a federal
law prohibiting the disclosure of a radio message by a third party. This
means that the media, or even law enforcement can't listen to a private
radio (cell phone) conversation and then disclose its content. At one time,
cell phones used FM at 960 MHz. This could be readily received by receivers
designed for Amateur Radio use. For a time, the FCC refused to Type Approve
receivers that cover these frequencies. However, most Hams know how to fix
their receivers so they can receive whatever they want and Type Approval
was only required for receivers that were designed to be sold. You could
build anything you want for yourself. This refusal to Type Approve receivers
was a trick to make the usual receiver owner think that there was some dumb
regulation when, in fact, under the Communications Act of 1934 (as amended),
there can't be such a regulation without creating a new public law, which
hasn't happened and probably will not.

Recently, some broadcast satellite companies have tried to get the FCC
to declare that their transmissions are private and unauthorized reception
should be unlawful. The FCC has continually postponed any such declaration
because, if once broadcast, a radio signal doesn't become public, then
anybody could sue every radio transmitter operator to prevent the trespass
of "their" signals onto private property. You can't have it both ways,
either radio signals are public and, therefore cannot commit a trespass,
or they are private and can.

But, unlike some other countries regulators, the FCC has steadfastly
refused to allow broadcasters, even satellite broadcasters, to pursue such
extortion. Basically, once a signal leaves an antenna, it becomes public
property.

The same is not true for cable and "guided waves". Satellite broadcasters
have not been able to convince the FCC that their transmissions are "guided
waves". However, some private RF link companies signals, including some that
use satellites, are considered "guided waves" and cannot be used without
permission.

Various commercial interests have convinced governments of many other
countries that they "own" their radio signals and therefore different
regulations exist in many other countries. In the UK, for instance, one has to
purchase a license to use a receiver (you know, some Sony Walkman). This is,
in my opinion, extremely repressive. It would be nice for somebody to start
suing the BBC (and others) to recover damages for the criminal trespass of
"their" radio signals onto private property. After a few such lawsuits,
the ownership of such broadcast signals would revert to the public, just
like in the US.

Carl-Daniel replied,
"Here in Germany,
receiving some particular frequencies (e.g. those used by the police)
was prohibited a few years ago (I don't know exactly if they changed the
law). The argument was that some receiver types emitted a weak signal on the
frequency they were listening to (and could be tuned to become a private radio
station) which could interfere with the low-power police devices. However,
it was simply not sensible to prohibit all radios, so they were constained
to a specific frequency range."

Close by, Alan took exception to Richard's statement that people needed
a license for things like a Sony Walkman in England. Alan said,
"You need a license to receive terrestrial TV but that is
rather different and relates to both cultural and historical tax differences
in philosophy between the US and UK. The big problem with 'soft' radios
is transmit. You can hotwire your centrino. People in the UK are already
trying to use US drivers in Windows XP because "they go further". If you
listen to police transmissions then its ultimately poor police security,
if you transmit on their frequency then its a lot more serious because you
might interfere with emergency services."

(Several
months later, in August, Ed Weinberg sent me an email saying that Richard's
"description of The Communications Act of 1934, as amended, is close. A few
years ago (less than 10, anyway) Congress passed a law forbidding radio's to
receive signals in the bands used by mobile phones. Unauthorised listening
to satellite transmissions seems to also be prevented by new laws. Recent
reports I have read say that the satellite broadcasters are getting people
under the DMCA." -- Ed: [05 Aug 2003 00:00:00 -0800]

Below is a first cut at tracking the major work items which should be
completed for a 2.6 release.

When considering these items it would be useful to have a clear idea of
what a 2.6.0 release is actually _for_. Obviously, 2.6.0 doesn't mean
"it's finished, ship it".

I'd propose that 2.6.0 means that users can migrate from 2.4.x with a good
expectation that everything which they were using in 2.4 will continue to
work, and that the kernel doesn't crash, doesn't munch their data and
doesn't run like a dog. Other definitions are welcome.

I shall be maintaining list this so we can understand where we are with
respect to 2.6 readiness. And so we can look at features and say "no".
And so we can look at bugs and say "not gating 2.6.0".

Things we should not track here are:

Regular old bugs. Please use bugzilla.

Wishlist items. This list is not a route for getting commitment for
inclusion of $FAVEFEATURE. In fact it's probably a good way of getting the
feature shot down ;)

Driver problems. Most important drivers mostly work OK now. Please use
bugzilla.

Things which we should track here are significantly-sized outstanding
development activities which resolve big bugs or which address missing
features & speedups.

I've organised it into three main sections:

must-fix bugs which require significant amounts of work/restructuring
to fix.

late features and speedups.

Important driver bugs. This wasn't supposed to be here, but various
contributors sent me a lot of details, and it would be sad to lose them.

The list is already very long, and very incomplete. Additions (and
removals!!!) are sought. Thanks.

And thanks to the various contributors who helped pull this together.

Must-fix bugs

drivers/char/

TTY locking is broken (see FIXME in do_tty_hangup())

"One bug that was found is that the dropping of lock_kernel from do_exit
caused races in the exit tty cleanup. There was a patch for that, but I'm
not sure it was merged."

drivers/block/

RAID0 dies on strangely aligned BIOs

- Need to hoist BIO-split code out of device mapper, use that.

(neilb)

1/ RAID5 should work fine. It accepts any sort of bio and always
submits a 1-page bio to the underlying device, and if my
understanding is correct, every device must be able to handle a
single page bio, no matter what the alignment (which is why raid0
has a problem - it doesn't).

2/ RAID1 works pretty well. The only improvement needed is to define
a merge_bvec_fn function which passes the question down to lower
layers. This should be easy except for the small fact that it is
impossible :-) There is no enforced pairing between calls to
merge_bvec_fn and submit_bh, so it is possible that a hot spare
with different restrictions could get swapped in between the one
and the other and could confuse things. I suspect that can be
worked around somehow though...

Someone sent me a patch that is sorely needed - it allows you
to simply call blk_queue_stack() (or somethink like that), and it will
get your stacked limits set appropriately.

3/ I just realised that raid0 is easier than I had previously
thought. We don't need the completely functional bio splitting
that dm has. We only need to be able to split a bio that has just
one page as the use of merge_bvec_fn will ensure that we never get
a larger bio that we cannot handle. And splitting a bio with only
one page is a lot easier. I now have code in my tree that
implements this quite cleanly and will probably post a patch
during the week.

ideraid hasn't been ported to 2.5 at all yet.

CD burning. There are still a few quirks to solve wrt SG_IO and ide-cd.

Jens: The basic hang has been solved (double fault in ide-cd), there still
seems to be some cases that don't work too well. Don't really have a
handle on those :/

IDE tcq. Either kill it or fix it. Not a "big todo", as such.

drivers/video/

Lots of drivers don't compile, others do but don't work.

fs/

NFS client gets an OOM deadlock.

- Some fixes exist in -mm. Seem to mostly work.

NFS client runs very slowly consuming 100% CPU under heavy writeout.

- Unsubtle fix exists in -mm. (Looks like it's fixed anyway).

ext3 data=journal mode is bust.

ext3/htree doesn't play right with NFS server. 90% fixed in -mm.

AIO/direct-IO writes can race with truncate and wreck filesystems.

- Easy fix is to only allow the feature for S_ISBLK files.

davej: NFS seems to have a really bad time for some people. (Including
myself on one testbox). The common factor seems to be a high spec client
torturing an underpowered NFS server with lots of IO. (fsx/fsstress etc
show this up). Lots of "NFS server cheating" messages get dumped, and a
whole lot of bogus packets start appearing. They look severely corrupted,
(they even crashed ethereal once 8-)

kernel/

O(1) scheduler starvation, poor behaviour seems unresolved.

Jens: "I've been running 2.5.67-mm3 on my workstation for two days, and
it still doesn't feel as good as 2.4. It's not a disaster like some
revisisons ago, but it still has occasional CPU "stalls" where it feels
like a process waits for half a second of so for CPU time. That's is very
noticable."

__module_get(): "I know I have a refcount already and I don't care
if they're doing rmmod --wait, gimme.". Keeps bouncing off Linus.

Per-cpu support inside modules (have patch, in testing).

driver class code is getting redone. I have this now working, and will
send it out in a few days.

net/

(davem)

UDP apps can in theory deadlock, because the ip_append_data path can end
up sleeping while the socket lock is held.

It is OK to sleep with the socket held held, normally. But in this case
the sleep happens while waiting for socket memory/space to become
available, if another context needs to take the socket lock to free up the
space we could hang.

I sent a rough patch on how to fix this to Alexey, and he is analyzing
the situation. I expect a final fix from him next week or so.

Semantics for IPSEC during operations such as TCP connect suck currently.

When we first try to connect to a destination, we may need to ask the
IPSEC key management daemon to resolve the IPSEC routes for us. For the
purposes of what the kernel needs to do, you can think of it like ARP. We
can't send the packet out properly until we resolve the path.

What happens now for IPSEC is basically this:

O_NONBLOCK: returns -EAGAIN over and over until route is resolved

!O_NONBLOCK: Sleeps until route is resolved

These semantics are total crap. The solution, which Alexey is working
on, is to allow incomplete routes to exist. These "incomplete" routes
merely put the packet onto a "resolution queue", and once the key manager
does it's thing we finish the output of the packet. This is precisely how
ARP works.

I don't know when Alexey will be done with this.

There are those mysterious TCP hangs of established state sockets.
Someone has to get a good log in order for us to effectively debug this.

net/*/netfilter/

(Rusty)

Handle non-linear skbs everywhere. This is going in via Dave now.

Rework conntrack hashing.

Module relationship bogosity fix (trivial, have patch).

global

Lots of 2.4 fixes including some security are not in 2.5

There are about 60 or 70 security related checks that need doing
(copy_user etc) from Stanford tools

A couple of hundred real looking bugzilla bugs

Not-ready features and speedups

drivers/block/

Framework for selecting IO schedulers. This is the main one really.
Once this is in place we can drop in new schedulers any old time, no
risk.

Dynamic disk request allocation. Patch exists.

Runtime-selectable disk scheduler framework.

Anticipatory scheduler. Working OK now, still has problems with seeky
OLTP-style loads.

CFQ scheduler. Seems to work but Jens planning significant rework.

The feral.com qlogic driver: needs work.

fs/

reiserfs_file_write() speedup. There are concerns that some applications
do the wrong thing with large stat.st_blksize.

ext3 lock_kernel() removal: that part works OK and is mergeable. But
we'll also need to make lock_journal() a spinlock, and that's deep
surgery.

I have several reasons for wanting to do this (all of
them related to NFS of course, but much of the reasoning applies
to *all* networked file systems).

1) The above sequence is simply not atomic on *any* networked
filesystem.

2) It introduces a sh*tload of completely unnecessary RPC calls (why
do a 'permission' RPC call when the server is in *any* case going to
tell you whether or not this operations is allowed. Why do a
'lookup()' when the 'create()' call can be made to tell you whether or
not a file already exists).

3) It is incompatible with some operations: the current create()
doesn't pass an 'EXCLUSIVE' flag down to the filesystems.

4) (NFS specific?) open() has very different cache consistency
requirements when compared to most other VFS operations.

I'd very much like for something like Peter Braam's 'lookup with
intent' or (better yet) for a proper dentry->open() to be integrated with
path_walk()/open_namei(). I'm still working on the latter (Peter has
already completed the lookup with intent stuff).

/proc/kallsyms. What most people really wanted from /proc/ksyms. Patch
exists.

Fix module-failed-init races by starting module "disabled". Patch
exists, requires some subsystems (ie. add_partition) to explicitly say
"make module live now". Without patch we are no worse off than 2.4 etc.

Integrate userspace irq balancing daemon.

mm/

objrmap: concerns over page reclaim performance at high sharing levels,
and interoperation with nonlinear mappings is hairy.

Readd and make /proc/sys/vm/freepages writable again so that boxes can be
tuned for heavy interrupt load.

net/

(davem)

Real serious use of IPSEC is hampered by lack of MPLS support. MPLS is a
switching technology that works by switching based upon fixed length labels
prepended to packets. Many people use this and IPSEC to implement VPNs
over public networks, it is also used for things like traffic engineering.

Anyways, an existing (crappy) implementation exists. I've almost
completed a rewrite, I should have something in the tree next week.

Sometimes we generate IP fragments when it truly isn't necessary.

The way IP fragmentation is specified, each fragment must be modulo 8
bytes in length. So suppose the device has an MTU that is not 0 modulo 8,
ethernet even classifies in this way. 1500 == (8 * 187) + 4

Our IP fragmenting engine can fragment on packets that are sized within
the last modulo 8 bytes of the MTU. This happens in obscure cases, but it
does happen.

I've proposed a fix to Alexey, whereby very late in the output path we
check the packet, if we fragmented but the data length would fit into the
MTU we unfragment the packet.

This is low priority, because technically it creates suboptimal behavior
rather than mis-operation.

IPV4 output engine changes for IPSEC need to be moved over to IPV6.

IPV6 ipsec works but gravely suboptimally in some cases. It is also for
this reason that the zerocopy UDP stuff isn't functional on the ipv6 side.

The USAGI project (www.linux-ipv6.org) is working with Alexey on this
work.

net/*/netfilter/

Lots of misc. cleanups, which are happening slowly.

davem: Netfilter needs to stop linearizing packets as much as possible.

Zerocopy output packets are basically undone by netfilter becuase all of
it assumed it was working with linear socket buffers.

Rusty is fixing this piece by piece. He is nearly done with this work.

power management

(Pat) There is some preliminary work at bk://ldm.bkbits.net/linux-2.5-power,
though I'm currently in the process of reworking it.

It includes:

New device power management core code, both for individual devices,
and for global state transitions.

A generic user interface for triggering system power state transitions.

Arch-independent code for performing state transitions, that calls
platform-specific methods along the way.

A better suspend-to-disk mechanism that swsusp.

There are various other details to be worked out, which are the real fun
part. And of course, driver support, but that is something that can happen
at any time.

(Alan)

PCI locking

Frame buffer restore codepaths (that requires some deep PCI magic)

XFree86 hooks

AGP restoration

DRI restoration

IDE suspend/resume without races (Ben is looking at this a little)

How to deal with devices that babble (some stuff we have to global IRQ
off to save, and global IRQ on -after- we recover with APM)

Pat's swsusp rework?

arch/i386/

Andi: i386 sub architectures for common boxes (in particular bigsmp and
summit) need to be runtime probed options, not compile time. Vendors
cannot ship an own kernel rpm for all these cases. (patch is in -mm, works
OK).

Also PC9800 merge needs finishing to the point we want for 2.6 (not
all).

ES7000 wants merging (now we are all happy with it). That shouldn't be a
big problem.

global

64-bit dev_t. Seems almost ready, but it's not really known how much
work is still to do. Patches exist in -mm but with the recent rise of the
neo-viro I'm not sure where things are at.

We need a kernel side API for reporting error events to userspace (could
be async to 2.6 itself)

(Prototype core based on netlink exists)

Kai: Introduce a sane, easy and standard way to build external modules

Alan: We have multiple drivers walking the pci device lists and also
using things like pci_find_device in unsafe ways with no refcounting. I
think we have to make pci_find_device etc refcount somewhere and add
pci_device_put as was done with networking.

Lots of network drivers don't even build

Alan: PCI hotplug is unsafe (locking is totally screwed)

Ditto cardbus

Alan: Cardbus/PCMCIA requires all Russell's stuff is merged to do
multiheader right and so on

drivers/acpi/

davej: ACPI has a number of failures right now. There are a number of
entries in bugzilla which could all be the same bug. It manifests as a
"network card doesn't recieve packets" booting with 'acpi=off noapic' fixes
it.

davej: There's also another nasty 'doesnt boot' bug which quite a few
people (myself included) are seeing on some boxes (especially laptops).

drivers/block/

Alan: Partition handling is hosed for DM users. (I have some partly
debugged patches in the -ac tree, but Andries objects to them and I think
his user knows magic options hack is unacceptable too. Mostly this is
figuring out the right answer)

Floppy is almost unusably buggy still

drivers/char/

Alan: Multiple serious bugs in the DRI drivers (most now with patches
thankfully). "The badness I know about is almost entirely IRQ mishandling.
DRI failing to mask PCI irqs on exit paths."

Various suspect things in AGP.

drivers/ide/

(Alan)

IDE requires bio walking

IDE PIO has occasional unexplained PIO disk eating reports

IDE has multiple zillions of races/hangs in 2.5 still

IDE eats disks with HPT372N on 2.5.x

IDE scsi needs rewriting

IDE needs significant reworking to handle Simplex right

IDE hotplug handling for 2.5 is completely broken still

drivers/isdn/

(Kai, rmk)

isdn_tty locking is completely broken (cli() and friends)

fix lots of remaining bugs in the isdn link layer / hisax protocol layer
/ hisax subdrivers, so that at least 99% of the users have a usable ISDN
subsystem

Alternatively, we could re-introduce the fallback to driver ioctl parsing
for these if not enough drivers get updated.

fixup the usb-serial core and drivers to provide support for this
patch.

drivers/net/

davej: Either Wireless network drivers or PCMCIA broke somewhen. A
configuration that worked fine under 2.4 doesn't receive any packets. Need
to look into this more to make sure I don't have any misconfiguration that
just 'happened to work' under 2.4

drivers/scsi/

Half of SCSI doesn't compile

arch/i386/

2.5.x won't boot on some 440GX

2.5.x doesn't handle VIA APIC right yet - dont know why

ACPI needs the relax patches merging to work on lots of laptops

ECC driver questions are not yet sorted (DaveJ is working on this)

arch/x86_64/

(Andi)

time handling is broken. Need to move up 2.4 time.c code.

memory corruption with IOMMU pci_free_consistent - often causes crashes
at shutdown. This is rather mysterious, the code is basically identical to
2.4 which works fine. Can only be seen on systems with >4GB of memory or
with iommu=force

Another report of a crash at shutdown on Simics with no iommu when all
memory was used. Could be related to the one above.

change_page_attr corrupts memory/crashes. Breaks some AGP users.

NMI watchdog seems to tick too fast

some fixes from 2.4 still need to be merged

not very well tested. probably more bugs lurking.

Andi Kleen replied:

I found a new bad class of bugs (slowly working on fixing them, also
present in 2.4)

Machine Check handlers use printk in an NMI like (ignoring cli) situation.
This can deadlock on the console or low level character driver (serial, vga)
locks. Not all MCEs are fatal (e.g. corrected ECC errors) and the kernel
should be safely able to continue.

Need to buffer the printk in an atomic fashion (e.g. in a ring buffer managed
with cmpxchg) and cause an self IPI that triggers an interrupt after
the next sti. This is easy with x86/APIC mode, but difficult with PIC
(the 8259 supports it in theory, but it's not clear that all clones in various
chipsets do; also changing the programming may be risky). Fallback: pick it
up with the next timer interrupt by adding a check there.

New entries for the x86-64 list
(actually I'm not sure they are all x86-64 specific, just that the
bug has been seen there)

32bit core dumps do not dump 32bit SSE data currently. they should

AT_GID/AT_UID ELF environment vector contains crap currently
This breaks debugging of the shared linker for suid programs because
ld.so always thinks it is suid/not called by root and ignores environment
variables.

NIS/ypbind breaks with an abort() in glibc. Only happens on 2.5, 2.4
is fine.

need /proc/kcore access for kernel mappings that are outside vmalloc
(in particular the kernel and the modules are special mappings on x86-64;
other architectures have the same problem)

Best would be to put them in the vmalloc mappings list, but that requires
some more fixes in other code that uses it. Also /proc/kcore seems to have
some 64bit signedness bugs (patch for 2.4 exists)

Generic item:

need to share the ioctl 32bit emulation handlers between ports.
Pavel has a patch, but he's running into difficulties with merging it.

To the generic item, Pavel Machek replied that his patch had been accepted.
Andi replied that things were still quite broken; and Pavel said a new patch was
on his way to Linus.

Elsewhere, regarding Andrew's item regarding IDE suspend/resume without
races, Benjamin Herrenschmidt said,
"I
have something that work not too badly for PPC already but that need some
cleanup, to be tested/adapted to Pat's new work (especially tested against
his swsusp, and we shall still verify if it fits x86 needs)"
.

Elsewhere, Christoph Hellwig added his own items to Andrew's list:

drivers/scsi/

large parts of the locking are hosed or not existant

shost->my_devices isn't locked down at all

the host list ist locked but not refcounted, mess can
happen when the spinlock is dropped

there are lots of members of struct Scsi_Host/scsi_device/scsi_cmnd
with very unclear locking, many of them probably want to become
atomic_t's or bitmaps (for the 1bit bitfields).

there's lots of volatile abuse in the scsi code that needs to
be thought about.

there's some global variables incremented without any locks

fs/devfs/

there's a fundamental lookup vs devfsd race that's only fixable
by introducing a lookup vs devfs deadlock. I can't see how this
is fixable without getting rid of the current devfsd design.
Mandrake seems to have a workaround for this so this is at least
not triggered so easily, but that's not what I'd considere a fix..

Martin Schlemmer got sound working on his ICH5, by simply adding the ICH5
IDs to the list. That worked for his system, but Jeff Garzik replied,
"Unfortunately this doesn't work on all ICH5s out there.
At the very minimum, for now, it would be nice to match up ich5 and codec
pairs, as codec differentiation seems to be what stops this patch from
working on all ICH5."
And Martin replied:

Hmm, right.

Anybody working on getting support for the 875 Chipset into 2.5? Can I
send a 'lspci -vv' to help ? I have a Asus P4C800 here (Intel 875p), so I
can do some testing if need be.

SCO-Caldera Senior Vice President Chris Sontag explicitly says that the
kernel.org kernel is *not* tainted, but that that other stuff that Red Hat
and SuSE are including *is*.

Quote from the interview:

"Chris Sontag: We're not talking about the Linux kernel that Linus and
others have helped develop. We're talking about what's on the periphery of
the Linux kernel."

He doesn't specify exactly what he's talking about, but he makes an
interesting claim:

"Chris Sontag: We are using objective third parties to do comparisons of
our UNIX System V [SCO-owned Unix] source code and Red Hat as an example. We
are coming across many instances where our proprietary software has simply
been copied and pasted or changed in order to hide the origin of our System
V code in Red Hat. This is the kind of thing that we will need to address
with many Linux distribution companies at some point."

"We're finding ... cases where there is line-by-line code in the Linux
kernel that is matching up to our UnixWare code.

We're finding code that looks likes it's been obfuscated to make it look
like it wasn't UnixWare code -- but it was."

Chris Sontag should get his story straight with his boss before he opens
his mouth to the press.

Elsewhere, Christoph Hellwig replied to the original post as well,
saying:

As somone who walked for SCO (or rather Caldera how it was called at that
time) I can tell you this is utter crap. There were very people actually
doing Linux kernel work then (and when the German office was closed down
all those left the company) and we really had better things to do then
trying to retrofit UnixWare code into the linux kenrel. Especially given
that the kernel internals are so different that you'd need a big glue
layer to actually make it work and you can guess how that would be
ripped apart in a usual lkml review :)

It might be more interesting to look for stolen Linux code in Unixware,
I'd suggest with the support for a very well known Linux fileystem in
the Linux compat addon product for UnixWare..

Jim Nance said,
"Wouldnt it be halirous if whatever
code SCO is talking about when they say there is Unix code in Linux turns out
to be code some SCO employee ripped out of some GPL program and stuck it into
Unixware. That is actually far more likely than what they alledge."

There were a few more quips, and the thread petered out inconclusively.

which contains lots of documentation on the whole linux-hotplug
process.

There are lots of changes in this release from the last one (which was
almost 8 months ago), most of them make things work better for systems
running 2.5, but some of them fix problems that 2.4 users will see.

Some of the major changes in this release are:

fix for the lack of a drivers file in usbfs in 2.5.

initial scsi.agent for 2.5, modprobes sd_mod or sr_mod

call devlabel if it's present

made /sbin/hotplug a tiny multiplexer program, moving the original
/sbin/hotplug program to /etc/hotplug.d/default/default.hotplug

The full ChangeLog extract since the last release is included below for
those who want to know everything that's been changed, and who to blame
for them :)

We are pleased to announce the first publically available source code
release of a new kernel-based security feature called the "Exec Shield",
for Linux/x86. The kernel patch (against 2.4.21-rc1, released under the
GPL/OSL) can be downloaded from:

The exec-shield feature provides protection against stack, buffer or
function pointer overflows, and against other types of exploits that rely
on overwriting data structures and/or putting code into those structures.
The patch also makes it harder to pass in and execute the so-called
'shell-code' of exploits. The patch works transparently, ie. no
application recompilation is necessary.

Background:
-----------

It is commonly known that x86 pagetables do not support the so-called
executable bit in the pagetable entries - PROT_EXEC and PROT_READ are
merged into a single 'read or execute' flag. This means that even if an
application marks a certain memory area non-executable (by not providing
the PROT_EXEC flag upon mapping it) under x86, that area is still
executable, if the area is PROT_READ.

Furthermore, the x86 ELF ABI marks the process stack executable, which
requires that the stack is marked executable even on CPUs that support an
executable bit in the pagetables.

This problem has been addressed in the past by various kernel patches,
such as Solar Designer's excellent "non-exec stack patch". These patches
mostly operate by using the x86 segmentation feature to set the code
segment 'limit' value to a certain fixed value that points right below the
stack frame. The exec-shield tries to cover as much virtual memory via the
code segment limit as possible - not just the stack.

Implementation:
---------------

The exec-shield feature works via the kernel transparently tracking
executable mappings an application specifies, and maintains a 'maximum
executable address' value. This is called the 'exec-limit'. The scheduler
uses the exec-limit to update the code segment descriptor upon each
context-switch. Since each process (or thread) in the system can have a
different exec-limit, the scheduler sets the user code segment dynamically
so that always the correct code-segment limit is used.

the kernel caches the user segment descriptor value, so the overhead in
the context-switch path is a very cheap, unconditional 6-byte write to the
GDT, costing 2-3 cycles at most.

Furthermore, the kernel also remaps all PROT_EXEC mappings to the
so-called ASCII-armor area, which on x86 is the addresses 0-16MB. These
addresses are special because they cannot be jumped to via ASCII-based
overflows. E.g. if a buggy application can be overflown via a long URL:

http://somehost/buggy.app?realyloooooooooooooooooooong.123489719875

then only ASCII (ie. value 1-255) characters can be used by attackers. If
all executable addresses are in the ASCII-armor, then no attack URL can be
used to jump into the executable code - ie. the attack cannot be
successful. (because no URL string can contain the \0 character.) E.g. the
recent sendmail remote root attack was an ASCII-based overflow as well.

With the exec-shield activated, and the 'cat' binary relinked into the the
ASCII-armor, the following layout is created:

In the above layout, the highest executable address is 0x01003fff, ie.
every executable address is in the ASCII-armor.

this means that not only the stack is non-executable, but lots of
mmap()-ed data areas and the malloc() heap is non-executable as well.
(some data areas are still executable, but most of them are not.)

the first 1MB of the ASCII-armor is left unused to provide NULL pointer
dereference protection and leave space for 16-bit emulation mappings used
by XFree86 and others.

In this layout none of the executable areas are in the ASCII-armor, plus
the exec-limit is 0xbfffffff (3GB) - ie. including all userspace mappings.

Note that the kernel will relocate every shared-library to the
ASCII-armor, but the binary address is determined at link-time. To ease
the relinking of applications to the ASCII-armor, Arjan Van de Ven has
written a binutils patch (binutils-2.13.90.0.18-elf-small.patch), which
adds a new 'ld' flag "ld -melf_i386_small" (or "gcc -Wl,-melf_i386_small")
to relink applications into the ASCII-armor. (The patch can be found at
the exec-shield URL as well.)

Overhead:
---------

the patch was designed to be as efficient as possible. There's a very
minimal (couple of cycles) tracking overhead for every PROT_MMAP
system-call, plus there's the 2-3 cycles cost per context-switch.

Limitations:
------------

This feature will not protect against every type of attack.

E.g. if an overflow can be used to overwrite a local variable which
changes the flow of control in a way that compromises the system. But we
do believe that this feature will stop every attack that is purely
operating by overflowing the return address on the stack, or overflowing a
function pointer in the heap. Furthermore, exec-shield makes it quite hard
to mount a successful attack even in the other cases, because it inhibits
the execution of exploit shell-code, in most cases.

also, if the overflow is within the exec-shield itself (e.g. within the
data section of one of the shared library objects in the ASCII-armor) then
the overflow might be possible to exploit.

All in one, exec-shield is one barrier against attacks, not blanket 100%
protection in any way. The most efficient security can be provided by
installing as many layers as possible.

To provide as good protection as possible, there's no trampoline
workaround in the exec-shield code - ie. exec-limit violations in the
trampoline case are never let through. Applications that need to rely on
gcc trampolines will have to use the per-binary ELF flag to make the stack
executable again. (The ELF flag is the same as used by Solar Designer's
non-exec stack patch, to provide as much compatibility with existing
non-exec-stack installations as possible.)

The exec-shield feature will uncover applications that incorrectly assumed
that PROT_READ allows execution on x86. One such example is the XFree86
module loader. The latest XFree86 on rawhide.redhat.com fixes this
problem. For those who cannot install the XFree86 bugfix at the moment
there's a workaround added by the patch, which can be activated via:

echo 1 > /proc/sys/kernel/X-workaround

This will make every iopl() using application (such as X) have the
exec-shield disabled. Other applications (sendmail, etc.) will still have
the exec-shield enabled. This workaround is default-off. We strongly
encourage to solve this problem by upgrading X, or by using the 'chkstk'
utility to make X's stack forced-executable.

Using it:
---------

Apply the exec-shield-2.4.21-rc1-B6 kernel patch to the 2.4.21-rc1 kernel,
recompile & install the kernel and reboot into it, that's all.

There is a new boot-time kernel command line option called exec-shield=,
which has 4 values. Each value represents a different level of security:

shared library address randomization, both within and outside the
ASCII-shield. This should make remote attacks a little bit more
difficult.

process stack randomization. A number of other patches did this as
well, it generally helps. (There's no memory wasted because the stack
area left out will simply not be paged in.)

turn off shlib relocation if the stack is executable. This is needed
for Wine, qemu and other apps that need the low memory range.

do not show the wchan field of non-owned processes, and do not show the
maps file either. This should make it a little bit harder to guess
library locations for local attackers.

most of the new stuff in this patch (randomization, information filtering)
has been done in other patches as well (such as PaX, grsecurity, non-exec
stack patch, etc.) - i tried to filter out and add the ones that matter
most, do not introduce constraints and are thus uncontroversial.

Various folks were very happy to see this work, and a bunch of people started
discussing the implementation and various security issues.

Daniele Pala asked,
"Trying to run 'make xconfig'
i got into the message 'you don't have installed qt!'...so the xconfig is
now dependant from qt? why? what about us poor guy who only use twm and
not kde? isn't qt pretty big and fat?"
Diego Calleja Garcia replied
that 'make gconfig' would use the gtk library; Balram Adlakha said,
"I think xconfig should be the "X" based one, qconfig
should be the qt based one and gconfig should be the gtk one."
Sam
Ravnborg invited him to contribute a generic X-based config program.

Ok, I finally found the reason for why some of my machines had trouble
with restarting the X server, and it turns out that it's been around since
very early February. I bet others must have seen it too, with random crashes
on X server restart when the server used AGP (which means that it mainly
hit either hw-accelerated 3D setups or the intel integrated graphics which
use a UMA model with AGP as the backing store).

That's a big relief for me, as it was the major thing I personally worried
about for 2.6.x.

Anyway, that's fixed here, along with a lot of other updates. Much of
2.5.69 is small one-liners to drivers to handle the new IRQ semantics, but
there's a lot of other cleanups in there too (Christoph Hellwig continued
on his devfs rampage, for example).

NOTE! As of this release I think I'll want to have patches either
be _really_ obvious, or they should go through one of more people for
approval. In particular, I'm hoping that the paperwork stuff with Andrew
should be getting closer to finalized, and that we could start moving over
towards a 2.6.x release schedule..

moved dvb_usercopy() to dvb_functions.c -- this is essentially
video_usercopy() which should be generic_usercopy() instead... ;-)

Made the dvb-core in dvbdev.c work with devfs again. I had to
introduce some #if KERNELVERSION magic again here, sorry. I'll fix it up
with the next patchset.

Christoph Hellwig had some criticism of the code, and gave Michael some
suggestions. He also said,
"your devfs stuff is
a mess. I already told one of the DVB folks (it wasn't you IIRC) that I'll
publish a 2.5 devfs API on 2.4 header. But first I have to fix the devfs API
on 2.5 and randomly bringing back old crap and lots of ifdefs in those changing
areas won't help. What the problem with 2.5, dvb and devfs?"
Michael
replied:

The main problem is that our development "dvb-kernel" CVS tree *should*
compile under 2.4 aswell, because most of the dvb-users don't want to
participate in kernel development in general, but only on the development
of the dvb subsystem. So work is done on the "dvb-kernel" tree, which should
be synced with the 2.5 kernel frequently.

So, regarding devfs, I introduced #ifdefs around the functions that have
changed recently. That's not nice, I know. But in my eyes it's important to
keep the CVS and the kernel version more in sync.

IIRC Gerd Knorr has the same problems with his driver packages (regarding
the i2c subsystem mainly), but he has written some perl scripts to remove
the #ifdef stuff before submitting his patches...

Christoph felt that it would be best to delay the dvb updates, because
"you don't just add ifdefs (which give me lots
of rejects and you much uglier code than just using the compat header I'll
send to lkml once I'm done with the API changes) but you also change the
code that's ifdefed for 2.5 to reverse change I did. There is a reason why I
removed every occurance of devfs_handle_t from all drivers and the particular
reason is that it will go away in the next series of patches."
Michael
asked how best to proceed, and after a little wrangling, it was agreed that
Michael should continue to send updates to either Christoph or Alan Cox,
bearing in mind that 2.5 features shouldn't be broken by Michael's updates.

Here are some TTY changesets that do two different things TTY related
things:

fix the MOD_DEC/MOD_INC warnings for a number of tty drivers. These
patches were previously sent to the Trivial Patch Monkey, but seem to have
disappeared in its bowels somewhere, never to be seen again. These are
the majority of the changes (25 different patches), and were all written by
Hanna Linder.

Make the tty class code actually work. With these changes, the
/sys/class/tty directory shows all current tty devices, along with their
device numbers, and a link to their device in the sysfs tree, if it has a
physical location in that tree. Yeah, the implementation of yet another
list of devices within the tty core isn't the nicest, but until the tty
layer switches over to having tty devices in memory before they are opened
(like all other device subsystems), this is necessary. For 2.7, this extra
list will go away.