Slashdot videos: Now with more Slashdot!

View

Discuss

Share

We've improved Slashdot's video section; now you can view our video interviews, product close-ups and site visits with all the usual Slashdot options to comment, share, etc. No more walled garden! It's a work in progress -- we hope you'll check it out (Learn more about the recent updates).

Last Tuesday you askedIngo Molnar, Red Hat kernel hacker, about the means by which his TUX Web server recently achieved such fantastic results in SpecWeb99 . He was kind enough to respond with at-length answers addressing licensing, the reality of threads under Linux, the realism of benchmarks, and more. Thanks, Ingo!

1) TUX Architecture
by dweezil

"You appear to have take an "architectural" approach to designing TUX, so I have some architectural questions.

1.The choice of a kernel space implementation is probably going to be a controversial one. You suggest that HTTP is commonly used enough
to go in the kernel just as TCP/IP did years ago. What performance or architectural advantages do you see to moving application protocols
into the kernel that cannot be achieved in user space?

Ingo Molnar
The biggest advantage i see is to have encapsulation, security and performance available to dynamic web applications *at the same
time*.

There are various popular ways to create dynamic web content. Encapsulation, security is provided by CGI, various scripting and
virtual machine models - unpriviledged/prototype/buggy CGIs are
sufficiently isolated both from HTTP protocol details, from the
webserving context, from the web-client and from each other.

But all CGI/scripting/virtual-machine models (including fast-CGI)
lack the possibility of performing with 'maximum performance' if
the webserver is in another process context. 'maximum performance'
means that there should be only one context running (no context
switching done), and the application writer should have the freedom
to use C code, or even assembly code. (not that using assembly in
web-applications would be too common.)

ISAPI, NSAPI dynamic webserver applications can have maximum
performance because they run in the webserver's context, but they do
not provide enough security - ie. a buggy ISAPI DLL which is loaded
into IIS's address space can take down IIS and the whole site,
including unrelated virtual sites. Not a very safe programming model.
Additionally ISAPI/NSAPI modules often have to care about HTTP
protocol details, and can cause HTTP-nonconform replies being
generated. Additionally, debugging ISAPI/NSAPI modules is hard and a
pain, and nobody does it on live servers. [i intentionally omitted
Apache modules, because Apache modules are primarily used to extend
the capabilities of Apache, they are not ment to provide a platform
for dynamic applications.]

The TUX dynamic API is implemented through a kernel-based subsystem,
which is accessible via a system-call which is available to
'unpriviledged' user-space code. There is no context switching
(ie. we can have maximum performance) and still no user-space code
can take the webserver down, because kernel-space is isolated from
user-space. Obviously the TUX subsystem checks user-space parameters
with extreme prejudice. It's a goal of TUX to also enforce
RFC-conform replies being sent to the web-client - just like CGIs.
Plus TUX modules can be in separate process contexts as well.
(in this case there is going to be casual context-switching between
these process contexts.)

TUX also has kernel-space modules - for the truly performance-hungry and
kernel-savvy web coder. Some TUX features (such as CGI execution) are
implemented as TUX kernel-space modules. Here are the various layers of
modularity in TUX:

'accelerated requests', automated by TUX. (static GETs, etc.)

'kernel-space TUX modules'

'user-space TUX modules'

'fast socket-redirection to other webservers'

'external CGIs'

2. What is your approach to concurence? In particular, you refer to "event
driven". What do you mean by that and what did you choose as the core to
your event engine? Also, how do you handle threading to scale on SMP
machines?

Ingo:
This is how it works, simplified:

there is a 'HTTP request' user-space structure that is manipulated
by the TUX kernel-space subsystem. Whenever a requested operation (eg.
'send data to client', or 'read web-object from disk') is about to block
then TUX 'schedules' another request context and returns it to
user-space. User-space can put event codes (or other private state)
into the request structure, so that it can track the state of the
request. This programming model is harder to code than other, more
'synchronous' solutions, but avoids the context-switch problem.
Whenever there is no more work to be done by user-space, TUX suspends
the process until there are new connections or any other progress.

3. Are there any plans to generalize the infrastructure elements of TUX so
that other protocol servers can take advantage of the TUX architecture and
also go fast as hell?"

TUX includes a variety of kernel and apache changes. Can you give a rough
measure of how each of the changes improved the http performance? I'm
interested in the amount of improvement as well as why it improved
performance. Do those particular changes have negative impact on the
performance of other applications?

Ingo: i have no exact numbers because TUX advances in an 'evolutionary' way
and closely followed/follows the 2.3/2.4 kernel line - the impact of
particular changes is hard to judge exactly. Here are a couple of
new features of the 2.4 kernel that are used by TUX and make a visible
performance impact (without trying to list all the improvements):

the single biggest change that enabled TUX was the inclusion of the
new Linux TCP/IP architecture into 2.3 early this year. This is the
'softnet' architecture which has been written by David Miller and
Alexey Kuznetsov, and is a very SMP-centric rewrite of the Linux
TCP/IP stack. In Windows speak: 'the Linux 2.4 TCP/IP stack is
completely deserialized'.

another important change for TUX were the many VFS cleanups and
scalability improvements done by Alexander Viro and Linus.

a patch from Manfred Spraul went into the kernel recently, the
per-CPU SLAB cache (the concept is this: when freeing buffers the
kernel keeps them in variable-size per-CPU pools, and if the same
type of buffer gets allocated shortly afterwards then the kernel
picks from the pool that belongs to that CPU. This leads to
dramatically better cache-locality.) Per-CPU SLAB pools are used
by TUX, obviously.

another feature which made a noticeable difference is the I/O
scheduler rewrite by Andrea Arcangeli and Jens Axboe.

large file support (up to 2TB files) from Matti Aarnio and Ben
LaHaise is crucial as well - the logfile of the 4200 connections
SPECweb99 result was bigger than 5 GB.

there are other important scalability improvements all around the
place in 2.4 Linux, i think it's fair to say that almost every
single line in the 'main kernel' got replaced by something else.

TUX includes no Apache changes right now - the 'redirection' feature
can be used to feed connections to an (unchanged) Apache setup.
But enabling a future mod_tux.c was definitely a goal, to get deeper
Apache integration.

3) Caching
by JohnZed

I have a few questions about TUX's caching system. Before I go any
further,
I want to say that I'm incredibly impressed by the results. I've been
following specWeb99 for a while and have been wondering when someone would
manage to build a great dynamic cache like this one. I hope it'll get the
wide acceptance it seems to deserve.

First, it seems that basically the entire test file set was loaded into
memory ahead of time for use by TUX. [...]

Ingo: no, this was not the case actually. In the 1-CPU Dell system the fileset
size was 4182 MB, RAM size was 2GB. In the 4-CPU Dell system the total
fileset size was 13561 MB, RAM size was 8GB. So the IO and cache
capabilities of TUX and the hardware were tested heavily. Eg. the
best Windows 2000 + IIS result had a fileset size of 5241 MB, with 8GB
RAM. (ie. fully cached, only logging IO.)

[...] How adaptable is TUX to more dynamic,
limited-memory environments in terms of setting cache size limitations,
selectivity (e.g. "cache all .GIFs, but not .html files"), and
expiration/reloading algorithms?

Ingo: i do not want to raise expectations unnecesserily, but it's an integral
part of TUX to perform async IO and 'cachemiss' operations effectively.
Cached objects are timed out by a LRU mechanizm, so objects accessed
less frequently will get deallocated if memory pressure rises.

Second, can a tux module programmer modify the basic tux commands, or do
they always do the same thing? For instance, if I were adapting TUX to work
with a web proxy cache, I'd want TUX_ACTION_GET_OBJECT to actually go out
over the network and do a GET request if it couldn't find a requested
object in the cache. You can imagine lots of other circumstances where this
would come up as well. [...]

Ingo: TUX is a webserver, and has no proxy capabilities yet. It would be a
reasonable and natural extension of the GET_OBJECT mechanizm to fetch
objects from a cache hierarchy or from origin servers, yes.

Third, is it possible to execute more than one user-space TUX module at
one time? [...]

Ingo: yes. TUX modules right now are compiled as shared libraries and thus
can be loaded/unloaded runtime. An unlimited number of user-space
modules can be used without performance impact.
Fourth, when can we play with the code? Thanks a lot!

Ingo: SPEC rules which regulate the acceptance of not-yet-released products require us to release TUX sometime in August - the whole TUX
codebase is going to be released under the GPL.

4)Integration into RedHat?
by ErMaC

How will the TUX Webserver integrate with RedHat's Linux distributions?
Will RedHat create a special distribution with an identical setup to yours?

Ingo: there are going to be RPMs which can be installed on Red Hat 6.2, plus
source-code which can be used with any distribution. Since TUX is a
kernel-subsystem mainly, TUX is fundamentally distribution-neutral.

Will RedHat start releasing more specialized distributions, preferably ones
more suited to a secure server environment but focused on performance like
your setup was?

Ingo: well i cannot comment on specific products (being a kernel guy), but
we always try to maximize the security and performance of all our
products. TUX itself can be safer than user-space webservers, simply
due to the fact that the kernel is a much more controlled, predictable
and dedicated programming environment. Nevertheless we try to minimize
the amount of code put into the kernel.

How would TUX perform using CGI/Servlets/PHP/etc. compared to Apache or
IIS? The ability to serve static pages fast is not that useful in the real
world, as all the sites that get really big hits-per-second are those with
dynamic content (Yahoo, Slashdot, Amazon.com, etc.)"

Ingo: SPECweb99 is based on real-life logged web-traffic and as a result of
that it includes 30% dynamic content -- which dynamic content uses the
same fileset as static replies. So the dynamic workload of SPECweb99 is
neither trivial, nor isolated.

as explained earlier, TUX is designed to generate very fast dynamic
content, the possibility to accelerate static content is a natural
step enabled by the kernel-space HTTP protocol stack. After all the
kernel does eg. TCP handshake fully in kernel-space as well, and the
user gets the simpler 'connect()' functionality. Nevertheless static
content is still very important, if take a look at the workload of
really big sites like Yahoo, Slashdot or CNN then
you'll notice that lots of content (especially content that attracts
many hits) is still static. But TUX tries to provide an environment
for 'next generation' web-content (ie. web applications), where
dynamic output is just as natural as static.

Unix programmers seems to dislike using threads in their applications.
After all, they can just fork(); and run along instead of using the thread
functions. But, that's not important right now.

What is your opinion on the current thread implementation in the Linux
kernel compared to systems designed from the ground up to support threads
(like BeOS, OS/2 and Windows NT)? In which way could the kernel developers
make the threads work better?"

Ingo: thats a misconception. The Linux kernel is *fundamentally* 'threaded'.
Within the Linux kernel there are only threads. Full stop. Threads
either share or do not share various system resources like VM
(ie. page tables) or files. If a thread has 'all-private' resources
then it behaves like a process. If a thread has shared resources
(eg. shares files and page tables) then it's a 'thread'. Some OSs
have a rigid distinction between threads and processes - Linux is
more flexible, eg. you can have two threads that share all files but
have private page-tables. Or you can have threads that have the
same page-tables but do not share files. Within the kernel i couldnt
even make a distinction between 'processes' and 'threads', because
everything is a thread to the kernel.

This means that in Linux every system-call is 'thread-safe', from
grounds up. You program 'threads' the same way as 'processes'. There
are some popular shared-VM thread APIs, and Linux implements the
pthreads API - which btw. is a user-space wrapper exposing already
existing kernel-provided APIs. Just to show that the Linux kernel has
only one notion for 'context of execution': under Linux the
context-switch time between two 'threads' and two 'processes' is all
the same: around 2 microseconds on a 500MHz PIII.

programming 'with threads' (ie.: with Linux threads that share page
tables) is fundamentally more error-prone that coding isolated threads
(ie. processes). This is why you see all those lazy Linux programmers
using processes (ie. isolated threads) - if there is no need to share
too much state, why go the error-prone path? Under Linux processes
scale just as fine on SMP as threads.

the only area where 'all-shared-VM threads' are needed is where there
is massive and complex interaction between threads. 98% of the
programming tasks are not such. Additionally, on SMP systems threads
are *fundamentally slower*, because there has to be (inevitable,
hardware-mandated) synchronization between CPUs if shared VM is used.

this whole threading issue i believe comes from the fact that it's so
hard and slow to program isolated threads (processes) under NT (NT
processes are painfully slow to be created for example) - so all
programming tasks which are performance-sensitive are forced to
use all-shared-VM threads. Then this technological disadvantage of
NT is spinned into a magical 'using threads is better' mantra. IMHO
it's a fundamentally bad (and rude) thing to force some stupid
all-shared-VM concept on all multi-context programming tasks.

for example, the submitted SPECweb99 TUX results were done in a
setup where every CPU was running an isolated thread. Windows 2000
will never be able to do stuff like this without redesigning their
whole OS, because processes are just so much fscked up there, and
all the APIs (and programming tools) have this stupid bias towards
all-shared-VM threads.

7) Kernel modules decrease portability?
by 11223

You mentioned in the second Linux Today article that you intend to
integrate TUX with Apache. However, Apache has always been a cross-platform
server and is heavily used on *BSD and Solaris. Do you feel that this
integration will undermine the portability work of the Apache team, or will
it simply provide an incentive for web servers to be running Linux? [...]

Ingo: TUX is a kernel subsystem with a small amount of user-space glue code to
make it easier to use the TUX system-call. I believe that integrating
kernel-based HTTP protocol stacks into Apache makes sense - i dont think
this will 'undermine' anything, to the contrary, it will enable similar
solutions on other OSs as well.

" If you
intend to encourage people to move to Linux, can a similar idea as TUX be
applied to an SQL server to make up for the speed deficit between Linux SQL
servers and Microsoft SQL?

Ingo: I dont care about Microsoft SQL Server being too slow. ;-)

Database servers are not networking protocols - so the concepts
of TUX and SQL servers cannot be compared. While there are a couple of
simple RFCs describing the HTTP protocol, SQL is a few orders more
complex, has programming extensions and all sorts of proprietary
flavors.

It works with Apache but is TUX generic enough to be interfaced with
another server?

Ingo: yes, i think so.

10) Version 1
By An Unnamed Correspondent

"This is a version 1 of the web server, and it has proven itself to be
pretty nifty when it comes to serving both static webpages (through a
kernel level httpd) and dynamic webpages. Do you see TuX getting more lean
and faster as time wears on, past versions 2, 3, ... or do you see it
getting bogged down in mostly unnecessary cruft and bloat?

Ingo: actually if you watch the development of Apache, Apache got faster with
every major release. No, i dont think that additional features will
slow TUX down, once the internal architecture is extensible enough there
is no problem if additional features are *added*. It brings an obvious
performance penalty if you *use* a given feature (eg. SSI). A webserver
is badly designed if it gets slower doing the very same task.

Will there be a way to port an existing Apache configuration across to the
TuX configuration? How about IIS, Netscape, Zeus, etc? Will TuX have the
option of a GUI setup screen for those who don't like the command line?
Will TuX have a simple installer?

Ingo: i dont know. The initial release will have no 'fancy' tools, this will
implicitly 'encourage' the early adopters to be technically savvy. (and
thus helping TUX development indirectly.) The initial TUX release is
expected to be raw and uncut.

Actually there is a specific feature that would probably make TUX
incompatible with the BSDs. TUX is licensed under the GPL and the BSD
maintainers would probably be very reluctant to port it to their OSes.
Especially since it is possible that this would require them to release the
derivative work under the GPL.

Which leads to the obvious question for Ingo. You mention a specific
disclaimer that would allow the Apache to be linked with TUX, do the BSDs
get the same privilege?

Ingo: TUX is not 'linked' to Apache. Apache can use the TUX system-call, and
applications are 'isolated' from the GPL license of the kernel. Can
BSD-licensed software use the TUX system-call? Of course!

Not that I particularly care, as I am not a BSD user, but the putting such
a nifty program as TUX under the GPL is bound to cause weeping and gnashing
of teeth in the BSD camp. Which brings up another question. How much
pressure do you get from your BSD compatriots to release software like this
under a more liberal BSD-friendly license?

Ingo:
TUX has to be under the GPL because it's a Linux kernel subsystem, and
because Red Hat releases all source code as open-source. That having
said, i wouldnt write Linux kernel code if i didnt agree with the
ethics of the GPL. Putting it simply, the GPL guarantees that all
derivatives of source code that *we* have written stay under the GPL
as well.

Isnt this a reasonable thing? Nobody is forced to use *our* source code
as a base for their project, but if you freely chose to use our
source-code, then isnt it very reasonable to ask for the same kind of
openness that enabled you using this source code in the first place?

To put in another way, you get our source code only if you agree to be
just as open to us as open we are to you. You always have the option
to not use our source code. You are free to weigh the benefits of using
our source code, against the 'disadvantage' of having to show us your
redistributed improvements.

but again, the GPL is only about *our* *source* code. You are completely free to put your source code under whatever license you wish to. I
respect BSD kernel hackers and closed-source kernel hackers just as
much as Linux kernel hackers, based on the code they produce. It's their
private matter to decide under what license they put their code. It's
their godsent right to do whatever they wish to do with the source code
they wrote.

a patch from Manfred Spraul went into the kernel recently, the per-CPU SLAB cache (the concept is this: when freeing buffers the kernel keeps them in variable-size per-CPU pools, and if the same type of buffer gets allocated shortly afterwards then the kernel picks from the pool that belongs to that CPU. This leads to dramatically better cache-locality.) Per-CPU SLAB pools are used by TUX, obviously.

So... basically what you're trying to say is that it's like a ham sandwich, right?

Yes you can. You can charge in the same way as shareware developers can charge for fully-working copies of their software.

You say, "You may distribute this software freely under the terms of the GPL. I would appreciate $20, as this is how I make my living and it would encourage me to do more of this work."

Maybe 99% of the people don't pay, but you'll get some.

Alternatively, you can charge whatever you want for GPL'd software and not give it to them until they pay. They can then go distribute it if they want, but they'd have to be motivated enough to do so. The only difference with commercial software in this regard is that it's not legal when they redistribute commercial software.

In years past, folks were worried that once RedHat became the dominant commercial distribution, they'd leverage their popularity and start closing up the source whenever possible. If Ingo provides any indication of what general attitudes are at RedHat about being dedicated to opensourcing all their code, then such fears are unfounded -- in response to question #11, Ingo waxes eloquent about GPL philosophy with only the slightest provocation. It looks like it runs a little deeper than enlightened corporate self-interest, bordering almost on zealotry, and I can't say I'm displeased.

You most definitely may charge for a GPL'ed program. What's more, you can charge for it, too -- witness Red Hat, SuSE, et al. You can also dual license it if you're the author -- GPL it, but also release it under a commercial license to people who want it under terms incompatible with the GPL.

Your posited GPL variant is really nothing more than a "community" license, anyway. Once redistribution can only be done with the explicit consent of the copyright holder, or to people already holding a license, then the whole point of the GPL has evaporated.

What bothers me about a lot of this discussion is the implication that if the author doesn't want to make money off it by keeping it secret, the least s/he can do is not be a spoilsport and require that others using the work play by the same rules. It comes across feeling like making money is the Highest Calling in Life, and helping others to make money at others' expense is next. Therefore, people releasing code under non-copyleft free licenses are doing a good deed, because they're enabling others to make money by basing proprietary products around it. Sorry, that won't wash.

If you're the AUTHOR of the code, you can, as I stated above, release it under any license you please. If you really want commercial advantage from it, you shouldn't be using the GPL, and so talking about a watered down GPL is nothing more than trying to use the good name (or "brand equity" for those of suitable disposition) of the GPL to bless outright commercial ventures having nothing to do with the goals of free software. Use a community source license or what have you, but don't try to invoke the name of the GPL.

If you're not the author, and your real aim is to use GPL'ed code in your proprietary product, then tough. As long as copyright exists, and gives the author the right to control distribution, an author is entitled to use the GPL to control the terms of the redistribution (or, in this case, ensure a controlled lack of control).

Finally, for the BSD folks (not all of them, but a vocal minority) who complain that Linux can freely take from BSD but that BSD can't put GPL'ed code in its kernel: that's the way you wanted it, right? The whole idea of the BSD license is to *permit* other people to use your code with basically no strings attached. Why is it suddenly so bad when Linux plays by those rules, but when Apple does the same thing, it's perfectly fine? If you really want to ensure that everyone using your code plays by the same rules you do, then use the GPL or the like!

I certainly have no problem with BSD, or with people using the BSD or MIT licenses for their work, but they shouldn't complain quite so loudly about other people following their choices to logical conclusions. And again, I don't want to paint all BSD advocates with the same brush, but there are some people who are rather vocal about this.

The Linux tcp/ip stack requires that all data is copied from user space to kernel space before transmitting it (a server's ratio of tx to receive is roughly 10:1). In NT, no such copying needs to occur, and in fact it can take advantage of some of the new NIC cards (i.e. Adaptec & Intel) which can offload such things as TCP/IP checksumming to hardware. Solaris, of course, has had this capability for ages. Linux has no such capability at this time. Adding support for zero copy transmit in Linux will be a major chore since all of the networking stacks and drivers are designed for skbuffs which contain only a single linear buffer per packet, whereas other implementations (i.e. BSD, Solaris, NT) allow fragmented packets where the header is in one fragment and the payload is in the other fragments (the payload can be fragmented if it crosses a page boundary). Linux did not have support for locking pages in user space for I/O operations.

Having procfs _replace_ sysctl is A Bad Idea.

I understand complaints about numbering sysctl branches and nodes, but it's not as big of a deal as it's makes it out to be- things don't get drastically changed with this practically ever.

The fact is, being able to get your data with one or two syscalls is fast. A lot faster than having to drag the VFS code into things.

Procfs is great for users who want to read things, but programs need to have a lower level interface for the sake of speed, and code simplification. It's a lot easier to use sysctl than it is to open up a file and read the data from it, hate to break it to you.

And for those who some how believe that the Linux TCP/IP stack is catching up, try packet capture/shaping on a Linux TCP/IP stack. Not fun. The reasons why anyone doing serious network software would not choose the Linux TCP/IP stack... Ask junipernetworks why their choose FreeBSD;)

Now that I've seen Ingo's answers, I think that
the question about versions of TUX for other OSes
turned out to be phrased poorly. The questions I think were intended were:

1) Is the TUX interface (now revealed as the TUX
system call) something that can be used and
implemented by non-GPLed software? The use case
has been answered affirmatively and I can't see any reason why the answer would be different for the implementation case, but it hasn't been addressed yet.

2) If other OSes decide to implement the TUX
system call (or another TUX-like interface) are
you willing to work with other OSes to keep the
interfaces compatible (to ease the headaches for
software running on top of these interfaces like Apache)? Again, I don't see why not, but it hasn't been addressed.

If you are using processes and they do need to share memory, a memcpy() (or two), which also has the nice property of stomping all over your cache from the actual function call and the data that is getting moved around, is slower than cache coherency hardware which should selectively move cache lines between the CPUs.
If you are using two threads that do not share any writable memory space, they do not (or at least should not) invoke any cache coherency policies. If the shared pages are read-only (such as the pages where the program is), cache coherency isn't a problem either because writes are what cause the cache coherency problems, not reads.
If you are using seperate processes, for every context switch you have to reload the MMU's VM pointers (which takes time) which will blow away your TLBs. If you use threads (and actually switch between two threads of the same process), you shouldn't have to reload the MMU and the TLB is still valid because it is the same address space.

He didn't dispell anything you couldn't read in the clone man page.
In fact what he's spread is a bit of FUD, in implying that threads won't provide simplicity in design and better performance, if used correctly.
He also mistates the performance difference between NT and Linux in terms of process startup time. There is a very miniscule difference.
NT's threads, though, best Linux's.

In any event, prepooling threads and reusing them will provide better performance than on-demand construction of processes or threads. Doing this with threads is easier and faster than with processes, since it doesn't involve costly IPC mechanisms.

On a machine with multiple processors, this will work much better than spawning multiple processes, or multiplexing descriptors in a single process.
On a machine with one processor, processes and threads can both be considered costly in comparison to a properly designed single-process design.

Alternatively, you can charge whatever you want for GPL'd software and not give it to them until they pay. They can then go distribute it if they want, but they'd have to be motivated enough to do so. The only difference with commercial software in this regard is that it's not legal when they redistribute commercial software.

Not completely true, you allowed to charge money for as much as it takes for the media (last time I checked), but if you distribute you must make the source code available.

Is the TUX interface (now revealed as the TUX system call) something that can be used and implemented by non-GPLed software?

Of course. It can be used by non-GPL software because it is a part of the operating system and therefore acts more like LGPL than true GPL.

It can be implemented by non-GPL software as well. This should be obvious - licenses can't stop you from reimplementing the software yourself (certain software companies [microsoft.com] seem to think they can, but they can't) and releasing it under whatever license you see fit.
--

Given that a benchmark as popular as this will tend to have vendors adding, uhh, "features" to make their webservers run faster for the benchmark, how did he manage to beat them anyway? Did he modify the TCP/IP stack? DoS the other servers during the test? Connect a compulsator to a large coil?

Mentioning "threads" and "simplicity in design" in the same sentence? Hahahaha. In my experience, threads must be handled with extreme care, and require extremely careful and clever design.

The problem is that reasoning about asynchronous computation is inherently hard, and to minimize the amount of reasoning you have to do, you need to carefully limit the communication between your asynchronous components. A shared-everything thread model actively discourages and obstructs this goal.

> NT's threads, though, best Linux's.

Perhaps. I've heard a Linux kernel hacker claim that Linux switches processes faster than NT switches threads. He could have been lying, but it should be easy enough to check. If NT has an advantage, it's in the IO completion ports and other thread-related tricks.

You raise interesting questions which were not explained in the article, and i think under Linux there are easy answers for them:

1) if you need to share memory then you can do it with isolated processes as well - just use mmap(MAP_SHARED). The point is to have only as much sharing as absolutely necessery - to avoid any unwanted interaction between threads (isolate them as much as possible).

2) the MMU reload is not an issue on SMP if you have isolated threads on every CPU, because there will be no context switches. You can use shared threads on a per-CPU basis to avoid the TLB overhead. Btw., the Linux kernel avoids the TLB overhead by using 'global TLBs' (on PPro and better x86 CPUs) which survive even context switches between isolated threads. Anyway, the TLB-reload problem is a short-term x86 issue only - modern RISC CPUs and IA64 use context-tagged TLBs which survive context switches.

TLB flush can be expensive, and that is why shared-everything threads are useful in some situations. However, if you are frequently scheduling different threads onto the same CPU, your system is probably poorly designed. In an optimally designed and configured system, you have exactly one thread (or process) pinned to each CPU, each one using non-blocking/asynchronous system calls and callbacks (e.g. signals or NT's IO completion ports) to service many different requests simultaneously.

Seriously, the only reason public-domain is miscible with GPL is that the GPL authors can take the public-domain code and release modified versions under GPL that aren't public-domain. Therefore, a GPL program which includes some public-domain pieces is distributed with all pieces under the GPL, and none remaining in the public-domain.

This isn't flamebait. It's just one of the primary issues I have with public-domain.

Most servlet engines interface with the webserver by using a custom module that sends all messages the module receives off to the JVM that is running the servlet engine. From what I can tell, tux would speed up the communication until it hit the JVM. Unfortunately, this communication is extremely fast anyway. The main speed problems occur from the JVM being slow. Don't expect any noticible speed improvements from TUX and using servlets.

Cache coherency has nothing to do with it, really, as long as every CPU can access every part of memory equally fast in the cache miss case. The cost of cache coherency is taken by any and every inter-CPU communication. Keeping the inter-CPU communication down is easier to do with separate address spaces, because you're less likely to share something accidentally, but a well-written shared-everything thread model will do just as well.

For NUMA machines, it may be advantageous to make copies of shared structures rather than sharing the same memory across CPUs, especially if the structures are read-only. (Each CPU can have its own copy of the data "close to" that CPU.) But again, you can make these copies even in a shared-everything thread model, if you're clever.

Part 2 of Question 7 asked about using a TUX-like mechanism for SQL. Ingo responded that SQL is more complex, but I'm positive that there is a simpler answer...one that Microsoft has already implemented.

Microsoft included MTS (MS Transaction Server) in the kernel of Windows 2000, and SQL Server 2000 will reportedly use it. MTS can be used to distribute transactions across machines, and since it is in the kernel space, it runs very fast.

Finally, note that MTS can be used by COM/COM+ objects, not just SQL server. To quote MS: "Transaction services include code for connectivity, directory, security, process, and thread management, as well as the database connections required to create a transaction-aware application." For the whole article GO HERE [microsoft.com].

I think that someone should take a look at this. It could be of great benefit for mySQL, PostgreSQL, and even Bonobo. Note that since I'm not a kernel programmer, I'm probably WAY out of my league here, but I'd be willing to help.

You simply cannot charge for a GPL'ed program. You can charge for *support*, but if the product's good enough, it shouldn't *need* that support. (Makes you think about the quality of certain Linux distros, eh?)

Are you smoking crack? Of course you can charge for GPL'd programs! Let's say I write a version of GCC that works super-duper great for the Palm. Then, I call up 3-Com and say "give me $1,000,000, and you can have a copy, otherwise it's/dev/null for this baby..." Of course, you can only sell it once, but who cares?

And wtf you talking about with support? Oracle makes a bundle with support, and a lot of people swear by their stuff. Microsoft probably doesn't make anything from support, and no-one swears by that stuff...

Don't know about current, but this link [linuxjournal.com] in LJ shows that as of the date of the article, NT creates a thread in 0.9ms, while Linux takes 1.0ms to create a process. However, this table [linuxjournal.com] shows the effect of some changes to the scheduling - under light load conditions (small run queues), Linux switches processes (much more secure & protected) faster than NT switches threads. Read the main article [linuxjournal.com]. The changes happened in the 2.0.x series, so hopefully it got even better during 2.2, never mind 2.4.

I'd like to add the fact that the Linux kernel creates a new shared thread in 0.01 milliseconds (10 microseconds) on a 500 MHz PIII. Forking a new process (isolated thread with new page-tables) is about 0.5 milliseconds on the same box, using the latest 2.4 kernel.

You're only required to make the source available to people you distributed it to. So if you want to sell your binaries, support, etc. for $10,000, you are only obligated to distribute the source for no more than reasonable media cost to the people you distributed the binaries to. If those people in turn redistribute the binaries, they're on the hook for the source to those downstream folks, you're not.

Other than the fact that you're not allowed to make it materially harder for people to get the source as to get the binaries, you're allowed to charge whatever you please. RMS has been quite clear that any restrictions on how much someone may charge for the software are in violation of the GPL.

=====
The Linux tcp/ip stack requires that all data is copied from user space to kernel space before transmitting it (a server's ratio of tx to receive is roughly 10:1). In NT, no such copying needs to occur
=====

The presence of one unnecessary copy operation in the Linux stack cannot be used as the sole evaluation of that stack's speed or reliability. While the Linux stack has this one issue, the NT stack has been laden with issues, as I pointed out in my response to your original post of this message a few weeks back. Stability issues that can result in a denial of service are discovered in the NT stack on an almost monthly basis. There were at least a dozen stack-related DoS conditions for NT's stack posted to bugtraq over the last year while there may have been two or three at the most (if that) for linux. That indicates a serious difference in reliability which, as opposed to the presence or absence of a copy operation, can actually be used to evaluate the effectiveness of a stack in a real-world situation.

Not completely true, you allowed to charge money for as much as it takes for the media (last time I checked)

This is false. The original poster is correct. You can refuse to give the software to anyone until they pay whatever sum you demand. Your second portion is correct, though:

but if you distribute you must make the source code available.

Whoever you decide to distribute to has full rights to distribute to whomever they want (potentially without charging) without giving a royalty to you and you must make the source code available to them for no charge. If they distribute, then they must (independantly!) make the source code available for no charge as well (a link to the original's web page is not sufficient).

Linux 2.2 and later have IO completion ports; they're called "queued realtime signals" on Linux. From the fcntl(2) man page:

F_SETSIG

Sets the signal sent when input or output becomes possible. A value of zero means to send the default SIGIO signal. Any other value (including SIGIO) is the signal to send instead, and in this case additional info is available to the signal handler if installed with SA_SIGINFO.

By using F_SETSIG with a non-zero value, and setting SA_SIGINFO for the signal handler (see sigaction(2)), extra information about I/O events is passed to the handler in a siginfo_t structure. If the si_code field indicates the source is SI_SIGIO, the si_fd field gives the file descriptor associated with the event. Otherwise, there is no indication which file descriptors are pending, and you should use the usual mechanisms (select(2), poll(2), read(2) with O_NONBLOCK set etc.) to determine which file descriptors are available for I/O.

By selecting a POSIX.1b real time signal (value >= SIGRTMIN), multiple I/O events may be queued using the same signal numbers. (Queuing is dependent on available memory). Extra information is available if SA_SIGINFO is set for the signal handler, as above.

Using these mechanisms, a program can implement fully asynchronous I/O without using select(2) or poll(2) most of the time.

Thanks for M$'s spin on things:)
I dunno, but I'm gonna guess that you don't read the Linux Kernel mailing list? It appears, from discussions on the list, that checksumming in hardware is supported on receive, but on transmit it really doesn't matter; in fact, on transmit, it appears that checksumming in hardware is actually undesireable, as this quote from Alan Cox would indicate:
Until you can lock user pages down nicely and DMA from them it is not a performance improvement to checksum in hardware [on transmit].
-Harold

This illusion lasts until the first race condiction that you need to debug.

For those who do not know, a race condition is a case where 2 things (threads, processes, whatever) can try to do the same thing at the same time when it is only safe for one to do it. Like incrementing a count of things in use, one figures out the current count, gets swapped out, the second figures out the current count, increments it and puts it back, the first comes back, increments what it had, and puts it back. So two attempts to increment the number happened, but it is marked that only one happened. This can lead to premature freeing of memory, and memory leaks.

While race conditions can happen with anything, the more you share, the more of a possibility you have for hitting them.

Now what are the characteristics of a race condition? Why they are quite simple:

1. They cannot be reliably reproduced.
2. The problem shows up somewhere different from the real mistake.
3. The probability rises sharply under load.

The last point is interesting. The possibilities for getting races vary with the number of pairs of things competing. So with 2 threads, you may have to wait many years to be hit by a race that nabs you every week when you have 50 threads going.

Anyways, look at those items for threads and think of what this looks like on the other side. Hmmm. Bugs that show up under load where nobody can figure out...does that sound familiar?

Quite simply that the task of optimising your threads or processes, however you want to make the distinction, is different for different OSes that handle them differently, and for different hardware that has different capabilities. Somehow I dont think this is news:) Thats why a good programmer is a good programmer. I can crank out bad c code all I want, a real artist faced with implementing the same algorithm would turn out something faster and probably smaller. That same skilled programmer would be better able to optimise the code for an OS or a hardware setup and would likely be able to incorporate those tweaks into the code base in a suitably encapsulated form that they'd properly compile if the capabilities they required were present. I'd get lost in that task after the first half dozen #ifdefs

This whole threads vs processes thing is REALLY getting old, since it all boils down to "I want to do it the way I'm used to and since I dont want to learn any alternatives they must all be bad." It doesnt matter a goats fart to me if Jack spawns processes or Jane creates threads so long as their code compiles on my server and gives me the performance I need without me having to mess with it too much.# human firmware exploit
# Word will insert into your optic buffer
# without bounds checking

Zero-copy TCP has been discussed extensively on the Linux-kernel mailing list. You might want to check out this post by Linux [tux.org], who disagrees with zero-copy in most cases. Another post, from someone at CMU, [tux.org] has a fairly good argument in favor of zero-copy, but Ingo Molnar (to get back on topic), responds to him (taking a moderate position) in this post [tux.org].

Personally, I tend to fall into the camp which believes that sendfile() and other, specialized interfaces should use a copy-free approach, but there's no need to add this special case code into every single frickin' corner of the IO system.

Red Hat's Mission Statement / Corp Policy or whatever is that (from http://redhat.com/about) ``Red Hat shares all of its software innovations freely with the open source community under the GNU General Public License (GPL).'' Indeed, they tend to not talk about themselves in terms of a linux distribution company, but a company doing R&D, Integration, and distribution of Open Source software in general. Hence, the merger with cygnus.

In an optimally designed and configured system, you have exactly one thread (or process) pinned to each CPU, each one using non-blocking/asynchronous system calls and callbacks (e.g. signals or NT's IO completion ports) to service many different requests simultaneously.

Yeah right!

I am implementing a SQL based database server on Linux (Mimer [mimer.com]). I would love to be able to do it that way, but it would require a new level of asynchronousness from the Linux kernel. In my opinion, one problem with most UNIX OS implementations is that the OS thinks it can suspend CPU processing too easily. How could the database server schedule (say) 10 asynchronous file read or write requests and be notified when they complete? And how do you do efficient and scaleable asynchronous net I/O? poll() and select() don't scale well when you have thousands of simultaneous connections. I am not aware of any I/O completion port architecture on Linux.

I understand that TUX is able to get much greater performance under Linux by not using the conventional kernel APIs and doing the stuff directly in the kernel. But I feel very reluctant about moving an entire SQL database server into the kernel. There must be better ways...

Another problem with the "optimal" single-process event-driven approach is that it tends to turn your code "inside-out". If you have subroutines that calls each other, and when you are 20 routines deep on the stack, you get a database cache miss and need to perform an asynchronous database disk read. How do you reschedule to another task in the process?

user-mode threads

Turn your code inside-out and resolve all cache misses at top level (ugh!) (or set a flag and return up to the main dispatcher? And how do you get back?)

You're an idiot.
Juniper uses ASIC's to do all the packet forwarding. The BSD based JUN OS is strictly for out-of-band management, creating configs, logging, etc. and has absolutely nothing to do with moving real packets.
Kashani

well this looks much like nca (network cache accelerator, or something) for Solaris (new in 8, but existed in 7 with the netra isp pack), a similar implementation of an in-kernel http server. As I understand it, it communicates with a user-space web server using "solaris doors" and currently only Sun Web Server can talk on that interface (well, probably iplanet too). Supposedly Apache will have a patch soon.

Network Cache Accelerator:
The Network Cache Accelerator increases web server performance by maintaining an in-kernel cache of web pages accessed during HTTP requests. NCA provides full HTTP 1.1 support in the kernel by either handling the request or passing it to the web server for processing.

So where's the tux equivalent for BSD (and shush, all you license troublemakers)?
-o

Finally, for the BSD folks (not all of them, but a vocal minority) who complain that Linux can freely take from BSD but that BSD can't put GPL'ed code in its kernel: that's the way you wanted it, right? The whole idea of the BSD license is to *permit* other people to use your code with basically no strings attached. Why is it suddenly so bad when Linux plays by those rules, but when Apple does the same thing, it's perfectly fine? If you really want to ensure that everyone using your code plays by the same rules you do, then use the GPL or the like!

A few reasons:

1) In relation to Open Source, you will notice that BSD can share with GPL, but GPL cannot share with BSD. This is at the same time GPL advocates are advocating how free their code is. If they are ALWAYS talking about sharing, why can't I use the code in my BSD project(s) which is open-source.

2) The GPL is seen as a license with an agenda. The NSI agreement also appears to have an agenda [domainname...sguide.com]. Many people distrust NSI due to this.

3) Closed-source projects rarely (if ever?) take code they acquired from an open-source project and distribute it under another open-source license. It is somewhat a slap in the face.

4) Most GPL advocates are stronger believers in copyright laws than BSD advocates even though RMS started the copyleft license to counter copyright.

I think you shall start seeing more licenses like OpenSSL. It is BSD-like, but states that is must be distributed under its license if distributed as open-source.

Linux 2.2 and later have IO completion ports; they're called "queued realtime signals" on Linux.

This is actually very great stuff! This will probably be used in the next version of our database server.

The main problem we have is that we need to do asynchronous I/O on FILES also! And to make it more interesting, we want to queue several asynchronous read or write requests to the SAME database file. So if signals are used to complete the file I/O transaction, the si_fd field would not be enough to identify the I/O.

Does anyone know of any work to add asynchronous disk or file I/O to Linux?

Remember, you can charge all you want for it, but you have to give them the source. You can make people pay for the binaries, but of course, someone else can give away binaries. Generally, the way people make money selling things that include GPL code is by selling binaries, media, documentation, and so on.

In relation to Open Source, you will notice that BSD can share with GPL, but GPL cannot share with BSD. This is at the same time GPL advocates are advocating how free their code is. If they are ALWAYS talking about sharing, why can't I use the code in my BSD project(s) which is open-source.

You can, but then you have to release as GPL. The GPL may be seen as a beneficial virus -- depending on your definition of beneficial.

Anyway, the point of the GPL is to let you use the code, but to make sure that the code makes it back out into the world. Is this anything other than noble?

Most GPL advocates are stronger believers in copyright laws than BSD advocates even though RMS started the copyleft license to counter copyright.

The GPL is good. RMS is not god. Copyright laws are bullshit, but I believe in a certain amount of copyright protection. To elaborate: If someone else makes money off of my work, that's bad, but if someone copies something I'm charging for, which they would never have paid for, and uses it for their own purposes which do not include making money or giving it to someone who would have paid for it, I have no problem with them. They are not stealing, because they are not causing me to not receive any revenue which I would have gotten if they had not copied it without paying for it.

Anyway, I didn't say this was a *nice* or easy way to write the system. It's just the best performing way. The interesting point is that the easiest (and slowest) and the hardest (and fastest) way to build the system uses a minimal number of shared-everything threads.

Cute, but you can make an excellent argument that the MIT position is that of a pure scientist while the Unix approach is that of a solid engineer. (Do you really want to spend a month optimizing some obscure feature of your compiler -- only to discover that *nobody* tickled that code for two years? - it happened, see Programming Pearls.)

I think that even the most hard-core "MIT" proponent would agree that Unix is very much complete, robust, consistent, and everything else good and wholesome in the universe when compared to "the Operating System found on 98% of the World's PCs!". Windows, of course, is an example of an operating system designed by the marketing department.

Since the other posters haven't helped, let me try my take on actually answering the question (which, IMHO, is a reasonable one):

In a SMP (Symmetric Multi-Processing, that is 2 or more CPUs) system, each CPU has its own caches. One of the biggest performance hits in a modern system is a cache miss: having to go out into main memory to fetch something that wasn't in the cache. This is a big issue on SMP systems, since all of the CPUs have to compete for memory bandwidth.

SO, anything that will reduce the number of cache misses is a Good Thing.

The per-CPU SLAB cache attempts to do this by keeping track of a set of buffers that are frequently needed. As buffers are allocated and then freed, the code keeps track of the buffers separately for each CPU. When a new buffer is needed, the code tries to find one that was recently freed by the same CPU that needs the new buffer. This way, using memory that the buffer represents is probably in that CPU's cache, and accessing the buffer doesn't cause a cache miss, SO the whole thing runs faster.

As you say. If everything is designed properly, and implemented correctly, then threads give you better performance and frequently a simpler design as well. Which is pretty much what I said before you.

As you say. Sun uses blocking IO in Java to keep people from making design mistakes with threads. Which is pretty much what I said before you.

You don't say what happens if a mistake happens. I did. At length.

If you believe that mistakes never happen, then you are living in a fantasy world. Even if you were a perfect programmer, can you guarantee that every co-worker is? Heck, the Linux kernel folks regularly uncover race conditions, and it is not for nothing that people like Alan Cox and Linus Torvalds believe that very few programmers understand races well enough to properly avoid them.

In fact you may not have avoided them either. Do you know? Are you sure? Do you double-check everything that you are doing for correctness? Perhaps there are silent untested errors in your production code that just don't get hit very often?

Why does every competent programmer I know or have heard of from Knuth on down claim that they make mistakes and don't believe that there is anyone who doesn't?

Just today, I managed to get 3400 requests per second (1000 requests, sent ten at a time) out of Resin [caucho.com], a pure Java, open-source Java servlet 2.2/JSP engine. This was on a dual 600MHz PIII machine.

I'm sorry, but you're a twit. I've never taken to defending ESR before, but he is MOST ASSUREDLY better than the Chinese leadership. Yup. Fer sure. Not to say anything against your average person in China, I have the utmost respect for them and their history, but I'll take a gun-totin' libertarian anyday. In fact, the most significant effect of RMS (who I disagree with on MANY points) will be the "freeing" of software for China, India, and other such countries. It puts the pieces in place to totally demote the US as a major computer tech. source. As such, the GPL and RMS will be seen by history as one of the most major contributions to developing countries ever and/or the biggest act of sabotage to the Western economy ever.

MTS is a mechanism for running COM(+) objects in a transactional environment. By itself, it won't be of any benefit to a traditional SQL server system. To benefit from MTS and from running in the kernel, a program would have to rely heavily on COM objects.

Migrating a setup like this to Linux would require (a) standardizing on a component architecture like COM (which might be a good idea), (b) developing an in-kernel MTS equivalent for Linux (perhaps the Linux community could do a better job), and (c) rewriting software such as SQL servers to take advantage of this inevitably Linux-specific feature (OK, now we're fantasizing). This would be an unprecendented architectural change in direction for Linux, and is unlikely to happen anytime soon.

A semi-feasible alternative might be an in-kernel CORBA transaction server, which incidentally might be of benefit to the Berlin project. But it would be a long time before something like this had any direct effect on database servers on Linux.

Smoking Joe,
I have encountered Signal 11. He sought me out in the steel cage. I gave him the option of either joining us or dieing in an unspeakably painful and erotic manner. He has not seen fit to join us. Do you still need him alive or can I deal with him.
Your young apprentice,
--Shoeboy

Is that you make parallelism a very small part of the code. That is good software design. You can do it and have it work. I really believe that. But you need to isolate out the parallelism and treat it with extreme prejudice.

As for why Java only has blocking IO, they very specifically wanted to keep people from using the standard C select() loop construct that inevitably has a small race in it. (Early versions of Apache were bitten by that one. It worked fine until you asked Apache to listen to multiple ports at once.)

Our cause is just and we must always strive to keep the moral high ground. While Signal 11's God-like Karma would certainly benefit our cause, we cannot force him to join us. He must make that decision himself. It is similar to dealing with an addict -- they have to bottom out and really WANT to be cured.

I am an optimist. I do not believe that Signal 11 is too far gone. We can win him over, but we must take care not to push too hard as this would only drive him away. "Hold On Loosely" is the advice given by the band 38 Special. In time, Signal 11 will grow tired of being a model Karma Whore and will begin to turn like old milk. It happened to you, it happened to me, it happened to PHroD. When is happens to S11, we must be ready with open arms, Natalie Portman pictures, and a steaming hot bowl of grits to welcome him.