The Linux Kernel: It’s Worth More!

David A. Wheeler

October 12, 2004 (Lightly revised September 28, 2011)

This paper refines Ingo Molnar’s estimate of the development effort
it would take to redevelop Linux kernel version 2.6.
Molnar’s rough estimate found it would cost $176M (US) to
redevelop the Linux kernel using traditional proprietary approaches.
By using a more detailed cost model and much more information about the
Linux kernel, I found that the effort would be
closer to $612M (US)
to redevelop the Linux kernel as it existed in 2004.
A postscript lists some recalculations since then,
showing that these values have grown.
In any case, the Linux kernel is clearly worth far more than the $50,000
offered in 2004.

We offer to kernel.org the sum of $50,000.00 US for a one time
license to the Linux Kernel Source for a single snapshot of
a single Linux version by release number. This offer must be
accepted by **ALL** copyright holders and this snapshot will
subsequently convert the GPL license into a BSD style license
for the code.

Many respondents noted that this proposal was unworkable,
because it required complete agreement by all copyright holders.
Not only would such a process be lengthy, but
many copyright holders made it clear in various replies
that they would not agree to any such plan.
Many Linux kernel
developers expect improved versions of their code to be continuously
available to them, and a release using a BSD-style license would
violate those developers’ expectations.
Indeed, it was clear that many respondants felt that such a move
would strip the Linux kernel of legal protections
against someone who wanted to monopolize a derived version of the kernel.
Many open source software / Free software (OSS/FS)
developers allow conversion of their OSS/FS programs
to a proprietary program; some even encourage it.
The BSD-style licenses are specifically designed to allow conversion
of an OSS/FS program into a proprietary program.
However,
the GPL is the
most popular OSS/FS license, and it was specifically designed
to prevent this.
Based on the thread responses, it’s clear that
many Linux kernel developers prefer that the GPL continue to be used as
the Linux kernel license.

In addition, many people were suspicious about the motives for this offer.
Groklaw
published an article that mentioned this proposal, and
noted that someone with the same name
is listed on a patent recently obtained by the Canopy Group.
SCO is a Canopy Group company, and I have
since confirmed that the patent application refers to the same person.
Groklaw
later tried to learn more about him.
I don’t really know why Merkey made this proposal, and it
doesn’t really matter.
What’s more interesting to me is the questions that this raised,
namely, how much is Linux “worth”?
That is a valid question!

In one of the responses,
Ingo Molnar calculated the cost to re-develop the Linux kernel
using my tool
SLOCCount.
Molnar didn’t specify exactly which version of the Linux kernel he used,
but he did note that it was in the version 2.6 line, and
presumably it was a recent version as of October 2004.
He found that “the Linux 2.6 kernel, if developed from scratch
as commercial software, takes at least this much effort under the
default COCOMO model”:

After noting the redevelopment cost of $176M (US),
Ingo Molnar then commented,
“and you want an unlimited license for $0.05M? What is this, the latest
variant of the Nigerian/419 scam?”

Strictly speaking,
the value of a product isn’t the same as the cost of developing it.
For example,
if no one wants to use a software product, then it has no value, no matter
how much was spent in developing it.
The value of a proprietary software product to its vendor
can be estimated by
computing the amount of money that the vendor will receive from it over all
future time (via sales, etc.),
minus the costs (development, sustainment, etc.)
over that same time period -- but predicting
the future is extremely difficult, and the Linux kernel isn’t a
proprietary product anyway.
Estimating value to users is difficult, and in fact,
value estimation is surprisingly difficult to compute directly.
But if a software product is used widely,
so much so that you’d be willing to
redevelop it, then development costs are a reasonable way to estimate
the lower bound of its value.
After all,
if you’re willing to redevelop a program, then it must have at least
that value.
The Linux kernel is widely used, so its redevelopment costs
will at least give you a lower bound of its value.

Thus, Molnar’s response is quite correct -- offering $50K for something
that would cost at about $176M to redevelop is ludicrous.
It’s true that the kernel developers could continue to develop the
Linux kernel after a BSD-style release, after all, the *BSD operating systems
do this now.
But with a BSD-style release, someone else could take the code
and establish a competing proprietary product, and it would
take time for the kernel developers to add enough additional material
to compete with such a product.
It’s not clear that a proprietary vendor could really pick up the Linux
kernel and maintain the same pace without many of the original developers,
but that’s a different matter.
Certainly, the scale of the difference between $176M and $50K is enough
to see that the offer is not very much, compared to what the offerer
is trying to buy.

But in fact, it’s even sillier than it appears; I believe the cost to
redevelop the Linux kernel would actually be much greater than this.
Molnar correctly notes that he used the default Basic COCOMO model
for cost estimation.
This is the default cost model for SLOCCount, because it’s
a reasonable model for rough estimates about typical applications.
It’s also a reasonable default when
you’re examining a large set of software programs at once, since the ranges of
real efforts should eventually average out (this is the approach I used in my
More than a Gigabuck paper).
So, what Molnar did was perfectly reasonable for getting a rough
order of magnitude of effort.

But since there’s only one program being considered in this analysis --
the Linux kernel --
we can use a more detailed model to get a more accurate cost estimate.
I was curious what the answer would be.
So I’ve estimated the effort to create the Linux kernel, using a
more detailed cost model.
This paper shows the results -- and it shows that redeveloping the
Linux kernel would cost even more.

This estimate is what it would cost to
rebuild a particular version, and not exactly the same as the effort
actually invested into the kernel.
In particular, in Linux kernel development, a common practice is to
have a “bake-off” where competing ideas are all implemented
and then measured; the approach with the best result
(e.g., faster) is then used.
Bake-offs have much to commend them, but since only one approach
is actually included, the effort invested in the alternatives isn’t
included in this estimate.

To get better accuracy in our estimation,
we need to use a more detailed estimation model.
An obvious alternative, and the one I’ll use, is
the Intermediate COCOMO model.
This model requires more information than the Basic COCOMO model,
but it can produce higher-accuracy estimations if you can provide
the data it needs.
We’ll also use the version of COCOMO that uses physical SLOC
(since we don’t have the logical SLOC counts).
If you don’t want to know the details, feel free to skip to the next
section labelled “results”.

First, we now need to determine if this is an “organic”, “embedded”, or
“semidetached” application.
The Linux kernel is clearly not an organic application; organic applications
have a small software team developing software in a familiar,
in-house environment, without significant communication overheads,
and allow hard requirements to be negotiated away.
It could be argued that the Linux kernel is embedded, since it often
operates in tight constraints; but in practice
these constraints aren’t very tight,
and the kernel project can often negotiate requirements to a limited extent
(e.g., providing only partial support for a particular peripheral
or motherboard if key documentation is lacking).
While the Linux kernel developers don’t ignore resource constraints,
there are no specific constraints that the developers feel are
strictly required.
Thus, it appears that the kernel should be considered
a “semidetached” system; this is the
intermediate stage between organic and embedded.
“Semidetached” isn’t a very descriptive word, but that’s the word used by
the cost model so we’ll use it here.
It really just means between the two extremes of organic and embedded.

The intermediate COCOMO model also requires a number of additional parameters.
Here are those parameters, and their values for the Linux kernel
(as I perceive them); the parameter values are based on
Software Engineering Economics by Barry Boehm:

RELY: Required software reliability: High (1.15). The Linux kernel
is now used in situations where crashes can cause high financial loss.
Even more importantly, Linux
kernel developers expect the kernel to be highly reliable,
and the kernel undergoes extensive worldwide off-nominal testing.
While the testing approach is different than traditional testing regimes,
it clearly produces a highly reliable result
(see the
Reliability
section of my paper
Why OSS/FS? Look at the Numbers!).

DATA: Data base size: Nominal (1.0). Typically the Linux kernel
manages far larger data bases (file systems) than itself, but it
handles them as somewhat opaque contents, so it’s questionable that
those larger sizes can really be counted as being much greater than
nominal. Handling the filesystems’
metadata is itself somewhat complicated, and does take significant
effort, but filesystem management is only one of many things that
the kernel does. So, absent more specific data, we’ll
claim it’s nominal. If we claim it’s higher, and there’s reason
for doing so, that would increase the estimated effort.

CPLX: Product complexity: Extra high (1.65).
The kernel must perform multiple resource handling with dynamically
changing priorities: multiple processes/tasks running on potentially
multiple processors, with multiple kinds of memory, accessing peripherals
which also have various dynamic priorities.
The kernel must deal with device timing-dependent coding, and
with highly coupled dynamic data structures (some of whose structure
is imposed by hardware). In addition, it implements
routines for interrupt servicing and masking, as well as multi-processor
threading and load balancing.
The kernel does have an internal design structure, which helps manage
complexity somewhat, but in the end no design can eliminate the
essential complexity of the task today’s kernels are asked to perform.
It’s true that toy kernels aren’t as complex; requiring single
processors, forbidding re-entry, ignoring resource contention issues,
ignoring error conditions, and a variety of other simplifications
can make a kernel much easier to build, at the cost of poor performance.
But the Linux kernel is no toy.
Real-world operating system kernels are considered extremely difficult
to develop, for a litany of good reasons.

TIME: Execution time constraint: High (1.11). Although it doesn’t need to
stay at less than 70% resource use, performance is an important
design criteria, and much effort has been spent on measuring and
improving performance.

STOR: Main storage constraint: Nominal (1.0). Although there has been
some effort to limit memory use (e.g., 4K kernel stacks), Linux kernel
development has not been strongly constrained by memory.

VIRT: Virtual machine volatility: High (1.15).
The most common processor (x86) doesn’t change that quickly, though new
releases by Intel and AMD do need to be taken into account.
The Linux kernel is also influenced by other processor architectures,
which in the aggregate change quite a bit over time.
Even more importantly, the other components of underlying machines
(such as motherboards, peripheral and bus interfaces, etc.)
change on a weekly basis. Often the documentation is unavailable,
and when available, it’s sometimes wrong (which from a developer’s
point of view looks like a volatile interface, since it keeps changing).
The Linux kernel developers spend a vast amount of time identifying
hardware limitations/problems and working around them.
What’s worse, there’s a variety of different hardware, and new ones
keep arriving.
The kernel developers do attempt to control things where they can.
For example, while they try to write code that works with a variety
of gcc versions, they limit themselves to one compiler (gcc), designate
an official gcc version, and try to limit when official gcc versions
are changed.
But these measures cannot hide the fact that
the interface of the underlying machine is
actually quite volatile.

TURN: Computer turnaround time: Nominal (1.0). Kernel recompilation
and rebooting aren’t interactive, but they’re reasonably fast on
2+ GHz processors. Once the first compilation has occurred,
recompilation is usually quite quick for localized changes.
Thus, there’s no reason for this to be a penalty.

ACAP: Analyst capability: High (0.86). It appears that the people
analyzing the system, identifying the “real” requirements, and the
needed design modifications to support them,
are significantly better at doing this than the industry average.
This analysis tends to be more distributed than in a typical proprietary
project, but it obviously still occurs.

AEXP: Applications experience: Nominal (1.0). It’s difficult to
determine how much experience with the Linux kernel
the software developers of the Linux kernel have.
Clearly, if you modify the same program day after day for many years,
you’ll tend to become more efficient at modifying it.
Some developers, such as Linus Torvalds and Alan Cox,
clearly have a vast amount of experience in modifying the Linux kernel.
But for many other kernel developers it isn’t clear that they have
a vast amount of experience modifying the Linux kernel.
In absence of better information, I’ve chosen nominal. This suggests that
on average, developers of the Linux kernel have about 3 years’ full-time
experience in modifying the Linux kernel.
More experience on average would help, and lower the effort
estimation somewhat.

PCAP: Programmer capability: High (0.86).
Modern kernels such as Linux are complex, creating a strong barrier against
attempts to contribute by less capable developers.
Would-be contributors
must convince the existing experts that their work is worthwhile, so
new contributors’ works are normally revised by highly capable developers.
Key kernel developers are not accepted as such unless they convince the
other, already highly capable developers that they are also capable.
Generally only
highly capable, above-average developers (75th percentile or more)
will be successful at helping to develop the Linux kernel.

VEXP: Virtual machine experience: Nominal (1.0). The x86 processors,
which are by far the most popular for the Linux kernel, are
relatively stable and kernel developers have a lot of experience with them.
But they are not completely stable (e.g., the new 64-bit extensions
for x86 and the NX bit), which can also reduce experience slightly.
Authors of ports to other processors also tend to be experienced
with those processors.
On the other hand,
most of the kernel’s code is in its hardware drivers, and this
hardware often acts as a virtual machine as well as a needed interface.
Many driver developers, while experienced in general,
often have less experience with the particular component they’re
writing a driver for.
In particular, many drivers are not written by companies that
produce the hardware, and the developers
often don’t have good documentation to help them.
Sometimes this has helpful side-effects.
It can help unify how hardware is handled, since the
kernel developers who are writing drivers for several
similar peripherals will often develop a way
to unify their handling and apparant interface.
It can also have aid reliability in the long term, since the driver
writers undrerstand how the kernel works (Windows drivers
tend to be written by hardware companies who understand their product
but have less knowledge about Windows, and since their code is often
not peer reviewed by Windows developers, many Windows drivers can
cause the entire operating system to crash).
But this initial lack of information by Linux kernel developers
about the components does increase the effort to develop a driver.
What’s worse, hardware components are notorious for not operating as their
specifications proclaim, and the kernel’s job is to hide all that.
Thus, this is averaged as nominal, and this is probably being generous.

LEXP: Programming language experience: High (0.95).

MODP: Modern programming practices: High - in general use (0.91).
This program is written in C, which lacks structures such as
exception handling, so there is extensive use of “goto” (etc.) to implement
error handling. However, the use of such constructs tends to be
highly stylized and structured, so credit is given for using modern
practices. Some might claim that this is
giving too much credit, but changing this would only make the estimated
effort even larger.

TOOL: Use of software tools: Nominal (1.0).

SCED: Required development schedule: Nominal (1.0). There is little
schedule pressure per se, so the “most natural” speed is followed.

In short, it would actually cost about $612 million (US) to re-develop the
Linux kernel.

Why is this estimate so much larger than Molnar’s original estimate?
The answer is that SLOCCount presumes that it’s dealing with an
“average” piece of software (i.e., a typical application) unless
it’s given parameters that tell it otherwise.
This is usually a reasonable default; almost nothing is as hard
to develop as an operating system kernel.
But operating system kernels
are so much harder to develop that, if you include that difficulty
into the calculation, the effort estimations go way up.
This difficulty shows up in the nominal equation -
semidetached is fundamentally harder, and thus has a larger exponent
in its estimation equation than the default for basic COCOMO.
This difficulty also shows up in factors such as “complexity”;
the task the kernel does is fundamentally hard.
The strong capabilities of analysts and developers, use of modern practices,
and programming language experience all help,
but they can only partly compensate; it’s still very hard to
develop a modern operating system kernel.

This difference is smoothed over in my paper
More than a Gigabuck
because that paper
includes a large number of applications.
Some of the applications would cost less than was estimated, while
others would cost more; in general you’d expect that by computing the
costs over many programs the differences would be averaged out.
Providing that sort of information for every program would have been
too time-consuming for the limited time I had available to write that paper,
and I often didn’t have that much information anyway.
If I do such a study again, I might treat the kernel specially, since
the kernel’s size and complexity makes it reasonable to treat specially.
SLOCCount actually has options that allow you to provide the
parameters for more accurate estimates,
if you have the information they need and you’re willing
to take the time to provide them.
Since the nominal factor is 3, the adjustment for this situation
is 1.54869, and the exponent for semidetached projects is 1.12,
just providing SLOCCount with
the option “--effort 4.646 1.12”
would have created a more accurate estimate.
But as you can see, it takes much more work to use this more
detailed estimation model, which is why many people don’t do it.
For many situations, a rough estimate is really all you need;
Molnar certainly didn’t need a more exact estimate to make his point.
And being able to give a rough estimate when given
little information is quite useful.

In the end, Ingo Molnar’s response is still exactly correct.
Offering $50K for something
that would cost millions to redevelop, and is actively used and
supported, is absurd.

It’s interesting to note that there are already
several kernels with BSD licenses: the *BSDs (particularly
FreeBSD, OpenBSD, and NetBSD).
These are fine operating systems for many purposes,
indeed, my website once ran on OpenBSD.
But clearly, if there is a monetary offer to buy Linux code,
the Linux kernel developers must be doing something right.
Certainly, from a market share perspective, Linux-based systems are far
more popular than systems based on the *BSD kernels.
If you just want a kernel licensed under a BSD-style license,
you know where to find them.*

It’s worth noting that these approaches only estimate development cost,
not value.
All proprietary developers invest in development with the presumption
that the value of the resulting product (as captured from license fees,
support fees, etc.) will exceed the development cost -- if not, they’re
out of business.
Thus, since the Linux kernel is being actively sustained, it’s only
reasonable to presume that its value far exceeds this development
estimate.
In fact, the kernel’s value probably well exceeds this estimate of
simply redevelopment cost.

It’s also worth noting that the Linux kernel has grown substantially.
That’s not surprising, given the explosion in the number of peripherals
and situations that it supports.
In
Estimating Linux’s size,
I used a Linux distribution released in March 2000,
and found that the Linux kernel had 1,526,722 physical source lines of code.
In
More than a Gigabuck,
the Linux distribution had been released on April 2001, and its
kernel (version 2.4.2) was 2,437,470 physical source lines of code (SLOC).
At that point, this Linux distribution would have cost more
than $1 Billion (a Gigabuck) to redevelop.
The much newer and larger Linux kernel considered here, with far more
drivers and capabilities than the one in that paper,
now has 4,287,449 physical source lines of code, and
is starting to approach a Gigabuck of effort all by itself.
If the kernel reaches 6,648,956 lines of code
(($1E9/$56286/2.4*12/3/1.54869) ^ (1/1.12))
given the other assumptions
it’ll represent a billion dollars of effort all by itself.
And that’s just the kernel, which is only part of a working system.
There are other components that weren’t included More than a Gigabuck
(such as OpenOffice.org) that are now common in Linux distributions,
which are also large and represent massive investments of effort.
More than a Gigabuck
noted the massive rise in size and scale
of OSS/FS systems, and that distributions were rapidly growing in
invested effort; this brief analysis is evidence that the trend continues.

In short, the amount of effort that today’s OSS/FS programs represent
is rather amazing.
Carl Sagan’s phrase “billions and billions,” which he applied to
astronomical objects, easily applies to the effort
(measured in U.S. dollars) now invested in OSS/FS programs.

I’d like to thank Ingo Molnar for doing the original analysis
(using SLOCCount) that triggered this paper.
Indeed, I’m always delighted to see people doing analysis instead of
just guesswork.
Thanks for doing the analysis!
This paper is not in any way an attack on Molnar’s work; Molnar computed
a quick estimate, and this paper simply uses more data to refine his
effort estimation further.

Also, I’d like to tip my hat to
Charles Babcock’s October 19, 2007 article
“Linux Will Be Worth $1 Billion In First 100 Days of 2009”.
He noticed that, by my calculations, if
the Linux kernel ever reached 6.6 million lines of code,
it would be worth more than $1 billion in terms of equivalent, commercial development costs.
Using the current size and growth rates of the Linux kernel, he examined
the trend lines and found that
“Sometime during the first 100 days of 2009, Linux
will cross the 6.6 million lines of code mark and $1 billion in value.”

(C) Copyright 2004-2010 David A. Wheeler. All rights reserved.
This
article was reprinted in Groklaw by permission.
Before September 28, 2011, this article was titled
Linux Kernel 2.6: It’s Worth More!, but many of the
points apply to versions other than Linux kernel version 2.6.