Random code snippets, projects and musings about software from Eric Kidd, a developer and entrepreneur. You're welcome to contact me!

15 Minutes and 150MB of RAM to Compare Unix and Linux

Jun 20, 2003 • by Eric Kidd

SCO has recently made two
accusations: (1) IBM has contributed IBM employees' code to Linux
in violation of certain SCO/IBM contracts, and (2) some proprietary
Unix code has somehow been illegally contributed to Linux. I'm not
qualified to comment on whether or not IBM owns the code IBM
wrote--though on behalf of software authors everywhere, I hope IBM
does. However, I've written a tool which will allow SCO to find any
code shared between Linux and Unix in about 15 minutes. What SCO does
with this tool is up to them.

My Motivations

When I was young, my father once told me, "The truth will out." (I
think he was quoting someone.) He felt that it was better to face
the truth, get all the facts before the public, and do the right thing.
You might pay a price for your honesty, but you'd pay a bigger price if
you lied, because sooner or later, "The truth will out."

I don't fear the truth. And as a creator, I deeply respect other
people's copyrights. I don't want other people misuing my work. Linus
Torvalds is a creator, too, and he says he respects the work
of others. If there's Unix code in the Linux kernel, I want it removed
swiftly, and I want those responsible to be barred from future
contribution. Such a illegal copying would be stain upon the honor of
many good, creative people.

When I read Egan Orion's excellent article, I
decided to implement the idea he described, and to make the tool
publically available to help copyright owners figure out whether their
code has been copied.

How to Use It

You'll need a Linux or Unix system with a decent C++ compiler.
Download srcdupchk-0.2.tar.gz
(that's "source duplication checker" to people who don't speak Unix),
decompress it, and type:

$ cd srcdupchk-0.2
$ ./configure
$ make
$ make install

You may need be root to run the last command. Before continuing,
please read the README and COPYING files carefully. srcdupchk comes
with no warranties and is provided "AS IS".

Now place the programs to be be compared in two different
directories, and type:

$ srcdupchk my-program-src linux-2.4.20

After 15 minutes or so, you'll get output which looks like this:

linux-2.4.20/foo/bar.c:20:107
linux-2.4.20/foo/baz.c:52:57
...

This means that lines 20 to 107 of bar.c are similar to code
in my-program-src, as are lines 52 to 57 of baz.c. By
default, srcdupchk won't print out the corresponding lines in
my-program-src, so you don't have to reveal more than necessary
about your own program. (If you want to see the other half of the
matches, use the --show-both option.)

For details on how srcdupchk actually works, see the README.
srcdupchk uses some neat tricks to ignore whitespace, commenting
style, brace placement, and other irrelevant details.

A Caveat

srcdupchk will find lots of perfectly legal code sharing. For
example, both Linux and Unix are allowed to contain BSD code, or public
domain code from various textbooks or the web. Other common sources of
duplication are license notices, and the boilerplate code generated by
popular tools. Once you've found the duplication, you need to
investigate it carefully before you know what it means.

A Personal Request

Please don't publish the results on running srcdupchk on
other people's code. It isn't polite, it almost certainly violates any
non-disclosure agreements you've signed, and it may get you sued. Your
actions would reflect poorly on the reputations of many free software
developers, who in my experience, are painfully scrupulous about
their legal and moral responsibilities.

I wrote this tool so that software developers could quickly find
improper uses of their code, and report those problems. Please respect
my wishes in this matter. If you want to violate your NDAs in the name
of journalism, there are people who will help you do that; they
are already offering to provide similar tools for just that purpose.
The ethics of a law-abiding citizen and the ethics of investigative
journalism are sometimes in conflict; how you resolve them is on your
own conscience. But whatever you do, please don't do it with my
tools.

What I'd Like SCO to Do

SCO has a choice. They could run this tool, keep the results
secret, and issue press releases saying there are 5,241 lines of code
which appear in both Unix and Linux. Or they could call up Linus
Torvalds on the phone and say: "The following files and line numbers in
Linux look suspicious to us. Would you please work with the Linux
community to figure out who 'contributed' this code, and if it
is illegally copied, would you please remove it promptly?"

The latter choice would end the wrongdoing which concerns SCO, and
would allow hundreds of dedicated, hardworking people to clear their
names. All I'm asking for is 15 minutes, 150MB of RAM, and one
phone call to Linus.