Ken Thompson Hack (1984)

Ken Thompson outlined a method for corrupting a compiler binary (and other compiled software, like a login script on a *nix system) in 1984. I was curious to know if modern compilation has addressed this security flaw or not.

Short description:

Re-write compiler code to contain 2 flaws:

When compiling its own binary, the compiler must compile these flaws

When compiling some other preselected code (login function) it must compile some arbitrary backdoor

Thus, the compiler works normally - when it compiles a login script or similar, it can create a security backdoor, and when it compiles newer versions of itself in the future, it retains the previous flaws - and the flaws will only exist in the compiler binary so are extremely difficult to detect.

Questions:

I could not find any answers to these on the web:

How does this relate to just-in-time compilation?

Are functions like the program handling logins on a *nix system compiled when they are
run?

Is this still a valid threat, or have there been developments in
the security of compilation since 1984 that prevent this from being a
significant issue?

Does this affect all languages?

Why do I want to know?

I came across this while doing some homework, and it seemed interesting but I lack the background to understand in a concrete way whether this is a current issue, or a solved issue.

10 Answers
10

This hack has to be understood in context. It was published at a time and in a culture where Unix running on all kinds of different hardware was the dominant system.

What made the attack so scary was that the C compiler was the central piece of software for these systems. Almost everything in the system went through the compiler when it was first installed (binary distributions were rare due to the heterogenous hardware). Everyone compiled stuff all the time. People regularly inspected source code (they often had to make adjustments to get it to compile at all), so having the compiler inject backdoors seemed to be a kind of "perfect crime" scenario where you could not be caught.

Nowadays, hardware is much more compatible and compilers therefore have a much smaller role in the day-to-day operation of a system. A compromised compiler is not the most scary scenario anymore - rootkits and a compromised BIOS are even harder to detect and get rid of.

Or, since most people don't compile anything from source (say, on Windows) your average trojan will suffice :) (I'm agreeing that a compromised compiler is way overkill)
–
Andres F.Jan 25 '13 at 22:18

12

@ArjunShankar: A non-free proprietary binary-only compiler does not need, and cannot have, this backdoor. This backdoor only applies to compilers that you compile yourself from source-code.
–
ruakhJan 26 '13 at 4:48

6

Except for the desktop, Unix, and all its variants, is still the dominant operating system.
–
RobJan 26 '13 at 12:44

2

@ruakh: maybe I do not understand your emphasis on 'this', but I happen to disagree. If this backdoor has been introduced in the company that happens to own the non-free, proprietary compiler and uses this compiler to compile new versions of the same compiler, this backdoor would have a much worse impact than in the original scenario. You'll only need one attack vector to infect all.
–
orithenaJan 26 '13 at 15:14

2

Imagine someone compromises a ubuntu build server and replaces the compiler without changing any source. It might take a little time for this to be found out, and by that time ubuntu images would be pushed out to people all over with the compromised compiler built into them (along with compromised login assemblies or what have you). I think this is still a perfectly valid concern.
–
Jimmy HoffaJan 28 '13 at 15:11

No

The attack, as originally described, was never a threat. While a compiler could theoretically do this, actually pulling off the attack would require programming the compiler to

Recognize when the source code being compiled is of a compiler, and

Figure out how to modify arbitrary source code to insert the hack into it.

This entails figuring out how the compiler works from its source code, in order that it can modify it without breakage.

For instance, imagine that the linking format stores the data lengths or offset of the compiled machine code somewhere in the executable. The compiler would have to figure out for itself which of these need to be updated, and where, when inserting the exploit payload. Subsequent versions of the compiler (innocuous version) can arbitrarily change this format, so the exploit code would effectively need to understand these concepts.

This is high-level self-directed programming, a hard AI problem (last I checked, the state of the art was generating code that is practically determined by its types). Look: few humans can even do this; you would have to learn the programming language and understand the code-base first.

Even if the AI problem is solved, people would notice if compiling their tiny compiler results in a binary with a huge AI library linked into it.

Analogous attack: bootstrapping trust

However, a generalization of the attack is relevant. The basic issue is that your chain of trust has to start somewhere, and in many domains its origin could subvert the entire chain in a hard-to-detect way.

An example that could easily be pulled off in real life

Your operating system, say Ubuntu Linux, ensures security (integrity) of updates by checking downloaded update packages against the repository's signing key (using public-key cryptography). But this only guarantees authenticity of the updates if you can prove that the signing key is owned by a legitimate source.

Where did you get the signing key? When you first downloaded the operating system distribution.

You have to trust that the source of your chain of trust, this signing key, isn't evil.

Anyone that can MITM the Internet connection between you and the Ubuntu download server—this could be your ISP, a government that controls Internet access (e.g. China), or Ubuntu's hosting provider—could have hijacked this process:

Detect that you're downloading the Ubuntu CD image. This is simple: see that the request is going to any of the (publicly-listed) Ubuntu mirrors and asks for the filename of the ISO image.

Serve the request from their own server, giving you a CD image containing the attacker's public key and repository location instead of Ubuntu's.

Thenceforth, you will get your updates securely from the attacker's server. Updates run as root, so the attacker has full control.

You can prevent the attack by making sure the original is authentic. But this requires that you validate the downloaded CD image using a hash (few people actually do this)—and the hash must itself be downloaded securely, e.g. over HTTPS. And if your attacker can add a certificate on your computer (common in a corporate environment) or controls a certificate authority (e.g. China), even HTTPS provides no protection.

This is false. The compiler only has to determine when it is compiling a very specific source file from its own source code with very specific contents, not when it is compiling any compiler whatsoever!!!
–
KazJan 25 '13 at 23:45

6

@Kaz -- At some point, aboveboard modifications to the compiler or login program might get to the point where they defeat the backdoor's compiler-recognizer/login-recognizer, and subsequent iterations would lose the backdoor. This is analogous to a random biological mutation granting immunity to certain diseases.
–
Russell BorogoveJan 26 '13 at 0:27

9

The first half of your answer has the problem that Kaz describes, but the second half is so good that I'm +1'ing anyway!
–
ruakhJan 26 '13 at 1:03

4

An evil compiler that only recognizes it's very own source is easy to build, but relatively worthless in practice - few people who already have a binary of this compiler would use it to recreate said binary. For the attack to be successful for a longer period, the compiler would need more intelligence, to patch newer verdions of its own source, thus running into the problems described in the snswer.
–
user281377Jan 26 '13 at 21:34

4

A recognizer for a specific compiler could be quite general, and unlikely to break in the face of new version. Take for instance gcc - many lines of code in gcc are very old, and haven't changed much. Simple things like the name almost never change. Before the recognition goes awry, it's likely the injected code would. And in reality, both of those problems are largely theoretical - in practice a malware author would have no trouble keeping up to date with the (slow) pace of compiler development.
–
Eamon NerbonneJan 27 '13 at 15:15

The purpose of that speech wasn't to highlight a vulnerability that needs to be addressed, or even to propose a theoretical vulnerability that we need to be aware of.

The purpose was that, when it comes to security, we'd like to not have to trust anyone, but unfortunately that's impossible. You always have to trust someone(hence the title: "Reflections On Trusting Trust")

Even if you're the paranoid type who encrypts his desktop hard-drive and refuses to run any software you didn't compile yourself, you still need to trust your operating system. And even if you compile the operating system yourself, you still need to trust the compiler you used. And even if you compile your own compiler, you still need to trust that compiler! And that's not even mentioning the hardware manufacturers!

You simply can't get away with trusting no one. That's the point he was trying to get across.

If one has an open-source compiler whose behavior does not depend upon any implementation-defined or unspecified behavior, compiles it using a variety of independently-developed compilers (trusted or not), and then compiles one program using all the different compiled versions of that open-source one, every compiler should produce exactly the same output. If they do, that would suggest that the only way a trojan could be in one would be if it was identically in all. That would seem rather unlikely. One of my peeves with much of .net, though, ...
–
supercatJan 26 '13 at 19:07

...is that many of the compilers generally produce different output every time they are run, making comparisons of compiled code essentially impossible.
–
supercatJan 26 '13 at 19:09

4

@supercat: You seem to be missing the point. You're saying that the hack Ken Thompson presented can be worked around. I am saying that the particular hack he chose doesn't matter; it was just an example, to demonstrate his larger point that you must always trust someone. That's why this question is somewhat meaningless - it completely misses the forest for the trees.
–
BlueRaja - Danny PflughoeftJan 26 '13 at 19:32

5

@supercat: Its highly unlikely that different compilers would produce the same bytecode for any non-trivial program due to different design decisions, optimizations etc. This raises the question - how would you even know that the binaries are identical?
–
Ankit SoniJan 27 '13 at 21:01

1

Wouldn't some of this conversation just mean that for the things you tested, the binaries/hardware behaved as expected? There could still be something in it you didn't test for and are unaware of.
–
Bart SilverstrimFeb 1 '13 at 16:04

This particular hack could certainly (*) be done today in any of the major open source OS projects, particularly Linux, *BSD, and the like. I would expect it would work almost identically. For example, you download a copy of FreeBSD that has an exploited compiler to modify openssh. From then on, every time you upgrade openssh or the compiler by source, you will continue the problem. Assuming the attacker has exploited the system used to package FreeBSD in the first place (likely, since the image itself is corrupted, or the attacker is in fact the packager), then every time that system rebuilds FreeBSD binaries, it will reinject the problem. There are lots of ways for this attack to fail, but they're not fundamentally different than how Ken's attack could have failed (**). The world really hasn't changed that much.

Of course, similar attacks could just as easily (or more easily) be injected by their owners into systems like Java, the iOS SDK, Windows, or any other system. Certain kinds of security flaws can even be engineered into the hardware (particularly weakening random number generation).

(*) But by "certainly" I mean "in pricinciple." Should you expect that this kind of hole exists in any particular system? No. I would consider it quite unlikely for various practical reasons. Over time, as the code changes and changes, the likelihood that this kind of hack would cause strange bugs increases. And that raises the likelihood that it would be discovered. Less ingenious backdoors would require conspiracies to maintain. Of course we know for a fact that "lawful intercept" backdoors have been installed in various telecommunications and networking systems, so in many cases this kind of elaborate hack is unnecessary. The hack is installed overtly.

So always, defense in depth.

(**) Assuming Ken's attack ever actually existed. He just discussed how it could be done. He didn't say he actually did it as far as I know.

Does this affect all languages?

This attack primarily affects languages that are self-hosting. That is languages where the compiler is written in the language itself. C, Squeak Smalltalk, and the PyPy Python interpreter would be affected by this. Perl, JavaScript, and the CPython Python interpreter would not.

How does this relate to just-in-time compilation?

Not very much. It is the self-hosting nature of the compiler that allows the hack to be hidden. I don't know of any self-hosting JIT compilers. (Maybe LLVM?)

Are functions like the program handling logins on a *nix system compiled when they are run?

Not usually. But the question isn't when it is compiled, but by which compiler. If the login program is compiled by a tainted compiler, it will be tainted. If it is compiled by a clean compiler, it will be clean.

Is this still a valid threat, or have there been developments in the security of compilation since 1984 that prevent this from being a significant issue?

This is still a theoretical threat, but is not very likely.

One thing you could do to mitigate it is to use multiple compilers. For example, an LLVM Compiler which is, itself compiled by GCC will not pass along a back door. Similarly, a GCC compiled by LLVM will not pass along a back door. So, if you are worried about this sort of attack, then you could compile your compiler with another breed of compiler. That means that the evil hacker (at your OS vendor?) Will have to taint both compilers to recognize each other; A much more difficult problem.

Your last paragraph isn't, strictly speaking, true. In theory, code could detect the compiler being compiled and output the back door appropriately. This is of course impractical in the real world, but there's nothing that inherently prevents it. But then, the original idea was not about real practical threats but rather a lesson in trust.
–
Steven BurnapJan 25 '13 at 22:29

Fair point. After all, the hack carries along a backdoor for login, and a mod for the compiler, so it can carry a mod for another compiler too. But it becomes increasingly unlikely.
–
Sean McMillanJan 30 '13 at 14:10

There's a theoretical chance for this to happen. There is, however, a way of checking if a specific compiler (with avaoilable source) has been compromised, through Diverse double-compiling.

Basically, use both the suspected compiler and another compiler to compile the source of the susp[ect compiler. This gives you SCsc and SCT. Now, compile the suspect source using both of these binaries. If the resulting binaries are identical (with exception of a variety of things that may well legitimately vary, like assorted timestamps), the suspect compiler was not actually abusing trust.

As a specific attack, it's as much of a threat as it ever was, which is pretty much no threat at all.

How does this relate to just-in-time compilation?

Not sure what you mean by that. Is a JITter immune to this? No. Is it more vulnerable? Not really. As a developer YOUR app is more vulnerable simply because you can't validate that it's not been done. Note that your as yet undeveloped app is basically immune to this and all practical variations, you only have to worry about a compiler that is newer than your code.

Are functions like the program handling logins on a *nix system compiled when they are run?

That's not really relevant.

Is this still a valid threat, or have there been developments in the security of compilation since 1984 that prevent this from being a significant issue?

There is no real security of compilation, and can't be. That was really the point of his talk, that at some point you have to trust someone.

Does this affect all languages?

Yes. Fundamentally, at some time or another, your instructions have to be turned into something the computer execeutes, and that translation can be done incorrectly.

Me, I'm more worried about hardware attacks. I think we need a totally VLSI design toolchain with FLOSS source code, that we can modify and compile ourselves, that lets us build a microprocessor that has no backdoors inserted by the tools. The tools should also let us understand the purpose of any transistor on the chip. Then we could pop open a sample of the finished chips and inspect them with a microscope, making sure they had the same circuitry that the tools said they were supposed to have.

Systems in which the end users have access to the source code are the ones for which you would have to hide this type of attack. Those would be open source systems in today's world. The problem is that although there is a dependence on a single compiler for all Linux systems, the attack would have to get onto the build servers for all of the major Linux distributions. Since those don't download the compiler binaries directly for each compiler release, the source for the attack would have had to be on their build servers in at least one previous release of the compiler. Either that or the very first version of the compiler that they downloaded as a binary would have to have been compromised.

Suppose one has source code for a compiler/linker package (say the Groucho Suite) written in such a way that its output will not depend upon any unspecified behaviors, nor on anything other than the content of the input source files, and one compiles/links that code on a variety of independently-produced compilers/linker packages (say the Harpo Suite, the Chico suite, and the Zeppo Suite), yielding a different set of exeuctables for each (call them G-Harpo, G-Chico, and G-Zeppo). If one then compiles the Groucho suite once using G-Harpo (yielding G-G-Harpo), G-Chico (G-G-Chico), and G-Zeppo (G-G-Zeppo), then G-G-Harpo, G-G-Chico, and G-G-Zeppo, should all be byte-for-byte identical. If the files match, that would imply that any "compiler virus" that exists in any of them must exist identically in all of them. Depending upon the age and lineage of the other compilers, it may be possible to ensure that such a virus could not plausibly exist in them. For example, if one uses an antique Macintosh to feed a compiler that was written from scratch in 2007 through a version of MPW that was written in the 1980's, the 1980's compilers wouldn't know where to insert a virus in the 2007 compiler. It may be possible for a compiler today to do fancy enough code analysis to figure it out, but the level of computation required for such analysis would far exceed the level of computation required to simply compile the code, and could not very well have gone unnoticed in a marketplace where compilation speed was a major selling point.

I would posit that if one is working with compilation tools where the bytes in an executable file to be produced should not depend in any way upon anything other than the content of the submitted source files, it is possible to achieve reasonably good immunity from a Thompson-style virus. Unfortunately, for some reason, non-determinism in compilation seems to be regarded as normal in some environments. I recognize that on a multi-CPU system it may be possible for a compiler to run faster if it is allowed to have certain aspects of code generation vary depending upon which of two threads finishes a piece of work first. On the other hand, I'm not sure I see any reason that compilers/linkers shouldn't provide a "canonical output" mode where the output depends only upon the source files and a "compilation date" which may be overridden by the user. Even if compiling code in such a mode took twice as long as normal compilation, I would suggest that there would be considerable value in being able to recreate any "release build", byte for byte, entirely from source materials, even if it meant that release builds would take longer than "normal builds".

-1. I don't see how your answer addresses the core aspects of the question.
–
GlenH7Jan 29 '13 at 16:39

@GlenH7: Many older compilation tools would consistently produce bit-identical output when given bit-identical input [outside things like TIME, which could be tweaked to report an "official" compile time]. Using such tools, one could pretty well protect against compiler viruses. The fact that some popular development frameworks provide no way of "deterministically" compiling code means that techniques that could have protected against viruses in older tools cannot be effectively used with newer ones.
–
supercatJan 29 '13 at 17:55

Have you tried this? 1. Lead with your thesis. 2. Use shorter paragraphs. 3. Be more explicit about the difference between "functionally identical" (the result of the first stage) and "bit identical" (the result of the second), possibly with a list of all compiler binaries produced and their relationships to one another. 4. Cite David A. Wheeler's DDC paper.
–
tepplesFeb 20 at 22:15