Search

The Compiler as Attack Vector

Media exposure of serious security threats has
sky-rocketed in the last five years, and this has caused
a strange parallel to develop. As software developers
have become more aware of security problems and have
taken steps to mitigate them during the development
phase, attackers have been forced to become more
insidious in exploit vectors. A possible vector that often
is not explored is attacking the program as it
is built.

I first encountered this idea while reading
the September 1995 ACM classic of the month article
“Trusting Trust”, by Ken Thompson. The article
originally appeared in the August 1984 issue of
Communications of the ACM, and it deals with the belief
that ultimate security is impossible to achieve
because in the chain of building an application
there is no way to trust every link fully.
The particular focus was on the C compiler for UNIX
and how, within the build process, the programmer
can be blind to the compiler's actions.

The same problem still exists currently. Because
so many things in the Linux world are downloaded and compiled,
an avenue of attack opens. Binary distributions like RPMs and
Debian packages are becoming increasingly popular; thus, attacking the
build machines for the distributions would yield many unsuspecting
victims.

GCC and Glibc

Before engaging in a discussion of how such attacks could take place, it is
important to become familiar with the target, and how someone
would evaluate it for places to attack. GCC, written and distributed by
the GNU Project, supports many languages and architectures. For the sake
of brevity, we focus on ANSI C and the x86 architecture in this article.

The first task is to become more familiar with GCC—what it
does to code and where. The best way to start this is to build a simple
Hello World program, passing GCC the -v option at compile
time. The output should look something similar to that shown in Listing 1. Examining
it yields several important details, as GCC is not a single program.
It invokes several programs to translate the c source
file into an ELF binary. It also links in
numerous system libraries with virtually no verification that they are
what they appear to be.

Further information can be gained by repeating
the same build with the -save-temps options. This saves the
intermediate files created by GCC during the build. In addition to the
binary and source file, you now have filename.i, filename.s
and filename.o. The .i file contains your source after preprocessing,
the .s contains the translated assembly and the .o is the assembled
file before any linking happens. Using the file command on these files
provides some information as to what they are.

The thing to focus on while looking through the temp files is the type
and amount of code added at each step, as well as where the code comes
from. Attackers look for places where they can add
code, often called payloads, without being noticed. Attackers
also must add statements somewhere in the flow of a program to execute the
payload. For attackers, ideally this would be done with the least
amount of effort, changing only one or two files. The phase that covers
both these requirements is called the linking phase.

The linking phase, which generates the final ELF binary, is the
best place for attackers to exploit to ensure that their changes are
not detected. The linking phase also gives attackers a chance to
modify the flow of the program by changing the files that are linked in
by the compiler. Examining the verbose output of the Hello World
build, you can see several files like ld_linux.so.2 linked in. These are
the files an attacker will pay the most attention to because they contain
the standard functions the program needs to work. These collections are
often the easiest in which to add a malicious payload and the code to
call it, often by replacing only a single file.

Let's take a small aside here and discuss some parts of
ELF binaries, how they work and how attackers can use this to their
advantage. Ask many people who write C code where their programs begin
executing and they will say “main”, of course. This is true only
to a point; main is where the code they wrote begins execution, but in
actuality, the code started executing long before main. You can examine this
with tools like nm, readelf and gdb. Executing the command readelf --l
hello shows the entry point for the program. This is where the
program begins executing. You then can look at what this does by setting
a breakpoint for the entry point, and then run the program. You will
find the program actually starts executing at a function called _start,
line 47 of file <glibc-base-directory>/sysdeps/i386/elf/start.S. This
is actually part of glibc.

Attackers can modify the assembly directly, or they can trace
the execution to a point where they are working with C for
easier modifications. In start.S, __libc_start_main is called
with the comments Call the user's main function.
Looking through the glibc source tree brings you to
<glibc-base-directory>/sysdeps/generic/libc-start.c. Examining this
file,
you see that not only does this call the user's main function,
it also is responsible for setting up command-line and environment
options, like argc, argv and evnp, to pass to main. It is also in C,
which makes modifications easier than in assembly. At this point, making
an effective attack is as simple as adding code to execute before main
is called. This is effective for several reasons. First, in order for
the attack to succeed, only one file needs to be changed. Second, because
it is before main(), typical debugging does not discover it. Finally,
because main is about to be called, all the built-ins that C coders
expect already have been set up.

Attack

Now that we have completed a general introduction to GCC and the parts of
interest, we can apply the knowledge to attacks. The simplest attack
is to add new functionality, evoked by a command-line option.
Let's attack libc-start.c, because it is easier to wait for command-line
options to be set up for us rather than by doing it with our own code.

This type of work should be done on a machine of
little importance, so that it can be re-installed when
necessary. The version of glibc used here is 2.3.1, built on Mandrake 9.1. After the initial build,
which will be lengthy, as long as the build isn't
cleaned, future compiles should be pretty quick.

The first example makes simple text appear before
and after the main body executes. In order to do this,
the library that is linked in by the compiler is
modified. The modifications to libc-start.c simply
add a hello and good-bye message that is
displayed as the program runs. The modifications
include adding stdio.h as a header and two simple
printf statements before and after main, as shown in
Listing 2. With these simple changes made, kick off
another build of glibc and wait.

Listing 2. Modifications to the libc-start.c for Hello World

/* XXX This is where the try/finally handling
must be used. */
printf("Before main()\n");
result = main (argc, argv, __environ);
printf("After main()\n");

Waiting until the build is finished is not necessary. You can build
programs from the compile directory without risking machine
usability due to a faulty glibc install. Doing this requires some tricky
command-line options to GCC. For simplicity of demonstration, the
binary is built statically, as shown in Listing 3. The program
compiled is a simple Hello World program.

Pay close attention to nostdlib, nostartfiles
and static. These options are followed by the paths of libraries for
the common C library, as well as standard libs like -lgcc.
These strange
options instruct GCC not to build in the standard libraries and
startup functions. This allows us to specify exactly what we want linked
in and where. After the compile is complete, we are left with a hello ELF
binary as expected, but it is much larger than normal. This is a side
effect of building the program statically, meaning that the required
functions are built within the program, rather than relying on them to
be loaded on an as-needed basis. Running the binary results in our
messages being displayed before and after the hello world message, and
it verifies that we can indeed execute code before the developer intends.

A real attacker would not have to build statically
and could subvert the system copy of glibc in place
so that executables would look normal.

Looking back at the libc-start source file, it's easy to tell that this
function sets up argc, argv and evnp before calling main(). Moving
on from displaying text, the execution of a shell is the next
step. Because modifications of this gravity are such that an attacker
would not want someone to know they exist, this shell executes only if
the correct command-line option is passed. The source file already
includes unistd.h, so it is simple and tempting to use getopt to parse
the command-line options before main() is called. Although this will
work, it can lead to discovery if getopt errors out due to unknown
options. I wrote a brief snippet of code that searches argv
for the option to invoke the shell, as shown in Listing 4. When you exit
the shell, you will notice the program continues operating normally.
Unless you knew the option used to start the shell, more than
likely you never would have known this back door existed.

The previous examples are interesting, but they really don't do
anything noteworthy. The next example adds a unique identifier
to every binary built with GCC. This is most useful in honeypot-like
environments where it is possible an unknown party will build a program
on the machine, then remove it. The unique identifier, coupled with a
registry, can help a forensics analyst trace a program back to its point
of origin and establish a trail to the intruder.

There could be much debate about what the unique identifier should
be and how it should be generated. To avoid a trip to Crypto 101, the
identifier is a generic 26-character string. To prevent immediate
detection, the identifier is added as a void function that is visible
using nm. Its name is __ID_abcdefghijklmnopqrstuvwxyz(). This
is added to libc-start.c. After rebuilding glibc and compiling
the test program, the value is visible. The value I chose is for
demonstration purposes. In reality, the more obscure and legitimate
sounding the identifier, the harder it is to detect. My choice for a
name in a real scenario would be something like __dl_sym_check_load(). In
addition to tagging the binary at build, a token could be inserted
that would create a single UDP packet, with the only payload being the
IP address of the machine on which it is running. This could be sent to
a logging server that could track what binaries are run in what places
and where they were built.

One of the more interesting elements of this attack vector is the
ability to make good code bad. strcpy is a perfect example of this
function, because it has both an unsafe version and a safe one, strncpy,
which has an additional argument indicating how much of a string
should be copied. Without reviewing how a buffer overflow works, strcpy
is far more desirable to an attacker than its bounds-checking big
brother. This is a relatively simple change that should not attract too
much attention, unless the program is stepped through with a debugger. In
the directory <glibc-base>/sysdeps/generic, there are two files, strcpy.c
and strncpy.c. Comment out everything strncpy does and replace it with
return strcpy(s1,s2);.

Using GDB, you can verify that this actually works by writing a snippet
of code that uses strncpy, and then single stepping through it. An easier
way to verify this is to copy a large string into a small buffer and wait
for a crash like the one shown in Listing 6.

Depending on the function of the code, it may
be useful only if it is undiscovered. To help keep
it a secret, adding conditional execution code
is useful. This means the added code remains
dormant if a certain set of circumstances are not
met. An example of this is to check whether the binary is
built with debug options and, if so, do nothing. This
helps keep the chances of discovery low, because a
release application might not get the same scrutiny
as a debug application.

Defense and Wrap-Up

Now that the whats and the hows of this vector have been explored,
the time has come to discuss ways to discover and stop these sorts of
attacks. The short answer is that there is no good way. Attacks of this
sort are not aimed at compromising a single box but rather at dispersing
trojaned code to the end user. The examples shown thus far have been
trivial and are intended to help people grasp the concepts of the attack.
However, without much effort, truly dangerous things could emerge. Some
examples are modifying gpg to capture passphrases and public keys,
changing sshd to create copies of private keys used for authentication,
or even modifying the login process to report user name and passwords to
a third-party source. Defending against these types of attacks requires diligent
use of host-based intrusion-detection systems to find modified system
libraries. Closer inspection at build time also must play a crucial
role. As you may have discovered looking at the examples above, most of
the changes will be made blatantly obvious in a debugger or by using tools like
binutils to inspect the final binary.

One more concrete method of defense involves
profiling all functions occurring before and after
main executes. In theory, the same versions of glibc
on the same machine should behave identically. A tool
that keeps a known safe state of this behavior and
checks newly built binaries will be able to detect
many of these changes. Of course, if attackers knew
a tool like that existed, they would try to evade
it using code that would not execute in a debugger
environment. The most important bit of
knowledge to take away from this article is not the
internal workings of glibc and GCC or how unknown
modifications can affect a program without alerting
the developer or the end user. The most important
thing is that, in
this day and age, anything can be used as a tool to
undermine security—even the most trustworthy staples
of standard computing.