Analysis of "the-binary"

May 27, 2002

Phase 1: Getting acquainted

The first thing I did after downloading the binary was to run 'strings'
on it. The output gave a pretty strong idea that this was a Linux
binary (evidence: '@(#) The Linux C library 5.3.12') Next, I loaded
it up in IDA Pro (4.1.7.600 on a Win2k box). The binary was confirmed
to be in ELF format, which caused me to get sidetracked reading up on the
format of ELF binaries. This being the first binary I have taken
a serious look at, I wanted to do some learning along the way. With
the ELF specification at hand, I loaded the binary up in my favorite hex
editor UltraEdit, and started poking around to satisfy myself that I understood
the ELF specification. This greatly assisted me in understanding
all that IDA was telling me.

Phase 2: Understanding data flows

IDA does a fantastic job of analyzing a binary, creating labels and cross
references during its initial analysis. I elected to go after low
hanging fruit first and tracked down all of the Linux system calls that
I could find. IDA had already labeled each one as to its purpose
with comments such as 'LINUX - sys_socketcall'. Using a Linux system
call reference allowed me to start naming functions and identifying data
types. My goal was to rename as many functions according to their
purpose, comment as many lines of code as I could to remind myself what
I had discovered, define data structures based on known parameter requirements
for the system calls, and lastly to start renaming local variables and
parameters according to their purpose. Having many named data types,
variables and functions made understanding the code that much easier when
I finally dug into the main function. I generally elected to do a
depth first search into the code in order to try to discover the purpose/data
type of dat used as function parameters, as I feel the single most useful
piece of the reverse engineering puzzle is knowing what kind of data is
being manipulated. It brings many other aspects of the code into
proper context.

Phase 3: Code Analysis

With a reference sheet for x86 assembly language at my side, the next thing
I did was dig into the code. I also found it very useful to have
access to the Linux man pages for all of the library functions that were
being called. Again this was invaluable in clueing me in to exactly
what types of data was being manipulated. The inclusion of so many
system calls, led me to the conclusion (admittedly slow in the coming)
that the binary was statically linked to all of its required library functions.
I found myself often frustrated with not following the logic behind some
section of code, and having the gut feeling that I was attempting to reverse
engineer the standard library. The appearance of printf style format
strings in the main function certainly suggested calls to functions such
as sprintf. In order to keep myself from drilling too far down into
library code when I wanted to be focussing on the author's code, I loaded
up a copy of the the sources for the libc 5.3.12 C libraries. By
comparing the code I was looking at in IDA with the source code, I was
able to identify library functions much faster. Grep helped me identify
the gethostbyname function for example by searching the library for one
of the strings used in the function. This was a tremendous time saver
and validation technique. It allowed me to focus my efforts on the
tool code.

Phase 4: Code Functionality

With many functions, and data types identified I started to focus on the
tool code. By stepping through the code in my head, it became clear
that the tool attempts to hide itself by resetting argv[0] to "[mingetty]"
and forking. Next, following the closing of stdin, stdout, and stderr,
a raw IP socket is opened to receive IP packets that use protocol 11 (NVP-II).
This caused a brief sidetrack to the RFC on NVP-II before I decided to
let the binary tell me how it was using this protocol.

The tool then enters a loop to receive packets and perform tasks based
on the received data. IDA identified the switch table and labeled
all of the cases, making life much easier. There was only one function
call between the recv and the switch. The function was a bit more
than I could follow at first, so I moved on to examining the cases.
Always one for an easy target, I scanned the cases for things that looked
familiar. Case 3 looked like it wanted to invoke the shell to run
a command, based on the format string that it references. The "rb"
string that it references was a clue that it wanted to read a file and
led to discovery of the fopen, fread, fclose and unlink functions.
A logical conclusion here was that if the toll wanted to read data following
execution of a command, it was probably going to send that data back to
the attacker. This led to the discovery of the data transmission
functions, and pointed out that some of the parameters necessary to send
data to the attacker were only ever modified, and thus probably being initialized,
in case 2. Analysis of how data was used in this case started to
reveal the structure of incoming packets. One interesting programming
flaw in this case is that results are sent back to the handler in 398 byte
chunks, but random padding tacked on the end of each packet causes transmitted
packets to range in size from 400 to 600 bytes. The content of the
padding is hardly random however. Sloppy buffer allocation results
in the padding bytes being taken from the unencoded command output which
gets sent in an unencoded fashion. This explains the appearance of
plain text command output in the sample snort log privided by the Honeynet
contest team as shown in the raw packet below:

The clear text output shown above is was most likely not intended to be
exposed by the attacker. Its encoded version is contained starting
at byte 0x0018 of the packet.

Case 6 was the next case to fall. Opening a backdoor listener
was fortunately something I had seen done before. The code following the
acceptance of an incoming connection revealed that the initial input had
to match the string "TfOjG" (found in the strings output), after being
upshifted by 1 position. This revealed the backdoor password as "SeNiF".
Case 7 was next as a simple command execution case. Case 8 was a
bit trickier. It was apparent that it was killing some process, but
which one was the question. Looking at the variables that it tested
(which had to be a process id in order to be passed to kill) and seeing
where it was set led to the initial conclusion that case 8 was used to
close the backdoor listener. It turned out that it was also used
to terminate a variety of other actions that might be in progress as well.

I noticed the very similar code of cases 4, 5, 9, 10, 11, and 12, and
elected to start by examining how each case set itself up prior to calling
a function specific to each case. The initial code in each case helped
point out what case 8 was killing off. When one of these 6 commands
is received, the code checks to see if some "service" is already in progress.
A "service" being the actions performed by these cases as well as case
6. If a service is already active, then these commands are silently
ignored. If no service is active, then each of these cases forks
and the child becomes the new active service. No more than one service
process can be active at any given time. All service processes, once
activated continue indefinitely until terminated using command 8.
Following the fork, the child processes in each case manipulated the data
received in the command packet in order to pass parameters to a case specific
function. Analysis of exactly how these parameters were manipulated
was essential in understanding the command formats expected by the agents
(see the answers section for a more detailed
explanation of the format of each command). Each of these six
cases finishes with a call to a function that actually implements the case
specific service. A total of four separate service functions are
called by the six cases, meaning that four distinct service are available
for activation (in addition to the backdoor service). Cases 4 and
9 called invoked the same service, and cases 10 and 11 invoked the same
service. While the activation commands have a slightly different
format, it turns out that case 4 is simply a specialized version of case
9. There is actually no need to include case 4 in the code.
Similarly, case 10 is a specialized version of case 11. Cases 5 and
12 each perform unique services.

Phase 5: The Tool reveals itself

These four services turn out to contain the "attack" capabilities of the
tool. I am not an expert, so I will do my best to describe the capabilities
of each of the services.

Analysis of case 4 (and 9) showed it to perform some form of DNS response
flooding against a specified target. Unfortunately this case also
turned out to be the least straight forward to analyze because it references
static data outside the function. Determining what was going on was
a result of looking at the socket setup and what data was sent in the resulting
sendto call (the socket and sendto functions were known because of the
earlier analysis of the Linux system calls made by the program).
An internet protocol, raw socket was being created in a specific message
buffer. Knowing that an IP header would be the first thing in the
buffer, I defined an ip_header structure in IDA pro and overlaid it at
the start of the message buffer. At this point, then SANS "TCP/IP
and tcpdump Pocket Reference Guide" became my best friend. This greatly
assisted me in identifying what information was being used to construct
each message. The key piece of information was the fact that a UDP
packet was being built. As a result, IDA was used to define a UDP
header structure as well. The structure definition feature of IDA Pro
proved to be invaluable in showing what types of data the program was manipulating.
As seen in the screen shot to the right, life is a lot easier when IDA
is telling you exactly what data field is being manipulated. The
two sticky parts to this function were where the destination IP was coming
from and where the data portion of the packet was coming from. The
destination IP was pulled from a block of initialized static memory, which
upon further analysis turned out to be a block of 11444 IP addresses, presumably
all DNS servers once the rest of the function is understood. The
data portion of each packet was being copied in from another initialized
block of static memory, that happened to contain strings like "com", "org",
and "edu" among others. This was the tip-off that these were perhaps
DNS queries and the list of IPs, DNS servers. The query table contained
9 queries in all. A humorous feature of this table discovered during
the live testing phase was that 4 of the nine queries were malformed.
Once the looping structure of the function was understood, the basic algorithm
turned out to be essentially this:

repeat forever
if necessary, resolve target host name
for each of the 9 queries
send a DNS lookup to all 11444 servers,
spoofing the target IP as the source address

The result is that the target machine is flooded with DNS responses to
queries that it did not send. This type of attack is effective because
DNS responses are generally allowed through firewalls. Stateful firewalls
that match incoming responses to valid requests should drop these incoming
packets, having failed to see a previous matching request. More details
of this particular feature of the tool are available
here. It seemed apparent after analysis of this case that

Using similar analysis techniques, command 5 was discovered to be capable
of sending either an icmp echo flood, or a udp flood. The sloppiness
of the original author revealed itself in this function. First, he
sets the fragment field of each packet to a non-zero value leading recipients
to believe they have received a fragmented packet. In the case of
the icmp flood, which I assume was meant to be a ping flood, the target
won't respond with an echo reply because they never receive all of the
fragments (I learned this in the live phase). The second sloppy feature
of this function is that the author does not know how to compute the checksum
of a udp header, so every udp packet sent has a bad checksum. Again
I found it somewhat funny that even if he had the proper checksum, the
first thing he does after computing it is to change one of the data bytes
in the packet. This function simply loops forever sending either
icmp or udp packets. Whether it sends icmp or udp packets is selectable
via one of the function's parameters, and can be specified by the agent's
handler. For more information on this command go
here.

Command 10, perhaps the easiest of the cases to analyze, is a basic
tcp SYN flooder. Command 10 is a special case of command 11 and could
have been omitted. The analysis of this case was similar to commands
4, 5, and 9. Understanding the data being passed to the socket and
sendto functions help define the remainder of the function. More
information on this command is available here.

Finally, command 12 is a variation on the DNS flooder of command 4.
Having deciphered command 4 made this case easier. The table of predefined
DNS queries was referenced, while the table of DNS server IP addresses
was not. All that remained was to analyze the loops. This function
is much simpler in that it directs all of the DNS queries to a server specified
by the handler in the command that activates this function. Because
the agent can be told to randomize the source address from which the query
originates, this command appears to be aimed specifically at flooding a
target DNS server with requests. The algorithm for this function
loops indefinitely, sending each of the nine canned queries for each pass
in the main loop. Command 12 is described here.

All of the DoS attacks execute a brief delay after sending each packet.
The intent of this delay is not immediately apparent to me, other than
to allow the agent host a chance to receive incoming packets occasionaly,
and thus receive an incoming command to terminate the attack.

Phase 6: Live Testing

Armed with what I felt was thorough knowledge of the workings of the tool,
I moved into live testing on a test network. My personal preference
is to know exactly what to expect and have a plan to probe the tool looking
for very specific responses. I did not want to have to try to analyze
network traffic and wonder what the tool was doing and why it did it.
My test setup consisted of only two computers, a prober, and a host for
the-binary. I wrote simple packet generator programs whose goal it was
to invoke each of the behaviors of the tool. These programs needed to make
use of the network encoding scheme of the-binary. That scheme is
described on the answers page. I used
ethereal to monitor network traffic. At this stage of my analysis,
this was merely a validation exercise. I expected to see, and did
see each of the behaviors previously described.

Testing proceeded as follows

ps -ax on agent host for reference

netstat -a on agent host for reference

lsof on agent host for reference

launch ethereal on prober

launch the-binary on agent host, no network traffic noted

lsof on agent host for comparison

ps -ax on agent host reveals new process [mingetty]

From prober, issue command 2 to set transmit list, ethereal confirms command
sent

From prober, issue command 3 to retrieve a list of active processes on
the agent host. Ethereal confirms transmission of the command and
receipt of resulting data packets sent from agent to the prober.

Decoder verifies that the packets newly received by the prober are in fact
the results of a ps command issued at the agent host.

Prober sends a kill command to be executed by the command 7 case of the-binary,
ps on the agent host verifies that the command executed properly.

Prober sends commands to initiate the flood attacks described above.
Ethereal confirms the transmission of these flood packets.

Use command 8 to terminate each of the flooding attacks.

Things learned from live testing:

The icmp and udp packets of attack 5 contained non-zero fragment numbers
and the target failed to respond to the attempted ping of the ping flood.

The udp packets of attack 5 are not checksummed properly.

4 of the 9 canned DNS queries for attacks 4, 9, and 12 were malformed (those
for "ie", "es", "de", and "gr", all european counrties???).

Agent/Handler communications packets (IP protocol 11 packets) sailed right
through a RedHat 7.3 "high security" firewall configuration as selected
during the RedHat installation process. Full control of the agent
was possible with the single exception of being able to connect to the
backdoor listener. This was easily remedied by using the agent capabilities
to shutdown the firewall.

Phase 7: Just for the hell of it

A true test of understanding is to duplicate the original, so just for
kicks, I wrote a functional equivalent to the-binary in C and tested it
out against the above procedures. The source code for the reverse
engineered version is available here the-binary.c.
I also wrote scanner.c that scans a class C network
for running instances of the-binary and lists any machines on which it
is found, killer.c connects to a running agent and
kills it off.