File format parsing and converting for further processing is a fundamental activity in many computer software-related tasks. Taking cues from the gaming community, wherein graphics files are converted from one format to another by writing custom parsers, we will proceed with the steps of writing parsers for the DEX android file for the Dalvik VM.

What Is a File Format?

A file format is, in its simplest abstraction, a data structure. A data structure is a specific arrangement, or format, of data. In this context, the organization of data elements in memory or filesystem would entail a linear sequence of bytes.

It is not necessary for every file format to have something called the header or metadata (like a table of contents for the file, it describes the data, without necessarily being the data itself) that describes the various 1) locations and 2) lengths of the individual data elements contained in the file structure. Locations specify from where to start looking and lengths specify the number of bytes covered. Basically, when you are dealing with such a file format, you need to look at the specifications/documentation provided by the vendor or author of the file format or reverse engineer the parsing code from the software that uses it. In our case, DEX format is very well described indeed. In cases where you don’t get a detailed specification from the authors of the file formats, reverse engineering the software that makes use of that format would be needed. A quick online check for resources for file format specifications take us to sites like Wotsit.org, which serves as a repository for various software and hardware formats. There are some caveats though: Quite a few of the links don’t work, the formats are not all complete submissions, and the list of formats is selective and not a total list of all formats available. Many closed source formats are just not there.

Brief Overviews

The DEX file format is a little bit involved. It consists of a header that clearly demarcates where the file sections start from. Each file section contains a certain type, for instance, a MAP LIST, METHOD LIST, CLASS LIST, STRINGS LIST, and the like. The header specifies the location and the size of each. It also makes use of the ULEB128 data type for parsing certain sections data types. Further, each of the sections must to be parsed deeply, as in levels and sub-references within each section, and, finally, the extracted information must be cross-referenced to fully extract all the structures contained within. Our goal is to extract the instruction bytes array in each method that contains the opcodes so that these arrays can be sent to the Dalvik disassembler later on.

Most file formats have something that gives away its format type in the first few bytes—its “tell.” These unique sequences of bytes are “magic numbers.” It would therefore be simple enough to build a file type identifier that simply reads the first few bytes and then detects the patterns. PE files have a “MZ” 0x4D5A pattern at the very beginning of the file. DEX files have a “DEX” string at the outset as well. While not as foolproof as parsing the entire file for a detailed description, this method is certainly a traditional method for preliminary format identification.

We will extract the entire file as a byte array for ease of traversal and memory performance. The entire file is one long series of bit patterns. Grouping that into a sequence of bytes immediately makes the indexing and parsing a more linear activity.

The general approach is depicted in the diagram below:

Data Sheets

This word comes from hardware specifications or data sheets for chipsets and microprocessors. In terms of file formats like PE and DEX, the documentation is the data sheet. We need the specifications of each format to get a better understanding of how the data is organized before we start coding the parsing class. Let’s start with the DEX file specification and see if we can make any sense of it. The Android Open Source Project site gives a good listing. A few excerpts are given below.

A sample malware android apk file has the folder structure shown below when opened in Windows:

Every component can be parsed in this directory structure—the zip (apk!) file, the binary XML, AndroidManifest.XML (apk config file), the classes.dex file, each of the files in the folders could be parsed recursively and checked for file extension validity and entropy values to look for anomalies. However for now we are focused only on classes.dex. Every apk has only one of them containing all the code and classes required for the execution of the apk.

Take a good look at the list of data types supported in the dex file. Their bit sizes immediately give their memory space requirements. Among them you have unsigned and signed types that run from a byte, word, dword, qword, to the LEB signed/unsigned types. These types are variable length. The bio of this data type is interesting and you should get the requisite information for the custom type conversion code. We will implement the same logic in our parser.

This part of the document gives us an overview of the entire file layout. Read the descriptions and make notes of what types are contained in which sections and how they are arranged and organized.

This part of the document details the specific data types for reaching out to the different file sections. Notice the regular pattern of SIZE followed by OFFSET. All of these values are expressed by a positive integer data type. We have a SHA signature that is 20 bytes long. The magic sequence at the outset is exactly 8 bytes long.

To round off, we have eight areas to parse, excluding the initial header. Strings are very important in binary executables, and we will extract them from the dex file as well. Let us focus on the string_ids segment. The size and offset are specified.

The excerpt above is for the strings_ids sections of the dex file. Each string data item appears in the data section. Note that each string item is a ULEB type. The offset specified in the string_data_offset uint type provides the offset from the start of the file to the string data located in the data section further down the file.

Next we try to locate where the executive opcodes are. You will find that they are in the code_item structure that is referenced from the method_item type. The insns field within the code_item structure contains an array of 2-byte sized elements (4-byte alignment is also done with larger opcodes, thus overriding the default spec) of rank specified in the preceding field, “insns_size.” Each method can be either virtual or direct, as the dex file is itself converted from the .class files of a java application. The class_data_item structure references these method lists. The class_data_item is referenced from the class_def_item structure in the class_defs section.

Therefore, you have to start parsing from one section and keep digging into other sub-referenced sections to finally extract the opcode array.

Now that we have read the specifications, we have a good understanding of the overall layout of the dex file, its headers, data types, and the total number of items. We have to deal with a custom data type LEB to extract strings from the data section. We also have to parse the headers to locate each of the referenced items to get to the instructions array.

WRITING THE PARSER

The parser is in C#. This means complete access to OOP & OOD.

First we get the ULEB128 data type done in code because important members later on, including the code item structures, use this type.

Thereafter we proceed with the I/O code to manage the in-memory representation of the dex binary file. To enable this, we code a filepointer persistence function that keeps track of the current byte position in the parser.

Finally, the actual parsing begins from the in-memory byte array and the class members for the dex file structures are filled in by traversing the byte array. Once it is filled in, we dig into the instruction array and store the opcodes for retrieval for the disassembler class. After this we are done!

This function returns a bool value of true or false. If the parsing fails for any reason, it won’t parse any further. The catch exception handler is more involved and reports what part has failed and clears all the data structures. Here it’s not shown for brevity.

A byte array is turned generic and the file is parsed to a List type. The File.ReadAllBytes(string filePath) method from the System.IO namespace is used to read the entire file to memory and start parsing the byte array. Android files are small in size for the most part and this approach saves the number of IO operations, as the file is read in one shot.

List<byte[]> _fileImage = newList<byte[]>();

Header dh = new Header();

The line above instantiates the Header struct and the parser fills out each of its fields. Notice the use of the byteWalker() function to update and persist to the current offset in the byte array representation of the file.

The excerpts above give an idea of the rather quick and dirty, though structured, way of filling a memory structure. Moving further in, finally we get to the code offsets we need to parse from to get the opcodes. If you recollect the sub-references path this is exactly what we have in mind:

The first foreach loop set was for digging into the DIRECT functions, while the second set of foreach loops digs into VIRTUAL functions.

All code instructions are saved in the _codeRep.Add() list, which acts as a repository for the insns [].

At this point the dex file parsing is complete.

Now each of the struct instances is exposed through properties that return the instances to another class. These instances can be further queried using the ‘.’ ‘dot operator syntax’ to get the individual types for processing further. Now the disassemble code in another class can be fed the opcode lists directly. In keeping with the dex format design, the disassembler will also be have to fed the data from other types and segments as well, so that the code cross-referencing can be done precisely. That can be taken care of in the disassemble class J

CONCLUSION

We have taken a look at the process of coding a parser for the binary format—DEX in C# (OOP). If you find any other binary format you wish to parse like PE format, you know how you might approach it. For those of us looking for a bit of challenge try doing the same in a functional programming language like Haskell. The process can be interesting and the approach is quite different from OOP.

Victor SR Marak is a security researcher and electronic musician. He likes to make and break software. He was into music production prior to joining the AV industry recently. He plays the piano and loves composing music of all genres.His debut EP 'My Never Ending Wish' under moniker 'Emotion Coder' is out soon on Beatport Exclusive Releases on 30th October 2012. He can be reached at victormarak.wordpress.com

I appreciate the interest. The rest of the code is part of another project but the ones which are here are for use without license, if you require additional resources you are welcome to get in touch with me (author).

Your email address will not be published. Required fields are marked *

Comment

Name *

Email *

Website

× = two

About InfoSec

InfoSec Institute is the best source for high quality information security training. We have been training Information Security and IT Professionals since 1998 with a diverse lineup of relevant training courses. In the past 16 years, over 50,000 individuals have trusted InfoSec Institute for their professional development needs!

Join our newsletter

File download

First Name

Last Name

Work Phone Number

Work Email Address

Job Title

Why Take This Training?

How will you fund your training?

What is your training budget?

InfoSec institute respects your privacy and will never use your personal information for anything other than to notify you of your requested course pricing. We will never sell your information to third parties. You will not be spammed.

Comments

What is Skillset?

Skillset

Practice tests & assessments.

Practice for certification success with the Skillset library of over 100,000 practice test questions. We analyze your responses and can determine when you are ready to sit for the test. Along your journey to exam readiness, we will:

1. Determine which required skills your knowledge is sufficient
2. Which required skills you need to work on
3. Recommend specific skills to practice on next
4. Track your progress towards a certification exam