Wednesday, February 11, 2009

Long ago I sat a set of exams designed to test the breadth of our computer science knowledge. I failed Compilers. Luckily, they had just changed the rules, and it was acceptable to fail Compilers provided you passed the Databases exam. Fortune smiled upon me, and I passed the Databases exam despite possessing even less knowledge in that field. Or perhaps fortune was actually laughing at me: I stayed away from compiler theory for the rest of my degree to my detriment. It's scandalous I still know so little.

I was especially dispirited by my failure since I had thought I knew compilers well. When I was a child, I felt moved after learning about recursive descent parsers. I had been mystified: how could a program could possibly read an expression like (1+2)*((3+4)/5)? All I could think of was to use a simple loop to iterate over the tokens in the expression and assigning them to local variables before processing them. This led nowhere, because the nesting of parentheses can be arbitrarily deep. Thus I was awestruck by the elegance of recursive descent; after the first reading the technique was indelibly etched in my mind.

Much later, in university, basic compiler theory (grammar, languages, automata, and so on) came easily to me. I implemented a compiler for a toy language which produced the smallest and fastest code in my class. I'd also written the odd recursive descent parser for pet projects.

So I thought I was a compiler expert. But I should have paid attention to my weaknesses. I found LR(1) parsers unintuitive. I leafed through their proofs of correctness and promptly forgot their details, especially since the textbook stated these parsers are rarely coded by hand and instead generated by programs such as yacc or bison. Instead of embracing these powerful tools, I stuck with trusty hand-coded parsers because I understood them completely.

Ultimately this caused me to fail my Compilers exam, which was peppered with yacc questions, and more seriously, to waste copious amounts of time coding and debugging parsers and lexical analyzers. I have at last invested energy in seriously learning flex and bison, and I urge anyone in a similar situation to do the same. All these years, instead of writing recursive descent parsers unless there was a compelling reason not to, my default attitude should have been to use flex and bison unless there was a compelling reason not to.

I wrote a compiler that converts a toy language (essentially a tiny subset of Limbo) to Dalvik bytecode. I also wrote a Dalvik assembler along the way. In both cases, flex and bison were invaluable: gone are the finicky routines I've written and rewritten a million times, like handling an unexpected end of input, and fine-tuning mutually recursive functions and logic to get the desired operator precedence and associativity.

I've put my efforts on hold. I was hoping for a relatively functional and less object-oriented input language, but it seems such a language would be tricky to shoehorn into Dalvik bytecode: function calls are abstracted, and with no control over stack frames I doubt I can implement trampolining.

Also, I greatly underestimated my ignorance of compiler theory. I know nothing about the static single assignment (SSA) form, which appears too new for the textbooks I had. Moreover, I need to refresh my memory on older optimization techniques.

On the other hand, I don't want to discard my code, particularly as it will forever remind me to consider flex/bison before coding yet another parser by hand. Below is the git repository containing my primitive compiler and assembler that turns this:

Friday, February 6, 2009

When I was a kid, the Windows era had yet to arrive. Floor-model PCs in shops ran MS-DOS. They'd often take steps to stop little imps like me messing around with their systems. I recall one where they even deleted the standard DOS utilities. But they had left COMMAND.COM unmodified which is enough for plenty of mischief. One can create executables via:

$ copy con: hello.com

and then enter machine code by holding Alt while typing a number on the keypad. (It doesn't work on DOSBox by the way.) Some numbers, such as 0, must be avoided, but there are workarounds; the book from which I learned the trick, Ralph Burger's "Computer Viruses: a High-tech Disease", recommended first creating a program that read numbers in ASCII and output the corresponding binary, a sort of proto-assembler, which in turn could be used to create more complex executables.

I created a HELLO.COM program by typing the following, where the numbers are typed on the numberpad with the Alt key held:

180 9 190 9 1 205 33 205 32 hello$

which is perhaps more recognizable to DOS programmers when converted to hex:

B409 BA0901 CD21 CD20 hello$

It is the machine code for:

MOV AH, 09MOV DX, 109hINT 21hINT 20h

This instructs DOS to print the '$'-terminated string at address 0x109 and then exit. When .COM files are loaded, DOS places their contents at address 0x100 (locations 0 to 0x100 are populated with miscellaneous information such as the command-line), hence the 0x109.

Thus with 9 bytes plus the size of the message, plus a byte for the terminator, we get a valid DOS program. We can type raw machine code from the keyboard to produce a full program.

Linux on x86 systems is less straightforward. The machine code is similar, but there must also be an ELF header (not strictly true: in some circumstances, the entire machine code can be placed within a truncated ELF header!), making binaries more tedious to construct by hand. However, certain kernel configurations allow different header types, some of them brief enough to be easily memorized and hand-rolled.

Minimalist Android

How about Android? An APK file is a ZIP file containing, among other files, the Dalvik executable in a DEX file. We're interested in minimizing this DEX file, whose full specification is well-documented.

All Android code must lie in some class, and Android applications typically inherit from a library class such as Activity, so we must also record information such as the class name, access rights, and inheritance. This, in turn, requires a strings section so we can store the class names, as well as invoked library functions and function signatures. Other overheads include the header and the map section. We can cheat a little by defining the UI with XML, so it won't contribute to the size of the DEX file.

After compiling to Java class files, most or all the R*.class files can be removed before making a DEX file,since the constants they contain are hardcoded in other class files by javac. For some projects, R$styleable must be retained, though this class appears to have since been removed from the public API.

I chose to write a program that parsed the DEX file from the HelloAndroid example and output only the necessary sections rather than attempt to create a DEX file from scratch. I found I could safely elide the source filenames, debug information and annotations. The result consists of 632 bytes, though we could trivially shrink this further by using a shorter name for the class.