Pages

Sunday, January 2, 2011

Java Bytecode Fundamentals

Java bytecode is a bread and butter of JRebel, the productivity tool for Java developers. This is why I decided that it would be a good thing to write a blog post on this subject. This blog post is a summary of various sources that I've found while googling on the subject. Hopefully someone may find it relevant and useful. What is a little weird is that there's not much information on the subject, either books or articles. BTW, if you have anything to add or comment - don't hesitate to post to comments :)

The developers who use Java for application development, usually do not need to be aware of the bytecode that is being executed in the VM. However, those developers who implement the state-of-the-art frameworks, compilers, or even Java tooling - may need to understand and may even be using bytecode directly. While special libraries (like ASM, cglib, Javassist) do help regarding bytecode manipulation, it is still important to understand the fundamentals in order to make the effective use of those tools.

Let's start off with a simple example - a POJO, with one field, a getter and a setter.

First of all, one you compile the class using javac Foo.java, you'll have the Foo.class, which now contains the bytecode. Here's how it looks in the hex editor:

Each pair of hex numbers (a byte) is actually translatable to opcodes (mnemonics), but obviously it would be too brutal to start reading it in the binary format. Let's proceed to the mnemonical representation.

The class is very simple and it is easy to see the relation between the sourced code and the generated bytecode. First of all we notice that in the bytecode version of the class the compiler inferred the default constructor (as promised by the JVM spec).

Secondly, if we study the Java bytecode instructions (in our example aload_0 and aload_1), we can see that some of the instructions have prefixes like aload_0 or istore_2. This is related to the type of the data that the instruction operates with. The prefix 'a' means that the opcode is manipulating an object reference. The prefix 'i' means the opcode is manipulating an integer.

One interesting thing we could spot here is that some of the instructions take a weird operand like #1 or #2, which actually refer to the constant pool of the class. This is now a good point to get more information from the class file. Execute the following command: javap -c -s -verbose (-s to print the signatures, -verbose to print all the details)

One more thing to notice is that every opcode is marked with a number (0: aload_0). This is related to the position of instruction within the frame - explained later.

To understand how the bytecode works it is worth to have a look at the execution model. JVM uses stack-based model of computation. Each thread has a JVM stack which stores frames. For instance, when running the application in a debugger, you will see those frames:

IntelliJ IDEA debugging session

Each time a method is invoked a new stack frame is created. The frame consists of an operand stack, an array of local variables, and a reference to the runtime constant pool of the class of the current method.

The size of the array of local variables is determined at compile time and is dependent on the number and size of local variables and formal method parameters. The operand stack is a LIFO stack used to push and pop values. Its size is also determined at compile time. Certain opcode instructions push values onto the operand stack; others take operands from the stack, manipulate them, and push the result. The operand stack is also used to receive return values from methods.

The bytecode for the method above consists of three opcode instructions. The first opcode, aload_0, pushes the value from index 0 of the local variable table onto the operand stack. The this reference is always stored at location 0 of the local variable table for constructors and instance methods. The next opcode instruction, getfield, is used to fetch a field from an object. The last instruction, areturn, returns a reference from a method.

Each method has a corresponding bytecode array. Looking at a .class file with a hex editor, you would see the following values in the bytecode array:

That said, the bytecode for getBar method is 2A B4 00 02 B0. The code 2A corresponds to aload_0 instruction, and B0 corresponds to areturn. It might seem strange that the bytecode of the method has 3 instructions, but the byte array holds 5 elements. This is because the getfield (B4) requires 2 parameters to be supplied (00 02), and those parameters occupy positions 2 and 3 in the array, hence the array size is 5 and areturn instruction is shifted to the position 4.

The Local Variables Table

To demonstrate how the local variables are handled let's have a look at another example:

First, the method loads constant 1 with iconst_1 and stores it in a local variable number 2 with istore_2. We can see in the local variables table that slot number 2 is occupied by the variable name b, as expected. Next, iload_1 loads value of a to the stack, iload_2 loads value of b. iadd pops 2 operands from the stack, adds 'em, and stores the value back to return the value from the method.

Exception Handling

Another interesting example of the bytecode is what code is generated for exception handling, i.e. for try-catch-finally constructs.

So in fact, the compiler generated the code for all the scenarios possible within the try-catch-finally execution: the call for finallyMethod() was inferred 3 times (!). The try block is compiled just as it would be if the try were not present and merged with finally:

If the block executes successfully, the the goto instruction will lead the execution to the position 30 which is the return opcode.

If tryMethod throws an instance of Exception, the first (innermost) applicable exception handler in the exception table is chosen to handle the exception. From the exception table we can see that the position to proceed with the exception handling is 11:

So finallyMethod() is executed in any case, with aload_2 and athrow to rise the unhandled exception.

Bottomline

I have covered just a few aspects of the things related to JVM bytecode. Most of the inspiration came from the developerWorks article by Peter Haggar - Java bytecode: Understanding bytecode makes you a better programmer. It must be the best article on the subject that I've managed to find. The article looks a bit outdated already, thought still relevant. Surprisingly the BCEL user's manual page has a decent description of bytecode fundamentals. So I'd suggest to take a look if you're interested. Also, the The JavaTM Virtual Machine Specification might also be useful source of information, but it is a little hard to read and it doesn't visualize the things described there, which is often useful.
Overall, I think that understanding how bytecode works would be essential to the developers who are looking to deepen their proficiency in Java programming, especially if one looks into the framework internals, tooling or compilers for JVM languages.

@Rémi, yeah! the manuals for the bytcode manipulation libraries are almost the only comprehensive source for this subject. But other than that, quite a few information can be found.Probably some scientific papers in ACM or IEEE?

@samaaron again, this is what happens when you try to post something late night and experiment too much :)thx for re-checking this - I have copy-pasted another example indeed. (with other constructor). Corrected it now.

Yes and people keep on asking where is the CAFEBABE. For those interested please see the hexcode in the post above. The first 2 bytes are reserved for this magic number. This helps the interpreter to distinguish class files from other binary file.

It's really a nice article.But initially,I wondered how did you get that local variable table for Example class displayedin the byte codes.Later I found out that we should include the option -g while compiling the program, otherwise this information will not be included in the class file.