C Compiler Targeting the Java Virtual Machine

Transcription

1 C Compiler Targeting the Java Virtual Machine Jack Pien Senior Honors Thesis (Advisor: Javed A. Aslam) Dartmouth College Computer Science Technical Report PCS-TR May 30, 1998 Abstract One of the major drawbacks in the field of computer software development has been the inability for applications to compile once and execute across many different platforms. With the emergence of the Internet and the networking of many different platforms, the Java programming language and the Java Platform was created by Sun Microsystems to address this Write Once, Run Anywhere problem. What sets a compiled Java program apart from programs compiled from other high level languages is the ability of a Java Virtual Machine to execute the compiled Java program on any platform, as long as the Java Virtual Machine is running on top of that platform. Java s cross platform capabilities can be extended to other high level languages such as C. The main objective of our project is to implement a compiler targeting the Java Platform for a subset of the C language. This will allow code written in that subset of C to be compiled into Java Virtual Machine instructions, also known as JVM bytecode, which can then be executed on a Java Virtual Machine running on any platform. The reader is assumed to be intimately familiar with compiler construction, the use of the flex scanner generator, the use of the GNU bison parser generator, and the structure and implementation of the Java Virtual Machine. 1 Introduction The main focus of our project is to implement a compiler for a particular subset of the C programming language which targets the Java Virtual Machine. The compiler is able to read in a sourcefile, written in the subset of C implemented, and compile that code to a JVM.class file called targetfile.class. The targetfile.class would then be able to be executed on a Java Virtual 1

2 Machine running on any platform. We feel that the particular subset of the C programming language chosen was sufficient in providing the essential backbone for allowing further extensions of other C language grammars to be added in the future. Many different tools were used to implement the different segments of the compiler. The scanner was created using flex and the parser was created using bison. 1 The code generator was written in C++. It consists of a symbol table to look up functions and variables that need to be accessed. Along with the symbol table is a data structure representing the JVM Constant Pool table that holds all the Constant Pool entries that are needed by the Java Virtual Machine to execute the compiled C code. Given a sourcefile to compile, the compiler would first scan through the file creating a list of tokens. From the token list, a syntactical parse tree is created with the parser. The parse tree is then handed to the code generator for code generation into JVM bytecode. Many major issues concerning the differences between the JVM and other stack based machine platforms came up while implementing the compiler. The main problem was coming up with a way to compile a non-object oriented language such as C into a target machine language that was designed to only support the fully object oriented nature of Java. Our compiler design also had to deal with the autonomous memory management of the JVM and the unique way the JVM accesses variable and function declarations. 2 Technical Description 2.1 C Programming Language Subset that was Implemented The subset of C that was implemented into our compiler includes global variable declarations, local variable declarations, function declarations, if-else and if statements, while loop iterations, and return statements. Variables can be declared as a single entity or as an array of a size indicated by a numeric value. The compiler only handles variable declarations of the type int. Functions can return either nothing (i.e. void) oranint. The only types of arguments that can be passed into a function call are int s or a reference to an int array. The exact grammar that was implemented is indicated as follows: 1 For more information on compiler construction and the use of flex and bison refer to [1]. 2

5 2.2 Structure of the Compiler Token String Literal ID <identifier name> NUM <number> INT int VOID void IF if ELSE else WHILE while RETURN return LESSEQUAL <= LESSTHAN < GREATERTHAN > GREATEREQUAL >= EQUAL == NOTEQUAL!= The compiler has mainly three parts associated with it: a scanner (created with flex), a parser (created with bison), and the code generator. First, the scanner scans the C code and creates a list of tokens associated with the code. In addition to creating tokens, it also stores the numeric value associated with a NUM token and the identifier name of an ID token. The parser then parses through the token list following the rules of the C language grammar being implemented. During the parse, the parser will verify the syntactical structure of the token list according to the grammar as well as produce a tree data structure representing the syntax of what was parsed. Essentially, the root of the tree would be the start point of the grammar (Program) and the leaves would represent the tokens or terminals of the grammar. Finally, The code generator would then run through this tree and generate the correct Java Virtual Machine bytecode needed to execute the particular C code represented by that section of the tree. Semantic verifications are made during the operations of the code generator. The code generator would also have to recognize function calls from external C library (i.e. stdio.h) and generate the JVM bytecodes that handle linking to the other JVM ċlass files needed to execute those functions. During the code generation segment, the code generator would be writing bytecode to four temporary files which contain the different segments of the complete JVM.class file. When code 5

6 generation is completed, the four temporary files would be merged into the final targetfile.class file. 2.3 Code Generator Implementation While implementing the code generator, it was necessary to understand the Java Virtual Machine s similarities to and differences from other stack based machine platforms (i.e. PowerPC, x86, MIPS, etc.). One of the main differences is the JVM s use of a data structure called the Constant Pool table. The JVM uses the Constant Pool table to represent the various classes, functions, and variables in the.class file. Since the entries of the Constant Pool are part of the final JVM.class file, it is essential for the compiler to be able to generate the Constant Pool table and keep track of all its entries throughout the code generation. Another difference is the JVM s method of referencing variables and functions. One of the qualities and restrictions in the design of the JVM is its ability to hide memory references and prohibit memory reference manipulations. By not being allowed to know or to access the memory locations of declared variables and functions, the compiler has to use the methods the JVM gives to reference the variables and functions. Also, the referencing of a global variable is very different from the referencing of a local variable. Since the design of the JVM is to mainly support the fully object oriented nature of Java, global variable declarations are not allowed, therefore the compiler has to find a way to represent a global variables. Given a way to represent the JVM Constant Pool and to reference variables and functions, a number of other issues must be considered during the code generation segment. Although the JVM is a stack based machine making the bytecode generation directly effecting the frame (or Java Operand Stack) similar to other stack based machines, administrative code related to memory management, memory allocation, and program counter changes are extremely different from the administrative assembly code of other stack based machines. These were the major issues that needed to be dealt with in order to successfully generate the JVM bytecode from a given C file. 6

7 2.3.1 Implementing the Constant Pool Table During the execution of a.class file, the Java Virtual Machine relies heavily upon that particular.class file s Constant Pool to represent the various string constants, class names, field (variable) names, method (function) names, and other references in the file. In order to reference certain classes, variables, and functions, the JVM needs to access the various entries in the Constant Pool table in order to obtain information about a certain class, function, or variable. The compiler would need to create and organize the Constant Pool table before generating the JVM bytecode representing it upon the completion of the code generation. Since it is necessary to access the entries in the Constant Pool and to insert new Constant Pool entries fast, the dynamically growing array became the ideal data structure for implementing the Constant Pool table within the code generator. Each element in the array represents a Constant Pool entry and all the information associated with that particular entry. Constant Pool entries are added as the code generator parses through syntactical parse tree given by the parser. Each entry is added to the end of the Constant Pool table. The format of each entry depends on the type of the entry [3, pages ]. The Constant Pool data structure is accessed whenever the code generator comes across a function call or a global variable access. The code generator would look into the Constant Pool and generate the JVM bytecode needed to access the Constant Pool entry associated to that function or variable. Since the JVM.class file format requires the Constant Pool to precede the executable bytecode, it is necessary to write the JVM bytecode representing each Constant Pool entry to a temporary file upon completing the code generation segment when all the essential Constant Pool entries would have been realized Referencing Variables and Functions Besides not being able access memory locations, one of the reasons why referencing variables and functions is different from other stack based machines is due to the object oriented nature of the Java Virtual Machine. Being fully object oriented would mean that at the global scope, there are only objects and classes. Since function declarations and variable declarations must be made within those objects and classes, globally declared functions and variables are impossible to be represented 7

8 in JVM. The method that was chosen to solve the problem of compiling globally declared functions and variables of a non-object oriented language such as C into the Java Virtual Machine is to have the compiler wrap the compiled code within a fabricated class. The details of creating this class will be discuss in but the solution essentially makes all global variable declarations and function declarations, field and method declarations (respectively) within that class. With a class enclosure around the compiled code, referencing a global variable or a function is now just a question of referencing the Constant Pool entry associated with the field or method within the enclosed class that represents the global variable or function. Since the JVM allows for the static declarations of the fields and methods within a class, an instance of a class is not needed to access a field or method that has been declared static. Thus when the code generator needs to generate bytecode that references a global variable or function, it would just generate the bytecode used to access the index of the Constant Pool entry representing that global variable or function. One the other hand, local variables are not referenced through the Constant Pool. The JVM definition of local variables include those variables declared within the scope of a function as well as the arguments of that particular function. For each frame (or Java Operand Stack) allocated for a function call, four bytes or one word is allocated for each function argument and then for each local variable declaration (two words for double and long variables) in that function. The i minus first index reference, for all i greater than or equal to one and less than or equal to the total number of arguments and local variables declared, refers to the i-thwordfromthetopoftheframe; the top of the frame holds the first word representing the first argument and the bottom of the frame refers to the last word representing the last local variable declared. 2 The code generator uses that same index number to generate bytecodes to access the location or value of a variable within that function call Generating Bytecode As briefly explained in 2.3.2, the C code in the sourcefile has to be to be enclosed within a class since the Java Virtual Machine was designed for the object oriented language, Java. Therefore, unlike other stack-based machine platforms, the code generator first virtually encloses the sourcefile 2 This is assuming that the JVM method is declared static. For more information on local variables, refer to [3, pages 66]. 8

9 code within a class, which it names the targetfile class. All functions and global variables declared within sourcefile thus becomes publically and statically declared methods and fields of the class targetfile. The code generator then adds all the Constant Pool entries into the Constant Pool structure necessary for the JVM to acknowledge the existence of the targetfile class. Also, the bytecodes to initialize the targetfile class are written to a temporary file dedicated for this targetfile class constructor. Once the administrative task of creating the targetfile class is completed, the code generator can begin to parse for global variable and function declarations which will be converted to member variable and function declarations of that class. The code generator stores all variable and function identifiers in a symbol table that s implemented as a stack of hash tables. Each level of the stack represents a different scope level and as a scope begins or ends, a new hash table is added to the top or popped off of the symbol table stack. The identifier name is added into the current scope of the symbol table, along with all the related function or variable information. When an identifier needs to be looked up in the symbol table, the information that was added along with the identifier name will be used to access the function or variable of that identifier. Currently, there are three types of identifiers that are implemented in the compiler: a global variable, a local variable, and a function. When the code generator comes across a global variable declaration, the Constant Pool entries needed to access the variable are created and added the compiler s Constant Pool data structure. All the JVM bytecode needed to make the JVM aware of this global variable s existence is then written to a temporary file dedicated for global variables. Essentially, the bytecode tells the JVM that this particular global variable is a field of a particular type that s within the class targetfile. If the global variable is an array pointer, then the bytecodes needed to initialize the array elements are added to the targetfile constructor (to the temporary file dedicated for the constructor of the targetfile class. So the array elements get initialized when the targetfile class gets initialized). The code generator then finally stores the identifier name of the variable into the symbol table. The identifier name of the variable and the reference index to the Constant Pool entry referring to the variable is then added to the symbol table. The code generator treats a function declaration similarly. Constant Pool entries relating to the identification of this function (i.e. argument types, return types, etc.) need to be created as well as Constant Pool entries needed to call this function in the future. After these entries are created 9

10 and stored into the compiler s Constant Pool data structure, the JVM bytecodes related to the operation of the function are written to another temporary file dedicated for function declarations. Then the identifier name of the function, along with the Constant Pool reference information, the JVM bytecodes needed to to prepare for the execution of this function, the JVM bytecodes needed to execute this function after all the arguments needed to pass into this function have been pushed on to the frame (also known as the Java Operand Stack), and other Constant Pool entries needed for the JVM to execute the function, are added in to the symbol table. If a local variable declaration is parsed out within a function, the code generator only needs to increment the total number of other local variables previously declared plus the number of arguments of the function. 3 This number becomes the reference index to access this local variable in the future. The code generator would then add the variable identifier name along with the reference index into the symbol table. Most of the JVM bytecode related to frame operations and instructions operate similarly to the assembly language instructions of other machine platforms. The JVM allocates a frame (or Java Operand Stack) for each function call. Most of the bytecodes of a particular function pops values from the Operand Stack, operates on the values, and then returns a result to that same Operand Stack. A majority of the bytecode instructions make some use of the Operand Stack, including function invoking bytecode which pops the passing arguments (which passes by value) from the Operand Stack. One of the main differences between generating code for the JVM and other stack-based machine platforms comes from the JVM s method of memory management and the way its program counter operates. Although similar to the other machine platforms in that the JVM is also stack and heap based and allocates frames (Java Operand Stacks) upon function calls, the JVM is also quite different since it performs all the memory managements automatically within the virtual machine and does not allow others to directly access, make changes in, or allocate memory. Since the allocation of memory for the stack, heap, and frames are all performed automatically by the JVM, the code generator does not need (nor is it allowed) to generate code that explicitly move or store the old stack or frame pointers upon invoking a function. The code generator only needs to generate 3 Assuming the local variable is of type int which is 4 bytes or one word wide. Future implementations of types long and double would require incrementing the total count by two since they are two words wide. 10

11 the code that tells the JVM the amount of memory to allocate. The operation of storing and restoring the stack and frame pointers are all automatically performed by the JVM when function calls and function returns are made. Likewise, the program counter cannot be changed directly by the code generator and is incremented and manipulated directly by the JVM. Therefore the bytecode used to jump to other parts of the program during conditional and iteration statements are done through bytecode counting rather than program counter manipulation. Instead of changing the value of the program counter, the code generator will generate bytecode to jump to a particular byte in the executing code in the function being executed. 4 Since the format of the JVM.class file requires the Constant Pool information to appear first, then class field declarations (or our global declarations), and finally class method declarations (including the constructor), it was necessary to write the different code generations of each segment to different files. The temporary file for the constructor was needed since the declaration of global array variables can be made throughout the C code in the sourcefile so the compiler would need a way to be able to add bytecode to the constructor whenever it parses a global array variable declaration. When the code generation is complete, the four temporary files for the Constant Pool, global variable declarations, function declarations, and the constructor for our targetfile class are combined in the respective order into the complete targetfile.class Java Virtual Machine.class file. 4 For more information on bytecode counting and jumping, refer to [3, pages ]. 11

12 3 Conclusion With the number of different machine platforms in existence and the increasing influence of the Internet, it becomes more and more important to write applications that has the ability to run on all platforms. With the increasing popularity of the Java programming language as being a crossplatform language, our project gave the widely known C language this ability to Write Once, Run Anywhere. Despite the fact that the Java Virtual Machine was designed for the Java programming language, we wanted to show that C can be efficiently compiled to the JVM. Since the compiler is a one-pass compiler and the sourcefile is read through only once, the compiler executes relatively efficiently. All the code was written in C++ and the code generation segment of the compiler is object orientedly designed. Each object was designed to be small and efficient, at the expense of having less objects, but this makes the code extremely readable and easy to understand, which we thought is important to in motivating any future interests in our project. Currently, any C code in the sourcefile which conforms to the grammar described in 2.1 can be compiled into JVM bytecode. Future extensions of the project include extending the implemented C grammar and adding C++ grammar to allow for class objects. More immediate grammar extensions include adding the #include rule. Currently, including our own version of a stdio.h library is hard coded into the compiler. The symbol table in the code generator automatically reads in a header file called stdio.hjava and adds the information in the file into the global scope of the symbol table. This file includes all the information (i.e. Constant Pool, executable bytecode, etc.) the symbol table needs to allow the functions, int input() and void output(int), to be called. 5 These functions give the C programmer using the compiler an interface to the standard input and standard output. The format of the.hjava file is straight forward and very similar to the format of a method declaration in a JVM.class file. 6 Other extensions to the compiler can be the implementation of a stronger semantic verifier within the code generator. Although, the JVM does semantic checking at the virtual machine runtime level, it would be more efficient and more predictable to also have more semantic checking performed at the compiler level. In conclusion, we hope our project will spur future interests in combining the popularity of the 5 The details of int input() and void output(int) are discussed in Appendix A. 6 The structure of the.hjava header file is documented in the file SymbolTable.cc 12

13 C programming language with the cross-platform cability the Java Virtual Machine has to offer. A User s Instructions All the source code for our project is in the following directory until mid-july: /usr/tahoe1/slash/thesis/codegen/ To compile the code that incorporates a subset of the C programming language as described in 2.1, execute: JVMcc targetfile < sourcefile Where: JVMcc: is the C compiler sourcefile: is the C programming being compiled targetfile: is the name of the JVM.class file (without the.class suffix) being compiled to JVMcc reads in the redirected sourcefile and will parse, scan, and generate JVM bytecodes into a file it creates called targetfile.class. The.class suffix is attached to the end of targetfile to indicate that targetfile is a JVM classfile. To execute targetfile.class, run the Java Interpreter with the targetfile as it s input file: java targetfile Note on the C code in the sourcefile To read from standard in, call the function int input() which reads an integer input from the standard input and returns the integer read. To write an integer to standard out, call the function void output(int) which writes the integer argument to the standard output. 13

1 The Java Virtual Machine About the Spec Format This document describes the Java virtual machine and the instruction set. In this introduction, each component of the machine is briefly described. This

CS 110B - Rule Storage Classes Page 18-1 Attributes are distinctive features of a variable. Data type, int or double for example, is an attribute. Storage class is another attribute. There are four storage

Moving from CS 61A Scheme to CS 61B Java Introduction Java is an object-oriented language. This document describes some of the differences between object-oriented programming in Scheme (which we hope you

Interpreters and virtual machines Michel Schinz 2007 03 23 Interpreters Interpreters Why interpreters? An interpreter is a program that executes another program, represented as some kind of data-structure.

Java and the JVM Martin Schöberl Overview History and Java features Java technology The Java language A first look into the JVM Disassembling of.class files Java and the JVM 2 History of a Young Java 1992

Appendix E Glossary of Object Oriented Terms abstract class: A class primarily intended to define an instance, but can not be instantiated without additional methods. abstract data type: An abstraction

This sub chapter discusses another architecture, that of the JVM (Java Virtual Machine). In general, a VM (Virtual Machine) is a hypothetical machine (implemented in either hardware or software) that directly

The Java Series Java Essentials I What is Java? Basic Language Constructs Slide 1 What is Java? A general purpose Object Oriented programming language. Created by Sun Microsystems. It s a general purpose

Compiler Construction Lecture 1 - An Overview 2003 Robert M. Siegfried All rights reserved A few basic definitions Translate - v, a.to turn into one s own language or another. b. to transform or turn from

1. Overview of the Java Language What Is the Java Technology? Java technology is: A programming language A development environment An application environment A deployment environment It is similar in syntax

Jonathan Worthington Scarborough Linux User Group Introduction What does a Virtual Machine do? Hides away the details of the hardware platform and operating system. Defines a common set of instructions.

CS1Bh Lecture Note 7 Compilation I: Java Byte Code High-level programming languages are compiled to equivalent low-level programs which are executed on a given machine. The process of compiling a program

Adjusted/Modified by Nicole Tobias Chapter 2: Basic Elements of C++ Objectives In this chapter, you will: Become familiar with functions, special symbols, and identifiers in C++ Explore simple data types

University of Twente Department of Computer Science A simulation of the Java Virtual Machine using graph grammars Master of Science thesis M. R. Arends, November 2003 A simulation of the Java Virtual Machine

1 Introduction The purpose of this assignment is to write an interpreter for a small subset of the Lisp programming language. The interpreter should be able to perform simple arithmetic and comparisons

Debugging ESE112 Java Programming: API, Psuedo-Code, Scope It is highly unlikely that you will write code that will work on the first go Bugs or errors Syntax Fixable if you learn to read compiler error

A.1.1 What is the Java Virtual Machine? Is it hardware or software? How does its role differ from that of the Java compiler? The Java Virtual Machine (JVM) is software that simulates the execution of a

JDK 1.5 Updates for Introduction to Java Programming with SUN ONE Studio 4 NOTE: SUN ONE Studio is almost identical with NetBeans. NetBeans is open source and can be downloaded from www.netbeans.org. I

Chapter 2: Basic Elements of C++ Objectives In this chapter, you will: Become familiar with functions, special symbols, and identifiers in C++ Explore simple data types Discover how a program evaluates

Exam Name MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question. 1) The JDK command to compile a class in the file Test.java is A) java Test.java B) java

Lab Experience 17 Programming Language Translation Objectives Gain insight into the translation process for converting one virtual machine to another See the process by which an assembler translates assembly

Strings in C : Overview : String data type is not supported in C Programming. String is essentially a collection of characters to form particular word. String is useful whenever we accept name of the person,

1 2 Introduction to Java Applications 2.2 First Program in Java: Printing a Line of Text 2 Application Executes when you use the java command to launch the Java Virtual Machine (JVM) Sample program Displays

1 1 Abstract Data Types Information Hiding 1.1 Data Types Data types are an integral part of every programming language. ANSI-C has int, double and char to name just a few. Programmers are rarely content

strsep exercises Introduction The standard library function strsep enables a C programmer to parse or decompose a string into substrings, each terminated by a specified character. The goals of this document

Programming Languages In the beginning To use a computer, you needed to know how to program it. Today People no longer need to know how to program in order to use the computer. To see how this was accomplished,

Run-Time Data Structures Stack Allocation Static Structures For static structures, a fixed address is used throughout execution. This is the oldest and simplest memory organization. In current compilers,

CS1Bh Practical 2 Inside the Java Virtual Machine This is an individual practical exercise which requires you to submit some files electronically. A system which measures software similarity will be used

2 Introduction to Java Introduction to Programming 1 1 Objectives At the end of the lesson, the student should be able to: Describe the features of Java technology such as the Java virtual machine, garbage

Exercise 4 Learning Python language fundamentals Work with numbers Python can be used as a powerful calculator. Practicing math calculations in Python will help you not only perform these tasks, but also

csce4313 Programming Languages Scanner (pass/fail) John C. Lusth Revision Date: January 18, 2005 This is your first pass/fail assignment. You may develop your code using any procedural language, but you

An Introduction to the Java Programming Language History of Java In 1991, a group of Sun Microsystems engineers led by James Gosling decided to develop a language for consumer devices (cable boxes, etc.).

Crash Course in Java Based on notes from D. Hollinger Based in part on notes from J.J. Johns also: Java in a Nutshell Java Network Programming and Distributed Computing Netprog 2002 Java Intro 1 What is

The programming language C sws1 1 The programming language C invented by Dennis Ritchie in early 1970s who used it to write the first Hello World program C was used to write UNIX Standardised as K&C (Kernighan

INTRODUCTION 1 Programming languages have common concepts that are seen in all languages This course will discuss and illustrate these common concepts: Syntax Names Types Semantics Memory Management We

Tutorial on C Language Programming Teodor Rus rus@cs.uiowa.edu The University of Iowa, Department of Computer Science Introduction to System Software p.1/64 Tutorial on C programming C program structure:

Organization of Programming Languages CS320/520N Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Names, Bindings, and Scopes A name is a symbolic identifier used

Texas University Interscholastic League Contest Event: Computer Science The contest challenges high school students to gain an understanding of the significance of computation as well as the details of

Chapter 7 The Stack In this chapter we examine what is arguably the most important abstract data type in computer science, the stack. We will see that the stack ADT and its implementation are very simple.