An older (yet remarkably up-to-date) resource on compiling with continuation passing style is Compiling with Continuations. (This is a very advanced book and not recommended for struggling students.)

The classic compiler textbook is "The Dragon Book":
Compilers: Principles, Techniques and Tools.
Some of the techniques in this book work for dynamic languages like Python.
Unfortunately, most of the techniques only work well for C-like typed languages.

Bonus opportunities

Valid until the week of finals.

+5% to P1 grade: handle unicode.

+5% to P1 grade: handle tabs in input.

Use an exploit on caprica to gain root access: +10% for a local user exploit; +15% for a remote exploit (e.g. breaking in via apache).
You must exploit a vulnerability (e.g. buffer overflow) for Ubuntu 12.10 on caprica
to gain root; that is, you can't steal my laptop while it has an open ssh
connection to caprica to claim the prize.
You must write up a short summary of the vulnerability and how you exploited it. (You may use a prepackaged tool for exploitation.)
Mail the summary to me for approval and then to the class.
Each individual exploit may only be claimed once, and the first to exploit wins.
To signal that you have claimed root, modify the message of the day.

Project 1: A lexer for Python

Due: Friday, Feb 1, 11:59 PM Utah time

Your task is to construct a lexical analyzer for the Python programming language.

Python is a relatively simple language to tokenize, with the notable exception of its
significant whitespace.

(PUNCT text) -- for operators and
delimiters, where text is a Racket
string containing the text of the operator or delimiter.

(ENDMARKER) -- for the end of the input.

If you encounter a lexical error, print (ERROR "explanation") and quit.

Simplifications

You do not have to handle multiple character-set encodings. Assume the input is ASCII. (This nullifies the need to handle regular strings and byte strings differently.)
Assume an identifier matches the regular expression
[A-Za-z_][A-Za-z_0-9]*

You should not join adjacent strings together.
That's easier to do with a later pass.

If a tab appears as an indentation character, it is an error. (You do not have to bump it to the nearest multiple of 8.)

Hints and tips

You can use a lexer-generator like flex
or Racket's lexer for this project.
(You'll find flex preinstalled on most Unix systems.)
If you go this route, be sure to exploit the ^
regular expression pattern for matching the start of a line.

It's not unreasonable to write a lexer by hand for a language like Python.
If you use a language that supports regular expressions, you can get the longest matching prefix of a string for the regular expression
re with /^(re)/.
The special pattern ^ matches the start of the input (or in some cases, the start of a line).

You may find it easier to write a two-pass lexical analyzer: the first pass could resolve signifcant whitespace, and the second could handle proper tokenization.

For this project, using a functional language holds advantages over languages like C and Java, but it
does not hold an advantage over a lexer generator like flex.

Reference implementation

Submission instructions

cd /home/unid/pylex
make
make run < test1.py > test1.py.out
make run < test2.py > test2.py.out
make run < test3.py > test3.py.out
...

Project 2: A parser for Python

Due: Wed, Feb 27 11:59 PM Utah time

Your task is to construct a parser for the Python programming language.

Unusually among programming languages, Python is straightforward to parse.
In fact, it is one of the few popular programming languages with an LL(1) grammar, making it amenable to
hand-written recursive descent parser.

Of course, it is also possible to use a parsing tool.

Guidelines

You can use any language you want for this project, but you must be able to run your program under Linux or Mac OS X.

If the TA or I can't get your code to run, we'll ask you to help us get it running.

Requirements

You should assume that Python source is coming in -- not pre-lexed input.

If you need to use a working lexer, you can assume that the program
pylex is in the current PATH, and that it matches the behavior
of the web app by consuming a file on STDIN and printing its tokens as
S-Expressions line-by-line on STDOUT.

The output S-Expression should fit within grammar for the concrete AST.

Include a README.txt file that contains:

your name;

a list of software dependencies necessary to run your assignment;

a description of what you got working,
what's partially working and what's completely broken; and

a manifest briefly describing each file you've turned.

Every source code file you turn in needs to have your name and student number at the top and the bottom in comments.

The program should use a Makefile:

make should compile the program (if compilation is necessary).

make run should accept input on STDIN and send
output to STDOUT.

make lex should run the lexer on STDIN.

make parse should run the parser on STDIN.

make test should run any test cases you used.

The program should output one s-expression containing the concrete parse tree on success or #f
on failure.

Simplifications

Use the simplified grammar provided.

Hints and tips

Do not write the parser by hand, unless you have a lot of time.

I recommend that you use yacc, Racket's parser generator a derivative-based parsing tool.

Requirements

If you need to use a working lexer, you can assume that the program
pylex is in the current PATH, and that it matches the behavior
of the web app by consuming a file on STDIN and printing its tokens as
S-Expressions line-by-line on STDOUT.

If you need to use a working parser, you can assume that the program
pyparse is in the current PATH, and that it matches the
behavior of the reference app by consuming a tokenized input on STDIN and
printing the AST on stdout.

a description of what you got working,
what's partially working and what's completely broken; and

a manifest briefly describing each file you've turned in.

Every source code file you turn in needs to have your name and student number at the top and the bottom in comments.

The program should use a Makefile:

make should compile the program (if compilation is necessary).

make trans should accept the output of project 2 on STDIN and send
output to STDOUT.

make run should accept a Python on STDIN and send
output to STDOUT.

make test should run any test cases you used.

The program should output one S-expression containing the HIR code on success or
print an error message containing the word error on failure.

Your output need not match the reference implementation exactly.
It should be semantically equivalent (unless the reference implementation has a bug).

You can also test your output against a Python 3.x interpreter, but note that these may differ in how they display objects using print.

Simplifications

Since project 2 snipped out the class system, you can't create objects, but you should still assume that you can access and set their fields with .name.
(Note: The code to interpret HIR fails immediately with an error message if these constructs are invoked.)

For numbers, assume only integers are used in the source and at run-time. (No floating point, no imaginary, no complex.)

All try blocks have exactly one except clause,
and no else or finally clauses, i.e.:

try:
main code
except:
failure code

A raise can raise any value (not just exceptions).

The single except clause for a try block does not specify which exception it catches, nor does it name it.

You cannot delete a variable from the local scope.
You can only delete fields and indices.
That is, you can write:

del a[i]

or

del a.f

but not:

del x

Of course, you could also write:

del f(3)["foo"][i]

You can assume all global variables are defined at the top level before their first reference.
If you want to handle global variables first defined locally using the global construct, talk to Matt.
That is, you have to handle:

y = 10
def f(x):
global y
y = x

but not:

def f(x):
global y
y = x

Hints and tips

In this project, the expressive advantage of using a functional language
like Racket or Haskell over a language like C or Java starts to become
substantial.
Those wishing to complete this project in a non-functional language
need to start immediately and ask for help early!

A prominent feature in functional languages is the ability to pattern-match
on data structures.
This ability is the killer advantage in this project.

If you want write your lexer, parser and translator in different languages,
you can tie them together with a Unix pipe, like so:

$ pylex | pyparse | ./pytrans

Break the transformation down into a transformer for each kind of Python term:
let transform-program
transform an entire Python program into HIR;
let transform-stmt
transform an individual Python stmt into HIR;
let transform-exp
transform an individual Python expression into HIR.

Use only very small programs for testing (initially).
Start with print(3) and work up from there.
Examine the output of the reference implementation on these small programs
to unravel its logic.

Code

This code is provided without warranty or guarantee of correctness.
If you find a discrepancy between this code and the project description,
notify Matt.

To execute HIR code, you can prepend hir-header.rkt to the result of your translation, and then execute it in racket.
(hir-header.rkt provides macros and functions to define all of elements of
HIR in terms of Racket.)

If you want, you can start the project with stub code;
this file illustrates a common design pattern for compilers--a transformation function for each class of term.
In this case, it contains
transform-program,
transform-stmt and
transform-exp.

You can grab a reference implementation from pytrans_rkt.zo.
It's compiled Racket bytecode for Racket 5.3.1.
To run, use racket pytrans_rkt.zo and send input on STDIN.

Project 4: A CPS translator for Python

Of course, the expected way of accomplishing this is to translate
HIR
into CPS.

It is tedious to translate HIR directly to CPS.
It is strongly recommended that you push HIR into a desugaring phase--dropping it down to
LIR first.

Guidelines

You can use any language you want for this project, but you must be able to run your program under Linux or Mac OS X.
Once again, non-functional languages are at a steep expressiveness disadvantage, and will require substantially more
time and code invested.
Please plan and request help accordingly.

If the TA or I can't get your code to run, we'll ask you to help us get it running.

Requirements

You should assume that Python source is coming in, not pre-lexed, pre-parsed input or pre-translated input.

If you need to use a working lexer, you can assume that the program
pylex is in the current PATH, and that it matches the behavior
of the web app by consuming a file on STDIN and printing its tokens as
S-Expressions line-by-line on STDOUT.

If you need to use a working parser, you can assume that the program
pyparse is in the current PATH, and that it matches the
behavior of the web app by consuming a tokenized input on STDIN and
printing the AST on stdout.

If you need to use a working HIR translator, you can assume that the program
pytrans is in the current PATH, and that it matches the
behavior of the bytecode app by a Python AST on STDIN and
printing the HIR AST on stdout.

a description of what you got working,
what's partially working and what's completely broken; and

a manifest briefly (a few words) describing each file you've turned.

Every source code file you turn in needs to have your name and student number at the top and the bottom in comments.

The program should use a Makefile:

make should compile the program (if compilation is necessary).

make run should accept Python code on STDIN and send
CPS code to STDOUT.

make test should run any test cases you used.

The program should output one S-expression containing the CPS code.
The output for invalid input is undefined, but it is recommended that
you print an error message containing the word error on failure.

Your output need not match the reference implementation exactly.
It should be semantically equivalent (unless the reference implementation has a bug).

When CPS-converting, the continuation argument always comes last, even for primitive operations.

You can also test your output against a Python 3.x interpreter.

Simplifications

All object- and field-related operations have been removed from the language.

Hints and tips

This project is optimally split into two phases: desugaring and CPS conversion.

It is strongly recommended that you first translate HIR into
LIR,
and then translate LIR to CPS.

Even if you don't understand the define-syntax form in detail,
you may learn a great deal about how to desugar from hir-header.rkt.
This is particularly true for the desugaring of try.

A substantial harness is provided in the stub code.
Even if you don't plan to use Racket, it is recommended that you look at the harness.
In particular, try:

$ make test-cps
$ make test-lir

The stub code contains a desugarer
with all of the cases stubbed out.
It also contains the "top level" for the reference CPS converter.

Pattern-matching tree transforms once again dominate this project.

Reading material

Of the reading material for the course, the following are most helpful:

Project 5: A Python-to-C compiler

Due: Wed 24 April 11:59 PM Utah time

Your task is to construct a translator from the subset of Python in Project 4
into C.

Of course, the expected way of accomplishing this is to translate
from CPS.

To make this transformation as easy as possible, it is recommended that you
perform mutable variable elmination,
flat closure conversion and
lambda lifting.
If you do, then the result may be more easily transliterated into C.

In terms of restructuring the language for these transforms, it is recommended that you follow
this spec.

Guidelines

You can use any language you want for this project, but you must be able to run your program under Linux or Mac OS X.
Once again, non-functional languages are at a steep expressiveness disadvantage, and will require substantially more
time and code invested.
Please plan and request help accordingly.

If the TA or I can't get your code to run, we'll ask you to help us get it running.

Requirements

You should assume that Python source is coming in, not pre-lexed, pre-parsed input or pre-translated input.

If you need to use a working lexer, you can assume that the program
pylex is in the current PATH, and that it matches the behavior
of the web app by consuming a file on STDIN and printing its tokens as
S-Expressions line-by-line on STDOUT.

If you need to use a working parser, you can assume that the program
pyparse is in the current PATH, and that it matches the
behavior of the web app by consuming a tokenized input on STDIN and
printing the AST on stdout.

If you need to use a working HIR translator, you can assume that the program
pytrans is in the current PATH, and that it matches the
behavior of the bytecode app by a Python AST on STDIN and
printing the HIR AST on stdout.

If you need to use a working LIR translator, you can assume that the program
pydesugar is in the current PATH, and that it matches the
behavior of the bytecode app by a Python AST on STDIN and
printing the LIR AST on stdout.

If you need to use a working CPS converter, you can assume that the program
pycps is in the current PATH, and that it matches the
behavior of the bytecode app by a Python AST on STDIN and
printing the LIR AST on stdout.

The output should be accepted by gcc.

Include a README.txt file that contains:

your name;

a list of software dependencies necessary to run your assignment;

a description of what you got working,
what's partially working and what's completely broken; and

a manifest briefly (a few words) describing each file you've turned.

Every source code file you turn in needs to have your name and student number at the top and the bottom in comments.

The program should use a Makefile:

make should compile the program (if compilation is necessary).

make run should accept Python code on STDIN and send
C code to STDOUT.

make test should run any test cases you used.

The output for invalid input is undefined, but it is recommended that
you print an error message containing the word error on failure.

Your output need not match the reference implementation exactly.
It should be semantically equivalent (unless the reference implementation has a bug).

You can also test your output against a Python 3.x interpreter.

Hints and tips

This project is optimally split into four small transforms:
mutable variable elimination,
closure conversion,
lambda-lifting
and
C code emission.

Tree transforms once again dominate this project.

Code

This code is provided without warranty or guarantee of correctness.
If you find a discrepancy between this code an the project description,
notify Matt.

To execute MVE'd, closue-converted or lambda-lifted code, you can prepend c-header.rkt to the result of your translation, and then execute it in racket.