Full speed ahead!

My big project at the moment involves rewriting the Epoch language compiler for what feels like the millionth time.

The good news is, instead of yet another C++ incarnation of the compiler, this time around I'm moving towards a self-hosted model, where the compiler for Epoch is itself an Epoch program.

As I've detailed elsewhere, I decided to take a somewhat unconventional approach to this project. Instead of starting with a parser and building the compiler from the ground up, I'm creating a series of plugins that replace sections of the C++ compiler implementation, starting from the back-end and moving towards the lexer/parser as a final step.

The creation of bytecode streams has been ported for a while; currently, the big bulk of the work centers around turning the compiler's internal representation of a program into a sequence of calls into the bytecode emitter. There are two aspects to this that merit a little more detail. First, in order to do this, it's necessary to actually have an internal representation of programs within the Epoch compiler. This is actually the bulk of the work. Second, once the representation is close enough in features and robustness to the parallel C++ version, it should be possible to turn IR into bytecode using either Epoch or C++.

Actually creating the bytecode sequences is pretty trivial; it's just a matter of traversing the IR tree structures and popping out a few blurbs of bytecode for each node. What really gets complex is porting the complete tree structures to Epoch in the first place. Part of what makes it tricky is that the only data structure I'm actually using in the Epoch compiler port is a singly linked list. This isn't so bad once you get used to it, but it's still a mental shift and it's not the most efficient code in the world.

It may seem a bit weird to have such a constraint on the implementation; fact of the matter is, it would have taken considerable work to implement other data structures, because the way references work in Epoch doesn't easily allow for controlling the lifetime of objects, nor for re-seating references (i.e. making it impossible to do things like actual red/black trees, hashmaps, etc.). I wanted to move to self-hosting first for two reasons. First, this limits the amount of C++ code maintenance I have to do; and second, it makes the self-hosting process a little easier because there's less functionality to reimplement.

Right now, code generation in Epoch passes 33 of 62 compiler tests. Most of the remaining stuff has to do with automatically generated functions such as aggregate constructors, with a smattering of little bits and pieces here and there. I think, unfortunately, I've hit a point where it takes a lot of effort to make each additional test pass - the low-hanging fruit has been thoroughly scavenged.

On the plus side, I basically doubled the number of passing tests over a weekend, so there's that.

Once code generation is done, it'll be time to tackle semantic validation and type inference. Those two processes are deeply tied together in Epoch, and both operate in-place on the compiler's IR. Since code generation basically requires about 85% of the IR to be implemented, extending the data structures should take minimal effort. The bulk of the job will comprise algorithmic work to actually reimplement the validation and inference logic.

As with any software task, estimating the amount of effort it will take to finish the self-hosting port is... difficult, to say the least. The cool thing is that I already have a complete battery of compiler tests to throw at this, so it's pretty straightforward to pinpoint bugs in the new compiler implementation. The much less cool thing is that I don't have any real debugging tools for Epoch programs yet, so the stickier issues generally tend to require a lot of manual print() calls and sifting through logs.

That said, given the pace of development thus far and the mountain of work left in front of me (keep in mind that after semantic validation/type inference I still have to write a complete lexer and parser for the language) I think it's fair to say that I should hit self-hosting completion by the end of the year. I'm trying to pad that time generously so that when I inevitably get tired (or busy), I can leave off for a while without feeling like I'm going to slip a personal deadline.

If you're morbidly curious about what a self-hosted compiler in progress looks like, check out the source (just under 2000 lines as of now).