Adding static typing and scope validation into the language, part 1

Continuing on my series discussing the language I wrote, this next post is going to talk about the basics of static typing and scope rules. So far my language implementation follows very closely to Parr’s examples in his book Language Implementation Patterns, which is what gave me the inspiration to do this project.

However, starting here, I began to get confused with Parr’s examples. I didn’t really see how to translate his examples into real code. On top of that some of his diagrams weren’t matching my mental model of what was going on, so that didn’t help either. In general, I understood what he was doing, but I didn’t want to follow his examples anymore – Parr switched over to using ANTLR for everything which didn’t help me: I wanted to know how to do this without a generator. So I just jumped in and took a stab at it.

For the scope builder (and later the interpreter) I did what I thought made sense. Like I’ve mentioned before, I haven’t formally studied compiler/language design, so if a solution seems naive or egregiously wrong take it with a grain of salt (but leave a comment so I can learn how to do it better).

Anyways, this is where the project started to take off because I got to make real language decisions. The first two things I decided on was that this language would have static typing and forward reference support. I, personally, prefer static typed languages to dynamic typed languages because I like compile time checking. I’m not going to deny that dynamic typing isn’t helpful in some situations, in fact my interpreter leverages the dynamic data type heavily. For fun I did build in some runtime typing into the language (which I’ll cover in a later post). The type system I implemented also supports type inference. However, my language is pretty rudimentary and I don’t support type promotion, though it wouldn’t be hard to add.

For scoping, I wanted to allow forward references for methods and class internals, but not for internal statements. I like forward references because it lets you structure your code by ideas and locality of reference, and not strictly by bottom up execution. I like the freedom that forward references gives you. Also I knew that solving forward references would be more difficult so I decided to tackle the problem.

For the remaining post, I’m going to discuss the groundwork required to dig into more complex parts of the scope builder. As always, if you are curious check out the github project for full source, examples, tests, and comments.

Let’s get started.

Iterating over the syntax tree

In the last post we built a parser that generates an abstract syntax tree representing the execution of our program. Now, we need to iterate over that tree and do meaningful work. However, we’re going to be going over this tree a bunch of different times and we want to do different work over it. We could put the tree iteration into the syntax tree, but thats nasty. Instead, I used the visitor pattern to iterate over the tree.

I defined two interfaces. The first, is the definition for the actual visitor.

This interface will accept different syntax trees and depending on which overloaded Visit function is called, the visitor will know how to iterate (and do work) on that particular syntax tree.

The next interface is applied to the abstract base class Ast, which means that all syntax tree classes have to implement it. This interface forces the ast classes to “accept” a visitor.

interface IAcceptVisitor
{
void Visit(IAstVisitor visitor);
}

Below is the class used to represent an expression. Notice its visit method. All the classes have exactly the same visit method. You have to do it this way because if the visit method was in the base class, the visitor wouldn’t be able to figure out which class was trying to be visited (since they all inherit from Ast).

Now all we need to do to iterate over the tree is create an actual visitor. As an example, here is the function invoke overloaded visit method on a visitor called PrintAstVisitor that iterates the syntax tree and writes the tree representation to the console. To iterate over other parts of the tree we tell the tree to visit itself using this visitor (with this).

Symbols, Types, and Scopes?

The goal of the scope builder visitor is to associate types to symbols and make sure that they are visible in the right scope. We all intuitively know what symbols are, we work with them all the time.

int x = 1

x, here, is a symbol. It also has a type of int. Some types are built into the language, like int, string, void, etc. Some types are user defined (like classes or structs). The scope builder is going to figure out which expressions are which types and tag each syntax tree node with its type.

The builder also starts to do some syntax validation for us. It’s responsible for making sure our assignments, declarations, and invocations all make sense. One of the builders job is to make sure we can’t reference undefined values. For example, this is invalid:

int x = y
int y = 0;

This is because x uses y before y is defined. The scope builder is also responsible for validating static typing: we want invalid type assignment to be prevented

void x = 1;

If we defined x to be a void then we certainly shouldn’t be able to assign 1 to a void. The scope builder should throw an exception and prevent us from doing this. By preventing us from doing this we eliminate having to find out at runtime that our code is bogus. Though, as the language “designer” here, if we wanted to let this happen we totally could! It’s really up to your language implementation as to how you deal with what this means.

On top of all of this, like mentioned above, the scope builder is responsible for validating that symbols are visible only where they should be. The code below should be invalid because y is declared in an inner scope not visible to the print statement. Some languages don’t do it this way (javascript/actionscript), but this drives me nuts, so I made sure to do it for my own language.

int x = 1;
{
int y = 2;
}
print y;

The scope builder sounds like it does a lot of work (and it does), but most of these things are intertwined and the underlying code is short and sweet so there isn’t too much overlapping of concerns in the same class.

Scopes

Scopes are easily represented by a stack. Each time you encounter a new scope block (like the inner scoped y above) all you need to do is push a new scope onto your scope stack. Symbols below the current scope on the stack are visible if you are in that scope, so x would be visible to y, but symbols are not allowed to be visible up the stack from the bottom.

To support classes I had one scope stack for the global space. This scoped items in what I considered “main”, or really any inline code that is not within a class. I also had a scope stack for each class I encountered. This is because you don’t want classes to be able to see symbols in the global scope. Imagine this scenario:

class foo{
int value = x;
}
int x = 0;

This should be invalid because foo shouldn’t be able to see x. [Note: classes get a little weird, because you need to define the class declaration in the global scope (so you can instantiate the class), but everything inside the class is in its own scope. There are also other issues with classes that I’ll talk about in a later post.]

A scope, then, is nothing more than a bag of symbols and a reference to its parent scope (below it in the scope stack).

For example, a basic Scope object in my language looks like this. Notice that a scope contains a reference to its parents scope (line 5). This will get set automatically by the ScopeStack I’ll show next. It’s important to understand that a scope will have access to all elements below it via its parent. You can see that logic in the Resolve function which checks if a symbol is in the current scopes dictionary, and if not, tries the parent scope.

To help with setting the parent scope each time we pushed a new scope onto the stack, I created a generic scope stack that facilitated pushing and popping scopes, as well as auto-linked scope references. This way when I access a scope reference I can go down its stack:

The concept of a stack based symbol visibility control will get re-used again when we discuss memory spaces (i.e. where a value is in memory). Memory and scope are very similar, but not the same. You’ll see why later, but needless to say the scope container class gets re-used a few times in different contexts.

Types

Every symbol has a type. Let me show you what my type definition looks like:

ExpressionTypes is an enum of expression types. I use this for type checking later. The Src property holds a reference to the syntax tree that generated this type. For example, if I have a class called foo I will create a type whose TypeName is foo and it’s Src property will point to its ClassAst reference. Later I can get info about the source syntax tree just from the type.

I only have two kinds of types in the entire system.

Built in types. These are types like int, string, etc. They all share the same class, and the only thing that is different about them is the ExpressionType enum they hold indicating which built in type they are.

User defined types. These are classes that the user defines.

To make a symbol is easy. Depending on the syntax tree type I can create the proper symbol type based on its token type.

But wait, symbols are also scopes? They can be. A class symbol is also a scope and when we’re within a class we’ll want to define it’s symbols within it’s own personal space (like I mentioned earlier with the separate scope stacks). Having a symbol also be a scope makes life really easy when you try and figure out where things are.

The scope builder is complicated, but it follows the same basic pattern

If the current symbol is being declared (variable declaration, method declaration argument definitions, etc) then define the symbol in the appropriate scope.

If we are referencing the symbol, resolve it and make sure it’s been declared and is visible in the scope we expect

Defining a symbol

A simplified version of the variable declration visit function looks like this:

[Note: it’s simplified because we need to handle variables that are declared without values, as well as type inferred values, and method assignments, and partial functions, and validating left and right hand side type assignments, etc. There is a lot to it, to be covered later. If you’re curious check the github]

This way each syntax tree can track what type it is. We can use this information to do static type checking later (basically validate that the left hand side and right hand side have the same, or promotable, types).

Resolving a symbol

Resolving is the other way around.

Here is a simplified version of the visitor for an expression. It first visits the left, then the right. If the left and right are null then we have a leaf in our syntax tree and we need to resolve what it is. If it’s a built in type, like an integer, it’ll get defined as a symbol, otherwise if it’s a word (a user defined variable) it’ll get resolved:

It’s a simplified version because it doesn’t handle forward references. I’ll discuss that in a later post.

Remember the scope rules above. If a variable isn’t visible in a scope, either because it’s undefined, or because it’s not visible, resolving the type will fail.

You’ll notice the SetScope function. Every syntax tree gets assigned a reference to the scope they came from. This way everyone knows who can see what.

Conclusion

At this point we have a very basic way of defining symbols, types, and scopes. We’ve also gone over a way to validate simple scoping rules. There is a lot more to the scope builder. Now that there is some infrastructure in the next post I’ll dig deeper into the scope builder and discuss how we can fix the forward reference problem and do type inference. Still to come in the scope builder is determining return types and handling how to build partial functions.