Blog Archives

First, Let’s get our vocabulary straight. For those unfamiliar with the term REPL, it is short for a “Read-Eval-Print Loop“. It is basically a great way to poke at a language with a stick and get a taste for things you can do with it, and/or a great debugging aid, etc. Many modern languages offer a REPL (online even!), and I hadn’t realized until recently just how bad Eureka was going to be at it. But why?

Let’s take that REPL piece by piece and see where we fall short. (We can assume in advance that I can write the Loop part.)

The Read step is basically a solved problem, and has been for some time. As my earlier link to the “online REPLs” shows, you can do this any number of ways; easy peasy.

Now to the Eval. The heart of Eureka’s interpreter is wrapped in a cute candy coating known as ekContextEval(), which takes arbitrary Eureka code in a string and compiles / assembles / interprets it in one swoop. It returns false if the eval fails (any compile or runtime errors), and you can ask the ekContext for a list of text errors. I think E is covered.

Now we just need to Print, and we’re finally catching up to Every Other Language On The Planet. But what to print? My poor Eval function loves to tell me all of the ways I’ve made a mistake, but it never gives me anything interesting in the form of an actual value. Perhaps we need to find out the kind of data Eval should be returning first, and then work on why we don’t have it.

Let’s try another REPL so we know what kind of output we want. Using a Ruby REPL as an example, I did this (input and output alternate lines):

> 5
=> 5
> [5,6,7]
=> [5, 6, 7]
> a = 5
=> 5
> a
=> 5

The flow here is pretty obvious. I give an expression (“5″), it compiles and runs the expression and spits out what its value is (=> 5). This is where Eureka breaks down; I don’t have that value anywhere!

It turns out Eureka’s assembler is a neat freak. Since Eureka’s interpreter relies almost entirely on two stacks (frames and values), the maintenance of those stacks is critical. If the interpreter allows things to linger in the value stack, it’ll just be a huge mess, and nobody wants a huge mess. The assembler carefully ensures that everything that is placed on the stack eventually has a permanent home, even if that home is in the garbage can (the free list).

Here is the assembly dump for Eureka when you give it the expression “5″:

.kint 0 5

.main
pki 0
pop 1
ret 0

Simple stuff. Beforehand, it defines constant int index 0 with a value of 5. Then, the code begins by pushing constant integer index 0 (the value 5) onto the stack. Everything looking good so far! But wait, the next instruction is to pop a single value from the stack! Where did it come from? But what about our precious eval result?!

The pop instruction comes from the innocuous sounding function asmPad(), deep in the guts of Eureka’s assembler. Its job is to keep track of the flow of outputs of expressions (the “offer”) and the inputs of other expression (the “keeps”). When an rvalue expression doesn’t “offer” as much as the associated lvalue wants to “keep”, asmPad() saves the day and pads the stack with null values. This ensures that the ops are always self consistent, even if someone botches the arity of a function call or only requests two return values from a function that isn’t returning anything. However, it also protects the other direction as well. If the rvalue “offered” is more than the lvalue wanted to “keep”, asmPad() happily injects pop instructions to politely discard the rvalue’s hard work. This is a well oiled machine, except when you’d like to show off that discarded value to someone struggling with a Eureka REPL.

So, how to fix it? I tried a handful of terrible things in the guts of asmPad() and always came up short. There is no sense in punishing asmPad() for cleaning up after the messes of mismatched instructions; it is only doing the one job it has ever had! I tried injecting special versions of “ret” instead, to clue in the interpreter of when I wanted the data, etc. All of these attempts ended in failure. I finally realized that the answer was right under my nose: the pop instruction.

Most of the time, things popped by the pop instruction aren’t very interesting. However, when working in a REPL, all of the juiciest pieces of data (the results!) are either popped at the end of execution, or were recently stored in a variable (the vset instruction). All I need to do is pay attention to the most recent values affected by pop and vset, and I should have my REPL.

So I did just that; I added an optional argument to ekContextEval() which is an empty array. If it is present during evaluation, I simply clear its contents before every pop or vset, and add references to any values involved into the array, adding a ref count to them in the process. Since Eureka’s interpreter is refcounted, this is actually pretty cheap. The result array stabilizes at a small-ish size pretty quickly (especially if reused by the REPL code), and simply stores off pointers to ekValues in the same way the interpreter does. When ekContextEval() returns, whatever is in that array happens to be the eval result. Here is my local Eureka REPL running very similar inputs to the Ruby one earlier:

I have been developing Yap in somewhat of a vacuum for the 14 months. I would occasionally post something on here and/or show some newly functioning code to friends or coworkers with pride, but I can’t stress enough what a difference having a real “client” of your code makes.

I granted early access to Yap to my friend and former coworker Shannon, and his random questions/criticisms alone have really motivated me to add in features I have been delaying for no great reason. In just the last few days:

* Fixed a really dumb crash he found that I hopefully would have caught
* Started reorganizing all of my global functions into something that makes more sense
* Completely reworked my OO “system” and function notation to be more friendly
* resurrected the ‘this’ keyword in a much simpler way
* eliminated the somewhat hacky ‘with’ scope
* Added closure support!

Anyway, I’d like to offer a big Thank You to Shannon for being the first client of Yap; having a client is wonderful.

As for closure support, this code will associate a new integer variable alongside each function reference that is returned from makeClosure(), and is incremented and returned each time f() is called. Woo!

I had a long, long post here, and then I reread it and almost fell asleep. Never write details on conflating modules and chunks while you’re sick. Let’s just leave it at the fact that I conflated them, fixed it, and now exec() is easy to write without causing all kinds of stupid. I even have a sweet command line wannabe Python interpreter console thing.

When is an open parentheses not an open parentheses? When it is an OPEN parentheses. Confused yet? My parser certainly was.

For regular languages, ambiguity in a statement can usually be cleared up by a little context. You might be able to mix around some prepositional phrases in a fun way to make it sound silly, use a homonym that someone misreads, or place an adverb in just the wrong place, but ultimately a human brain is parsing it. What you were trying to convey will probably come across just fine.

This is not the case with a programming language parser. The “context” is typically one “token”, which can be a single word or number, or could possibly be a large collection of tokens that has been “reduced” into a single thing. It is that reduction that makes it all possible, because it effectively boils down large amounts of context into a single, obvious nugget of data. For example, if you started reading a sentence from somewhere in the middle, and the first word you read was “wind”, it might be enough to know the previous word in order to properly understand it. Being able to discern whether or not it is acting as a verb or a noun might be enough to guess if you’re talking about a watch or a sailboat, but it isn’t as good as knowing that the entire chapter you’ve been reading is about Mahjong. However, if the whole context you received was “wind up”, you’d be lost. You might be able to prioritize watch over sailboat, but you couldn’t be sure it wasn’t somehow talking about an airplane’s lift.

Alright, enough rambling; time for more rambling.

The best and worst part about this whole programming language business up to this point has been the grammar; good ol’ yapParser.y (don’t be fooled by the extension; the format is Lemon, not yacc). It pushes your programmerbrain in directions it most likely hasn’t been before, which typically offers lots of really frustrating lows and doing-a-victory-lap highs. I can’t really explain what it is about building complicated grammar that makes me feel like that, only that the last time I can remember when I felt the same way might have been when I learned about recursion a million years ago. It offers this opportunity for terseness and elegance, and plenty of rope to hang yourself with. I’m sure it has to do with the feeling that I am solving a really hard puzzle.

This is where the dreaded “parsing conflict” error comes in. You sit down to add in a few “minor” features, which might require a tweak of your grammar a little bit. You add in a seemingly innocuous line to your grammar file, only for Lemon to spit back at you “18 parsing conflicts”, which stops you dead in your tracks. You look over the output file (which is meaningless the first time you read it, but so wonderful once you “get it”) and search for “conflict” to find the bad states. You then think about why the parser could possibly be confused when it reads in a comma, or a parentheses, or everything (if you’ve really screwed it up). Sometimes you get the Eureka moment you were hoping for and adjust your grammar accordingly, and sometimes you put the file down and walk away for a few months.

I did the latter.

The grammar tweak I wanted sounded pretty simple. In many places in my grammar, I allowed for a block of code to exist, such as the body of an if or while statement. However, unlike Every Other Language On The Planet, I didn’t allow for a single statement to take the place of a block. This can lead to an abundance of very-unnecessary braces, or angry programmers that are used to being able to do this in all languages other than Perl (postcondition doesn’t count!). I figured I’d add in this by making an intermediate token that was defined as either a statement or a statement_block.

Noooooooope. A billion zillion million parsing conflicts. I studied the output and quickly realized that by doing this, I’ve done the worst possible thing to the parser; created an ambiguity that touched pretty much every rule in the grammar. I ended up commenting out the connection in the code along with a pretty weak TODO to come back to it “later”.

The reason these kinds of problems are so difficult is that you’ve most likely made a misstep somewhere completely different than the place you’re modifying, and it is coming back to bite you now. In my case, it turns out that the Big Stupid Mistake I was making was that I allowed my expression_list token to be “empty”. This seemed really convenient at the time; the grammar that handled function calls magically took zero arguments peacefully, my empty statements (just a semicolon) magically worked, amongst other things. What I didn’t realize was that I was adding an incredible amount of ambiguity to the parser, and I was just really, really lucky that I hadn’t been burnt too hard up until this point. I mean, how could I expect the parser to function properly if every possible location during the parse might be an empty list of expressions?

This fixed most of my problems, but not all of them. I still had an issue with parentheses. Let’s take a line of code as a starting point for the issue:

var v = someFunc(a, b, c) * 7 * (3 + 4);

There are two parenthesized sets of tokens in that line of code, and they have wildly different meanings. The first parens serve as a packaging for the arguments to a function call, and (in fact) cause the function call to occur in the first place. Removing the first set of parens would just cause that code to cite a reference to a function instead of actually executing it. The second set of parens is merely there to provide grouping and implicitly poke at the order of operations. Other than the fact that the parser orders the nodes in the syntax tree a little differently, the actual nodes themselves are the same.

The issue is with reduction. If you set up your grammar to reduce (3 + 4) into a single token of type “expression_list” or perhaps even “paren_expr_list”, that reduced token might not actually parse into a function call anymore, that is, unless your function call grammar was crafted with this reduction in mind (which mine wasn’t). You end up with a layering of bad grammar on bad grammar, and months away from your codebase to “recuperate”.

The fix ended up being a lucky series of googlings mixed with some more luck, along with the Lemon parser’s %fallback token. I decided that the only way I was going to solve these grammar problems was if I could see an example Lemon grammar of a language that actually had some complexity to it (read: not grammar to make a calculator), and actually looked more like a regular programming language (read: not SQL). Also, the Lemon parser generator is not too wonderful with its documentation, no offense to the authors. It is a wonderful, lean-and-mean piece of code, and there are plenty of comments in the few lemon grammars out there, but there are some features that can only be found out about via word-of-mouth-or-newsgroup-or-irc, and it just so happens that %fallback is one of them.

I saw on the Lua wiki that someone had took a shot at making a Lemon grammar for Lua (Listing 1). As Yap looks like it is going to end up as some bizarre mashup of Lua with hints of Javascript in there, this was quite an exciting read. However, one of the lines in the grammar was this line:

%fallback OPEN '(' .

… and the token OPEN being used later:

prefixexp ::= OPEN exp ')' .

The magic of this rule was completely lost on me, so I did what any sane geek would do, and I googled it. I ended up here, which is just some old sqlite mailing list discussion about parsing ambiguity and the fix being to use the %fallback token. It was then that I had my Eureka moment and could fix things.

What the %fallback token actually does is allow for a safety net / second chance for a token when the parser chokes on it. In the case of the mailing list question, the author wanted to attempt to parse the second token in his rule as a string if the rule failed doing the “right” thing. However, in the Lua grammar example, the fallback token was “OPEN”, which is a new token! The big magic trick here is that his parentheses needed to be two things at once; schizophrenic. Sometimes it just needed to be a boring grouping mechanism, and sometimes it needed to be more than that (such as a function call). Since a fallback action doesn’t occur until a parse failure is imminent, you can actually “fall back” to an alternate name for your own token, and implicitly set precedence on two usages of the same token, thus eliminating ambiguity.

This tool seems to convert Lemon grammar into some other grammar, and explicitly doesn’t support %fallback. The docs state “…then you can manually define an nonterminal that does what the %fallback directive would have done“. This comment suggests that the %fallback mechanism is simply a clever way to perform something that can actually be done by rearranging your grammar’s nonterminals differently, which I completely believe. It made me think of how you can convert recursive algorithms to iterative by just dropping in a stack, but that you might lose some of the “elegance”. It also made me realize that I am probably not experienced enough in the ways of programming language grammar syntax to perform this without %fallback, and that is okay. I have sweet, sweet single statement blocks now, and am blissfully ignorant about all of the parsing conflicts I am to have in the future. I already know I am hosed on the ternary operator in the same way Lua is (and for the same reason), but other than that, I think it might be time to burn through the Yap Language Issues and try to 1.0/release this codebase.

My writer’s block is over, in that I chose instead to implement the guts with some “reasonable” keywords and can continue the agonizing over the “perfect keywords” after its all done. I am pretty pleased with how everything turned out, even if I might change the keywords themselves ten more times over the life of Yap’s early development.

with

My goal for this keyword was to offer a way to declare a block of function declarations and variables, but without the baggage of having to do it “all at once”. In C/C++, there are architectural reasons for this missing functionality. In Javascript, the object notation allows for a “similar” feeling at the expense of having to chain all of the statements together with colons and commas. Something about it feels strange.

I decided that what I really wanted was a way to reference a variable as the focal point of an arbitrary block of declarations, and have the interpreter “know what to do” every time it saw a new variable declared at that scope level. This allows for the creation of a new/derived object to be decoupled from the overrides or extensions you plan to give it, without lots of boilerplate name prefixes.

Note: Yes, I’m aware this keyword does something else in Python. I currently don’t care, and by the time I do, I will have probably changed it in Yap anyhow.

inherits

I already hate this choice, as it is ambiguous whether it is setting the ancestor of an object or testing for a match. I actually named it “from” at first, but it looked a little bit silly when not chained together with other keywords (“car from vehicle;”). Either way, it sets up an object’s ancestor, and is pretty straightforward.

I haven’t bothered to implement the testing of the ancestor, but when I do, I will most likely rename this keyword to minimize the ambiguity. It certainly seems like a legitimate thing to want to type “if (car inherits vehicle)”, and that is a problem.

A few “minor” things make for quite a large shift in the look and feel of the language. It now feels a lot like Javascript, only instead of using “this”, it passes the object as the first parameter ala Python, but only when you specifically request it like Lua (using colon). It ends up looking something like:

I’m not in love with it yet. The interpreter currently requires that if you implement init() for an object, it has to return itself. I thought it’d be neat to allow for generator objects, but in practice it is going to seem like a boilerplate nuisance. I used to have a “magical” workaround, but boilerplate might be better than magic. I also want to implement some intrinsic iterators that allow for things like ipairs() and/or members() for objects.

p.s. I had to switch the syntax highlighter from “python” to “javascript” for it to colorize it well.

I decided on prototypal inheritance for Yap a few weeks ago, as I want to keep the interpreter as simple as possible and classical inheritance adds lots of plumbing. Anyway, to pull it off, I don’t NEED any new keywords, as I could just offer some intrinsic functions that set up and test for inheritance (such as inherits() and instanceof()). However, I want the code to flow a little better than that, and I want to choose keywords that can be combined togther and not sound completely ridiculous.

On top of all of this, I’d like to make a keyword whose sole job is to provide a block for instantiating functions, much like how a class block would act in another language. Any functions defined in that block would be added to the object referenced in the block’s creation.

Every time my brain has gone idle in the last few weeks, it slowly works its way back to playing madlibs with two or three possible constructs that I want to be valid. I figured I’d share this game of madlibs in case someone out there has a brilliant combination of keywords that just “feels right”. Here we go:

# __A__ : a keyword which creates a block of functions to attach to an# expression# __B__ : a keyword which causes its LHS to be inherited from its RHS# (and if LHS is null, make it a fresh object)# __C__ : a keyword which returns boolean true if LHS inherits from RHS# (must be different from __B__!)

# The __C__ keyword isn't going to be listed in the madlibs below, but it# must fit thematically with them, and would look something like:

# if Car __C__ Vehicle ...

# Pretend that I've implemented an object named Vehicle, and I would like# to override a few functions in an inherited object named Car. Here are the# madlibs that must ALL "not feel terrible":

# Note: the __A__ block will be allowed multiple times for a single object,# so don't choose something that sounds too "final".

Dict support, finally. My array and dicts currently share the same syntax, like Python and PHP do. I now firmly believe this is not done in languages for “consistency”, but more because its really easy to be lazy and not add new grammar.