Recommended Posts

I'm having some trouble getting control structures to work in my scripting system, mainly if statements just now.
My main idea was this: If I have a piece of code, like this:
print "hello"; if(name == "Zotoaster"){ print "Cool name"; } print "done..";
Everytime you find an if, get whats inside the braces, and use that as the script, which means nested ifs would work too. It all works fine, apart from when printing what's after the if statement, in this case, the print "done.."; part.
Anyone have an idea how to fix this, or have a better idea of how to work it?
Thanks

Share this post

Link to post

Share on other sites

Right, so you really do not have a clue. I really advise that you get a book on programming language development. I'll write a quick primer for you, though.

The basic operations behind any programming language is as follows: the input file is read, character by character, and split up into tokens (characters or groups of characters which have particular meanings) by a lexer. The tokens are then transformed into an abstract syntax tree by a parser. This transform is described by means of a grammar which is a complete description of the language syntax.

Note — Some modern parsers are able to handle tokenizing themselves instead, and some complex languages (such as C++) must interpret the language at the same time to build the AST.

Once the abstract syntax tree exists, the compiler performs various static bindings (linking the usage and declaration of variables and functions) and checks (type-safety, existence of the objects), and then moves on to the transform part.

Note — Again, some languages don't do static binding at all (PHP, PERL, SH) while most others delay the actual binding another step (a dedicated linking step). Also, not all perform any checking whatsoever. On the contrary, some perform complete proofs about the language.

The transformation consists in taking the abstract syntax tree, altering it, and outputting an executable graph. This might require several successive transforms (moving from front-end to back-end to binary in GCC), or it might be immediate (for example, in λ-calculus, the AST is executed almost as-is). The result is the executable structure, which is then passed on to an evaluator.

The evaluator is either the processor (when the executable structure is machine code), a virtual machine (when the executable structure is bytecode), or an evaluator defined by means of operational semantics. Note that it's perfectly plausible to define an evaluator through operational semantics, and then implement it as a virtual machine, or by compiling the AST to machine code.

Now, this is the simple part. The first difficult part is providing a grammar which can handle what you want to do. The typical imperative language grammar considers blocks to be statements or lists of statements, and statements to be function calls, assignments, or control sequences (which may include blocks). So, the AST of your example would be as such:

A typical operational semantics definition of such a language is to consider that you are executing a sequence of statements. This sequence is called, say, P (program). Every time you execute a statement, you alter the state of the program (its memory, its output, its input), which we'll call S (state). Then, the execution function here consists in telling the program what the final state is, given an initial state S and an initial program P.

So, here:

execute(program P, state S) is:

execute([],S) = S // no code left? return the state

execute([print x]::rest,S) = execute(rest,x::S) // we "print x", thereby changing S, // and then determine what the rest of the program does

execute([if(x) Y]::rest,S) = if eval(x,S) // Looking at the state of the program, // is the expression 'x' true or false? then execute(Y::rest,S) // It's true: execute the if-statement code // 'Y' before the rest of the code else execute(rest,S) // It's false, don't bother with 'Y' and move on // to the rest

Note that the execution function never forgets about 'rest', which is what code will be executed after the current statement, even if the current statement is an if-clause.

Now, once this is known, you translate the above execution system into whatever language and/or VM you wish to use.

Share this post

Link to post

Share on other sites

Well, it's not so much that you're far behind. Language semantics is an entire field of study which has been around for quite a few decades now, so people have naturally evolved tools and mindsets towards this.

Look at it this way. You don't use punch cards anymore to give orders, and your CPU doesn't respond with a 100dpi printer: your computer has a screen and keyboard and mouse, because these are simpler to use when communicating with the machine. You don't write all your code in assembly language anymore, you write most simple programs in medium-level or high-level languages because it's simpler to use and it saves time.

In the same way, language developers and semanticians have their own tools: lexer generators (because writing an equivalent lexer by hand takes years of experience and a couple of days), parser generators (because writing an equivalent parser takes years of experience and a couple of weeks), adapted tree manipulation languages (just because C++, C# or Java definitely won't cut it until they have access to pattern matching and/or garbage collection), as well as many human-spoken languages to communicate the various concepts (tokens, grammar, operational/denotational semantics).

The idea is this: where the average language can be written by hand in a month from scratch, these tools allow us to whip up a working prototype in half a day, which leaves us the rest of the month to think about the actually smart things (such as improving the speed of compiled code by 10% in addition to what the optimizer may do, or perhaps writing a program that proves your code doesn't contain any bugs, or allowing the programmer to express 1500 lines of C in 10 lines of the prototype language, or allowing the code to be distributed to a hundred machines at compile time, and so on). Besides, the tools are so flexible, we can change the grammar in five seconds tops. Which is why, although I consider a fine occupation to reinvent these wheels, I usually advise people to use them if they intend to get work done quick.

So, let's consider your example language, and see how it would fit. This is more of an untested outline, but the idea is here. First, you seem to want a language that has a C-like syntax, and we'll add a few concepts of our own:

Imperative language: functions are sequences of statements. Statements are either control structures (if, while, return), function calls, printing, reading, or assigning a variable a value.

Our values are NULL, strings, integers and arrays. We'll go for a PHP-like approach where arrays are in fact hash tables, and conversion between strings and integers is automatic. Variables can be used before being initialized (they're initially NULL).

We use no overloading (since we have a single type) and I'm too lazy to do argument-count-based overload.

The language is interpreted by a virtual machine. It is not type-checked at all.

So, the first step is to decide which grammar we'll use. The typical grammar is split into three sections: expressions (things which have a value), statements (things which perform actions) and definitions (things which explain what a function is). Here, I'll use Menhir, which is a very nice parser generator for OCaml, to describe the grammar. The idea is that I define non-terminals to represent each of the "entities" that can appear in the language, using other non-terminals (for instance, an expression can be an expression plus another expression) and terminals (the symbols read from the file, also called tokens).

Sure, there are several niceties missing above, because I'm short on time (for instance, we don't have an unary minus, or an 'else' clause, or a 'break' statement, or a 'for' loop, or a 'switch' statement, oh well). But these are quickly added with minor alteration of the grammar. Also, I might have overlooked some stupid mistakes, but well, that happens.

Now, the next step is writing a lexer, which we'll plug into the parser. Ideally, I'll be using ocamllex, which is a lexer generator for OCaml:

// --- We define the different tokens as regular expressions// Between braces, the associated token

This lexer should transform happily any input file (or standard input) into a sequence of parser-recognized tokens, which we then plug into the parser to construct our AST! So, there we go: a few hundred lines of description and we've got a near-optimal lexer and parser.

Now, the fun part is defining the AST. We'll actually define a few distinct parts: statements, assignable expressions, and expressions. These work differently: statements are executed, assignable expressions are modified and expressions are evaluated. Our abstract tree should reflect this. I'll also write the generation functions which are used by the parser, so everything works.

Now, our AST is up and running, and can be generated in a breeze simply by launching the parser on a token stream. We can start defining operational semantics for the language. We basically want to implemnt two functions: eval and exec. Eval reads in an expression, and returns the value of that expression. Exec runs a list of statements, altering the world along the way, and returns the value returned by the function either when a "return" is encountered, or when it runs out of statements (implicit 'return;'). First, I'll assume for the sake of the eval function that we have access to a 'call(f,args)' function which calls a function with arguments and returns the return value. So, let's write that function:

The choices of interaction are fairly arbitrary (I decided that an array automatically turned into a 'null' whenever it was involved in an operation, except for equality) but the functions are there and can be amended to perform whatever you wish. The point is, describing theoperations in a loosely typed language such as this one is necessarily a long deal, since there are N² combinations to consider for N types.

Once the evaluation function is done, we can write a mutually recursiveexecution function. The execution function concerns itselfwith one call only. It keeps a list of stack positions, representedas the code left to be executed within each block it is present in.It then pops an element from the topmost block, executes it, and moveson to the rest (evaluating objects in the process, and altering thememory state if necessary).

I'll leave to you to choose how to implement print_val and read_val, and I'll concentrate on the assignment function instead. The objective of this function is to alter the variable or object described by its second argument (as part of the environment passed as first variable) and set its value to its third argument. Since we want the left-hand object to be created (if it's an array-of-array-of-array-of-...) we'll write a function which builds the object (if it doesn't exist) and returns the current value and a function to modify it.

| AIndexed (a,i) -> (* Evaluate the index *) let i = eval vars i in (* Build whatever is below *) let (value,modify) = helper a in begin match value with (* Value is alredy an array, so keep it and add to it! *) | Val_array a -> (find_var a i),(fun x -> Hashtbl.remove a i; Hashtbl.add a i x)

(* Value is not an array, we need to index into it so we alter it. *) | _ -> let a = Hashtbl.create 10 in modify (Val_array a); (Val_Null),(fun x -> Hashtbl.add a i x) end in

(* Retrieve the modification function and call it on the value. *) snd (helper target) value

This should probably work. It'd probably take one more hour to correct all the various syntax errors, typos and typical mistakes (the code isn't tested), as well as binding code to execute the main function on startup and the usual command line arguments to specify which script to load. Perhaps adding a subscripting option for strings might be a good idea as well. Total time: three hours. Note that the values are passed by reference, but only arrays are mutable, so you can pass a mutable reference by wrapping it in an array, as such: