@bilabila/luaparse

luaparse

A Lua parser written in JavaScript, for my bachelor's thesis at Arcada.

Installation

Install through bower install luaparse or npm install luaparse.

Usage

CommonJS

var parser =require('luaparse');

var ast =parser.parse('i = 0');

console.log(JSON.stringify(ast));

AMD

require(['luaparse'],function(parser){

var ast =parser.parse('i = 0');

console.log(JSON.stringify(ast));

});

Browser

<scriptsrc="luaparse.js"></script>

<script>

var ast =luaparse.parse('i = 0');

console.log(JSON.stringify(ast));

</script>

Parser Interface

Basic usage:

luaparse.parse(code, options);

The output of the parser is an Abstract Syntax Tree (AST) formatted in JSON.

The available options are:

wait: false Explicitly tell the parser when the input ends.

comments: true Store comments as an array in the chunk object.

scope: false Track identifier scopes.

locations: false Store location information on each syntax node.

ranges: false Store the start and end character locations on each syntax
node.

onCreateNode: null A callback which will be invoked when a syntax node
has been completed. The node which has been created will be passed as the
only parameter.

onCreateScope: null A callback which will be invoked when a new scope is
created.

onDestroyScope: null A callback which will be invoked when the current
scope is destroyed.

onLocalDeclaration: null A callback which will be invoked when a local
variable is declared. The identifier will be passed as the only parameter.

luaVersion: '5.1' The version of Lua the parser will target; supported
values are '5.1', '5.2', '5.3' and 'LuaJIT'.

extendedIdentifiers: false Whether to allow code points ≥ U+0080 in
identifiers, like LuaJIT does. See 'Note on character encodings' below
if you wish to use this option. Note: setting luaVersion: 'LuaJIT'
currently does not enable this option; this may change in the future.

The default options are also exposed through luaparse.defaultOptions where
they can be overriden globally.

There is a second interface which might be preferable when using the wait
option.

var parser =luaparse.parse({ wait:true});

parser.write('foo = "');

parser.write('bar');

var ast =parser.end('"');

This would be identical to:

var ast =luaparse.parse('foo = "bar"');

AST format

If the following code is executed:

luaparse.parse('foo = "bar"');

then the returned value will be:

{

"type":"Chunk",

"body":[

{

"type":"AssignmentStatement",

"variables":[

{

"type":"Identifier",

"name":"foo"

}

],

"init":[

{

"type":"StringLiteral",

"value":"bar",

"raw":"\"bar\""

}

]

}

],

"comments":[]

}

Note on character encodings

Unlike strings in JavaScript, Lua strings are not Unicode strings, but
bytestrings (sequences of 8-bit values); likewise, implementations of Lua
parse the source code as a sequence of octets. However, the input to this
parser is a JavaScript string, i.e. a sequence of 16-bit code units (not
necessarily well-formed UTF-16). This poses a problem of how those code
units should be interpreted, particularly if they are outside the Basic
Latin block ('ASCII').

Currently, this parser handles Unicode input by encoding it in WTF-8,
and reinterpreting the resulting code units as Unicode code points. This
applies to string literals and (if extendedIdentifiers is enabled) to
identifiers as well. Lua byte escapes inside string literals are interpreted
directly as code points, while Lua 5.3 \u{} escapes are similarly decoded
as UTF-8 code units reinterpreted as code points. It is as if the parser input
was being interpreted as ISO-8859-1, while actually being encoded in UTF-8.

This ensures that no otherwise-valid input will be rejected due to encoding
errors. Assuming the input was originally encoded in UTF-8 (which includes
the case of only containing ASCII characters), it also preserves the following
properties:

String literal nodes representing the same string value in Lua (and
identifier nodes, if extendedIdentifiers is enabled) will have the same
interpretation in the AST: e.g. the Lua literals '💩', '\u{1f4a9}' and
'\240\159\146\169' will all have "\u00f0\u009f\u0092\u00a9" in their
.value property, and likewise local 💩 will have the same string in
its .name property.

The .length property of decoded string values in the AST is equal to
the value that the # operator would return in Lua.

Maintaining those properties makes the logic of static analysers and code
transformation tools simpler. However, it poses a problem when displaying
strings to the user and serialising AST back into a string; to recover the
original bytestrings, values transformed in this way will have to be encoded
in ISO-8859-1.

Other solutions to this problem may be considered in the future. Some of
them have been listed below, with their drawbacks:

A mode that instead treats the input as if it were decoded according
to ISO-8859-1 (or the x-user-defined encoding)
and rejects code points that cannot appear in that encoding; may be
useful for source code in encodings other than UTF-8

Still tricky to get semantics correctly

Using an ArrayBuffer or Uint8Array for source code and/or string
literals

May fail to be portable to older JavaScript engines

Cannot be (directly) serialised as JSON

Values of those types are fixed-length, which makes manipulation
cumbersome; they cannot be incrementally built by appending.

They cannot be used as keys in objects; one has to use
Map and WeakMap instead

Using a plain Array of numbers in the range [0, 256)

Memory-inefficient

May bloat the JSON serialisation considerably

Cannot be used as keys in objects either

Storing string literal values as ordinary String values, and requiring that
escape sequences in literals constitute well-formed UTF-8; an exception
is thrown if they do not

UTF-8 chauvinism; imposes semantics that may be unwanted

Reduced compatibility with other Lua implementations

Like above, but instead of throwing an exception, ill-formed escapes are
transformed to unpaired surrogates, just like Python's surrogateescape
encoding error handler

Destroys the property that ("\xc4" .. "\x99") == "\xc4\x99"

If the AST is encoded in JSON, some JSON libraries may refuse to parse it

Custom AST

The default AST structure is somewhat inspired by the Mozilla Parser API but
can easily be overriden to customize the structure or to inject custom logic.

luaparse.ast is an object containing all functions used to create the AST, if
you for example wanted to trigger an event on node creations you could use the
following:

var luaparse =require('luaparse'),

events =new(require('events').EventEmitter);

Object.keys(luaparse.ast).forEach(function(type){

var original =luaparse.ast[type];

luaparse.ast[type]=function(){

var node =original.apply(null,arguments);

events.emit(node.type, node);

return node;

};

});

events.on('Identifier',function(node){console.log(node);});

luaparse.parse('i = "foo"');

this is only an example to illustrate what is possible and this particular
example might not suit your needs as the end location of the node has not been
determined yet. If you desire events you should use the onCreateNode callback
instead).

Lexer

The lexer used by luaparse can be used independently of the recursive descent
parser. The lex function is exposed as luaparse.lex() and it will return the
next token up until EOF is reached.

Each token consists of:

type expressed as an enum flag which can be matched with luaparse.tokenTypes.

value

line, lineStart

range can be used to slice out raw values, eg. foo = "bar" will return a
StringLiteral token with the value bar. Slicing out the range on the other
hand will return "bar".

Support

Quality Assurance

TL;DR simply run make qa. This will run all quality assurance scripts but
assumes you have it set up correctly.

Begin by cloning the repository and installing the development dependencies
with npm install. To test AMD loading for browsers you should run bower install which will download RequireJS.

The luaparse test suite uses testem as a
test runner, and because of this it's very easy to run the tests using
different javascript engines or even on locally installed browsers. Currently
the default runner uses PhantomJS and node so when
using make test or npm test you should have PhantomJS installed.

Test runners

make test uses PhantomJS and node.

make testem-engines uses PhantomJS, node, narwhal, ringo, rhino and rhino
1.7R5. This requires that you have the engines installed.

make test-node uses a custom command line reporter to make the output
easier on the eyes while practicing TDD.

By installing testem globally you can also run the tests in a locally
installed browser.

Other quality assurance measures

You can check the function complexity using complexity-report
using make complexity-analysis

Running make coverage will generate the coverage report.
To simply check that all code has coverage you can run make coverage-analysis.