Representing Indentations for Parsing

From:
andrew cooke <andrew@...>

Date:
Sun, 30 Aug 2009 13:06:31 -0400

What's the best way to represent indentation information for parsing?
There are two side to this question. First, from a UI perspective -
how best to support clear, concise specification of parsers for a wide
variety of different approaches. Second, from an implementation
perspective - how to do that efficiently within LEPL.
Taking the second first, LEPL now supports filtering of streams, but
these are costly in terms of memory (apart from an initial filtering
on the entire stream). That means that context-sensitive filtering of
streams is possible, but better avoided. A more efficient, but less
flexible, approach is to alter "global" state in a way that depends on
scope - this can be done with "monitors" and has constant time and
space costs.
One issue with UI seems to resolve around finding a good
representation of the data in the stream. In particular, avoiding
having tokens related to indentation in the stream where they are not
needed (for example, in Python, between parentheses indentation is not
significant). If - as in that example - this is restricted to a small
fraction of the stream, then filtering is a reasonable approach (since
memory usage is not a problem).
Another issue involves the difference between spaces/tabs and "depth".
One scheme might assign meaningful indentation to exact multiples of
4 spaces; another might only require that spaces "be consistent"; in
either case tabs may or may not be acceptable, and might have some
fixed size, or shift to the some fixed set of columns.
Yet another issue is blank lines. If these are significant, they need
to be present in the stream. If not, they are a nuisance and should
be filtered.
Putting all this together, my first reaction was to calculate a
"depth" for each indentation (using some user defined function of the
space matched at the start of the line) and to place that value in the
(tagged token) stream. Furthermore, blank lines would be discarded
and a global (but stack-aware) "current indentation level" set as
"zero".
A NoIndent matcher (token) would then match an indentation token with
a depth equal to the global indentation level. This would allow a
line with no indentation to be matched.
In addition, an Indent matcher would exist, which takes a list of
matchers (tokens) as an argument. This matcher: matches an indent
"one more" than the global value; increments the global value;
evaluates the internal matchers; on exit, decrements the global value.
Within the scope of the first Indent matcher, then, a NoIndent matcher
would match lines indented by "one".
That has at least two problems: continguous nested Indents (the second
indent has nothing to match because the indent token was consumed by
the first); second it assumes there's it's possible to calculate the
depth before matching.
The first problem can be addressed by making the Indent match but not
consume the token if some indentation "remains". The second can be
fixed by using "number of spaces" rather than depth (and having Indent
consume some number, perhaps (FixedIndent), or simply take what it's
given (VariableIndent)).
There's also the case of wanting to ignore indentation. That is, I
think, going to require a filter. And NoIndent might be the right
name for that matcher, in which case perhaps Line would be a good
choice for what I was calling NoIndent? Or Sol?
After writing the above I tried writing a parser (without having any
implementation) and learnt that defining what a line is, is important,
but using a separate Eol() is messy. It seems better to have various
ways of defining lines. Related to that, one example use case I had
missed is automatic continuation over multiple lines (indicated by
indentation).
So perhaps the basic grouping is Line(), which has a related
Multiline() for automatic continuation. This can be combined with
Indent(), NoIndent() and AnyIndent() (the last ignores indentation).
But Multiline and AnyIndent are both so closely related to Line that
perhaps they should be options on Line. And if we do that, shouldn't
Indent be, too?
Is there some other syntax we can use? I don't think so - we really
need this to group matches, so it has to have a (...) form, which
means a function invocation. Perhaps a single letter function name,
plus flags? There's always _(...), although that's traditionally for
i18n.
Maybe it's better to separate the main parser from the structure, and
then have some kind of complex layout expression that is itself
parsed? I'm thinking of nested lists. But as soon as you get any
complexity (alternatives for example), you're better using functions
and the standard approach (ie Or()).
Here's a later attempt. I'm now using Line... to make things appear related.
word = Token('[a-z_][a-z0-9_]*')
number = Token(Integer)
symbol = Token('[^a-z0-9_]')
# any indent, entire line
comment = LineAny(symbol('#') + Star(Any()))[:]
atom = number | word
# ignore line related tokens
args = symbol('(') + LineIgnore(atom[:,symbol(',')]) + symbol(')')
expr = ... # this includes nested function definitions
# single line function is distinct
func1 = Line(word('def') + word + args + symbol('=') + expr)
func = Line(word('def') + word + args + symbol('=') +
LineIndent(expr)[:])
program = (func|func1)[:]
Inline comments would require separating out the "#..." part and
adding it to the end of an expr.
But there's still a problem with nested definitions. If LineIndent
has consumed the indentation token, what does the Line() outside a
function definition match?
I guess this indicates that LineIndent doesn't consume indentation, it
just modifies the current level. So perhaps should be called Indent.
And perhaps LineIgnore is better called Freeform?
# any indent, entire line
comment = symbol('#') + Star(Any())
atom = number | word
# ignore line related tokens
args = symbol('(') + Freeform(atom[:,symbol(',')]) + symbol(')')
simple_expr = ...
expr = Line(simple_expr + Opt(comment))
line_comment = LineAny(comment)
# single line function is distinct
func1 = Line(word('def') + word + args + symbol('=') + expr +
Opt(comment))
func = Line(word('def') + word + args + symbol('=') + Opt(comment)
Indent(expr|func|func1)[:])
program = (func|func1)[:]
And it looks like it would be useful for a user to be able to easily
define something that was like Line but added an optional comment (and
in practice this should indeed be possible).
I haven't used LineMulti() above, but that would support implicit
continuation through indented lines.
Andrew

More Indentation

From:
andrew cooke <andrew@...>

Date:
Sun, 30 Aug 2009 18:57:52 -0400

So looking at that I noticed that there's an inconsistent mess in how
Line and Indent work together. Above I had:
func = Line(word('def') + word + args + symbol('=') + Opt(comment)
Indent(expr|func|func1)[:])
There's no operator between the comment and the Indent. My initial
reaction was that I had missed a &, but then Line would extend pass
end of line (dropping Eol and inferring it from the next indentation
doesn't solve the underlying problem, I think, which is that this is
messy).
Looking at the problem again, there are several different "things being done":
1 - Defining what constitutes a single line.
2 - Grouping lines with the same indentation together
3 - Indicating how one *group* is indented relative to another.
The confusion above seems to be confusing (3) with indicating how
*lines* indent relative to each other.
Using this new approach is going to be much more verbose, but also
clearer. On a more positive note, it might help separate two separate
tasks - definition of lines and blocks - and might even allow more
useful combinators to be defined for specific cases.
It's worth clarifying the difference between line and block. A line
starts with an empty indentation and ends with the end of line marker.
There is an exception with Freeformat() which discards indentation
and eol markers - if that is used inside a line then the line can
extend over several lines of text.
A block contains a one or more matchers that consume lines (these can
be simple lines, or they may be combined using Or, And etc). Because
they work by incrementing/decrementing the number of spaces "used"
they do not alter the stream flow in any way. This means that there
is very little constraint on how lines inside a block are combined.
If a block contains several lines, separated by commas, then they are
joined by And().
Given the above, Indent() seems like it is really an initial parameter
to a block, which sets the space to consume (perhaps by reading the
indentation, perhaps by simply incrementing by some fixed amount).
So the syntax would be Block(indent, Line(), Line()...) where indent
is some value (number of spaces, or some function)? But this value is
likely global, so we can set it in config and use a keyword argument
for exceptional cases.
The example then becomes:
word = Token('[a-z_][a-z0-9_]*')
number = Token(Integer)
symbol = Token('[^a-z0-9_]')
# any indent, entire line
comment = symbol('#') + Star(Any())
atom = number | word
# ignore line related tokens
args = symbol('(') + Freeform(atom[:,symbol(',')]) + symbol(')')
simple_expr = ...
expr = Line(simple_expr + Opt(comment))
line_comment = LineAny(comment)
# single line function is distinct
func1 = Line(word('def') + word + args + symbol('=') + \
expr + Opt(comment))
func = Line(word('def') + word + args + symbol('=') + \
Opt(comment)) + Block((expr|func|func1)[:])
program = (func|func1)[:]
Andrew