On 2/28/13 12:20 PM, Dmitry Olshansky wrote:
> 28-Feb-2013 19:31, Andrei Alexandrescu пишет:
>> On 2/28/13 5:54 AM, Walter Bright wrote:
>>> On 2/27/2013 11:55 PM, Jonathan M Davis wrote:
>>>>> Again, please see how lexer.c works. I assure you, there is no double
>>>>> copying going on, nor is there a double test for the terminating 0.
>>>>
>>>> I know what the lexer does, and remember that it _doesn't_ operate on
>>>> ranges,
>>>> and there are subtle differences between being able to just use char*
>>>> and
>>>> trying to handle generic ranges.
>>>
>>> Hence the need to invent SentinelInputRange.
>>
>> I don't think the sentinel input range is a blocker for redoing the
>> parser (with ranges) in D. This discussion has probably run its course.
>
>
>> The right thing to do at this point is port the lexer and figure what
>> primitives are necessary from its input.
>>
>
> No need to port - look at std.d.lexer again. It was revamped and is
> ready for the new round of review I should say. Let's not use the old
> source for the new module and go the long path of :
> split off --> port --> patch up --> D-ify & re-write to ranges
>
> Instead we can just tweak the current std.d.lexer a little bit more and
> we have good clean-room lexer written in the idiomatic D. Well, it's
> getting there w.r.t. idiomaticness but it supports ranges including both
> random-access and forward ones (by transparently specializing for each
> one).
I think that's a good idea but I took a look at
https://github.com/bhelyer/std.d.lexer/blob/master/std/d/lexer.d and I
will destroy it. In a good sense :o).
Andrei

28-Feb-2013 22:08, Timon Gehr пишет:
> On 02/28/2013 05:48 PM, Dmitry Olshansky wrote:
>> ...
>>
>> line 300:
>> case 0x1A: // ^Z means end of file
>> case 0:
>> break;
>>
>> On the lines you noted it claimed that that 0x1a is outdate. Along with
>> the fact that you allocate filesize+2 and fill the last 2 bytes with
>> zeros.
>>
>> In any case I see 0 and 0x1a as 2 values that act like sentinels i.e. a
>> tuple. And this is what spec says - any one of them is a sentinel.
>> Correct me if I'm wrong.
>>
>
> A sentinel is some data the original data is augmented with in order to
> simplify its processing.
I thought 0 was proposed as such. Might have misunderstood the proposition.
> The lexer acts the same on 0x1A and 0, but only the additional 0 at the
> end which does not occur in the input is the sentinel.
That would mean that when you see 0 or 0x1A you do a check to see if
that's the end of input then decide it's really the end of input.
If that's the intended behavior I fail to decipher it from here:
http://dlang.org/lex.html#EndOfFile> The lexer may
> even encounter a 0 that is not a sentinel.
--
Dmitry Olshansky

28-Feb-2013 22:38, Andrei Alexandrescu пишет:
> On 2/28/13 12:20 PM, Dmitry Olshansky wrote:
>> 28-Feb-2013 19:31, Andrei Alexandrescu пишет:
>>> On 2/28/13 5:54 AM, Walter Bright wrote:
>>>> On 2/27/2013 11:55 PM, Jonathan M Davis wrote:
>>>>>> Again, please see how lexer.c works. I assure you, there is no double
>>>>>> copying going on, nor is there a double test for the terminating 0.
>>>>>
>>>>> I know what the lexer does, and remember that it _doesn't_ operate on
>>>>> ranges,
>>>>> and there are subtle differences between being able to just use char*
>>>>> and
>>>>> trying to handle generic ranges.
>>>>
>>>> Hence the need to invent SentinelInputRange.
>>>
>>> I don't think the sentinel input range is a blocker for redoing the
>>> parser (with ranges) in D. This discussion has probably run its course.
>>
>>
>>> The right thing to do at this point is port the lexer and figure what
>>> primitives are necessary from its input.
>>>
>>
>> No need to port - look at std.d.lexer again. It was revamped and is
>> ready for the new round of review I should say. Let's not use the old
>> source for the new module and go the long path of :
>> split off --> port --> patch up --> D-ify & re-write to ranges
>>
>> Instead we can just tweak the current std.d.lexer a little bit more and
>> we have good clean-room lexer written in the idiomatic D. Well, it's
>> getting there w.r.t. idiomaticness but it supports ranges including both
>> random-access and forward ones (by transparently specializing for each
>> one).
>
> I think that's a good idea but I took a look at
> https://github.com/bhelyer/std.d.lexer/blob/master/std/d/lexer.d and I
> will destroy it. In a good sense :o).
That's the wrong one. This is the one:
https://github.com/Hackerpilot/Dscanner/tree/range-based-lexer
Though feel free to destroy the other one too ;)
But I need your full powers with Dscanner first :o)
> Andrei
--
Dmitry Olshansky

On Thursday, February 28, 2013 08:52:45 Walter Bright wrote:
> I've given you two examples (lexer and regexp) where you are certainly not
> stuck with that, and those two cases matter.
>
> > Pure input ranges fail utterly as you can't save them, so you get _zero_
> > lookahead [...]
My point is that in almost all cases, what'll end up happening with a sentinel
range which is something like this:
struct SRange(R)
if(is(Unqual!(ElementType!R) == char))
{
enum char sentinel = 0;
@property char front() { return _front; }
@property bool empty() { return _front == sentinel; }
void popFront()
{
_range.popFront();
if(_range.empty)
_front = sentinel;
}
this(R range)
{
if(_range.empty)
_front = sentinel;
else
_front = _rang.front;
}
private:
char _front;
R _range;
}
Notice that it has to check for empty every time that the front is popped, and
it can't avoid that, because it's wrapping another range which is not a
sentinel range. The only time that it can avoid that is if it's managing its
own memory internally via an array or pointer or whatnot. And if it's just
going to be a string or an array, I'd argue for simply special casing strings
or arrays to skip unnecessary empty checks.
Also, with sentinel ranges, you'll be forced to wrap strings because of that
sentinel marker that's required for isSentinelRange, meaning that you'll get
something like
struct SRange
{
enum char sentinel = 0;
@property char front() { return _front; }
@property bool empty() { return _front == sentinel; }
void popFront() { _str = _str[1 .. $]; }
this(string str)
{
_str = str;
if(_str.empty)
_str = [sentinel];
else if(_str[$ - 1] != sentinel)
_str ~= sentinel;
}
private:
string _str;
}
And while the compiler will hopefully optimize away all of the extra overhead
of the wrapper type, any functions which would be able to optimize what
they're doing for a string or array will not special case for this wrapper
type, because they probably won't even know about it. A prime case where that
might be an issue would be with std.utf.decode or std.utf.decodeFront which do
some special casing for strings and which any unicode-aware lexer would have
to use at least some of the time (even if the vast majority of the time, it
can just check ASCII stuff).
So, for the most part, the only time that you'll get any performance boost out
of this is with strings or arrays, and you'll take a performance hit if you're
using functions which special case strings or arrays but not the wrapper
sentinel type. And yet special-casing strings and arrays will give you exactly
the same performance boost without this sentinel range concept.
- Jonathan M Davis

On 2013-02-28 17:43, Walter Bright wrote:
> Consider the following code from lexer.c:
>
> p++;
> switch (*p)
>
> Written using an InputRange:
>
> popFront();
> switch (front)
>
> That code is INVALID. This is why a SentinelInputRange is necessary. You can't
> just use an InputRange in an invalid manner by convention.
>
I do not understand... Why make a special type of InputRange when you can
achieve exactly that with a normal string with an added extra '\0' at the end
and then use it without any calls to empty():
while(1) {
...
popFront();
switch(front) {
case '\0': return;
...
}
}
Cstrings are a very special case because we know the terminator beforehand and
the check is trivial CPU-wise.