On 2/28/2013 12:20 AM, Jacob Carlborg wrote:
> On 2013-02-28 06:13, Walter Bright wrote:
>
>> No. Look at how lexer.c is fed data.
>
> C strings have a sentinel from the beginning, '\0'.
>
Look at how the source file is transformed into a 0 terminated input. Files are
not C strings.
There are no extra copies and no extra tests.

On Thursday, 28 February 2013 at 10:37:35 UTC, Walter Bright
wrote:
> On 2/27/2013 11:37 PM, deadalnix wrote:
>> It does check for sentinel in many places (and it has to).
>
> No, it doesn't. I suggest examining line 481 of lexer.c. Then
> examine line 558 and 559.
>
> https://github.com/D-Programming-Language/dmd/blob/master/src/lexer.c
escapeSequence does check for it.

On 2/28/2013 12:25 AM, Jacob Carlborg wrote:
> On 2013-02-28 08:23, Walter Bright wrote:
>
>> There is always a terminating 0, even if the file ends in 0x1a.
>>
>> (The 0x1A comes from some old editors that end a file with a control Z.)
>
> http://dlang.org/lex.html, "End of File" shows this:
>
> EndOfFile:
> physical end of the file
> \u0000
> \u001A
>
> Doesn't that mean that the input is empty if the file is empty, \u0000 or \u001A
> is encountered?
>
Please follow the source code from file to input to the lexer.
https://github.com/D-Programming-Language/dmd/blob/master/src/root/root.c
line 1012 and 1038

On 2/27/2013 11:55 PM, Jonathan M Davis wrote:
>> Again, please see how lexer.c works. I assure you, there is no double
>> copying going on, nor is there a double test for the terminating 0.
>
> I know what the lexer does, and remember that it _doesn't_ operate on ranges,
> and there are subtle differences between being able to just use char* and
> trying to handle generic ranges.
Hence the need to invent SentinelInputRange.
> Given how a lexer works (and I have been working on a lexer off and on
> recently), the only real difference is that you'd just use a couple of static
> ifs like
>
> static if(!isSomeString!R)
> {
> if(range.empty)
> break; //or whatever you do at the end
> }
>
> static if(isSomeString!R)
> {
> case 0:
> break; //or whatever you do at the end
> }
There are so many places where this would occur, it cries out for a new type.
> So, in the case of a lexer, I don't see sentinel ranges as buying us much. You
> end up having to wrap most any range that you pass to the lexer or whatever
> (including strings so that they'll pass isSentinelRange), you lose out on any
> optimizations of any functions that you call which special-case strings
> (though there probably wouldn't be many of those in a lexer), and all you
> avoid is a couple of static ifs.
And NO, THE SOURCE FILE INPUT IS NEITHER WRAPPED NOR DOUBLE COPIED. Here's how
it's done:
https://github.com/D-Programming-Language/dmd/blob/master/src/root/root.c
line 1012 and 1038
> The idea of sentinels certainly isn't useless, but anything caring about that
> sort of speed is likely to just use strings or arrays, and those can trivially
> be special cased to avoid unnecessary empty checks and to add the check for
> the sentinel, making the whole sentinel range idea an unnecessary complication
> IMHO.
You can't do efficient lookahead without sentinels, either. Lexers are sensitive
to every instruction executed per character read. No sentinels mean double the
number of instructions per source character.
InputRanges are an abject failure if "anyone caring about speed" is not going to
use them. And yes, I care very much about the D lexing speed.

On 2/28/2013 1:19 AM, Peter Alexander wrote:
> On Thursday, 28 February 2013 at 01:12:19 UTC, Walter Bright wrote:
>> Motivation:
>> 1. easy conversion of C strings to ranges
>
> This is already easy.
>
> struct CString
> {
> @property char front() { return *p; }
> void popFront() { ++p; }
> @property bool empty() const { return !*p; }
> char* p;
> }
>
> (Yep, rubbish implementation, missing save, constructors etc., but you get the
> idea).
Notice the double read of *p. This is what SentinelInputRange is trying to fix.
>> 2. necessary for a fast implementation of a lexer
>
> This is very domain specific. Is a whole new range concept in the standard
> library necessary?
I've posted in this thread several use cases.
> My worry here is that we are setting a precedent for adding new range concepts.
> If the only justification needed is that is saves a single operation in some
> niche area of computing then we are opening the door to a LOT of different range
> concepts.
There are many cases where speed is a very big deal.
> Are there any other REAL uses for this other than in one line of a lexer
> implementation? I think inclusions into the standard library should require at
> least several distinct and realistic use cases.
I've posted several - including one in Phobos.

On 2/28/2013 2:42 AM, deadalnix wrote:
> On Thursday, 28 February 2013 at 10:37:35 UTC, Walter Bright wrote:
>> On 2/27/2013 11:37 PM, deadalnix wrote:
>>> It does check for sentinel in many places (and it has to).
>>
>> No, it doesn't. I suggest examining line 481 of lexer.c. Then examine line 558
>> and 559.
>>
>> https://github.com/D-Programming-Language/dmd/blob/master/src/lexer.c
>
> escapeSequence does check for it.
Note the lookahead is done without checking first. This is the salient point. A
sentinel means you are guaranteed to always be able to look one character ahead
without checking.
This is used all over the lexer.
The other InputRange forms do not allow peeking ahead without testing *first*.

On Thursday, February 28, 2013 02:54:24 Walter Bright wrote:
> On 2/27/2013 11:55 PM, Jonathan M Davis wrote:
> >> Again, please see how lexer.c works. I assure you, there is no double
> >> copying going on, nor is there a double test for the terminating 0.
> >
> > I know what the lexer does, and remember that it _doesn't_ operate on
> > ranges, and there are subtle differences between being able to just use
> > char* and trying to handle generic ranges.
>
> Hence the need to invent SentinelInputRange.
>
> > Given how a lexer works (and I have been working on a lexer off and on
> > recently), the only real difference is that you'd just use a couple of
> > static ifs like
> >
> > static if(!isSomeString!R)
> > {
> >
> > if(range.empty)
> >
> > break; //or whatever you do at the end
> >
> > }
> >
> > static if(isSomeString!R)
> > {
> >
> > case 0:
> > break; //or whatever you do at the end
> >
> > }
>
> There are so many places where this would occur, it cries out for a new
> type.
And you were just claiming that the lexer checked the sentinel type in only
one place. If that's indeed the case (and I think that it's quite close to
being true if it isn't true), then you _wouldn't_ need to use static ifs like
this in many places. So, which is it? If you need to check the sentinel often
enough that using static ifs is a problem, then it's probably not buying you
much of anything over checking empty anyway.
> > The idea of sentinels certainly isn't useless, but anything caring about
> > that sort of speed is likely to just use strings or arrays, and those can
> > trivially be special cased to avoid unnecessary empty checks and to add
> > the check for the sentinel, making the whole sentinel range idea an
> > unnecessary complication IMHO.
>
> You can't do efficient lookahead without sentinels, either. Lexers are
> sensitive to every instruction executed per character read. No sentinels
> mean double the number of instructions per source character.
But my point is that outside of strings or arrays, you're almost certainly
stuck with that. Most ranges would have to be wrapped in order to act as a
sentinel range, and the sentinel range would have to check empty on the
internal range on every popFront. The only time that a sentinel range would
buy you anything outside of strings or arrays is when you have a range type
written specifically to be a sentinel range and which doesn't have to call
empty on a range internally. But that means that it's going to have to manage
whatever it's holding internally itself and can't wrap another range - and
that memory will almost certainly be an array of some sort. And if that's the
case, what did we buy by wrapping it in a sentinel range? Why not just use the
array directly?
Pretty much the only type of range which isn't wrapping another range which
wouldn't have an array internally would be one that was either reading from a
file (or some other input) as it iterates or one which was generative. In
either case, you would be unable to put a zero on the end without checking
whether it was empty with each popFront, meaning that making it a sentinel
range was pointless. So, pretty much the only ranges which could gain efficiency
as sentinel ranges are those that hold arrays internally, meaning that the
only gain from having the sentinel range over special casing the code on
strings or arrays is the fact that you won't have to special case strings or
arrays at all with static ifs or template constraints, but I'm quite convinced
that in most cases, the number of static ifs involved would be quite few,
making it quite trivial to special case on strings, meaning that there's
pretty much no gain in having the sentinel ranges. It just incurs more
overhead that the compiler must optimize away, because you have to wrap
strings for them to work as sentinel ranges, or no template constraint will be
able to detect that they're sentinel ranges. Of course, you could special case
them so that they're treated as sentinel ranges in spite of that, but if
you're special casing, why use the sentinel ranges in the first place instead
of just special casing on strings?
> InputRanges are an abject failure if "anyone caring about speed" is not
> going to use them. And yes, I care very much about the D lexing speed.
Pure input ranges fail utterly as you can't save them, so you get _zero_
lookahead unless you start copying their contents somewhere else - which is
quite expensive in comparison to slicing an array and which complicates the
code. Forward ranges do better, but the lack of random access can be a killer
in some cases. For instance, anything involving unicode (which a lexer mostly
avoids, because it rarely has to decode unicode, but it still happens
sometimes) requires random access for speed at least some of the time
(especially when skipping code units), and it makes comparisons faster,
because you can compare slices at once rather than compare a character and
then pop it, compare, pop, compare, pop... For the fastest results, you need
random-access ranges.
There's tons of stuff that's fine with forward ranges (though I seriously
question the viability of pure input ranges given how insanely limiting is to
be unable to save their state), but there's also plenty of stuff that needs
random access, and if you _really_ care about speed, there's a decent chance
that you need random access (if nothing else, to be able to pop off multiple
elements at a time). That doesn't necessarily mean that you're using an array,
but odds are whatever you're using wraps an array if it isn't one. I don't
think that there's much of anything which is random access without having an
array or pointer underneath the hood. Reading from a file, I'd be very inclined
to use something like MmFile or reading the whole file into an array rather
than trying to wrap in something that's a pure input range. They're just too
limiting.
- Jonathan M Davis

On Thursday, 28 February 2013 at 10:57:29 UTC, Walter Bright
wrote:
>> My worry here is that we are setting a precedent for adding
>> new range concepts.
>> If the only justification needed is that is saves a single
>> operation in some
>> niche area of computing then we are opening the door to a LOT
>> of different range
>> concepts.
>
> There are many cases where speed is a very big deal.
Yes, are we going to add new range categories every time there's
a performance benefit to doing so?
has16ByteAlignedElements
hasUniqueElements
hasContiguousMemory
isMonotonic
isUnimodal
...
?