Douglas Katzman <dougk@...> writes:
> interesting point. I think one could argue that the dumping/restoring of
> the second argument to EQL was permitted to find and return any symbol in
> the image that was similar-as-constant. So in fact the answer might have
> been T even if (string 'foo) was not compile-time folded.
>
> Your point holds even for the character-to-string-of-length-1 but all the
> same I think it's on firmer ground to fold (string #\a) => "a" because
> "returns a string" doesn't preclude that there might be a cache of
> single-character strings whose codes are in the range of base-char, and
> that the string could come from that cache. And that any one-character
> string appearing as a constant in code might also come from the cache.
>
> If nothing else, we should make (string "foo") an identity.
For make-string it also says "returns a simple string", but nobody would
dare to fold that.
--
With best regards, Stas.

interesting point. I think one could argue that the dumping/restoring of
the second argument to EQL was permitted to find and return any symbol in
the image that was similar-as-constant. So in fact the answer might have
been T even if (string 'foo) was not compile-time folded.
Your point holds even for the character-to-string-of-length-1 but all the
same I think it's on firmer ground to fold (string #\a) => "a" because
"returns a string" doesn't preclude that there might be a cache of
single-character strings whose codes are in the range of base-char, and
that the string could come from that cache. And that any one-character
string appearing as a constant in code might also come from the cache.
If nothing else, we should make (string "foo") an identity.
On Fri, Jun 13, 2014 at 1:35 PM, Stas Boukarev <stassats@...> wrote:
> Douglas Katzman <dougk@...> writes:
>
> > The description allows STRING not to allocate in the case of symbol or
> > string input, so it seems to me that (string #\Newline) should not
> require
> > a "#." in front of it in user code to obtain a constant string of one
> > newline.
> > And if through macroexpansion we end up with (string 'foo) this should be
> > folded to "FOO".
> >
> > Sound ok?
> >
> > *Description:*
> The issue is coalescing,
> (eql (string 'foo) "FOO") will change from NIL to T. Is that allowed? Is
> that expected?
>
> --
> With best regards, Stas.
>

Douglas Katzman <dougk@...> writes:
> The description allows STRING not to allocate in the case of symbol or
> string input, so it seems to me that (string #\Newline) should not require
> a "#." in front of it in user code to obtain a constant string of one
> newline.
> And if through macroexpansion we end up with (string 'foo) this should be
> folded to "FOO".
>
> Sound ok?
>
> *Description:*
The issue is coalescing,
(eql (string 'foo) "FOO") will change from NIL to T. Is that allowed? Is
that expected?
--
With best regards, Stas.

Robert Swindells <rjs@...> writes:
>>if possible in `git format-patch` format, so that I can apply them with
>>`git am` and you get the credit for them.
>
> They were generated using 'git format-patch', what do I need to do
> differently with them ?
Ah! I think what you need to do differently is to leave a line between
the summary of the patch and any further information, so the commit
message looks like
Disable test on NetBSD
The disassembly is generated fine, but something prevents sbcl from
exiting afterwards.
and then the format of the message should be as I expect it. I think
this only affects your patch 1/6, where there are three (related)
changes in one patch. Now that I understand that, I think I can
probably fix them up myself.
>>I don't like patch 5; I know it's annoying, but patching out the test
>>will mean that the test will never get fixed :-(. I'm willing to help
>>try to debug the failure; presumably it's in annotating the disassembly,
>>but why? Is it reproducible at the repl (i.e. without all the contrib
>>building/testing infrastructure)?
>
> The disassembly finishes, both from a full build and from a repl, the
> problem is that the sbcl process won't quit afterwards, the behaviour
> is the same in threaded and non-threaded builds.
Hm. Hm hm. Yuk.
Cheers,
Christophe

Christophe Rhodes wrote:
>Robert Swindells <rjs@...> writes:
>
>> Christophe Rhodes wrote:
>>>Robert Swindells <rjs@...> writes:
>>>
>>>> [...]
>>>
>>>I have received patches 1,2,3,4,5 out of 6... what delights am I
>>>missing? (Has anyone else received 6/6)?
>>
>> Sorry, I haven't sent 6.
>
>OK. Could I ask you to send new patches as follows:
>
>- patch 1
Ok.
>- patch 2 + patch 4 squashed together
I will give it a go but may run out of patience with git.
>- patch 3
Ok.
>if possible in `git format-patch` format, so that I can apply them with
>`git am` and you get the credit for them.
They were generated using 'git format-patch', what do I need to do
differently with them ?
>I don't like patch 5; I know it's annoying, but patching out the test
>will mean that the test will never get fixed :-(. I'm willing to help
>try to debug the failure; presumably it's in annotating the disassembly,
>but why? Is it reproducible at the repl (i.e. without all the contrib
>building/testing infrastructure)?
The disassembly finishes, both from a full build and from a repl, the
problem is that the sbcl process won't quit afterwards, the behaviour
is the same in threaded and non-threaded builds.
Robert Swindells

Robert Swindells <rjs@...> writes:
> Christophe Rhodes wrote:
>>Robert Swindells <rjs@...> writes:
>>
>>> [...]
>>
>>I have received patches 1,2,3,4,5 out of 6... what delights am I
>>missing? (Has anyone else received 6/6)?
>
> Sorry, I haven't sent 6.
OK. Could I ask you to send new patches as follows:
- patch 1
- patch 2 + patch 4 squashed together
- patch 3
if possible in `git format-patch` format, so that I can apply them with
`git am` and you get the credit for them.
I don't like patch 5; I know it's annoying, but patching out the test
will mean that the test will never get fixed :-(. I'm willing to help
try to debug the failure; presumably it's in annotating the disassembly,
but why? Is it reproducible at the repl (i.e. without all the contrib
building/testing infrastructure)?
Thanks!
Christophe

This looks unexpected -
* (specifier-type '(member -0d0 +0d0 foo))
=> #<UNION-TYPE (OR (MEMBER 0.0d0 FOO) (DOUBLE-FLOAT 0.0d0 0.0d0))>
I would have thought it to be
#<UNION-TYPE (OR (MEMBER FOO) (DOUBLE-FLOAT 0.0d0 0.0d0))>
If only one of the zeros is present, it results in a MEMBER type as
expected:
(specifier-type '(member -0d0 foo)) => #<MEMBER-TYPE (MEMBER -0.0d0 FOO)>
But if both +/-0 are in the set, it produces a true numeric range type:
(specifier-type '(member -0d0 +0d0)) => #<NUMERIC-TYPE (DOUBLE-FLOAT 0.0d0
0.0d0)>
Canonicalization doesn't perform the "opposite" transformation of
augmenting this union's MEMBER part with an fp-zero:
(specifier-type '(OR (MEMBER FOO) (DOUBLE-FLOAT 0.0d0 0.0d0)))
=> #<UNION-TYPE (OR (MEMBER FOO) (DOUBLE-FLOAT 0.0d0 0.0d0))>
So is it right that adding FOO in the first example includes positive 0d0
in the (MEMBER) part of the union even though it's redundant?

Christophe Rhodes wrote:
>Robert Swindells <rjs@...> writes:
>
>> [...]
>
>I have received patches 1,2,3,4,5 out of 6... what delights am I
>missing? (Has anyone else received 6/6)?
Sorry, I haven't sent 6.
It contains the start of getting SBCL to build on NetBSD/arm, I had
second thoughts about sending it as it doesn't produce a working image.
One change that would make things easier for me would be to move
the signal handlers from arm-arch.c to arm-linux-os.c.
Robert Swindells

Douglas Katzman <dougk@...> writes:
> Opinions are mixed, but I think it's convenient to assume that wild
> dimensions for an array type, given a sequence as the input, should do
> something reasonable. In support of interpreting * as "whatever it
> needs to be" we must allow: (coerce '(1 2 3) '(vector * *)) => #(1 2
> 3) simply because that is the same as vector, but * isn't an actual
> type.
But (vector * *) is a recognizable subtype of vecto/sequence, whereas
(array (unsigned-byte 8) *) isn't. Overall I'm not a fan of
potentially-surprising convenient behaviour, particularly since the user
could use (vector (unsigned-byte 8)) instead. I don't like our error
message either; ECL's is closest to what I would want: if the argument
is a sequence, but (a) not of the type specified and (b) the type
specified isn't a subtype of sequence, then there's some possibility
that the user meant VECTOR instead of ARRAY.
Cheers,
Christophe

On 06/06/2014 03:18 PM, Richard M Kreuter wrote:
> Krzysztof Drewniak <krzysdrewniak@...> wrote:
>> On 05/31/2014 05:25 PM, Richard M Kreuter wrote:
>>> Nikodemus Siivola <nikodemus@...> wrote:
>
>>>> The following from the top of my head, not much thought invested:
>>>>
>>>> - Something like SB-EXT:*READ-NORMALIZE*...
>>>>
>>>> - ...tack the normalization behaviour of symbols onto readtables...
>>>>
>>>> - ...make it part of external format.
>>>
>>> ...having the reader normalize tokens... would violate 2.3.6's Print-read
>>> consistency requirement for any symbol whose name is not a string in the
>>> normal form the reader uses, unless such symbols are printed in some
>>> extended syntax...
>>>
>> 2.3.6, as I understand it, simply requires that (forall symbol (eql
>> symbol (read (print symbol))).
>
> Right. This is why I included "unless such symbols are printed in some
> extended syntax".
>
>> Printing symbols that are not normalized within pipes and inhibiting
>> normalization of those symbols (just as is currently done with
>> case-conversion) would, IMO, satisfy the requirements of ANSI.
>
> I agree that it would be possible to use the multiple escape syntax for
> this purpose (and if so, single escape should be made to agree). I would
> consider this an extension to Lisp's syntax, in some pedantic sense.
>
>> I think that, if we normalize symbols, we should normalize to NFKC,
>> which is both the recommendation of UAX #31 and what other languages,
>> such as Python, use, since out identifiers are case-insensitive.
>
> I don't have an intuition on this one.
>
> (Also, to be pedantic, identifiers are not case insensitive; the reader,
> however, folds lettercase by default. |CAR| and |car| are different
> symbols.)
>
>> Also, what should we do about format characters in identifiers, like the
>> RTL control characters or the Zero Width (Non) Joiners. Disallow them?
>> Act as if they were any other character? Allow them, but ignore them
>> when comparing symbols? Something else?
>
> I can't tell if you're asking what should happen if someone calls
> MAKE-SYMBOL or INTERN on a string containing such a character, or what
> the reader should do when seeing an unescaped instance of such a
> character. If it's the former, my read on INTERN and MAKE-SYMBOL is that
> the result of each ought to be a symbol whose name is STRING= to the
> function's argument.
>
> If your project is really going to try to make sense of Lisp syntax over
> the domain of Unicode characters, it might be worth trying to map
> Unicode characters to Lisp syntax types and constituent traits.
>
> It might also turn out that the Right Thing to do is to invent some
> additional syntax types and/or constituent traits and define how they
> extend the reader algorithm. (For example, perhaps zero-width joiners
> should have constituent syntax but a new constituent trait indicating
> that they're ignored/discarded when accumulating a token? Or maybe they
> should have a new syntax type that makes the reader error? Perhaps
> non-joiners should have whitespace syntax type and invalid constituent
> trait?)
One other possibility is to incorporate a version of the Unicode
identifier definition (\p{XID_Start}\p{XID_Continue}*) (adjusted to
include the characters allowed by CL but not most other languages in the
appropriate caterogies) somewhere in the reader. Many other languages
error out on identifiers that don't meet the definition, we could warn
about "Unusual character ~S in symbol")
> (Next, it might prove interesting or useful to make some decisions about
> the Unicode digits with code point above 128. At present I believe they
> all have constituent syntax and alphabetic constituent trait, but maybe
> some or all of them should be refiled as alphadigit and either defined
> for use in potential number tokens? Or some other traits could be
> invented and the definition of potential number extended? It would seem
> attractive to at least reserve such characters for eventual future use.)
>
Currently, every character with the Unicode Decimal_Digit property can
act as a digit in number tokens. (There's a bug in master's DIGIT-CHAR-P
that prevents the case that handles such digits from firing when RADIX
<= 10, though).
> In the parentesized content in the preceding few paragraphs, I'm just
> making guesses about how things might get used. It's your project, after
> all! However, I think that mapping Unicode's pecularities onto Lisp's
> syntactic peculiarities is likely to be a nontrivial project, and
> perhaps a distraction from the more broadly useful things you're doing
> for handling Unicode as data. The improvements you're working on will be
> valuable even without altering SBCL's relatively Unicode-unaware reader.
>
You do have a point there.
>>> Finally, please note that if input normalization is to be performed at
>>> all, making it tunable would create addition user burden...
>>>
>> I'm not certain exactly what you mean by this. I doe agree that SBCL
>> should only normalize symbols to one form, with no option to change that
>> form. Whether normalization behavior should be in the readtable, a
>> build-time option, or something else should probably be discussed further.
>
> I meant that if the reader ever does normalization, that it shouldn't be
> possible to turn that behavior off. (But maybe I was wrong, see below.)
>
Most other languages that normalize symbols (for example, Python) don't
provide the option to turn that behavior off (or maybe they make it a
build-time option, I don't know). Of course, most other programming
languages don't have readtables or provide as much control over parsing
as CL does, so I'm not sure we should look to them for precedent in this
case.
> Let's suppose the converse, and a mechanism: imagine the reader
> normalized symbol names to a specific normal form (NFKC, perhaps) if and
> only if *READ-NORMALIZE* was true.
> [...]
>
> On the other hand, many analogous things are already true, or
> approximately so, for programs that use readtables with differing values
> for READTABLE-CASE:
>
> 1. If I want to paste text from a source file into the REPL, I have to
> ensure the REPL uses the correct readtable anyway.
>
> 2. If a program reads its symbols at runtime, the program needs to
> ensure the right readtable, too.
>
> 3. If one program wants to name another program's symbols, it needs to
> address those symbols' names properly, which might require escaping in
> the addresser's readtable depending on the addressees' names' lettercase
> conventions.
>
> So maybe making the normalization behavior be a property of the
> readtable the way that case, as Nikodemus proposed, wouldn't be so
> bad. (Still seems simpler to me to make it be a build-time constant
> choice, though.)
>
I agree with your analogy between normalizing symbols and
READTABLE-CASE. On my "normalizing reader" branch, I've added
READTABLE-NORMALIZATION, which is T by default on #+SB-UNICODE builds
(this isn't too much of an issue, as NFKC is a noop on strings
consisting solely of [U+00,U+FF]).
I guess I should state a few of the assumptions that underly this
readtable work:
1. There are many more false negatives (symbols that seem like they
should be EQ, but aren't, such as '豈 (U+F900) and '豈 (U+8C48), which
should look visually identical) with a non-normalizing reader then there
would be false positives when symbols are normalized.
2. Anyone calling (setf (readtable-fee readtable) <nonstandard-value>)
knows what they're doing and how to deal with any of the issues with
other people's symbols you've described above.
3. It really can't hurt to give the users another reader control switch.
This is lisp, the language where you could redefine READ, EVAL, and/or
PRINT to do something completely different, after all.
>>>>> 2) What, if anything, should we do about confusables?...
>>>>
>>>> I think nothing by default / magically...
>>>
>>> Users might be encouraged to note that in cases where disambiguating
>>> confusables is desired, it can be done with the pretty printer.
>>
>> How so?
>
> However you like. Here's a trivial demonstration (untested),
> substituting a made-up notion, "confusticatable", for any rigorous
> notion of "confusable".
>
> (defun confusticatablep (object)
> ;; I'm easily confusticated.
> (not (every #'(lambda (char)
> (and (standard-char-p char) (graphic-char-p char)))
> (string object))))
>
> (set-pprint-dispatch '(and symbol (satisfies confusticatablep))
> #'(lambda (stream symbol)
> (write-char #\? stream)
> (write symbol :stream stream :pretty nil)))
>
> ;; And if you want to read them back in this way...
> (set-macro-character #\? #'(lambda (stream char) (read stream nil t t)))
>
Thanks for the example. I'm sure someone will find it useful in the
future. (As you said, having this in by default is probably not a good
idea).
Krzysztof

Krzysztof Drewniak <krzysdrewniak@...> wrote:
> On 05/31/2014 05:25 PM, Richard M Kreuter wrote:
> > Nikodemus Siivola <nikodemus@...> wrote:
> >> The following from the top of my head, not much thought invested:
> >>
> >> - Something like SB-EXT:*READ-NORMALIZE*...
> >>
> >> - ...tack the normalization behaviour of symbols onto readtables...
> >>
> >> - ...make it part of external format.
> >
> > ...having the reader normalize tokens... would violate 2.3.6's Print-read
> > consistency requirement for any symbol whose name is not a string in the
> > normal form the reader uses, unless such symbols are printed in some
> > extended syntax...
> >
> 2.3.6, as I understand it, simply requires that (forall symbol (eql
> symbol (read (print symbol))).
Right. This is why I included "unless such symbols are printed in some
extended syntax".
> Printing symbols that are not normalized within pipes and inhibiting
> normalization of those symbols (just as is currently done with
> case-conversion) would, IMO, satisfy the requirements of ANSI.
I agree that it would be possible to use the multiple escape syntax for
this purpose (and if so, single escape should be made to agree). I would
consider this an extension to Lisp's syntax, in some pedantic sense.
> I think that, if we normalize symbols, we should normalize to NFKC,
> which is both the recommendation of UAX #31 and what other languages,
> such as Python, use, since out identifiers are case-insensitive.
I don't have an intuition on this one.
(Also, to be pedantic, identifiers are not case insensitive; the reader,
however, folds lettercase by default. |CAR| and |car| are different
symbols.)
> Also, what should we do about format characters in identifiers, like the
> RTL control characters or the Zero Width (Non) Joiners. Disallow them?
> Act as if they were any other character? Allow them, but ignore them
> when comparing symbols? Something else?
I can't tell if you're asking what should happen if someone calls
MAKE-SYMBOL or INTERN on a string containing such a character, or what
the reader should do when seeing an unescaped instance of such a
character. If it's the former, my read on INTERN and MAKE-SYMBOL is that
the result of each ought to be a symbol whose name is STRING= to the
function's argument.
If your project is really going to try to make sense of Lisp syntax over
the domain of Unicode characters, it might be worth trying to map
Unicode characters to Lisp syntax types and constituent traits.
It might also turn out that the Right Thing to do is to invent some
additional syntax types and/or constituent traits and define how they
extend the reader algorithm. (For example, perhaps zero-width joiners
should have constituent syntax but a new constituent trait indicating
that they're ignored/discarded when accumulating a token? Or maybe they
should have a new syntax type that makes the reader error? Perhaps
non-joiners should have whitespace syntax type and invalid constituent
trait?)
(Next, it might prove interesting or useful to make some decisions about
the Unicode digits with code point above 128. At present I believe they
all have constituent syntax and alphabetic constituent trait, but maybe
some or all of them should be refiled as alphadigit and either defined
for use in potential number tokens? Or some other traits could be
invented and the definition of potential number extended? It would seem
attractive to at least reserve such characters for eventual future use.)
In the parentesized content in the preceding few paragraphs, I'm just
making guesses about how things might get used. It's your project, after
all! However, I think that mapping Unicode's pecularities onto Lisp's
syntactic peculiarities is likely to be a nontrivial project, and
perhaps a distraction from the more broadly useful things you're doing
for handling Unicode as data. The improvements you're working on will be
valuable even without altering SBCL's relatively Unicode-unaware reader.
> > Finally, please note that if input normalization is to be performed at
> > all, making it tunable would create addition user burden...
> >
> I'm not certain exactly what you mean by this. I doe agree that SBCL
> should only normalize symbols to one form, with no option to change that
> form. Whether normalization behavior should be in the readtable, a
> build-time option, or something else should probably be discussed further.
I meant that if the reader ever does normalization, that it shouldn't be
possible to turn that behavior off. (But maybe I was wrong, see below.)
Let's suppose the converse, and a mechanism: imagine the reader
normalized symbol names to a specific normal form (NFKC, perhaps) if and
only if *READ-NORMALIZE* was true.
1. Consider a program whose source file is in some normal form other
than NFKC. If you compiled or loaded that source file while
*READ-NORMALIZE* was true, then later, while developing the program,
copied and pasted a form from the source file into a REPL in which
*READ-NORMALIZE* was false. This could be harmless or horribly
confusing, depending on the form.
2. Or consider a program that attemps to READ at runtime any symbols
that happen to occur in its source code. Such a program would only be
reliable if it always rebound *READ-NORMALIZE* around its uses of READ,
or if the inputs it reads were always suitably escaped, e.g., according
to your proposed mutliple escape convention.
3. Or consider a program that wants to employ symbols from another
program. The two programs' source files would either have to both be in
the same normal form (which is probably how anything would work today,
in case anything does work out well today), or else both programs'
source files would have to have been compiled and/or loaded under the
same value of *READ-NORMALIZE*, or the programs need to take care to
name each others' symbols in the appropriate form. (This seems like it'd
be an an issue in DEFPACKAGE for sure.)
On the other hand, many analogous things are already true, or
approximately so, for programs that use readtables with differing values
for READTABLE-CASE:
1. If I want to paste text from a source file into the REPL, I have to
ensure the REPL uses the correct readtable anyway.
2. If a program reads its symbols at runtime, the program needs to
ensure the right readtable, too.
3. If one program wants to name another program's symbols, it needs to
address those symbols' names properly, which might require escaping in
the addresser's readtable depending on the addressees' names' lettercase
conventions.
So maybe making the normalization behavior be a property of the
readtable the way that case, as Nikodemus proposed, wouldn't be so
bad. (Still seems simpler to me to make it be a build-time constant
choice, though.)
> >>> 2) What, if anything, should we do about confusables?...
> >>
> >> I think nothing by default / magically...
> >
> > Users might be encouraged to note that in cases where disambiguating
> > confusables is desired, it can be done with the pretty printer.
>
> How so?
However you like. Here's a trivial demonstration (untested),
substituting a made-up notion, "confusticatable", for any rigorous
notion of "confusable".
(defun confusticatablep (object)
;; I'm easily confusticated.
(not (every #'(lambda (char)
(and (standard-char-p char) (graphic-char-p char)))
(string object))))
(set-pprint-dispatch '(and symbol (satisfies confusticatablep))
#'(lambda (stream symbol)
(write-char #\? stream)
(write symbol :stream stream :pretty nil)))
;; And if you want to read them back in this way...
(set-macro-character #\? #'(lambda (stream char) (read stream nil t t)))
Regards,
Richard

On 05/31/2014 05:25 PM, Richard M Kreuter wrote:
> Nikodemus Siivola <nikodemus@...> wrote:
>
>>> 1) Should we normalize symbols?...
>>
>> - INTERN and MAKE-SYMBOL should not normalize anything, IMO.
>
> The definitions of INTERN and MAKE-SYMBOL appear to require the result
> to be a symbol whose name is the same as the string, so yeah.
>
>> The following from the top of my head, not much thought invested:
>>
>> - Something like SB-EXT:*READ-NORMALIZE*...
>>
>> - ...tack the normalization behaviour of symbols onto readtables...
>>
>> - ...make it part of external format.
>
> Since INTERN can't normalize symbol names, having the reader normalize
> tokens (as in proposals 1 and 2) would violate 2.3.6's Print-read
> consistency requirement for any symbol whose name is not a string in the
> normal form the reader uses, unless such symbols are printed in some
> extended syntax. (Could print such symbols in a syntax supported by a
> reader macro, or else by defining a new syntax type for inhibiting
> Unicode normalization in the token parser. But somebody might argue that
> neither option would be a conforming way to print a symbol, since
> 22.1.3.3 seems to offer no leeway.)
>
2.3.6, as I understand it, simply requires that (forall symbol (eql
symbol (read (print symbol))). Printing symbols that are not normalized
within pipes and inhibiting normalization of those symbols (just as is
currently done with case-conversion) would, IMO, satisfy the
requirements of ANSI.
I think that, if we normalize symbols, we should normalize to NFKC,
which is both the recommendation of UAX #31 and what other languages,
such as Python, use, since out identifiers are case-insensitive.
Also, what should we do about format characters in identifiers, like the
RTL control characters or the Zero Width (Non) Joiners. Disallow them?
Act as if they were any other character? Allow them, but ignore them
when comparing symbols? Something else?
> Normalizing strings is a different issue, and merits separate
> consideration. (Related to string normalization is
> sharpsign-P. Namestring syntax is already non-surjective to the range of
> filenames on Linux and *BSD at least. This is arguably a flaw on its
> own, but normalization could make for some further confusions.)
>
Normalizing strings isn't something that's usually done, and a bad idea
due so things like filenames, so we probably shouldn't do it.
> Finally, please note that if input normalization is to be performed at
> all, making it tunable would create addition user burden about
> coordinating the normalization rules between program read-time and
> run-time, possibly between source-read-time and fasload time, between
> read-time and run-time squared when two programs in the same Lisp lookup
> each others' symbols, and so forth. So ISTM that at most one
> normalization rule should be permitted in the SBCL image, and if one,
> then it ought to be used everywhere and indicated by a feature that's
> set at bootstrap (and included in the FASL file, if that's still
> significant).
>
I'm not certain exactly what you mean by this. I doe agree that SBCL
should only normalize symbols to one form, with no option to change that
form. Whether normalization behavior should be in the readtable, a
build-time option, or something else should probably be discussed further.
>>> 2) What, if anything, should we do about confusables?...
>>
>> I think nothing by default / magically...
>
> Users might be encouraged to note that in cases where disambiguating
> confusables is desired, it can be done with the pretty printer.
How so?
- Krzysztof

On 05/31/2014 12:06 PM, Nikodemus Siivola wrote:
>> 2) What, if anything, should we do about confusables? For example, :peak
>> and :реак are not visually distinct, though the second keyword is made
>> wholly of Cyrillic letters while the first is made of Latin ones. One
>> possibility is that symbols containing codepoints that are confusable
>> with Latin be printed with vertical bars to make it clear that something
>> might be up.
>
> I think nothing by default / magically. Then again, it might be useful
> to have a few utilities re. confusables to make error reporting
> regarding symbols easy -- or just to make it easy for users to find
> confusable symbols.
>
> For example, LIST-CONFUSABLE-SYMBOLS would allow doing something like
>
> (let ((alts (remove-if-not #'fboundp (list-confusable-symbols fname))))
> (if alts
> (error "undefined function: ~S. Did you mean one of: ~S" fname alts)
> (error "undefined function: ~S" fname)))
>
> ...modulo details.
>
Yeah, magical default behavior is probably a bad idea.
I've implemented some confusable-detection functions on
unicode-algorithms, but list-all-confusables is currently O(don't). If
you (or anyone else) has any ideas for how to speed that function up to
the point that it could be integrated into an error message as above.
>> 4) Is it reasonable to integrate a tailored version of the Unicode line
>> breaker into the pretty-printer?
>
> Sounds like a nice thing to do.
>
After thinking about it a bit more, this might be a bad idea since lisp
"text" has significantly different break rules than natural-language text.
>> 5) Does anyone have any other Unicode-related ideas?
>
> Utilities for converting numbers and letters to
> subscripts/superscripts would be nice. I've found them really useful
> for pretty-printing myself.
>
Since the Unicode authors have, in their wisdom, scattered the
superscripts all over the place, you might be able to get away with the
following lispification of an old Unix utility (unless you need it early
in the compiler)
(defun tr (source dest string)
(coerce
(loop for char across string collect
(let ((index (position char source)))
(if index (char dest index) char)))
'string))
(defun superscript (string)
(tr "0123456789+-=()abcdefghijklmnoprstuvwxyzABDEGHIJKLMNOPRTUVW"
"⁰¹²³⁴⁵⁶⁷⁸⁹⁺⁻⁼⁽⁾ᵃᵇᶜᵈᵉᶠᵍʰⁱʲᵏˡᵐⁿᵒᵖʳˢᵗᵘᵛʷˣʸᶻᴬᴮᴰᴱᴳᴴᴵᴶᴷᴸᴹᴺᴼᴾᴿᵀᵁⱽᵂ" string))
(defun subscript (string)
(tr "0123456789+-=()aehijklmnoprstuvx"
"₀₁₂₃₄₅₆₇₈₉₊₋₌₍₎ₐₑₕᵢⱼₖₗₘₙₒₚᵣₛₜᵤᵥₓ" string))
Krzysztof