On 6/10/2011 9:37 AM, Jack Smiley wrote:
> Hi,
>
> I have three questions about the Lex regexes used to define the CSS
> tokens (section 4.1.1, Tokenization)
>
> 1) What do the dashes mean in the character class of the second
> alternate in the definition of URI
>
> |url\({w}([!#$%&*-\[\]-~]|{nonascii}|{escape})*{w}\)
>
> They're not escaped, so I'm assuming they're metacharacters (refer to
> ranges), but ranges don't seem to make sense here (what's the
> character range from * to [ or from ] to ~)?|
Look up the ASCII table. In particular, * to [ is *+,-./;<=>?@[, with
0-9 and A-Z in there as well, and ]-~ is
]^_`{|}~, with a-z as well.
In other words, it's every printable character escape space, ", $, ', (,
and ).
>
> 3) Regarding the macro definition for nonascii, why does it go up to
> octal 237? (what's special about 237?) Why not octal 177 (decimal 127
> -- standard ASCII) or octal 377 (decimal 255 -- extended ASCII)?
Presumably, 238 and above is where you have individually invalid octets
for UTF-8.
--
Beware of bugs in the above code; I have only proved it correct, not tried it. -- Donald E. Knuth