http://www.w3.org/Bugs/Public/show_bug.cgi?id=1373
------- Additional Comments From cmsmcq@w3.org 2005-06-02 00:29 -------
Scott Boag writes:
I'm curious as to why, in the XML spec, there is:
[22] prolog ::= XMLDecl? Misc* (doctypedecl Misc*)?
vs.
[24] VersionInfo ::= S 'version' Eq ("'" VersionNum "'" | '"'
VersionNum '"')
Section 6 states "Symbols are written with an initial capital
letter if they are the start symbol of a regular language,
otherwise with an initial lowercase letter." But it seems like a
fuzzy line.
The XML WG may have made errors in drawing the line, but whether a
particular language over the alphabet of Unicode characters is regular
or not, in the technical sense, should not be fuzzy. The language
defined by a non-terminal in a regular right-part grammar is regular
if and only if non-terminals on the right-hand side can be replaced
(iteratively) until there is nothing there but terminal symbols (in
this case, Unicode characters or expressions like [a-zA-Z]). This, in
turn, is the case if there is no recursion in the grammar rules (no
left-hand symbol turns up directly or indirectly in its own right-hand
side).
If all the symbols in a right-hand side are known to denote regular
languages, then the symbol on the left-hand side also denotes a
regular language; if any symbol on the right denotes a non-regular
language, then the language of the left-hand side symbol is
non-regular.
Consider the examples above. The language defined by using the XML 1.0
grammar with 'doctypedecl' as start symbol (I'll just call this 'the
language of doctypedecl' or 'the language denoted by doctypedecl' in
what follows) is not regular, because a doctypedecl can contain an
internal subset (intSubset), which can contain element declarations
(via markupdecl and elementdecl), which can contain content models for
elements with element content (via contentspec and children). Such
content models are not regular, because they require that opening and
closing parentheses match; there is indirect recursion in both choice,
and seq, through cp. (Content models for mixed content are regular
because they can't have nested groups.)
So 'doctypedecl' is spelled with an initial lowercase letter.
'VersionInfo', by contrast, has an initial uppercase because it
denotes a regular language: it can be written
[24] VersionInfo ::= (#x20 | #x9 | #xD | #xA)+ 'version'
(#x20 | #x9 | #xD | #xA)+? '='
(#x20 | #x9 | #xD | #xA)+?
("'" '1.0' "'" | '"' '1.0' '"')
which has no non-terminals on the right-hand side.
That may not be the 'why' you had in mind, though.
The distinction between regular and non-regular non-terminals was the
result of a compromise. Someone (I'll leave the protagonists
anonymous) proposed that it would be easier to see how to write an XML
parser if we distinguished the lexical level and the grammar level
explicitly, so that interested parties could see at a glance where one
might plausibly draw the line between a lexer and a parser. Even if
an implementor later decided to move that line, it would be convenient
to have an initial suggestion. Someone else objected that different
implementors might choose to draw the lexer/parser line in different
places, and that trying to prescribe it, or even making a specific
suggestion, was a waste of time. In the end, we agreed to distinguish
regular from non-regular sublanguages, on the theory that conventional
lexers typically recognize only regular languages. The initial
capital letter effectively says "If you want to, you can conveniently
treat this non-terminal as a terminal symbol recognized by the lexer";
perhaps even more important, the initial lowercase letter says "If you
were thinking of treating this as a terminal symbol, using a
conventional lexer design, then forget it".
I gather that when XPath 1.0 was done, the XSL WG had no one who
thought that this was a worthwhile way to help implementors or
readers. Myself, I find it helpful but not essential.