How can I handle languages where keywords can also be identifiers?

> From: kearneym@piercom.ie
>
>
> How can I handle the strings which can be token and identifier at the
> same time.
> I have been writting the parser for the SQL,
> and just found out that SQL has no reserved word
> which meens that keywords can be used as ordinary identifiers.
>
> The first way that come acrros in my head is to use wild card
> like below.
>
> dropStatement
> : "drop" ("alias" | "event" "monitor") .
> ;
>
>
> However, if I replace all the IDENTIFIER with wild card,
> it will create ambignity warnings.

Of course you realize that this is genuinely ambiguous, right? There are
two issues involved in your approach:

Understanding the reasons for the ambiguities enough to know that your rules work the way they should.

Understanding the code generated by antlr and knowing why it is generated and how it works.

#2 is actually easier because antlr's code is very readable and easy to
follow, except for maybe the tree building stuff. #1 requires understanding
lookahead and how the parser makes decisions about which alternative to
take. This is explained in any text on parsing that covers LL(k) parsing,
and more details are in Ter's thesis available from antlr.org.

I wrote a parser for AREV R/BASIC which had no keywords either. It had tons
of ambiguity warnings but worked fine. Antlr can do it with some help.
Note that the wildcard approach is a little too broad. You don't actually
want to accept numbers or delimiters, etc., only literals.

Here's what I would suggest, and I'd love to hear if it works right or not
and help refine the approach. Basically keep track of which tokens came
from the ID rule and use that in the parser on ambiguous constructs. So:

Create your own CommonToken subclass that has an attribute called
"possibleID". In your lexer define an instance variable called possibleID.
Then when you create a literal like so:

I use possibleID but it's not the only way to go, you could also
manually write a rule that contains _all_ of the literals in it. This is
actually what I did for AREV but it was a real headache.

Antlr, unlike PCCTS 1.3x, does not "hoist" syntactic or semantic
predicates. That means that you can't isolate that part into a rule like
this:

id_or_keyword:
( ID
| ( ~(ID) )=>{ LT(1).possibleID() }? .
)
;

The lookahead for this rule would be any token and give you the same
ambiguity problems as using the wildcard operator ".".

Once we know that the token is not an ID and that it is a possibleID then
we know we are safe to match it with the wildcard operator. Numbers or
delimiters or what-have-you will still throw exceptions if they come after
your "drop" "alias" tokens.

I would truly love to come up with a sane, methodical approach to handling
languages without reserved keywords. Let me know how this works so we can
refine it.