Functional requirements:

F-2: Hints must be specified after SELECT|INSERT|REPLACE|UPDATE|DELETE key words.

F-3: EXPLAIN must understand hints.

F-4: System identifiers must be visible in EXPLAIN FORMAT=JSON.

Non-Functional requirements:

NONE

Gleb has implemented 4 alternative prototypes for a to parse hints.

The basic approach for the four prototypes are as follows:

A set of parser rules that creates a list of hints which is forwarded to the hint processor. Each hint consists of a hint string (name) and a list of parameters. The parser does not need to be changed when new hints are added.

Implementation details: the main parser grammar (sql_yacc.yy) has only one generic rule for all hints: it doesn't care about individual hint rule syntaxes, this is a business of underlying hint processor (C++) to recognize an validate them all.

good: minimal lexical scanner (sql_lex.cc) changes: it have to recognize additional tokens from "/*+" and following "*/" strings and pass them to the main parser.

good: minimal changes in the main grammar:

we add only one generic rule for all kinds of future hints at the very beginning and never change it again,

we insert references to that generic hint rule after each SELECT_SYM, INSERT, REPLACE, UPDATE_SYM and DELETE_SYM token entry in the grammar.

bad: results of hint parsing are very generic: C++ code have to recognize and validate every hint from some random parse tree -- to much work, easy to bloat C++ code.

Parser rules for individual hints. Each hint type is handled by a separate contextualize function. New rules will have to be added for each hint.

Implementation details: each individual hint extend the main grammar (sql_yacc.yy).

good: direct and easy way for add detailed rules for new optimizer hints into the current parser grammar (sql_yacc.yy).

bad: detailed hint grammar bloats the main grammar (sql_yacc.yy).

A single parser rule which forwards the entire hint as a single string to the hint processor. No parser change is needed when hints are added, but hint processor will have to implement the lexer/parser for the hint string.

good: minimal changes of the lexical scanner (sql_lex.cc) as before.

good: minimal changes in the main grammar (sql_yacc.cc) -- recognize a special optional HINT_SYM token after every SELECT_SYM, INSERT, REPLACE, UPDATE_SYM and DELETE_SYM entry in the grammar. Also see the 4th alternative: we can do the same and live without that HINT_SYM token (i.e. no grammar changes at all).

bad: this is a business of the optimizer to parse hint string byte by byte.

bad or acceptable: we "parse" the hint string twice:

the 1st pass with the main lexical scanner to separate the string inside /*+ ... */ commentary,

the 2nd pass with some future hint string parser inside the optimizer.

good: no main grammar (sql_yacc.yy) changes at all: we pass hint parse tree from the lexical scanner to the main grammar as a token value of SELECT_SYM, INSERT, REPLACE, UPDATE_SYM and DELETE_SYM tokens.

good: hint parser grammar is a separate Bison file (sql_hints.yy):

easy to extend,

detailed strict grammar for each of hints (like in the 2nd alternative),

no additional overhead to the main grammar parser.

good: the current implementation doesn't parse hint commentary internal twice, the main lexical scanner (sql_lex.cc) just passes the control to the specialized hint lexical scanner (sql_lex_hints.{h,cc}).

bad: the current implementation requires some overhead to look-up for the "/*!" but if and only if that string follows:

SELECT, INSERT, REPLACE, UPDATE or DELETE keyword at the beginning of the query or

SELECT keyword at the beginning of the query block (i.e. the SELECT_SYM token after the "(" symbol or after the UNION_SYM token).

Decision

The 4th way seems to be the best one:

It offers a way for a separate development of a hint grammar in the simple declarative way (separate Bison grammar for hints).

Additional parsing overhead looks low or negligible.

New separate lexical scanner requires a way to recognize new separate set of keywords (hint names). The current lexical scanner uses the generated hash function (by the gen_lex_hash.cc tool) that is hardcoded for use with main sql_lex.cc and sql_yacc.yy scanner and grammar. The 4th alternative gives a chance to refactor old and unstructured gen_lex_hash.cc etc infrastructure to produce a few separate hash functions for separate scanners/parsers.

New parse tree node classes

The hint parser uses descendants of the same Parse_tree_node class for its parse tree nodes like the main parser does.
To apply the semantics of the parsed hint commentary to current LEX, SELECT_LEX etc objects, please use overloaded virtual Parse_tree_node::contextualize() function.

The PT_hint_list class object represents the whole hint commentary parse tree (PT_hint_list is ready to use).
The PT_hint_list::hints contains a list of individual hints of a particular hint commentary string.
Each hint parse tree node is a child of the common PT_hint class.

The current patch completely implements the MAX_EXECUTION_TIME hint support (please see the PT_hint_max_execution_time for the reference).
Other hints from the base WL#3996 have dummy implementations (we need to overload each contextualize() function to complete them).

Hints and their parse tree node classes:

MAX_EXECUTION_TIME

PT_hint_max_execution_time

DEBUG_HINT1 (temporary)

PT_hint_debug1

DEBUG_HINT2 (temporary)

PT_hint_debug2

Hint parse tree usage in the main parser

SELECT statement and query blocks

We pass the resulting hint parse tree as a first or second parameter to parse
tree nodes of the main parser to combine hint trees and the main tree into a
single big parse tree. Then we call contextualize() on the tree root to apply
hint semantics to the LEX/SELECT_LEX object.

Note: in the case of a *syntax* error (see ER_WARN_OPTIMIZER_HINT_SYNTAX_ERROR) the parser outputs a warning and skips the rest of hints in the hint commentary.
For other hint errors (like ER_WARN_BAD_MAX_EXECUTION_TIME) we just skip the malformed hint and process the rest.

Refactoring (partial) of the main lexical scanner

For some historical reasons the old parser uses a hand-written lexical scanner.
List of that reasons (probably incomplete):

"Lexer hacks": in some cases we change lexical scanner's state from the parser (this is a subject for the further refactoring).

Character encodings that not easy to parse with a generated lexical scanner (especially 1-2-4 byte long characters of GB18030 that we support in the lexer and 2 byte long UCS-2 characters that we still not support in the lexical scanner).

That lexical scanner uses the my_lex_states enum to classify characters of supported character sets.
For some other historical reasons that enum is declared in the public m_ctype.h file. The idea to place that lexer's internal data there seems to be not very good, especially if we create new similar lexers with their own character classifications. So, we move my_lex_states enum declaration to the new sql_chars.h file.

The current CHARSET_INFO structure refers to a 256 byte long array with a table of "states" for the main lexical scanner (my_lex_states enum values).

It seems like that the reasons to save lexer's internal data in global CHARSET_INFO structures are simplicity and performance. Usually we access CHARSET_INFO from the parser/lexer via the search by the charset name (that is expensive) or by the CHARSET_INFO pointer (but not by the charset number in the charset array or so on). If we save that "state" array pointer outside the CHARSET_INFO structure, then we need an additional search by a charset name to access it.

Thus it was decided to stay with a pointer to lexer's internal data in the CHARSET_INFO structure, but the initialization of that data has been moved from the mysys library to the lex_init() function. Also, the CHARSET_INFO::states field has been extended to save additional lexer's information (hint lexer's character classifications). Note: Binary structure of the CHARSET_INFO structure itself is unchanged.

Refactoring of the hash function generator

The existing lexical scanner uses a generated hash function to recognize keywords and internal function names. For historical reasons we use a hand-written hash function generator: gen_lex_hash.cc.
That generator is tied to the main lexical scanner and parser code, i.e. it can't generate more than 2 hash functions by design (one for reserved words and one for internal function names + reserved words).

Since we introduce a new parser with a separate set of reserved words, we have to recognize these reserved words in some way.
The current hash function generator produces fast hash functions that seem to outperform some industry-standard generators, so it would be nice to reuse it for new parsers.
Thus the WL introduces a way to refactor and extend gen_lex_hash.cc functionality for that.

First, we move static parts of generated code from gen_lex_hash.cc string constants and from sql_lex.cc to new sql_lex_hash.{h,cc} files.

Then, we merge "symbol" arrays in the lex.h file into one for the simplicity/performance: in any hash function we need just a plain index in the "symbols" array to locate the SYMBOL object. Actually, we can have a growing number of SYMBOL arrays there, but please take into account that our sets of SYMBOL objects intersect, so it's logical to save them all into single array and mark different sets with the SYMBOL::group bitmap.

Finally, we refactor and extend the rest of gen_lex_hash.cc to be more structured, readable and flexible to produce any number of hash functions on demand.

How to add a new hint

Note: please see existent implementation of MAX_EXECUTION_TIME, DEBUG_HINT1 and
DEBUG_HINT2 hints for reference.

Also, there is hint_param_index_list and opt_hint_param_index_list for mandatory and optional index lists, respectively.

To create a numeric parameter to the hint, please use the HINT_ARG_NUMBER token: it returns LEX_CSTRING with a source text of a positive decimal integer. If the result value in the input query is not suitable for your hint (out of range error) please don't abort the parser with YYABORT: please output a warning and return NULL from the hint rule instead.
For example, see the max_execution_time_hint rule:

Predefined tokens

HINT_ARG_NUMBER

HINT_ARG_NUMBER ::= [0-9]+

This token returns positive integers in a textual form (LEX_CSTRING).
We have to check ranges and convert token's value at the parser's side.
See the max_execution_time_hint grammar rule for references.

HINT_ARG_IDENT

This token's scanner is designed to be compatible with main parser's rules for
identifier scanning.

Identifier formats:

Simple identifier:

starts with "_", "$" or any alphabetical character,

continues with 0 or more "_" or "$" signs, alphabetical characters or decimal digits.

Identifier with a numeric prefix:

starts with a decimal digit,

continues with 1 or more "_" or "$" signs, alphabetical characters or decimal digits.

Back-quoted identifier:

starts with the back-quote (`) separator sign,

continues with 1 or more valid characters excluding "\0",

duplicated separators (``) convert into a single ` symbol (not a separator),

ends with the back-quote (`) separator sign.

Double-quoted identifier:

starts with the double-quote (") separator sign,

continues with 1 or more valid characters excluding "\0",

duplicated separators ("") convert into a single " symbol (not a separator),

ends with the double-quote (") separator sign.

Note 1: The "alphabetic" character set includes:

national alphabetic characters of the current character set,

valid multibyte characters (if any) of the current character set.

Note 2: The "valid" character set includes:

for single-byte character sets: any character excluding "\0";

for multi-byte character sets: any valid single- or multi-byte character excluding "\0".

Note 3: Double-quote sign is an identifier separator if and only if the
ANSI_QUOTES flag is on (see the SQL_MODE system variable).

Note 5: From the SQL parser's "point of view" our hint expressions are regular
commentaries. Thus, the "*/" sequence must close the hint expression list, even
the "*/" sequence occurs inside a quoted identifier!

So, the

SELECT /* HINT(`tricky hint name*/`) */ 1

query will fail with a syntax error at "`) */ 1".

OTOH the

SELECT /* HINT(`tricky hint name*/`) */`.* FROM t1 `) */`

is a valid query that returns columns of the t1 (the `) */` sequence is a tricky alias
of t1). In that case the hint parser will warn about invalid HINT and return the
HINT_ERROR token to the hint parser.

HINT_ARG_QB_NAME

This token is a HINT_ARG_IDENT with the "@" prefix sign (whitespace characters
between the "@" prefix and the identifier suffix aren't allowed).

The lexical scanner recognizes the HINT_ARG_QB_NAME token if and only if:

The "@" sign occurs right after the previous HINT_ARG_IDENT token (whitespace

separators between the previous identifier and the "@" sign are not
allowed):

ident@query_block_name

`tricky ident`@query_block_name,

ident@`select#1`,

`tricky ident`@`select#1`,

123tricky$ident@012tricky_query_block_name etc.

The "@" sign occurs after the "(" signs (whitespace separators are ok):

/*+ HINT(@query_block_name ... ) */,

/*+ HINT( @query_block_name ... ) */

/*+ HINT( @"select#1" ... ) */ -- if ANSI_QUOTES flag is set,

/*+ HINT( @```tricky name``` ... ) */ etc.

The HINT_ARG_QB_NAME token's scan process depends on the context:

HINT_CLOSE and HINT_ERROR

These tokens are for use in the root start rule only.

Please don't reference to them in hint grammar rules.

The HINT_CLOSE token represents the "*/" sequence in the parser grammar: it's
needed since we don't scan the hint "commentary" string twice (the main lexer
passes its control to the hint scanner/parser on the "/*+" sequence, and the
hint parser/lexer returns the control back on the "*/" sequence or on the end of
query).

The HINT_ERROR token represents scanner errors:

OOM,

invalid characters,

unexpected end of query string,

empty quoted identifier ("" if ANSI_QUOTES is on or ``).

How to add a new semantic value type

Since we have to save the compatibility with the old Bison 2.1 release, it is problematic to have a separate semantic value type (%union) per grammar: we share the same union YYSTYPE (see sql_lex.h) for both main sql_yacc.yy and new sql_hints.yy grammars instead.

sql_hints.yy: add new terminals or non-terminals with the sample_type type:

+ %type SAMPLE_TOKEN sample_rule

Data flow

Old parser

In the original server, we activate the parser by the MYSQLparse() function call. The MYSQLparse() function pulls lexical data from the lexical scanner token by token with the help of the MYSQLlex() function.

New parser

In the new server we activate the main parser by the MYSQLparse() function as before. The main parser pulls tokens from the main lexical scanner by the repetitive MYSQLlex() call as well.

Additionally, the main lexical scanner looks up for the "/*+" string now and may pass the control to the secondary parser (HINT_PARSER_parse() function) in some conditions. With the help of the secondary lexical scanner (HINT_PARSER_lex() function) it pulls a hint string token by token and returns the control to main lexical scanner and main parser at the end of the hint commentary string.

When the secondary (hint) parser returns its control to the main lexical scanner, the lexical scanner associates the resulting hint parse tree with the value of SELECT_SYM, INSERT, REPLACE, UPDATE_SYM or DELETE_SYM and return it that way to the main parser.