WL#3464: Add replication event to denote gap in replication

RATIONALE
---------
Needed to solve BUG#21494.
DESCRIPTION
-----------
Introduce a way to denote that there has been an outage and there is a gap in
the replication stream. Typically by either adding a flag to an event or by
introducing a new event. The indication can be used to force resynchronization,
but an initial implementation will just stop the slave indicating that the
master and the slave is out of sync and needs resynchronization.
We want the log after the gap to feel "clean". After the gap we
should hava a log file that can be easily used with an initial
snapshot without needing to execute SKIP EVENT or similar stuff.
REQUIREMENT
-----------
An incident is something that occurs and that affects the contents of the
database but which cannot easily be represented as a set of changes. Examples of
incidents are: server crashes, database resynchronization, (some) software
updates, and (some) hardware changes.
R1. It shall be possible to record an incident in the binary log
R2. It shall be possible to record an incident through the injector interface
R3.1. When seeing a known incident, the slave shall stop with an indication
of the incident that occured. In the future, other ways of handling known
incidents might be implemented, but for the current implementation,
the slave will stop
R3.2. For *unknown* incident codes, the slave shall *always* stop
R4. It shall be possible for the DBA to start the replication from just
after the incident when the apropriate actions have been taken
to handle the incident
DECISIONS REGARDING THIS WL
---------------------------
- Elliot Murphy, Brian Aker, Calvin Sun and Trudy Pelzer
agreed today (28 Aug 2006) that this task is needed for 5.1.
(Reported by Trudy)
- Lars and Mats dicussed this on email and we will most likely
simply insert a Format_description_log_event instead of
the gap event, possibly after rotating the log. There is no
need to change the interface that presented to NDB,
so Cluster team will not have to make any changes to their part.
- Lars and Mats discussed again 2007-01-25 and realized that a
separate event might be best.
BUG SCENARIOS
-------------
We need to have a solution that fixes both of these problems:
1. NDB node crashes and SUMA produce a gap in the cluster replication
log sent to the event interface in the MySQL server.
2. MySQL Server crashes and the event queue is lost. When MySQL
server is restarted it receives new events from SUMA, but it has lost
the events that were stored in the event queue in the MySQL server.
3. Network outage. The replication log is lost in the connection
between the NDB nodes and the MySQL server.
POSSIBLE SOLUTIONS
------------------
1. Slave stop with error message when it recieves GAP information from
the master. The user needs to do some special action to restart the
replication, e.g. skip event, START SLAVE WITH INITIAL START (a new
command).
2. Slave stop with error message but automatically skip the gap. A
new START SLAVE command will resume the replication without further
user action. Both Lars and Mats think this is a bad solution because
it is too easy for the DBA to make mistakes.
SUGGESTED SOLUTION
------------------
If a gap error occur, the user needs to execute SKIP EVENT.
IMPLEMENTATION
--------------
Possibilities:
I1. GAP info in separate event
I2. GAP info in rotate_event
I3. GAP info in format_description_log_event
Mats and Lars thinks that a "INCIDENT" event + a rotation log event
seems like the best way to indicate a gap. Then we get a fresh log
after the gap (R1) and we don't need to skip rotate events (I2).
In the incident event there should be:
1. A number saying what incident it conveys. For this case we think
it should be "LOST_EVENTS".
2. A optional string with information about the nature of the
incident. For this case it can be "MySQL Server crashed" or "Cluster
did not provide continuous replication log", "Network outage between
NDB node and MySQL server". (It is up to the cluster team to provide
proper messages).

# Behaviour
When the slave encounters a gap in the replication log, it shall stop with an
error message. The error message will be available in the output of SHOW SLAVE
STATUS.
The format of the message is::
The incident *symbol* occured on the master. Message: *message*
Where *symbol* is a symbol denoting the incident (the constant name without the
INCIDENT_) and *message* is the message supplied.
# Administration
If the slave stops at an incident, the output of SHOW SLAVE STATUS will indicate
that the SQL thread has stopped due to an incident registered in the replication
stream.
In order to restart the slave, the DBA has to issue the following commands::
slave> SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1;
slave> START SLAVE;