Slave replication fails when long time to load RDB and hz set to 1#1535

Labels

Milestone

Assignee

2 participants

Tested on 2.8.4
When hz is set to 1 and the loading time of of the rdb at the final stage of the slave replication takes longer than repl-timeout I fail with: "MASTER timeout: no data nor PING received..."
The failure isn't because the master disconnects the repl link so I know this isn't #1449.

What seems to be happening is that while the rdbLoad runs the server cron isn't updating unixtime. So we're left with a very old unixtime when we create the master client object (server.master = createClient()). This causes the lastinteraction value in the client object to be very old.
At this point there's a race: a call to the replicationCron will kill the replication link because of the old lastinteraction value but reception of new data from the master (or ping) will update lastinteraction to a decent value. I guess both events are pending in the event loop.

For some reason when hz is set to 1 the replicationCron always wins the race and kills my replication link.

This is a bit complex so I might be missing something. Assuming this analysis is correct I'd suggest something like updating the server.unixtime after calling rdbLoad in the replication code.

server.unixtime and server.mstime are cached less precise timestamps
that we use every time we don't need an accurate time representation and
a syscall would be too slow for the number of calls we require.
Such an example is the initialization and update process of the last
interaction time with the client, that is used for timeouts.
However rdbLoad() can take some time to load the DB, but at the same
time it did not updated the time during DB loading. This resulted in the
bug described in issue #1535, where in the replication process the slave
loads the DB, creates the redisClient representation of its master, but
the timestamp is so old that the master, under certain conditions, is
sensed as already "timed out".
Thanks to @yoav-steinberg and Redis Labs Inc for the bug report and
analysis.

server.unixtime and server.mstime are cached less precise timestamps
that we use every time we don't need an accurate time representation and
a syscall would be too slow for the number of calls we require.
Such an example is the initialization and update process of the last
interaction time with the client, that is used for timeouts.
However rdbLoad() can take some time to load the DB, but at the same
time it did not updated the time during DB loading. This resulted in the
bug described in issue #1535, where in the replication process the slave
loads the DB, creates the redisClient representation of its master, but
the timestamp is so old that the master, under certain conditions, is
sensed as already "timed out".
Thanks to @yoav-steinberg and Redis Labs Inc for the bug report and
analysis.

server.unixtime and server.mstime are cached less precise timestamps
that we use every time we don't need an accurate time representation and
a syscall would be too slow for the number of calls we require.
Such an example is the initialization and update process of the last
interaction time with the client, that is used for timeouts.
However rdbLoad() can take some time to load the DB, but at the same
time it did not updated the time during DB loading. This resulted in the
bug described in issue #1535, where in the replication process the slave
loads the DB, creates the redisClient representation of its master, but
the timestamp is so old that the master, under certain conditions, is
sensed as already "timed out".
Thanks to @yoav-steinberg and Redis Labs Inc for the bug report and
analysis.