[ https://issues.apache.org/jira/browse/ZOOKEEPER-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated ZOOKEEPER-3249:
--------------------------------------
Labels: pull-request-available (was: )
> Avoid reverting the cversion and pzxid during replaying txns with fuzzy snapshot
> --------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-3249
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3249
> Project: ZooKeeper
> Issue Type: Improvement
> Components: server
> Affects Versions: 3.6.0
> Reporter: Fangmin Lv
> Assignee: Fangmin Lv
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.6.0
>
>
> The only case we need to have [the tricky hack code|https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/DataTree.java#L1036-L1065]
, is because of the scenario below:
> If the child is deleted due to session close and re-created in a different global session
after that the parent is serialized, then when replay the txn because the node is belonging
to a different session, replay the closeSession txn won't delete it anymore, and we'll get
NODEEXISTS error when replay the createNode txn. In this case, we need to update the cversion
and pzxid to the new value with this tricky code here.
> This could be solved in ZOOKEEPER-3145 with explicit CloseSessionTxn. In theory, with
that code, we don't need this kind of hack code anymore, but there is another case, which
could cause the cversion and pzxid being reverted, and we still need to patch it, here is
the scenario:
> 1. Start to take snapshot at T0
> 2. Txn T1 create /P/N1, set P's cversion and pzxid to (1, 1)
> 3. Txn T2 create /P/N2, set P's cversion and pzxid to (2, 2)
> 4. Txn T3 delete /P/N1, set P's pzxid to 3, which is (2, 3)
> Those state are in the fuzzy snapshot.
> When loading the snapshot and txns during start up based on the current code:
> 1. replay T1, since /P/N1 is not exist, we'll overwrite P's cversion and pzxid to (1,
1)
> 2. replay T2, node already exist, so go through the hack code to patch cversion and pzxid,
and it became (2, 2)
> 3. replay T3, set P's pzxid to 3, which is now (2, 3)
> The state is consistent with the tricky patch code, but it's error-prone and hacky, we
should remove that. To be able to remove that, in this patch, we're going to check the cversion
first and avoid reverting the cversion and pzxid when replaying txns.
> We've also added metrics to verify that logic is not active on prod anymore, after
that I'll open another Jira to remove it to make the logic cleaner.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)