Multi deserialization issues

Details

Description

From the mailing list:

FileTxnSnapLog.restore contains a code block handling a NODEEXISTS failure during deserialization. The problem is explained there in a code comment. The code block however is only executed for a CREATE txn, not for a multiTxn containing a CREATE.

Even if the mentioned code block would also be executed for multi transactions, it needs adaption for multi transactions. What, if after the first failed transaction in a multi txn during deserialization, there would be subsequent transactions in the same multi that would also have failed?
We don't know, since the first failed transaction hides the information about the remaining transactions.

Activity

The test here is a little bit vague, because I don't really understand how a "proper" but broken multitxn would look. Handling the error codes in FileTxnSnapLog is also a bit fuzzy. But I think the general refactor should fix the issue. Would be great if Marshall could take a look at this to verify.

Camille Fournier
added a comment - 28/Oct/11 22:37 The test here is a little bit vague, because I don't really understand how a "proper" but broken multitxn would look. Handling the error codes in FileTxnSnapLog is also a bit fuzzy. But I think the general refactor should fix the issue. Would be great if Marshall could take a look at this to verify.

Ted Dunning
added a comment - 29/Oct/11 00:53 Once a multi makes it to the log, it is pretty much guaranteed that all of the elements of the multi will succeed or log replay will totally fail.
It may be a bit too cowboy to assume that this is really true, but the point of the idempotency property is to make this true.

I am pretty sure that it SHOULD be true. Whether it is actually true is another matter. Being in the log is evidence that there were no reasons for the update to fail if the data store is actually working correctly.

Again, the assumption of correct operation at this scale may be too strong.

Ted Dunning
added a comment - 29/Oct/11 01:13 I am pretty sure that it SHOULD be true. Whether it is actually true is another matter. Being in the log is evidence that there were no reasons for the update to fail if the data store is actually working correctly.
Again, the assumption of correct operation at this scale may be too strong.

Ah yes, being in the log is enough for it to be true were snapshots taken in a frozen system state. But since they are not, you can have these operations fail in playback due to concurrency issues. Multi isn't a special case above the other zk ops, they all have this potential race.

Camille Fournier
added a comment - 29/Oct/11 01:23 Ah yes, being in the log is enough for it to be true were snapshots taken in a frozen system state. But since they are not, you can have these operations fail in playback due to concurrency issues. Multi isn't a special case above the other zk ops, they all have this potential race.

we convert multi to idempotent transactions. by nature if an operation is a multi-op fails, there will not be any transaction there at all, so we should process the transactions resulting from multi-op in exactly the same way as normal transactions. the only difference is that we have to make sure that if one of the transactions get processed, they all do.

Benjamin Reed
added a comment - 29/Oct/11 14:32 we convert multi to idempotent transactions. by nature if an operation is a multi-op fails, there will not be any transaction there at all, so we should process the transactions resulting from multi-op in exactly the same way as normal transactions. the only difference is that we have to make sure that if one of the transactions get processed, they all do.

Right, ok. So I think the patch attached to this issue does exactly that, if someone would like to review it. What I'm not sure is whether the test I put in is particularly good, so would really appreciate one of the multi experts taking a gander there.

Camille Fournier
added a comment - 29/Oct/11 17:10 Right, ok. So I think the patch attached to this issue does exactly that, if someone would like to review it. What I'm not sure is whether the test I put in is particularly good, so would really appreciate one of the multi experts taking a gander there.

I agree with Benjamin's comment. In the PrepRequestProcessor we only apply the multi-op if we know all the contained ops will succeed. If any of them fail, the whole thing gets thrown away and we rollback. So there should be no multi-op specific handling that needs to be done in your patch. I looked over the patch and I don't see any problems.

Marshall McMullen
added a comment - 29/Oct/11 17:30 I agree with Benjamin's comment. In the PrepRequestProcessor we only apply the multi-op if we know all the contained ops will succeed. If any of them fail, the whole thing gets thrown away and we rollback. So there should be no multi-op specific handling that needs to be done in your patch. I looked over the patch and I don't see any problems.