rcoli helped me investigate this issue. The mystery was that the segment of commit log was probably not fsynced to disk since the setting was set to periodic with 10 second delay and CRC32 checksum validation failed skipping the reply, so what happened in my scenario can be explained by this. I am going to change our settings to batch mode.

I was restarting Cassandra nodes again today. 1 hour later my support team let me know that a customer has reported some missing data. I suppose this is the same issue. The application logs show that our client got success from the Thrift log and proceeded with responding to the user and I could grep the commit log for a missing record like I did before.

We have durable writes enabled. To me, it seams like when stuff are in memtables and hasn't been flushed to disk, when I restart the node, the commit log doesn't get replayed correctly.

Thanks for your reply. I did grep on the commit logs for the offending key and grep showed Binary file matches. I am trying to use this tool to extract the commitlog and actually confirm if the mutation was a write: