Jonathan Ellis
added a comment - 08/Mar/12 22:35 We added the multithreadedness specifically because it improves startup time for people with multiple spindles or SSDs...
Any ideas how to get the best of both worlds besides falling back to a config options?
(Maybe it's time to add random vs sequential speed ratio as a setting, which at least is general enough to be useful in other places.)

Brandon Williams
added a comment - 08/Mar/12 22:39 (Maybe it's time to add random vs sequential speed ratio as a setting, which at least is general enough to be useful in other places.)
This sounds like a good idea, we're never going to strike a balance that's sufficient between SSD and rotational media without a knob to turn.

My $0.02 is that it may be helpful to batch reads. Not sure if the underlying stream used in reading the bloom filters reads a large chunk and caches it, but if not, it could help to instead of just calling ois.readLong(), you read 64K or 1M or whatever you feel is appropriate (maybe configurable?) into a buffer and grab the longs out of those. This doesn't completely fix the problem of disk contention, but it might cause larger sequential reads to be submitted to the disk, which then might behave nicer?

The specific example I'm thinking of here is: it looks like the deserialization of LegacyBloomFilter (perhaps what 0.8 uses?) is just a ois.readObject() for a BitSet. And that's like, it. Whereas for BloomFilter (the new version?), deserialization is a tight loop of readLong() calls. Same with serialization FWIW. Not that using Java serialization for LTS is necessarily a good idea, but it may be happier for the disk.

Michael Harris
added a comment - 09/Mar/12 02:00 - edited My $0.02 is that it may be helpful to batch reads. Not sure if the underlying stream used in reading the bloom filters reads a large chunk and caches it, but if not, it could help to instead of just calling ois.readLong(), you read 64K or 1M or whatever you feel is appropriate (maybe configurable?) into a buffer and grab the longs out of those. This doesn't completely fix the problem of disk contention, but it might cause larger sequential reads to be submitted to the disk, which then might behave nicer?
The specific example I'm thinking of here is: it looks like the deserialization of LegacyBloomFilter (perhaps what 0.8 uses?) is just a ois.readObject() for a BitSet. And that's like, it. Whereas for BloomFilter (the new version?), deserialization is a tight loop of readLong() calls. Same with serialization FWIW. Not that using Java serialization for LTS is necessarily a good idea, but it may be happier for the disk.

+1 on the patch. When we get a chance, we'll try out the patch on the cluster that demonstrated the issue.

The overall intent of the ticket filed is that startup is slower overall, and this was just one example in the code that I found that could be part of the problem. Any other ideas as to things that could cause slower startup? We noticed the logs had a lot of SSTables being opened during startup, and we do have some 100+GB SSTables on disk, so general SSTable reading was my first instinct to check.

Michael Harris
added a comment - 09/Mar/12 06:28 +1 on the patch. When we get a chance, we'll try out the patch on the cluster that demonstrated the issue.
The overall intent of the ticket filed is that startup is slower overall, and this was just one example in the code that I found that could be part of the problem. Any other ideas as to things that could cause slower startup? We noticed the logs had a lot of SSTables being opened during startup, and we do have some 100+GB SSTables on disk, so general SSTable reading was my first instinct to check.

Michael Harris
added a comment - 09/Mar/12 06:48 Also, FWIW, the observed slowness is in 1.0.7, not 1.0.1 as mentioned in the affects version (though it certainly may affect 1.0.1 as well). I don't seem to have permission to update this.

Hi Jonathan. We patched our cluster and startup time didn't decrease much if at all. I still think the patch is probably a positive help, but it looks like there are bigger problems elsewhere. Let me know if any additional testing or logs could help out here.

Michael Harris
added a comment - 10/Mar/12 04:05 Hi Jonathan. We patched our cluster and startup time didn't decrease much if at all. I still think the patch is probably a positive help, but it looks like there are bigger problems elsewhere. Let me know if any additional testing or logs could help out here.

I measured how long does it take to load each sstable component(Data, Index, Filter) and found out that loading from index file takes longer in 1.0 than in 0.8.
By looking code for difference between 0.8 and 1.0, I noticed that in 1.0, every keys stored in index file get deserialized, while in 0.8, only those keys that should be added to index summary get deserialized.
The reason we deserialize all keys in 1.0 is to obtain first and last keys stored in sstable. Attached patch tries to skip deserializing keys when possible.

Yuki Morishita
added a comment - 19/Mar/12 23:21 I measured how long does it take to load each sstable component(Data, Index, Filter) and found out that loading from index file takes longer in 1.0 than in 0.8.
By looking code for difference between 0.8 and 1.0, I noticed that in 1.0, every keys stored in index file get deserialized, while in 0.8, only those keys that should be added to index summary get deserialized.
The reason we deserialize all keys in 1.0 is to obtain first and last keys stored in sstable. Attached patch tries to skip deserializing keys when possible.
Patch is against 1.0 branch.

Yuki Morishita
added a comment - 23/Mar/12 16:42 In trunk, CASSANDRA-2319 changed the content in primary index file, so we need to calculate its serialized size differently to determine if the key is the last one.

Unfortunately, there is no way to determine serialized size of promoted row index without actually deserializing some of it, so we cannot determine the last key.
Patch deserializes every key when sstable has promoted index, but instead it tries to reduce the amount of io when deserialize index itself.

Yuki Morishita
added a comment - 23/Mar/12 22:24 Attaching patch to current trunk.
Unfortunately, there is no way to determine serialized size of promoted row index without actually deserializing some of it, so we cannot determine the last key.
Patch deserializes every key when sstable has promoted index, but instead it tries to reduce the amount of io when deserialize index itself.

Yuki Morishita
added a comment - 23/Mar/12 22:36 Is this io-reducing something that could also apply to 1.0?
It's for RowIndexEntry which is introduced for v1.2 ( CASSANDRA-2319 ), so it only relates to trunk.

Jonathan Ellis
added a comment - 23/Mar/12 22:44 Could we rename lastKey to lastKeyForUnpromoted, and then take the hasPromotedIndexes out of that assignment? Since you also added it to the big if statement.
I guess deserializePositionOnly is only for unpromoted indexes? Because for promoted ones decoratedKey will never be null.
+1 otherwise, I guess we'll have to rely on CASSANDRA-2392 for startup speed in 1.2.

Yuki Morishita
added a comment - 24/Mar/12 19:15 Could we rename lastKey to lastKeyForUnpromoted, and then take the hasPromotedIndexes out of that assignment?
Done.
I guess deserializePositionOnly is only for unpromoted indexes?
It deserializes whole row index only when decoratedKey is not null AND cache key. Otherwise only deserialize position.

Yuki Morishita
added a comment - 26/Mar/12 22:26 NPE only happens on trunk and that is why I reopened this issue.
Fix is attached as trunk-4023.txt.
v3 attached above is for 1.0 branch and already committed to 1.0 and 1.1 without any problem.

the problem is in the general case where RowIndexEntry.serializer.deserialize() is using the hasPromotedIndexes code, the amount of data read is variable (assuming in this case the first int read in deserialize can be non 0). Therefore in SSTableReader the code can no longer statically predict when a particular key will be the last key.

Ah I see.. yes this happens on trunk... so if it is fixed elsewhere, then good... never mind me.

Dave Brosius
added a comment - 26/Mar/12 22:32 - edited no v3 has this bug
the problem is in the general case where RowIndexEntry.serializer.deserialize() is using the hasPromotedIndexes code, the amount of data read is variable (assuming in this case the first int read in deserialize can be non 0). Therefore in SSTableReader the code can no longer statically predict when a particular key will be the last key.
Ah I see.. yes this happens on trunk... so if it is fixed elsewhere, then good... never mind me.

Jonathan Ellis
added a comment - 27/Mar/12 05:44 committed.
(I note, for the record, that my original 4023.txt patch is unnecessary because we're already using a buffered inputstream in loadBloomFilter.)