incubator-cassandra-dev mailing list archives

That is strange. In 0.7.0 i see this for seek:
public void seek(long pos) throws IOException
{
this.curr_ = pos;
}
Ivan
On 21.2.2011 г. 21:20 ч., Jonathan Ellis wrote:
> BRAF.seek has not changed since 0.7.0. Here is the implementation:
>
> public void seek(long newPosition) throws IOException
> {
> current = newPosition;
>
> if (newPosition>= bufferEnd || newPosition< bufferOffset)
> {
> reBuffer(); // this will set bufferEnd for us
> }
>
> final int delta = (int) (newPosition - bufferOffset);
> buffer.position(delta);
> }
>
> In other words: if we are seeking outside of the part of the file that
> is currently buffered, we have to call rebuffer.
>
> Now, you might wonder why seek is being called at all in SSTW.append.
> Here is the relevant part (also unchanged since 0.7.0):
>
> // write placeholder for the row size, since we don't know it yet
> long sizePosition = dataFile.getFilePointer();
> dataFile.writeLong(-1);
> // write out row data
> int columnCount =
> ColumnFamily.serializer().serializeWithIndexes(cf, dataFile);
> // seek back and write the row size (not including the size Long itself)
> long endPosition = dataFile.getFilePointer();
> dataFile.seek(sizePosition);
> dataFile.writeLong(endPosition - (sizePosition + 8));
>
> So basically once your row size is larger than the BRAF buffer you can
> expect to do a lot of rebuffer calls during flush or compact. Sounds
> to me like the difference you are seeing is caused by your row size
> and not a regression in Cassandra.
>
> 2011/2/21 Ivan Georgiev<yngwiie@bk.ru>:
>> Some more digging. This is the code path causing the excessive rebuffer()
>> calls.
>>
>> java.lang.Exception: Stack trace
>> at java.lang.Thread.dumpStack(Unknown Source)
>> at
>> org.apache.cassandra.io.util.BufferedRandomAccessFile.reBuffer(BufferedRandomAccessFile.java:204)
>> at
>> org.apache.cassandra.io.util.BufferedRandomAccessFile.seek(BufferedRandomAccessFile.java:394)
>> at
>> org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:148)
>> at
>> org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:159)
>> at org.apache.cassandra.db.Memtable.access$000(Memtable.java:49)
>> at org.apache.cassandra.db.Memtable$1.runMayThrow(Memtable.java:174)
>> at
>> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>> at java.lang.Thread.run(Unknown Source)
>>
>> Basically the rebuffer() is being called through seek(). In 0.7.0 seek()
>> does not call rebuffer() while 0.7.1 and on does
>> and that is where the performance degradation is happening in my test case.
>> Any ideas how to alleviate the problem ?
>>
>> Ivan
>>
>> On 21.2.2011 г. 14:07 ч., Ivan Georgiev wrote:
>>> I did some very rough measurements in a desperate attempt to see if I can
>>> find the issue (if there is an issue).
>>> Since I dont know the code base well enough i chose
>>> BufferedRandomAccessFile as my suspect, since it was rewritten from 0.7.0 to
>>> 0.7.1
>>>
>>> I did rough measurements on how many times write() and rebuffer() were
>>> called, and how much time was spent in them. The measurements
>>> were printed out when close() on the file was called. Time measurements
>>> are rounded in seconds. I have included the measurements only for the
>>> large files that were flushed to disk.
>>>
>>> 0.7.0
>>> Total time in reBuffer : 0 seconds.
>>> Total times called reBuffer : 2
>>> Total time in write : 0 seconds.
>>> Total times called write : 579374
>>>
>>> 0.7.2
>>> Time spent in reBuffer : 67 seconds.
>>> Times called reBuffer : 30888
>>> Time spent in write : 0 seconds.
>>> Times called write : 1884107
>>>
>>> Seems like rebuffer is being called a lot more in 0.7.2 and thats where
>>> the performance degradation is coming from.
>>>
>>> rebuffer can end up being called from write itself, but since the time
>>> spent in write is insignificant, that means this excessive
>>> calling of rebuffer in 0.7.2 is not happening from the write path.
>>>
>>> Ivan
>>>
>>>
>>> On 20.2.2011 г. 05:21 ч., Ivan Georgiev wrote:
>>>> Is it possible that the changes to the BufferedRandomAccessFile.java
>>>> could be causing the issue ?
>>>> I think the most notable change there is using ByteBuffer instead of a
>>>> byte[] for the buffer.
>>>>
>>>> I will run more tests and see what comes out of it.
>>>>
>>>> Ivan
>>>>
>>>> On 20.2.2011 г. 05:03 ч., Jonathan Ellis wrote:
>>>>> I guess you'll need to binary search through the 0.7.1 changes to find
>>>>> what made the difference. I can't think of any obvious candidates.
>>>>>
>>>>> On Sat, Feb 19, 2011 at 10:32 AM, Ivan Georgiev<yngwiie@bk.ru>
wrote:
>>>>>> On 19.2.2011 г. 16:43 ч., Jonathan Ellis wrote:
>>>>>>> Flush code didn't change between 0.7.0 and 0.7.2. There must
be some
>>>>>>> other variable here. Memory pressure maybe?
>>>>>> Cloud you please elaborate on that one ?
>>>>>> The conditions are exactly the same for the test with 0.7.0 and 0.7.2.
>>>>>> By the way, 0.7.1 tests are similar to 0.7.2, while 0.7 early betas
to
>>>>>> 0.7.0
>>>>>> is fine.
>>>>>>
>>>>>> Ivan
>>>>>>
>>>>>> On 19.2.2011 г. 16:43 ч., Jonathan Ellis wrote:
>>>>>>> Flush code didn't change between 0.7.0 and 0.7.2. There must
be some
>>>>>>> other variable here. Memory pressure maybe?
>>>>>>>
>>>>>>> On Sat, Feb 19, 2011 at 8:16 AM, Ivan Georgiev<yngwiie@bk.ru>
>>>>>>> wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I am testing 0.7.2 on a Windows 2003 x64 system(one node)
and I am
>>>>>>>> having
>>>>>>>> the following problem.
>>>>>>>> My insertion speed is relatively slow, so the memtables do
not get
>>>>>>>> full
>>>>>>>> and
>>>>>>>> the actual flushing is triggered by memtable_flush_after_mins,
this
>>>>>>>> happens
>>>>>>>> on the hour mark. My problem with 0.7.2 is that when that
happens,
>>>>>>>> the
>>>>>>>> cpu
>>>>>>>> spikes to 25% overall usage (x4 Xeon) and the operation takes
>>>>>>>> anywhere
>>>>>>>> from
>>>>>>>> 2 to 4 minutes, leaving the node not responding during that
time.
>>>>>>>> This
>>>>>>>> has
>>>>>>>> forced to me to increase the rpc timeout option to beyond
what i feel
>>>>>>>> comfortable with.
>>>>>>>> I have run multiple tests with 0.7.0 and 0.7.2 with the same
dataset
>>>>>>>> and
>>>>>>>> the
>>>>>>>> results are consistent. During the same operation 0.7.0 takes
about
>>>>>>>> 10
>>>>>>>> seconds to complete vs. 2 to 4 minutes for 0.7.2. I am attaching
a
>>>>>>>> log
>>>>>>>> with
>>>>>>>> the timestamps from one such flushing of 0.7.2. Please let
me know if
>>>>>>>> there
>>>>>>>> is anything i can do to speed up and get results similar
to 0.7.0.
>>>>>>>>
>>>>>>>> Regards:
>>>>>>>> Ivan
>>>>>>>>
>>>>>>>> This is the log of the operations which took most time during
the
>>>>>>>> flush
>>>>>>>> operation. Using 0.7.0, with the same number of operations
the
>>>>>>>> flushing
>>>>>>>> takes less than 10 seconds.
>>>>>>>>
>>>>>>>> INFO 01:36:44,906 Enqueuing flush of
>>>>>>>> Memtable-ArchiveFiles.6f776e65724944@1225856921(1619949 bytes,
34467
>>>>>>>> operations)
>>>>>>>> INFO 01:37:47,375 Completed flushing
>>>>>>>> C:\Storage\data\Storage\ArchiveFiles.68617368-f-3-Data.db
(5549187
>>>>>>>> bytes)
>>>>>>>> INFO 01:37:47,375 Writing
>>>>>>>> Memtable-ArchiveFiles.6e616d65@978152661(1619949
>>>>>>>> bytes, 34467 operations)
>>>>>>>> INFO 01:37:47,375 Enqueuing flush of
>>>>>>>> Memtable-ArchiveFiles.706172656e74466f6c6465724944@2097700961(1619949
>>>>>>>> bytes,
>>>>>>>> 34467 operations)
>>>>>>>> INFO 01:38:51,343 Completed flushing
>>>>>>>> C:\Storage\data\Storage\ArchiveFiles.6e616d65-f-3-Data.db
(3820265
>>>>>>>> bytes)
>>>>>>>> INFO 01:38:51,343 Writing
>>>>>>>> Memtable-ArchiveFiles.6f776e65724944@1225856921(1619949 bytes,
34467
>>>>>>>> operations)
>>>>>>>>
>>>>>
>>>>
>>>
>>
>
>