great tips. I will investigate further with your suggestions in mind. Hopefully the problem
has gone away since I pulled in fresh data on the node with problems.
On Apr 13, 2011, at 3:54 AM, aaron morton wrote:
> Ah, unreadable rows and in the validation compaction no less. Makes a little more sense
now.
>
> Anyone help with the EOF when deserializing columns ? Is the fix to run scrub or drop
the sstable ?
>
> Here's a a theory, AES is trying to...
>
> 1) Create TreeRequest 's that specify a range we want to validate.
> 2) Send TreeRequest 's to local node and neighbour
> 3) Process TreeRequest by running a validation compaction (CompactionManager.doValidationCompaction
in your prev stacks)
> 4) When both TreeRequests return back work out the differences and then stream data if
needed.
>
> Perhaps step 3 is not completing because of errors like http://www.mail-archive.com/user@cassandra.apache.org/msg12196.html
If the row is over multiple sstables we can skip the row in one sstable. However if it's in
a single sstable PrecompactedRow will raise an IOError if there is a problem. This is not
what is in the linked error stack that shows a row been skipped, just a hunch we could checkout.
>
> Do you see an IOErrors (not exceptions) in the logs or exceptions with doValidationCompaction
in the stack?
>
> For a tree request on the node you start compaction on you should see these logs...
> 1) Waiting for repair requests...
> 2) One of "Stored local tree" or "Stored remote tree" depending on which returns first
at DEBUG level
> 3) "Queuing comparison"
>
> If we do not have the 3rd log then we did not get a replay from either local or remote.
>
> Aaron
>
> On 13 Apr 2011, at 00:57, Jonathan Colby wrote:
>
>> There is no "Repair session" message either. It just starts with a message like:
>>
>> INFO [manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723] 2011-04-10 14:00:59,051
AntiEntropyService.java (line 770) Waiting for repair requests: [#<TreeRequest manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723,
/10.46.108.101, (DFS,main)>, #<TreeRequest manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723,
/10.47.108.100, (DFS,main)>, #<TreeRequest manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723,
/10.47.108.102, (DFS,main)>, #<TreeRequest manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723,
/10.47.108.101, (DFS,main)>]
>>
>> NETSTATS:
>>
>> Mode: Normal
>> Not sending any streams.
>> Not receiving any streams.
>> Pool Name Active Pending Completed
>> Commands n/a 0 150846
>> Responses n/a 0 443183
>>
>> One node in our cluster still has "unreadable rows", where the reads trip up every
time for certain sstables (you've probably seen my earlier threads regarding that). My suspicion
is that the bloom filter read on the node with the corrupt sstables is never reporting back
to the repair, thereby causing it to hang.
>>
>>
>> What would be great is a scrub tool that ignores unreadable/unserializable rows!
: )
>>
>>
>> On Apr 12, 2011, at 2:15 PM, aaron morton wrote:
>>
>>> Do you see a message starting "Repair session " and ending with "completed successfully"
?
>>>
>>> Or do you see any streaming activity using "nodetool netstats"
>>>
>>> Repair can hang if a neighbour dies and fails to send a requested stream. It
will timeout after 24 hours (I think).
>>>
>>> Aaron
>>>
>>> On 12 Apr 2011, at 23:39, Karl Hiramoto wrote:
>>>
>>>> On 12/04/2011 13:31, Jonathan Colby wrote:
>>>>> There are a few other threads related to problems with the nodetool repair
in 0.7.4. However I'm not seeing any errors, just never getting a message that the repair
completed successfully.
>>>>>
>>>>> In my production and test cluster (with just a few MB data) the repair
nodetool prompt never returns and the last entry in the cassandra.log is always something
like:
>>>>>
>>>>> #<TreeRequest manual-repair-f739ca7a-bef8-4683-b249-09105f6719d9,
/10.46.108.102, (DFS,main)> completed successfully: 1 outstanding
>>>>>
>>>>> But I don't see a message, even hours later, that the 1 outstanding request
"finished successfully".
>>>>>
>>>>> Anyone else experience this? These are physical server nodes in local
data centers and not EC2
>>>>>
>>>>
>>>> I've seen this. To fix it try a "nodetool compact" then repair.
>>>>
>>>>
>>>> --
>>>> Karl
>>>
>>
>