OK, just to make sure: you can see these gossip/state messages when
the node is going down and coming back up again, but not afterwards?
That is, after you restart the node, you see "10.6.168.20 UP" and
"state jump to normal" only once and when the write rate goes to zero
and/or comes back to 230?
On Wed, Dec 23, 2009 at 12:01 PM, Ramzi Rabah <rrabah@playdom.com> wrote:
> Hi Jaako thanks for your response.
>
> I compiled the very latest from 0.5 branch yesterday (whatever
> yesterday nights build was). I do see that Node X.X.X.X is dead, and
> Node X.X.X.X has restarted.
>
> This show up on all the 3 other servers:
> INFO [Timer-1] 2009-12-22 20:38:43,738 Gossiper.java (line 194)
> InetAddress /10.6.168.20 is now dead.
>
> Node /10.6.168.20 has restarted, now UP again
> INFO [GMFD:1] 2009-12-22 20:43:12,812 StorageService.java (line 475)
> Node /10.6.168.20 state jump to normal
>
> This time the first time I restarted the node it seemed fine, but the
> second time I restarted it, this is what cfstats is showing for
> traffic on it :
>
> Column Family: Datastore
> Memtable Columns Count: 407
> Memtable Data Size: 42268
> Memtable Switch Count: 1
> Read Count: 0
> Read Latency: NaN ms.
> Write Count: 0
> Write Latency: NaN ms.
> Pending Tasks: 0
>
> and then it went up and now it's back to:
>
> Column Family: Datastore
> Memtable Columns Count: 2331
> Memtable Data Size: 242364
> Memtable Switch Count: 1
> Read Count: 107
> Read Latency: 0.486 ms.
> Write Count: 113
> Write Latency: 0.000 ms.
> Pending Tasks: 0
>
> which is half the traffic the other nodes are showing. The other 3
> nodes are showing a consistent ~230 reads/writes per second, which
> node 4 was showing before it was restarted. I hope data is not being
> lost in the process?
>
>
> On Tue, Dec 22, 2009 at 4:43 PM, Jaakko <rosvopaallikko@gmail.com> wrote:
>> Hi,
>>
>> Which revision number you are running?
>>
>> Can you see any log lines related to node being UP or dead? (like
>> "InetAddress X.X.X.X is now dead" or "Node X.X.X.X has restarted, now
>> UP again"). These messages come from the Gossiper and indicate if it
>> for some reason thinks the node is dead. Level of these messages is
>> info.
>>
>> Another thing is: can you see any log messages like "Node X.X.X.X
>> state normal, token XXX"? These are on debug level.
>>
>> -Jaakko
>>
>>
>> On Wed, Dec 23, 2009 at 12:59 AM, Ramzi Rabah <rrabah@playdom.com> wrote:
>>> I just recently upgraded to latest in 0.5 branch, and I am running
>>> into a serious issue. I have a cluster with 4 nodes, rackunaware
>>> strategy, and using my own tokens distributed evenly over the hash
>>> space. I am writing/reading equally to them at an equal rate of about
>>> 230 reads/writes per second(and cfstats shows that). The first 3 nodes
>>> are seeds, the last one isn't. When I start all the nodes together at
>>> the same time, they all receive equal amounts of reads/writes (about
>>> 230).
>>> When I bring node 4 down and bring it back up again, node 4's load
>>> fluctuates between the 230 it used to get to sometimes no traffic at
>>> all. The other 3 still have the same amount of traffic. And no errors
>>> what so ever seen in logs. Any ideas what can be causing this
>>> fluctuation on node 4 after I restarted it?
>>>
>>
>