On Fri, Mar 3, 2017 at 7:56 AM, Romain Hardouin <romainh_ml@yahoo.fr> wrote:
> I suspect a lack of 3.x reliability. Cassandra could had gave up with
> dropped messages but not with a "drop keyspace". I mean I already saw some
> spark jobs with too much executors that produce a high load average on a
> DC. I saw a C* node with a 1 min. load avg of 140 that can still have a P99
> read latency at 40ms. But I never saw a disappearing keyspace. There are
> old tickets regarding C* 1.x but as far as I remember it was due to a
> create/drop/create keyspace.
>
>
> Le Vendredi 3 mars 2017 13h44, George Webster <webstergd@gmail.com> a
> écrit :
>
>
> Thank you for your reply and good to know about the debug statement. I
> haven't
>
> We never dropped or re-created the keyspace before. We haven't even
> performed writes to that keyspace in months. I also checked the permissions
> of Apache, that user had read only access.
>
> Unfortunately, I reverted from a backend recently. I cannot say for sure
> anymore if I saw something in system before the revert.
>
> Anyway, hopefully it was just a fluke. We have some crazy ML libraries
> running on it maybe Cassandra just gave up? Ohh well, Cassandra is a a
> champ and we haven't really had issues with it before.
>
> On Thu, Mar 2, 2017 at 6:51 PM, Romain Hardouin <romainh_ml@yahoo.fr>
> wrote:
>
> Did you inspect system tables to see if there is some traces of your
> keyspace? Did you ever drop and re-create this keyspace before that?
>
> Lines in debug appear because fd interval is > 2 seconds (logs are in
> nanoseconds). You can override intervals via -Dcassandra.fd_initial_value_
> ms and -Dcassandra.fd_max_interval_ms properties. Are you sure you didn't
> have these lines in debug logs before? I used to see them a lot prior to
> increase intervals to 4 seconds.
>
> Best,
>
> Romain
>
> Le Mardi 28 février 2017 18h25, George Webster <webstergd@gmail.com> a
> écrit :
>
>
> Hey Cassandra Users,
>
> We recently encountered an issue with a keyspace just disappeared. I was
> curious if anyone has had this occur before and can provide some insight.
>
> We are using cassandra 3.10. 2 DCs 3 nodes each.
> The data was still located in the storage folder but is not located inside
> Cassandra
>
> I searched the logs for any hints of error or commands being executed that
> could have caused a loss of a keyspace. Unfortunately I found nothing. In
> the logs the only unusual issue i saw was a series of read timeouts that
> occurred right around when the keyspace went away. Since then I see
> numerous entries in debug log as the following:
>
> DEBUG [GossipStage:1] 2017-02-28 18:14:12,580 FailureDetector.java:457 -
> Ignoring interval time of 2155674599 for /x.x.x..12
> DEBUG [GossipStage:1] 2017-02-28 18:14:16,580 FailureDetector.java:457 -
> Ignoring interval time of 2945213745 for /x.x.x.81
> DEBUG [GossipStage:1] 2017-02-28 18:14:19,590 FailureDetector.java:457 -
> Ignoring interval time of 2006530862 for /x.x.x..69
> DEBUG [GossipStage:1] 2017-02-28 18:14:27,434 FailureDetector.java:457 -
> Ignoring interval time of 3441841231 for /x.x.x.82
> DEBUG [GossipStage:1] 2017-02-28 18:14:29,588 FailureDetector.java:457 -
> Ignoring interval time of 2153964846 for /x.x.x.82
> DEBUG [GossipStage:1] 2017-02-28 18:14:33,582 FailureDetector.java:457 -
> Ignoring interval time of 2588593281 for /x.x.x.82
> DEBUG [GossipStage:1] 2017-02-28 18:14:37,588 FailureDetector.java:457 -
> Ignoring interval time of 2005305693 for /x.x.x.69
> DEBUG [GossipStage:1] 2017-02-28 18:14:38,592 FailureDetector.java:457 -
> Ignoring interval time of 2009244850 for /x.x.x.82
> DEBUG [GossipStage:1] 2017-02-28 18:14:43,584 FailureDetector.java:457 -
> Ignoring interval time of 2149192677 for /x.x.x.69
> DEBUG [GossipStage:1] 2017-02-28 18:14:45,605 FailureDetector.java:457 -
> Ignoring interval time of 2021180918 for /x.x.x.85
> DEBUG [GossipStage:1] 2017-02-28 18:14:46,432 FailureDetector.java:457 -
> Ignoring interval time of 2436026101 for /x.x.x.81
> DEBUG [GossipStage:1] 2017-02-28 18:14:46,432 FailureDetector.java:457 -
> Ignoring interval time of 2436187894 for /x.x.x.82
>
> During the time of the disappearing keyspace we had two concurrent
> activities:
> 1) Running a Spark job (via HDP 2.5.3 in Yarn) that was performing a
> countbykey. It was using they Keyspace that disappeared. The operation
> crashed.
> 2) We created a new keyspace to test out scheme. Only "fancy" thing in
> that keyspace are a few material view tables. Data was being loaded into
> that keyspace during the crash. The load process was extracting information
> and then just writing to Cassandra.
>
> Any ideas? Anyone seen this before?
>
> Thanks,
> George
>
>
>
>
>
>
Cassandra takes snapshots for certain events. Does this extend to drop
keyspace commands? Maybe it should.