On 04/19/2010 11:55 AM, Travis Crawford wrote:
> It would be a lot easier from the operations perspective if the leader
> explicitly published some health stats:
>
> (a) Count of instances in the ensemble.
> (b) Count of up-to-date instances in the ensemble.
>
> This would greatly simplify monitoring& alerting - when an instance
> falls behind one could configure their monitoring system to let
> someone know and take a look at the logs.

That's a great idea. Please enter a JIRA for this - a new 4 letter word
and JMX support. It would also be a great starter project for someone
interested in becoming more familiar with the server code.

Activity

Attaching a Ganglia screen capture showing an example of how this data would be used. Many monitoring systems have a way to collect & store timeseries data; exporting these data as easily-parsable plaintext will make writing importers to any monitoring system easy.

Travis Crawford
added a comment - 19/Apr/10 22:54 Attaching a Ganglia screen capture showing an example of how this data would be used. Many monitoring systems have a way to collect & store timeseries data; exporting these data as easily-parsable plaintext will make writing importers to any monitoring system easy.

Looks good so far. You should update the forrest docs as part of this change, you should detail the format at least (and expectation that order is not maintained, etc...), that this is compatible with java properties format, the content may change (keys added) over time, etc... That way a user will better understand how to integrate and have a reasonable expectation of b/w compatibliity, implementers will also have insight into how they can change this command over time. We didn't do a good job with this for the other commands, but since this is new and intended for integration into third party tools (vs a human always looking at the results) it's a good idea to add. You should add this command to the Tests as well.

Patrick Hunt
added a comment - 24/May/10 17:56 Looks good so far. You should update the forrest docs as part of this change, you should detail the format at least (and expectation that order is not maintained, etc...), that this is compatible with java properties format, the content may change (keys added) over time, etc... That way a user will better understand how to integrate and have a reasonable expectation of b/w compatibliity, implementers will also have insight into how they can change this command over time. We didn't do a good job with this for the other commands, but since this is new and intended for integration into third party tools (vs a human always looking at the results) it's a good idea to add. You should add this command to the Tests as well.

Looks awesome! One nit – many monitoring systems do not interpret strings so it may be appropriate to export everything as numbers. For example, consider a script that loops through these values poking them into Ganglia (or other timeseries database). The script would need special-cased to handle "leader". Later, as more values are added the import script would need updated with the new strings. Doing everything as numbers ensures new values would "just work" without updating other systems.

Travis Crawford
added a comment - 26/May/10 03:13 Looks awesome! One nit – many monitoring systems do not interpret strings so it may be appropriate to export everything as numbers. For example, consider a script that loops through these values poking them into Ganglia (or other timeseries database). The script would need special-cased to handle "leader". Later, as more values are added the import script would need updated with the new strings. Doing everything as numbers ensures new values would "just work" without updating other systems.
With that in mind, perhaps:
zk_server_state 1 (instead of: leader)
zk_build_timestamp unix_timestamp (instead of build string)

Andrei Savu
added a comment - 26/May/10 23:05 I have fixed the patch: changed forrest docs and added tests.
@Travis: I believe your script should do some sort of filtering / format. Is it really a good idea to just throw any output in Ganglia?
PS: sorry for the late answer, I had some problems with forrest and java1.6

Andrei Savu
added a comment - 01/Jun/10 19:47 You can review the patch for commit. Right know I'm writing monitoring scripts for Nagios, Cacti and Ganglia (in this order). The script for nagios is almost ready. Thanks.
original message
Subject: [jira] Commented: ( ZOOKEEPER-744 ) Add monitoring four-letter word
From: "Patrick Hunt (JIRA)" <jira@apache.org>
Date: 01/06/2010 19:07
[ https://issues.apache.org/jira/browse/ZOOKEEPER-744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12874111#action_12874111 ]
Patrick Hunt commented on ZOOKEEPER-744 :
----------------------------------------
Andrei, are you are still working on this or should I review for commit?
–
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

1) indicate in the docs that not all keys are available on all platforms (fd count only on unix for example)
2) change "node_count" to "znode_count" (reduce confusion btw serving nodes and znodes)
3) your implementation of ephemeral counting:
org.apache.zookeeper.server.DataTree.getEphemeralsCount()
is inefficient, use entrySet instead (rather than keyset)
4) take a look at how ephemeral counting is done here:
org.apache.zookeeper.server.DataTreeBean.countEphemerals()
You might use refactor to use this code in both places.
5) watch_count is only counting the number of paths that are watched, not the total number of watches (a path may have multiple watches - ie multiple sessions watching the same path)
Looks like this is a bug in the existing implementation (currently only exposed in the bean). You should fix this. Add a test for this while you are at it to verify correct counting.
6) good that you capture the quorum info, is there a way to capture the date/time of the last election?

Patrick Hunt
added a comment - 02/Jun/10 01:08 Andrei, looks good, a few comments while reviewing the patch:
1) indicate in the docs that not all keys are available on all platforms (fd count only on unix for example)
2) change "node_count" to "znode_count" (reduce confusion btw serving nodes and znodes)
3) your implementation of ephemeral counting:
org.apache.zookeeper.server.DataTree.getEphemeralsCount()
is inefficient, use entrySet instead (rather than keyset)
4) take a look at how ephemeral counting is done here:
org.apache.zookeeper.server.DataTreeBean.countEphemerals()
You might use refactor to use this code in both places.
5) watch_count is only counting the number of paths that are watched, not the total number of watches (a path may have multiple watches - ie multiple sessions watching the same path)
Looks like this is a bug in the existing implementation (currently only exposed in the bean). You should fix this. Add a test for this while you are at it to verify correct counting.
6) good that you capture the quorum info, is there a way to capture the date/time of the last election?

@Patrick I have fixed 1-5. I will resubmit the patch after writing some tests to ensure that the node watch count works as expected (I'm having some problems with this part). Right now all tests are passing.

6. I believe the leader does not record the time of the last election. I will look more into this and change the code as needed.

Andrei Savu
added a comment - 06/Jun/10 22:53 @Patrick I have fixed 1-5. I will resubmit the patch after writing some tests to ensure that the node watch count works as expected (I'm having some problems with this part). Right now all tests are passing.
6. I believe the leader does not record the time of the last election. I will look more into this and change the code as needed.
Should I also add JVM memory stats?

I've updated the patch and added a new the test for getWatchCount(). I'm not yet recording the time of the last election, I'm thinking about open a JIRA later for this. I want to move on and work on ZOOKEEPER-613.

Andrei Savu
added a comment - 14/Jun/10 23:59 I've updated the patch and added a new the test for getWatchCount(). I'm not yet recording the time of the last election, I'm thinking about open a JIRA later for this. I want to move on and work on ZOOKEEPER-613 .