Goal
How can I check the health of the Zookeeper?
Environment
Zookeeper
Guide
Why Zookeeper?
Zookeeper is used as a state machine for the Solr cluster to understand the cluster state and hold configuration information. It uses what are called ephemeral nodes so that if connections are dropped with Solr servers, those z-nodes associated with that connection will automatically drop as well. This allows Solr to understand the state of the cluster. Requests to Solr are NOT routed through the zookeeper. It holds a minimal of information and does a minimal amount of processing, thus having minimal resource requirements to run well.
Best practices
- ZK should always be set up in a quorum of at least 3 zookeepers and always an odd number. Zookeeper needs a majority of its quorum up to respond to requests (2 of 3, 3 of 5, etc), so an even number is always a bad option
- You do not need a lot of zookeepers. It does not route requests to Solr and does not do a significant amount of processing. Its main overhead is simply the number of connections it has open
- ZK should have very fast disk IO latency and access to its disk. Do not put production ZK on nodes that do a high amount of IO to the same disk (such as with a Solr server that does any significant indexing). This is the biggest no-no of a ZK setup
- Do not open up lots of unnecessary connections to Zookeeper through clients trying to index to Solr
- Do not use a single load-balanced address for all your zks. The best practice for configuration is to use the full quorum address of all zk nodes. We have seen significant disconnects/reconnects on systems that use a single load-balanced address
Check on it’s health
- Zookeeper has a bunch of admin commands called the four-letter commands. Documentation is here: https://zookeeper.apache.org/doc/r3.4.8/zookeeperAdmin.html#sc_zkCommands
- The most common ones we use are ‘mntr’ and ‘cons’. For example:
cmd> echo 'mntr' | nc localhost 9983
zk_version 3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03, built on 06/29/2018 00:39 GMT
zk_avg_latency 0
zk_max_latency 0
zk_min_latency 0
zk_packets_received 4244
zk_packets_sent 4243
zk_num_alive_connections 1
zk_outstanding_requests 0
zk_server_state standalone
zk_znode_count 1199
zk_watch_count 0
zk_ephemerals_count 0
zk_approximate_data_size 3052823
zk_open_file_descriptor_count 43
zk_max_file_descriptor_count 10240
zk_fsync_threshold_exceed_count 0
echo 'cons' | nc localhost 9983
/127.0.0.1:60301[1](queued=0,recved=59,sent=60,sid=0x100048351d80004,lop=PING,est=1574447048795,to=10000,lcxid=0x38,lzxid=0x45f4,lresp=82867730,llat=0,minlat=0,avglat=0,maxlat=6)
/127.0.0.1:60300[1](queued=0,recved=8011,sent=8090,sid=0x100048351d80003,lop=GETD,est=1574447048686,to=30000,lcxid=0x1f4a,lzxid=0x45f4,lresp=82869974,llat=0,minlat=0,avglat=0,maxlat=7)
/127.0.0.1:60331[0](queued=0,recved=1,sent=0)
- We tend to look for high max latency numbers, the total number of alive connections, and whether or not the z-node counts are the same across the quorum. High latency could be either because of slow ZK, or slow Solr connectivity. Usually, it’s Solr that is slow unless you have disk IO issues on ZK.
- Checking resources: you should be very sensitive to any view that says there is an IO Wait on your ZK system, or a slow disk.
- It is possible to have the default total connections set too low, which defaults to 60
- Beware f-sync errors in your logs. This shows you have a slow disk and zk is having trouble sending syncs to disk in a timely fashion
Cause:
N/A
Comments
0 comments
Article is closed for comments.