Cannot get zookeeper to work running in docker using swarm mode

Thank you, arunkollipara, for the suggestion. While I believe that it might work for zookeeper instances run on separate computers each running in a separate container, it ended up not being what was needed to resolve this issue. I finally figured out what the problem was based upon a couple of stack overflow questions. The problem, apparently, is that even though a zookeeper node wants to know about all of the nodes that should be in its ensemble, it has to be able to resolve and contact ALL of them, including itself, to work. To this end, it would seem that it will only try to lookup the hostname for its own ID based on local resolution without going out to a DNS server. Why? I don’t know, it may just have something to do with how linux networking operates. To fix it one of two things has to happen:

  1. The hostname needs to be in the hosts file as an alias to localhost.
  2. You have to use 0.0.0.0 as the host for the server. More concretely, if the ID of the zookeeper that is starting is 1, then the ZOO_SERVERS environment variable has to be “server.1=0.0.0.0:2888:3888 server.2=zookeeper-0161_company_com:2888:3888 server.3=zookeeper-0114_company_com:2888:3888”

So I changed the configuration to be:

docker service create \
--network my-net \
--name zookeeper-1046_company_com \
--mount type=bind,source=/home/docker/data/zookeeper,target=/data \
--env ZOO_MY_ID=1 \
--env ZOO_SERVERS="server.1=0.0.0.0:2888:3888 server.2=zookeeper-0161_company_com:2888:3888 server.3=zookeeper-0114_company_com:2888:3888" \
--constraint "node.hostname == 1046.company.com" \
zookeeper

docker service create \
--network my-net \
--name zookeeper-0161_company_com \
--mount type=bind,source=/home/docker/data/zookeeper,target=/data \
--env ZOO_MY_ID=2 \
--env ZOO_SERVERS="server.1=zookeeper-1046_company_com:2888:3888 server.2=0.0.0.0:2888:3888 server.3=zookeeper-0114_company_com:2888:3888" \
--constraint "node.hostname == 0161.company.com" \
zookeeper

docker service create \
--network my-net \
--name zookeeper-0114_company_com \
--mount type=bind,source=/home/docker/data/zookeeper,target=/data \
--env ZOO_MY_ID=3 \
--env ZOO_SERVERS="server.1=zookeeper-1046_company_com:2888:3888 server.2=zookeeper-0161_company_com:2888:3888 server.3=0.0.0.0:2888:3888" \
--constraint "node.hostname == 0114.company.com" \
zookeeper

Which gives me:

docker service ls    

ID                        NAME                          MODE        REPLICAS  IMAGE
evtli9w4cuh3      zookeeper-0146_company_com  replicated  1/1       zookeeper:latest
mgm3qxxoida8  zookeeper-0114_company_com  replicated  1/1       zookeeper:latest
uxrns860wd8j    zookeeper-0161_company_com  replicated  1/1       zookeeper:latest

And in the logs:

2017-01-18 17:14:20,029 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Leader@952] - Have quorum of supporters, sids: [ 1,3 ]; starting up and setting last processed zxid: 0x700000000
2017-01-18 17:14:39,245 [myid:1] - INFO  [LearnerHandler-/10.0.0.5:42734:LearnerHandler@384] - Synchronizing with Follower sid: 2 maxCommittedLog=0x0 minCommittedLog=0x0 peerLastZxid=0x0
2017-01-18 17:14:39,302 [myid:1] - INFO  [LearnerHandler-/10.0.0.5:42734:LearnerHandler@518] - Received NEWLEADER-ACK message from 2

And getting a shell into one of the nodes:

netstat -tlupn

Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 127.0.0.11:44701        0.0.0.0:*               LISTEN      -
tcp        0      0 :::42429                :::*                    LISTEN      -
tcp        0      0 :::2181                 :::*                    LISTEN      -
tcp        0      0 :::2888                 :::*                    LISTEN      -
tcp        0      0 :::3888                 :::*                    LISTEN      -
udp        0      0 127.0.0.11:52263        0.0.0.0:*

Which shows all of our expected ports listening.

Running a check against zookeeper in the same shell gives us:

telnet localhost 2181
stat

Zookeeper version: 3.4.9-1757313, built on 08/23/2016 06:50 GMT
Clients:
 /127.0.0.1:59878[0](queued=0,recved=1,sent=0)

Latency min/avg/max: 0/0/0
Received: 1
Sent: 0
Connections: 1
Outstanding: 0
Zxid: 0x700000000
Mode: leader
Node count: 4
Connection closed by foreign host

And against another node:

 telnet zookeeper-0114_company_com 2181
stat

Zookeeper version: 3.4.9-1757313, built on 08/23/2016 06:50 GMT
Clients:
 /10.0.0.3:59118[0](queued=0,recved=1,sent=0)

Latency min/avg/max: 0/0/0
Received: 1
Sent: 0
Connections: 1
Outstanding: 0
Zxid: 0x0
Mode: follower
Node count: 4
Connection closed by foreign host

The curious thing here is that this shows 4 nodes in the ensemble, which is one more than there actually are. Perhaps we don’t need the loopback server host in the configuration as all of the zookeeper documentation shows?

I hope that this helps someone else!

2 Likes