I have a swarm with 3 managers. I demoted the “leader” in order to update the VM image, expecting the other 2 nodes to elect a new leader and the swarm to continue. Instead they failed to elect a new leader, and just reported:
Error response from daemon: rpc error: code = 2 desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online
Before I demoted the leader, I checked that all managers were reachable:
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
iqqagm3r4hbjieu27blwdav0j npt-preprod-mgr000002 Ready Active Reachable
l4p7x0xjde39jl2shmyrpj6vc npt-preprod-wrk000004 Ready Active
lxg66hlvgnn226kolabunytig npt-preprod-mgr000000 Ready Active Leader
rxoaxgy2zgnkqyd1720z2xvxj * npt-preprod-mgr000001 Ready Active Reachable
u76ijuu00pacn26b693zhpybq npt-preprod-wrk000000 Ready Active
y8baf64nlbkokytfr8t9v3512 npt-preprod-wrk000001 Down Drain
After the demotion, one manager was logging:
Sep 06 19:40:30 npt-preprod-mgr000002 dockerd[1577]: time="2017-09-06T19:40:30.858443334+01:00" level=info msg="1da7a4e112ef2be9 is starting a new election at term 511" module=raft node.id=iqqagm3r4hbjieu27blwdav0j
Sep 06 19:40:30 npt-preprod-mgr000002 dockerd[1577]: time="2017-09-06T19:40:30.858509934+01:00" level=info msg="1da7a4e112ef2be9 became candidate at term 512" module=raft node.id=iqqagm3r4hbjieu27blwdav0j
Sep 06 19:40:30 npt-preprod-mgr000002 dockerd[1577]: time="2017-09-06T19:40:30.858526034+01:00" level=info msg="1da7a4e112ef2be9 received MsgVoteResp from 1da7a4e112ef2be9 at term 512" module=raft node.id=iqqagm3r4hbjieu27blwdav0j
Sep 06 19:40:30 npt-preprod-mgr000002 dockerd[1577]: time="2017-09-06T19:40:30.858539935+01:00" level=info msg="1da7a4e112ef2be9 [logterm: 95, index: 62090] sent MsgVote request to 49282bf0d7e20ebc at term 512" module=raft node.id=iqqagm3r4hbjieu27blwdav0j
And the other one:
Sep 06 19:40:34 npt-preprod-mgr000001 dockerd[1531]: time="2017-09-06T19:40:34.078548975+01:00" level=error msg="Handler for GET /v1.24/services returned error: rpc error: code = 4 desc = context deadline exceeded"
Sep 06 19:40:41 npt-preprod-mgr000001 dockerd[1531]: time="2017-09-06T19:40:41.580789361+01:00" level=info msg="Node join event for npt-preprod-wrk000004-9886be872d25/10.240.4.73"
Sep 06 19:40:43 npt-preprod-mgr000001 dockerd[1531]: time="2017-09-06T19:40:43.412109375+01:00" level=error msg="agent: session failed" error="session initiation timed out" module="node/agent" node.id=rxoaxgy2zgnkqyd1720z2xvxj
Sep 06 19:40:56 npt-preprod-mgr000001 dockerd[1531]: time="2017-09-06T19:40:56.117383677+01:00" level=error msg="agent: session failed" error="session initiation timed out" module="node/agent" node.id=rxoaxgy2zgnkqyd1720z2xvxj
Sep 06 19:40:59 npt-preprod-mgr000001 dockerd[1531]: time="2017-09-06T19:40:59.086653423+01:00" level=error msg="Error getting services: rpc error: code = 4 desc = context deadline exceeded"
Sep 06 19:40:59 npt-preprod-mgr000001 dockerd[1531]: time="2017-09-06T19:40:59.086711123+01:00" level=error msg="Handler for GET /v1.24/services returned error: rpc error: code = 4 desc = context deadline exceeded"
Sep 06 19:41:01 npt-preprod-mgr000001 dockerd[1531]: time="2017-09-06T19:41:01.950593447+01:00" level=error msg="agent: session failed" error="session initiation timed out" module="node/agent" node.id=rxoaxgy2zgnkqyd1720z2xvxj
Sep 06 19:41:11 npt-preprod-mgr000001 dockerd[1531]: time="2017-09-06T19:41:11.585026048+01:00" level=info msg="Node join event for npt-preprod-mgr000002-d0b6804e2f6d/10.240.4.71"
I tried to create a new swarm with docker swarm init --force-new-cluster
but although that recovered the quorum, the overlay network didn’t work, and the containers couldn’t communicate over it.
In the end I had to delete the entire swarm, and rebuild it from scratch. I guess it was a good test of our version control and DR processes, but I’d rather not have it happen again.
Anyone know why this happened or seen it before? It seems like it must be a bug in Swarm, but thought I’d check on here before raising an issue for it.
Also, is there a way to force a new leader election, so I can test it works before demoting a manager in future?
Docker info from one of the managers:
Containers: 7
Running: 6
Paused: 0
Stopped: 1
Images: 6
Server Version: 17.06.2-ce
Storage Driver: overlay
Backing Filesystem: xfs
Supports d_type: true
Logging Driver: syslog
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
NodeID: nscsalv9o8xnjgp55msm5kvkw
Is Manager: true
ClusterID: dah4jycouc5ovx8qojyq8l5wt
Managers: 3
Nodes: 5
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 3
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 0
Root Rotation In Progress: false
Node Address: 10.240.4.69
Manager Addresses:
10.240.4.68:2377
10.240.4.69:2377
10.240.4.70:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 6e23458c129b551d5c9871e5174f6b1b7f6d1170
runc version: 810190ceaa507aa2727d7ae6f4790c76ec150bd2
init version: 949e6fa
Security Options:
seccomp
Profile: default
Kernel Version: 3.10.0-514.26.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 3.352GiB
Name: npt-preprod-mgr000003
ID: AN5Z:XL64:3P3B:NCFL:USGO:UQYJ:YEP4:65EU:E3NS:PQ7U:TKNV:ZFLB
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false