Swarm in Broken State after ASG replaced 2 out of 3 Managers

Expected behavior

Swarm should come back to life after managers are replaced assuming enough managers are left. I need to be able to recover the working Manager so I don’t have to rebuild the swarm environment. Even if manual steps are involved I should not have to rebuild my entire swarm. See the Additional Information section for details.

Actual behavior

Swarm is still functioning at the docker container level, however all swarm related commands are returning Error response from daemon: rpc error: code = 4 desc = context deadline exceeded.

The one surviving manager has the following in the logs:

Jul 24 17:17:48 moby root: time="2017-07-24T17:17:48.740986650Z" level=debug msg="Calling GET /_ping"  
Jul 24 17:17:48 moby root: time="2017-07-24T17:17:48.801706712Z" level=debug msg="failed to send message MsgVote" error="rpc error: code = 4 desc = context deadline exceeded" peer_id=493893846edb589  
Jul 24 17:17:48 moby root: time="2017-07-24T17:17:48.801777346Z" level=debug msg="failed to send message MsgVote" error="rpc error: code = 4 desc = context deadline exceeded" peer_id=7a7b7dab304819e2  
Jul 24 17:17:50 moby root: time="2017-07-24T17:17:50.174394407Z" level=debug msg="memberlist: Initiating push/pull sync with: 172.31.7.104:7946"  
Jul 24 17:17:50 moby root: time="2017-07-24T17:17:50.917465586Z" level=debug msg="Calling GET /_ping"  
Jul 24 17:17:51 moby root: time="2017-07-24T17:17:51.795057138Z" level=info msg="2fe3d3d53ab832ef is starting a new election at term 20245" module=raft node.id=khgqscjpphabk6sw8phooh2h4  
Jul 24 17:17:51 moby root: time="2017-07-24T17:17:51.795123065Z" level=info msg="2fe3d3d53ab832ef became candidate at term 20246" module=raft node.id=khgqscjpphabk6sw8phooh2h4  
Jul 24 17:17:51 moby root: time="2017-07-24T17:17:51.795149640Z" level=info msg="2fe3d3d53ab832ef received MsgVoteResp from 2fe3d3d53ab832ef at term 20246" module=raft node.id=khgqscjpphabk6sw8phooh2h4  
Jul 24 17:17:51 moby root: time="2017-07-24T17:17:51.795175424Z" level=info msg="2fe3d3d53ab832ef [logterm: 6, index: 1189709] sent MsgVote request to 493893846edb589 at term 20246" module=raft node.id=khgqscjpphabk6sw8phooh2h4  
Jul 24 17:17:51 moby root: time="2017-07-24T17:17:51.795199839Z" level=info msg="2fe3d3d53ab832ef [logterm: 6, index: 1189709] sent MsgVote request to 7a7b7dab304819e2 at term 20246" module=raft node.id=khgqscjpphabk6sw8phooh2h4  
Jul 24 17:17:51 moby root: time="2017-07-24T17:17:51.795222690Z" level=info msg="2fe3d3d53ab832ef [logterm: 6, index: 1189709] sent MsgVote request to 2519b2537be2a611 at term 20246" module=raft node.id=khgqscjpphabk6sw8phooh2h4  
Jul 24 17:17:51 moby root: time="2017-07-24T17:17:51.798989499Z" level=info msg="2fe3d3d53ab832ef received MsgVoteResp from 2519b2537be2a611 at term 20246" module=raft node.id=khgqscjpphabk6sw8phooh2h4  
Jul 24 17:17:51 moby root: time="2017-07-24T17:17:51.799014962Z" level=info msg="2fe3d3d53ab832ef [quorum:3] has received 2 MsgVoteResp votes and 0 vote rejections" module=raft node.id=khgqscjpphabk6sw8phooh2h4  
Jul 24 17:17:52 moby root: time="2017-07-24T17:17:52.118361478Z" level=debug msg="Calling GET /version"  
Jul 24 17:17:52 moby root: time="2017-07-24T17:17:52.200531655Z" level=debug msg="memberlist: TCP connection from=172.31.28.179:53500"  
Jul 24 17:17:52 moby root: time="2017-07-24T17:17:52.204492678Z" level=debug msg="ip-172-31-13-70.ec2.internal-ca2554560b32: Initiating  bulk sync for networks [o3dneytku64fe0b880xjo04mb ltxz265jp2yn19qctskflfdu5 zo7mzf2ahn2squdsv916s8jlf pk5x39orrsgnht9wlyd1gckdt z4jxng0o5zxbost48j2nt45d7 kly27wkx11wn3lona5jgm9kjj 0n8hd7wnlqcou6r13g7ync9al] with node ip-172-31-28-179.ec2.internal-69bea454e563"  
Jul 24 17:17:53 moby root: time="2017-07-24T17:17:53.722688237Z" level=debug msg="(*session).start" module="node/agent" node.id=khgqscjpphabk6sw8phooh2h4  
Jul 24 17:17:53 moby root: time="2017-07-24T17:17:53.796131587Z" level=debug msg="failed to send message MsgVote" error="rpc error: code = 4 desc = context deadline exceeded" peer_id=7a7b7dab304819e2  
Jul 24 17:17:53 moby root: time="2017-07-24T17:17:53.796186181Z" level=debug msg="failed to send message MsgVote" error="rpc error: code = 4 desc = context deadline exceeded" peer_id=493893846edb589  
Jul 24 17:17:54 moby root: time="2017-07-24T17:17:54.281361553Z" level=debug msg="Running health check for container da2719ee140ba953b336a625ced1a35cb93ce7d2a6688d2729acf8a107dce170 ..."  
Jul 24 17:17:54 moby root: time="2017-07-24T17:17:54.281598807Z" level=debug msg="starting exec command 1e3fa796fb27d71196dc552f4d0bba650929c3c4cb84df72934b5654c573bccc in container da2719ee140ba953b336a625ced1a35cb93ce7d2a6688d2729acf8a107dce170"  
Jul 24 17:17:54 moby root: time="2017-07-24T17:17:54.283046046Z" level=debug msg="attach: stdout: begin"  
Jul 24 17:17:54 moby root: time="2017-07-24T17:17:54.283086327Z" level=debug msg="attach: stderr: begin"  
Jul 24 17:17:54 moby root: time="2017-07-24T17:17:54.331981375Z" level=debug msg="libcontainerd: received containerd event: &types.Event{Type:\"start-process\", Id:\"da2719ee140ba953b336a625ced1a35cb93ce7d2a6688d2729acf8a107dce170\", Status:0x0, Pid:\"1e3fa796fb27d71196dc552f4d0bba650929c3c4cb84df72934b5654c573bccc\", Timestamp:(*timestamp.Timestamp)(0xc64d1d4020)}"  
Jul 24 17:17:54 moby root: time="2017-07-24T17:17:54.332051913Z" level=debug msg="libcontainerd: event unhandled: type:\"start-process\" id:\"da2719ee140ba953b336a625ced1a35cb93ce7d2a6688d2729acf8a107dce170\" pid:\"1e3fa796fb27d71196dc552f4d0bba650929c3c4cb84df72934b5654c573bccc\" timestamp:<seconds:1500916674 nanos:331509265 > "  
Jul 24 17:17:54 moby root: time="2017-07-24T17:17:54.336330382Z" level=debug msg="containerd: process exited" id=da2719ee140ba953b336a625ced1a35cb93ce7d2a6688d2729acf8a107dce170 pid=1e3fa796fb27d71196dc552f4d0bba650929c3c4cb84df72934b5654c573bccc status=0 systemPid=18598  
Jul 24 17:17:54 moby root: time="2017-07-24T17:17:54.336720176Z" level=debug msg="libcontainerd: received containerd event: &types.Event{Type:\"exit\", Id:\"da2719ee140ba953b336a625ced1a35cb93ce7d2a6688d2729acf8a107dce170\", Status:0x0, Pid:\"1e3fa796fb27d71196dc552f4d0bba650929c3c4cb84df72934b5654c573bccc\", Timestamp:(*timestamp.Timestamp)(0xc64d4df330)}"  
Jul 24 17:17:54 moby root: time="2017-07-24T17:17:54.336841875Z" level=debug msg="attach: stderr: end"  
Jul 24 17:17:54 moby root: time="2017-07-24T17:17:54.336857989Z" level=debug msg="attach: stdout: end"  
Jul 24 17:17:54 moby root: time="2017-07-24T17:17:54.336872927Z" level=debug msg="Health check for container da2719ee140ba953b336a625ced1a35cb93ce7d2a6688d2729acf8a107dce170 done (exitCode=0)"  
Jul 24 17:17:54 moby root: time="2017-07-24T17:17:54.648896372Z" level=debug msg="Calling GET /_ping"  
Jul 24 17:17:54 moby root: time="2017-07-24T17:17:54.795057415Z" level=info msg="2fe3d3d53ab832ef is starting a new election at term 20246" module=raft node.id=khgqscjpphabk6sw8phooh2h4  
Jul 24 17:17:54 moby root: time="2017-07-24T17:17:54.795102990Z" level=info msg="2fe3d3d53ab832ef became candidate at term 20247" module=raft node.id=khgqscjpphabk6sw8phooh2h4  
Jul 24 17:17:54 moby root: time="2017-07-24T17:17:54.795128896Z" level=info msg="2fe3d3d53ab832ef received MsgVoteResp from 2fe3d3d53ab832ef at term 20247" module=raft node.id=khgqscjpphabk6sw8phooh2h4  
Jul 24 17:17:54 moby root: time="2017-07-24T17:17:54.795154437Z" level=info msg="2fe3d3d53ab832ef [logterm: 6, index: 1189709] sent MsgVote request to 493893846edb589 at term 20247" module=raft node.id=khgqscjpphabk6sw8phooh2h4  
Jul 24 17:17:54 moby root: time="2017-07-24T17:17:54.795178158Z" level=info msg="2fe3d3d53ab832ef [logterm: 6, index: 1189709] sent MsgVote request to 7a7b7dab304819e2 at term 20247" module=raft node.id=khgqscjpphabk6sw8phooh2h4  
Jul 24 17:17:54 moby root: time="2017-07-24T17:17:54.795205198Z" level=info msg="2fe3d3d53ab832ef [logterm: 6, index: 1189709] sent MsgVote request to 2519b2537be2a611 at term 20247" module=raft node.id=khgqscjpphabk6sw8phooh2h4  
Jul 24 17:17:54 moby root: time="2017-07-24T17:17:54.799245018Z" level=info msg="2fe3d3d53ab832ef received MsgVoteResp from 2519b2537be2a611 at term 20247" module=raft node.id=khgqscjpphabk6sw8phooh2h4  
Jul 24 17:17:54 moby root: time="2017-07-24T17:17:54.799286361Z" level=info msg="2fe3d3d53ab832ef [quorum:3] has received 2 MsgVoteResp votes and 0 vote rejections" module=raft node.id=khgqscjpphabk6sw8phooh2h4

Additional Information

Rebuilding Swarm steps
Using the cloud formation template to add / remove the amount of work is considerable and the downtime for the entire swarm is around 1 hour or so.

  • Route53 changes to support the CNAME SSL / new ELB configuration.
  • Destroying the previous formation (takes 20+ minutes)
  • Creating new formation (takes 10+ minutes)
  • Setup rebind Security Groups for VPNs, etc
  • Setup Static Routes for ElasticSearch
  • Setup VPC Peering to allow VPN VPC / Dev Network to communicate on Docker VPC
  • Setup all Secrets again
  • Deploy Monitoring Stack and Setup Grafana
  • Deploy Base infrastructure stacks
  • Deploy Application Environment stacks (3)

So as you can see if there’s a way to recover from this even if it’s a manual process I’d be much appreciated.

Docker Version

$ docker version
Client:
 Version:      17.05.0-ce
 API version:  1.29
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 21:43:09 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.05.0-ce
 API version:  1.29 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 21:43:09 2017
 OS/Arch:      linux/amd64
 Experimental: true

This was not a testing attempt, this was AWS / ASG kicking in and replacing 2 of the 3 managers in quick succession.

Successful
Launching a new EC2 instance: i-034716ac58a99d0c9
2017 July 23 14:48:50 UTC-4
2017 July 23 14:49:25 UTC-4
Successful
Terminating EC2 instance: i-02931a76a602d1fe9
2017 July 23 14:48:18 UTC-4
2017 July 23 14:50:21 UTC-4
Successful
Launching a new EC2 instance: i-07c4ed6617d71e9b4
2017 July 23 14:42:24 UTC-4
2017 July 23 14:42:58 UTC-4
Successful
Terminating EC2 instance: i-0e5967ec31e5aa642
2017 July 23 14:41:53 UTC-4
2017 July 23 14:43:23 UTC-4

Docker Diagonse

$ docker-diagnose
OK hostname=ip-172-31-31-200-ec2-internal session=1500917211-MaOBJTPzodoFGXjmdfjZKGMsJBx6xoxW
OK hostname=ip-172-31-13-70-ec2-internal session=1500917211-MaOBJTPzodoFGXjmdfjZKGMsJBx6xoxW
OK hostname=ip-172-31-45-122-ec2-internal session=1500917211-MaOBJTPzodoFGXjmdfjZKGMsJBx6xoxW
OK hostname=ip-172-31-28-167-ec2-internal session=1500917211-MaOBJTPzodoFGXjmdfjZKGMsJBx6xoxW
OK hostname=ip-172-31-28-179-ec2-internal session=1500917211-MaOBJTPzodoFGXjmdfjZKGMsJBx6xoxW
OK hostname=ip-172-31-13-24-ec2-internal session=1500917211-MaOBJTPzodoFGXjmdfjZKGMsJBx6xoxW
OK hostname=ip-172-31-34-252-ec2-internal session=1500917211-MaOBJTPzodoFGXjmdfjZKGMsJBx6xoxW
OK hostname=ip-172-31-7-104-ec2-internal session=1500917211-MaOBJTPzodoFGXjmdfjZKGMsJBx6xoxW
Done requesting diagnostics.
Your diagnostics session ID is 1500917211-MaOBJTPzodoFGXjmdfjZKGMsJBx6xoxW
Please provide this session ID to the maintainer debugging your issue.

Steps to reproduce the behavior

  1. Build docker Swarm using Docker CE for AWS 17.05.0-ce (17.05.0-ce-aws2)
  2. Remove 2 managers in a short amount of time
  3. Sit back and watch things bomb.

Worked with Ken on Docker Github for AWS to rebuild swarm cluster from surviving manager: