Cluster Broken and Docker unresponsive

Expected behavior

A healthy cluster

Actual behavior

After a few weeks running a cluster using Docker for AWS 17.05.0-ce-rc1-aws1 , today everything started to hang, I noticed after I wasn’t unable to do a docker service update through jenkins:

Error response from daemon: rpc error: code = 4 desc = context deadline exceeded

~/docker # docker service ls
Error response from daemon: rpc error: code = 4 desc = context deadline exceeded
~/docker # docker info
Error response from daemon: rpc error: code = 4 desc = context deadline exceeded

Additional Information

The other 2 members of the cluster are out now:

~ # docker service ls
Error response from daemon: This node is not a swarm manager. Use “docker swarm init” or “docker swarm join” to connect this node to swarm and try again.

~ # docker info
Containers: 4
Running: 4
Paused: 0
Stopped: 0
Images: 5
Server Version: 17.05.0-ce-rc1
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9048e5e50717ea4497b757314bad98ea3763c145
runc version: 9c2d8d184e5da67c95d601382adf14862e4f2228
init version: 949e6fa
Security Options:
seccomp
Profile: default
Kernel Version: 4.9.21-moby
Operating System: Alpine Linux v3.5
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.785GiB
Name: ip-172-28-44-18.ec2.internal
ID: 5BC7:D7EP:SK6U:UMJY:LSFC:YZYD:X6DL:MIVW:ZUMK:XGTT:AGHR:6IED
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
File Descriptors: 46
Goroutines: 46
System Time: 2017-06-22T11:02:53.714211716Z
EventsListeners: 0
Registry: https://index.docker.io/v1/
Labels:
os=linux
region=eu-central-1
availability_zone=eu-central-1a
instance_type=m4.large
node_type=manager
Experimental: true
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false

I have no idea why this might had happened.

Any advice? Any other information I can supply?

Thank you very much.

Hi,

I had a very similar problem with all docker service commands being unresponsive with the “rpc error: code = 4 desc = context deadline exceeded” error.

Looking at your docker info, it states that the swarm is “inactive”, which seems a bit odd. Where did you run this command on/from?

When the swarm is active, (even when the manager nodes were dead) you should see some further information on the Manager nodes, including their IP addresses.

In our case, I noticed that we had 6 internal Manager IP addresses listed, but only three EC2 Manager instances running. When I analysed it closer, it looks like AWS EC2 instances had died over time, and had been auto re-created, but with different IP addresses. The services were listening on a particular IP:PORT, so when that machine had died, we lost the ability to communicate with the managers.

Unfortunately, in our case, we had to shut down the services, leaving the old-swarm from what nodes we could, then create a new swarm, and then restarting the relevant services on it, without a specific IP:PORT binding. We shall see how that goes.

Hope that helps.