How to kill a broken swarm?

I’ve been using 1.12 since RC1 all the way to the latest production release, and it’s still pretty buggy. Some of the bugs are hard to reproduce, but I’ve consistently found that if I start and stop enough containers, the swarm will eventually die and become unrecoverable.

Once that happens, I have no idea how to actually get my docker manager node into a “blank slate” stage - in other words, not part of a swarm at all.

For example, I’m currently in a situation where the swarm has bugged out. I have a global service running that I attempt to remove:

root@master1:~# docker service rm nginxtest
Error response from daemon: rpc error: code = 4 desc = context deadline exceeded

Clearly broken. I will never be able to remove that service. But the issue is, once I’m in this state, I have no idea how to actually just kill the swarm and get out of dodge.

I can’t just leave the swarm:

root@master1:~# docker swarm leave
Error response from daemon: You are attempting to leave cluster on a node that is participating as a manager. Removing the last manager will erase all current state of the cluster. Use `--force` to ignore this message.

I can’t --force leave the swarm:

root@master1:~# docker swarm leave --force
Error response from daemon: context deadline exceeded

I can’t initialize a new cluster:

root@master1:~# docker swarm init --force-new-cluster
Error response from daemon: context deadline exceeded

Is there any way to outright remove this node as a swarm master, without killing the entire machine (which has been my method so far)?

2 Likes

i have the same issue…

Thanks for sharing. I’ll keep an eye on this thread.

I have the same problem too. I’ve killed my cluster by doing the following:

  • create a global service (loadbalancer)
  • service was running great
  • upgraded the version (image) of the service
  • my new image was broken, so the containers exited 1 after 3 sec
  • swarm created new containers all the way, task list got filled up
  • I saw what happened and changed the image back to the one I’ve used before
  • swarm broke, 2 nodes are have the service in running state (still the old version because the upgrade did not reach them - upgrade was one by one), the other had the service in state “active” or sth. (I did not remember, but it started with an “a”) but the broken image attached

Since then I’m getting also the “context deadline exceeded” errors on all nodes and the whole swarm is broken. My services are not critical, so I could start from scratch, but there is no way to do that.

This issue is concerning, I’d also like to know how to fix this. @klamar - when you say the task list got filled up, what do you mean by that?

So I initiated the update on the service (changed image).
swarm tried to upgrade the first server. So it stopped the old container and created a new one. The new one failed after 3 seconds with exit code 1. So the task was marked as failed and a new task got created on the same node. Of course the same thing happened to this task as well. So I got about 10 tasks in about 30 seconds of time. I’ve seen that and tried to “undo” my service update by initiating a new service update, but using the old image which did work. After I did this the last task on this node become “accepted” but nothing else happened. At this time the swarm cluster basically stopped working with the error. I did not got a new task using the old image again. My task list was still full of these 10 tasks. I wanted to see if I really have 10 containers lying around (which would be not good). But I have seen I did not had 10 containers. For this service I think I did not even had one container (the task in the “accepted” state) on this node. So the task entries have been still in the swarm with no container assigned.

Unfortunately I tried to “fix” my cluster and destroyed a lot of information. So everything I said is from my memory. So it might not be accurate. I’m sorry for that. Would like to help more.

I’ll probably create a new swarm cluster soon on very different nodes. I will try to reproduce this issue.

I have the same problem. If I kill Docker for Mac and re-start it, I get back to the same state and I can’t remove any services. Anyone made any progress on this? I am using Docker for Mac Version 1.12.1 (build: 12133)
2d5b4d9c3daa089e3869e6355a47dd96dbf39856 on MacOS (Sierra)

Exactly the same for me, always when there are problems starting the images from a service.
I’m using docker-machine over generic driver, and for example if I try to start a service based on a dockerhub image:

docker $(docker-machine config pine64-1) service create --replicas 3 -p 6379:6379 --name redis-server adomenech73:redis-aarch64

And previously I didn’t pull the image on every node of the cluster, The service fails whit a “No such image:” error and the cluster gets inoperable, obtaining the same error. And I didn’t find anyway to go around, except to recreate the whole cluster from scratch

Error response from daemon: rpc error: code = 4 desc = context deadline exceeded 

docker version

Client:
 Version:      1.12.1
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   23cf638
 Built:        Thu Aug 18 05:33:38 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.1
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   23cf638
 Built:        Sun Aug 21 19:50:51 2016
 OS/Arch:      linux/arm64

I have the same issue but have not seen how to KILL the swarm yet.

root@sc-ubu-bld05:~# docker node promote sc-ubu-bld04
Error response from daemon: rpc error: code = 4 desc = context deadline exceeded
root@sc-ubu-bld05:~# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
2rhb6424qlqhdvgft2hbd3omt * sc-ubu-bld05 Ready Active Leader
4hunevvnhhm7yhl9u4ti0d0gq sc-ubu-bld04 Ready Active
6l2hnwkp3ch2w4jx3oy280vqu sc-ubu-bld06 Ready Active Unreachable
emstr6wsaakqb5poqh6en2iad sc-ubu-bld03 Ready Active
root@sc-ubu-bld05:~# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
2rhb6424qlqhdvgft2hbd3omt * sc-ubu-bld05 Ready Active Leader
4hunevvnhhm7yhl9u4ti0d0gq sc-ubu-bld04 Ready Active
6l2hnwkp3ch2w4jx3oy280vqu sc-ubu-bld06 Ready Active Unreachable
emstr6wsaakqb5poqh6en2iad sc-ubu-bld03 Ready Active
root@sc-ubu-bld05:~# docker node rm emstr6wsaakqb5poqh6en2iad
Error response from daemon: rpc error: code = 4 desc = context deadline exceeded
root@sc-ubu-bld05:~# docker node rm --force emstr6wsaakqb5poqh6en2iad
Error response from daemon: rpc error: code = 4 desc = context deadline exceeded
root@sc-ubu-bld05:~# docker info
Containers: 11
Running: 1
Paused: 0
Stopped: 10
Images: 79
Server Version: 1.12.1
Storage Driver: aufs
Root Dir: /local/mnt/docker/aufs
Backing Filesystem: extfs
Dirs: 173
Dirperm1 Supported: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: overlay null bridge host
Swarm: active
NodeID: 2rhb6424qlqhdvgft2hbd3omt
Is Manager: true
ClusterID: 5sadezu6u8o07uz7ofjcglwag
Managers: 2
Nodes: 4
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Heartbeat Tick: 1
Election Tick: 3
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Node Address: 10.49.13.222
Runtimes: runc
Default Runtime: runc
Security Options: apparmor
Kernel Version: 3.13.0-92-generic
Operating System: Ubuntu 14.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.798 GiB
Name: sc-ubu-bld05
ID: CP4J:YAC6:OMGC:4TMD:XV72:DJUX:5TDX:IFIA:CCMR:S5J7:MQK5:XRPK
Docker Root Dir: /local/mnt/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
127.0.0.0/8
root@sc-ubu-bld05:~# docker version
Client:
Version: 1.12.1
API version: 1.24
Go version: go1.6.3
Git commit: 23cf638
Built: Thu Aug 18 05:22:43 2016
OS/Arch: linux/amd64

Server:
Version: 1.12.1
API version: 1.24
Go version: go1.6.3
Git commit: 23cf638
Built: Thu Aug 18 05:22:43 2016
OS/Arch: linux/amd64

I have the same problem and it’s fairly easy to re-create by updating, and/or removing services. Are there any updates on this?
[centos@ip-10-0-2-29 ~]$ docker version
Client:
Version: 17.05.0-ce
API version: 1.29
Go version: go1.7.5
Git commit: 89658be
Built: Thu May 4 22:06:25 2017
OS/Arch: linux/amd64

Server:
Version: 17.05.0-ce
API version: 1.29 (minimum version 1.12)
Go version: go1.7.5
Git commit: 89658be
Built: Thu May 4 22:06:25 2017
OS/Arch: linux/amd64
Experimental: false
[centos@ip-10-0-2-29 ~]$ docker swarm status

Solution: I manually cd into /var/lib/docker and removed swarm folder to get out of this dead-lock.
need to stop docker daemon before doing it, and restart daemon after it.

3 Likes

Hi guys,

Somebody else can confirm if the solution of myjpa was successful?

I have the same issue.

Sorry for my english

Deleting the swarm folder in /var/lib/docker worked for me

worked for me, thanks