I’ve been using 1.12 since RC1 all the way to the latest production release, and it’s still pretty buggy. Some of the bugs are hard to reproduce, but I’ve consistently found that if I start and stop enough containers, the swarm will eventually die and become unrecoverable.
Once that happens, I have no idea how to actually get my docker manager node into a “blank slate” stage - in other words, not part of a swarm at all.
For example, I’m currently in a situation where the swarm has bugged out. I have a global service running that I attempt to remove:
root@master1:~# docker service rm nginxtest
Error response from daemon: rpc error: code = 4 desc = context deadline exceeded
Clearly broken. I will never be able to remove that service. But the issue is, once I’m in this state, I have no idea how to actually just kill the swarm and get out of dodge.
I can’t just leave the swarm:
root@master1:~# docker swarm leave
Error response from daemon: You are attempting to leave cluster on a node that is participating as a manager. Removing the last manager will erase all current state of the cluster. Use `--force` to ignore this message.
I have the same problem too. I’ve killed my cluster by doing the following:
create a global service (loadbalancer)
service was running great
upgraded the version (image) of the service
my new image was broken, so the containers exited 1 after 3 sec
swarm created new containers all the way, task list got filled up
I saw what happened and changed the image back to the one I’ve used before
swarm broke, 2 nodes are have the service in running state (still the old version because the upgrade did not reach them - upgrade was one by one), the other had the service in state “active” or sth. (I did not remember, but it started with an “a”) but the broken image attached
Since then I’m getting also the “context deadline exceeded” errors on all nodes and the whole swarm is broken. My services are not critical, so I could start from scratch, but there is no way to do that.
So I initiated the update on the service (changed image).
swarm tried to upgrade the first server. So it stopped the old container and created a new one. The new one failed after 3 seconds with exit code 1. So the task was marked as failed and a new task got created on the same node. Of course the same thing happened to this task as well. So I got about 10 tasks in about 30 seconds of time. I’ve seen that and tried to “undo” my service update by initiating a new service update, but using the old image which did work. After I did this the last task on this node become “accepted” but nothing else happened. At this time the swarm cluster basically stopped working with the error. I did not got a new task using the old image again. My task list was still full of these 10 tasks. I wanted to see if I really have 10 containers lying around (which would be not good). But I have seen I did not had 10 containers. For this service I think I did not even had one container (the task in the “accepted” state) on this node. So the task entries have been still in the swarm with no container assigned.
Unfortunately I tried to “fix” my cluster and destroyed a lot of information. So everything I said is from my memory. So it might not be accurate. I’m sorry for that. Would like to help more.
I’ll probably create a new swarm cluster soon on very different nodes. I will try to reproduce this issue.
I have the same problem. If I kill Docker for Mac and re-start it, I get back to the same state and I can’t remove any services. Anyone made any progress on this? I am using Docker for Mac Version 1.12.1 (build: 12133)
2d5b4d9c3daa089e3869e6355a47dd96dbf39856 on MacOS (Sierra)
Exactly the same for me, always when there are problems starting the images from a service.
I’m using docker-machine over generic driver, and for example if I try to start a service based on a dockerhub image:
And previously I didn’t pull the image on every node of the cluster, The service fails whit a “No such image:” error and the cluster gets inoperable, obtaining the same error. And I didn’t find anyway to go around, except to recreate the whole cluster from scratch
Error response from daemon: rpc error: code = 4 desc = context deadline exceeded
docker version
Client:
Version: 1.12.1
API version: 1.24
Go version: go1.6.3
Git commit: 23cf638
Built: Thu Aug 18 05:33:38 2016
OS/Arch: linux/amd64
Server:
Version: 1.12.1
API version: 1.24
Go version: go1.6.3
Git commit: 23cf638
Built: Sun Aug 21 19:50:51 2016
OS/Arch: linux/arm64
I have the same problem and it’s fairly easy to re-create by updating, and/or removing services. Are there any updates on this?
[centos@ip-10-0-2-29 ~]$ docker version
Client:
Version: 17.05.0-ce
API version: 1.29
Go version: go1.7.5
Git commit: 89658be
Built: Thu May 4 22:06:25 2017
OS/Arch: linux/amd64
Server:
Version: 17.05.0-ce
API version: 1.29 (minimum version 1.12)
Go version: go1.7.5
Git commit: 89658be
Built: Thu May 4 22:06:25 2017
OS/Arch: linux/amd64
Experimental: false
[centos@ip-10-0-2-29 ~]$ docker swarm status
Solution: I manually cd into /var/lib/docker and removed swarm folder to get out of this dead-lock.
need to stop docker daemon before doing it, and restart daemon after it.