Docker worker nodes shown as "Down" after re-start

I have a 4 physical node cluster, all nodes are running Docker 1.12.1 with one manager and three workers in the swarm. Manager availability is set to ‘Drain’.
Services are created (with docker service create …) and run fine on the worker nodes until a power-cycle. At the moment it is just one service (or replica) per node.
After the power cycle (e.g. power fail, not a graceful shutdown) the nodes sometimes have a status of “Down” as shown by docker node ls. Availability is still “Active”. Tasks are shown as Allocated, but no services are running. Sometimes the nodes and services recover after a power cycle.
What is the recommended procedure to recover the worker nodes and get the services back up and running? How can I change the status from Down to Ready?
At the moment I have to ssh to each node in turn, use ‘docker swarm leave’, then switch to the master and use ‘docker node rm ’, then switch back to the node and do ‘docker swarm join’. After re-joining the swarm tasks start to run on the nodes.

Is it possible that the node in question has a different IP address when it comes back up? There was a very similar symptom described here that was traced back to the node getting a different IP.

The IP addresses remain the same after the reboot

Hi,

Same “issue” here.
I have 3 managers with Drain availability and 3 workers.
After restarting the 3 workers, their status is set to Down while their availability is set to Active.

Any idea?

Hi,

I also have the same issue after restarting.

Try removing all nodes including manager(for manager:using docker swarm leave --force) from cluster & then again create a cluster with all old nodes.
your services & all currant state of machines will remain the same.

1 Like

I have the same issue. Other than removing and adding worker nodes, is there any other solution ?

[root@docker1 ~]# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
dqtt8gei6ozueyzjimkmnpaec * docker1 Ready Active Leader
jrbii6h9olcuixvs74m6x81a1 docker3 Down Active
o7rfieltu7om9sqgpxub7xbff docker2 Down Active
[root@docker1 ~]#

One workaround that works for me is to execute ‘docker ps’ after a reboot.
Option 1:
sudo crontab -e
then add a line like this
@reboot docker ps

Option 2: if you use ansible:
# Create a crontab entry like “@reboot docker ps” to help nodes join the swarm
- cron:
name: "Initialise docker after reboot"
special_time: reboot
job: “docker ps”

3 Likes

Is there anyone that does not have this problem? It would supprise me if this behaviour is different on different machines?

It’s a bit wierd that the nodes does not reconnect on their own, expecially since $ docker ps seem to fix everything why not run that kind of call inside docker when the docker service starts on the host?

I have:

ID                            HOSTNAME                 STATUS              AVAILABILITY        MANAGER STATUS
2nux77pt5w1uvk8ca47n208od *   docker-swarm-manager-1   Ready               Active              Leader
q9xe45perkz08eama2dwr5qeg     docker-swarm-worker-1    Down                Active
v4cwsa9b7i5dpixr3n6nocslb     docker-swarm-worker-1    Down                Active
wcs1v35x5i7izgt8tpuhps3eu     docker-swarm-worker-2    Ready               Active

I did $ docker leave and $ docker join again on docker-swarm-worker-1 for some reason it’s also duplicated. I’m running Docker version 17.05.0-ce, build 89658be on CoreOS (alpha channel to get a new enough docker).

If I run $ docker ps I immediately get:

ID                            HOSTNAME                 STATUS              AVAILABILITY        MANAGER STATUS                                                                                    
2nux77pt5w1uvk8ca47n208od *   docker-swarm-manager-1   Ready               Active              Leader                                                                                            
q9xe45perkz08eama2dwr5qeg     docker-swarm-worker-1    Ready               Active               
v4cwsa9b7i5dpixr3n6nocslb     docker-swarm-worker-1    Down                Active               
wcs1v35x5i7izgt8tpuhps3eu     docker-swarm-worker-2    Ready               Active

Run each nodes in order, may fix the problem

  1. sudo docker-machine start worker1
  2. sudo docker-machine regenerate-certs worker1
  3. sudo docker-machine env worker1

an individual node change status from Down to Ready

docker-machine ssh myvm1 "docker node update myvm2 --availability active"
did the trick for me

1 Like

I had the same issue with two node. One node was leader the other node was worker.
Leader node after docker service was restarted was down.

I have fixed this by promoting worker node as manager node and then on the new manager node demote failed leader node.

ubuntu@staging1:~$ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
y0363og32cur9xq9yy0nqg6j9 * staging1 Down Active Reachable
x68yyqtt0rogmabec552634mf staging2 Ready Active

ubuntu@staging1:~$ docker node promote staging2

root@staging1:~# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
plxkuqqnkxotrzy7nhjj27w34 * staging1 Down Active Leader
x68yyqtt0rogmabec552634mf staging2 Ready Active Reachable

root@staging2:~# docker node demote staging1

root@staging2:~# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
plxkuqqnkxotrzy7nhjj27w34 staging1 Down Active
x68yyqtt0rogmabec552634mf * staging2 Ready Active Leader

root@staging2:~# docker node rm staging1

Get join-token from leader node:
root@staging2:~# docker swarm join-token manager

Reconnect failed node to docker swarm cluster:

root@staging1:~# docker swarm leave --force
root@staging1:~# systemctl stop docker
root@staging1:~# rm -rf /var/lib/docker/swarm/
root@staging1:~# systemctl start docker
root@staging1:~# docker swarm join --token XXXXXXXX 192.168.XX.XX:2377

root@staging1:~# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
y0363og32cur9xq9yy0nqg6j9 * staging1 Ready Active Reachable
x68yyqtt0rogmabec552634mf staging2 Ready Active Leader

root@staging1:~# docker node demote staging2

root@staging1:~# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
y0363og32cur9xq9yy0nqg6j9 * staging1 Ready Active Leader
x68yyqtt0rogmabec552634mf staging2 Ready Active

maybe firewall or iptables

Thanks! work for me.

restart docker daemon

I hit this issue and found the cause to be my proxy settings. Once I added the swarm hosts to NO_PROXY in /etc/systemd/system/docker.service.d/http-proxy.conf, ran systemctl daemon-reload and systemctl restart docker (on all swarm hosts) all worked as expected.

[Service]
Environment=“HTTP_PROXY=http://XX.XX.XX.XX:8080/”
Environment=“HTTPS_PROXY=http://XX.XX.XX.XX:8080/”
Environment="NO_PROXY=*.mydomain.com"

In each vm exec:
docker swarm update

To me was perfect. I dont know why but it’s ok now.

Here is the Scenario I faced :

I have build one Manager & 2 nodes in aws. It all worked well. Once I did a restart it was showing like this :

[root@ip-172-31-37-141 ec2-user]# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
x5irz7ye2t6u2f16x35xo1o9e * ip-172-31-37-141 Ready Active Leader 19.03.6-ce
pkei5lx21eul97nk8pqwz9xmb ip-172-31-41-201 Down Active 19.03.6-ce
d3uadc27pvfpcz854txpkbsoe ip-172-31-91-112 Down Active 19.03.6-ce

Now I check , I have created the Swarm with public Ip which got changed after server restart.I did the following this to make it work again with few simple steps :

On Manager :

1/etc/init.d/docker stop
2. rm -rf /var/lib/docker/swarm/
3. /etc/init.d/docker start
4. docker swarm leave --force
5.docker swarm init --advertise-addr 172.31.37.141

@ip-172-31-37-141 ec2-user]# docker swarm init --advertise-addr 172.31.37.141
Swarm initialized: current node (s8kz86z5iqoo0rp05v2g5ivqz) is now a manager.

To add a worker to this swarm, run the following command:

docker swarm join --token SWMTKN-1-4jdx7tri1y2xflt7gv9zekd7dd7ok5wx4hdva4dtg04k4n5vb7-3mfbeirjbv594gb0t6enyhsae 172.31.37.141:2377

To add a manager to this swarm, run ‘docker swarm join-token manager’ and follow the instructions.

Finally I did rejoined workers node as well.

Rejoining worker Node :slight_smile:
[root@ip-172-31-41-201 ec2-user]# docker swarm leave --force
Node left the swarm.
[root@ip-172-31-41-201 ec2-user]# docker swarm join --token SWMTKN-1-4jdx7tri1y2xflt7gv9zekd7dd7ok5wx4hdva4dtg04k4n5vb7-3mfbeirjbv594gb0t6enyhsae 172.31.37.141:2377
This node joined a swarm as a worker.
[root@ip-172-31-41-201 ec2-user]#

It worked !!

Now it is :

@ip-172-31-37-141 ec2-user]# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
s8kz86z5iqoo0rp05v2g5ivqz * ip-172-31-37-141 Ready Active Leader 19.03.6-ce
tzlbqf06584az3txkhkbv7f2c ip-172-31-41-201 Ready Active 19.03.6-ce
lx5e308ctyvtc9c5hhb75aa0a ip-172-31-91-112 Ready Active 19.03.6-ce

Cheers & be Safe friends !

Regards,
Satya
Email : satyajitjem@gmail.com

go to the node server which node(worker node) status is down , just enable the docker service using “sudo service docker start” … then come to manager execute “docker node ls” see the status and feel Happy

this works perfectly for me. I am not sure if it would still work after reboot of the machines though. anyways, thanks a million!!