Docker Community Forums

Share and learn in the Docker community.

Docker worker nodes shown as "Down" after re-start


(Jonathanlister) #1

I have a 4 physical node cluster, all nodes are running Docker 1.12.1 with one manager and three workers in the swarm. Manager availability is set to ‘Drain’.
Services are created (with docker service create …) and run fine on the worker nodes until a power-cycle. At the moment it is just one service (or replica) per node.
After the power cycle (e.g. power fail, not a graceful shutdown) the nodes sometimes have a status of “Down” as shown by docker node ls. Availability is still “Active”. Tasks are shown as Allocated, but no services are running. Sometimes the nodes and services recover after a power cycle.
What is the recommended procedure to recover the worker nodes and get the services back up and running? How can I change the status from Down to Ready?
At the moment I have to ssh to each node in turn, use ‘docker swarm leave’, then switch to the master and use 'docker node rm ', then switch back to the node and do ‘docker swarm join’. After re-joining the swarm tasks start to run on the nodes.


(Jeff Anderson) #2

Is it possible that the node in question has a different IP address when it comes back up? There was a very similar symptom described here that was traced back to the node getting a different IP.


(Jonathanlister) #3

The IP addresses remain the same after the reboot


(Rudy YAYON) #4

Hi,

Same “issue” here.
I have 3 managers with Drain availability and 3 workers.
After restarting the 3 workers, their status is set to Down while their availability is set to Active.

Any idea?


(Ivandir) #5

Hi,

I also have the same issue after restarting.


(Maheshpawar149) #6

Try removing all nodes including manager(for manager:using docker swarm leave --force) from cluster & then again create a cluster with all old nodes.
your services & all currant state of machines will remain the same.


(Bijujo) #7

I have the same issue. Other than removing and adding worker nodes, is there any other solution ?

[root@docker1 ~]# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
dqtt8gei6ozueyzjimkmnpaec * docker1 Ready Active Leader
jrbii6h9olcuixvs74m6x81a1 docker3 Down Active
o7rfieltu7om9sqgpxub7xbff docker2 Down Active
[root@docker1 ~]#


(Jonathanlister) #8

One workaround that works for me is to execute ‘docker ps’ after a reboot.
Option 1:
sudo crontab -e
then add a line like this
@reboot docker ps

Option 2: if you use ansible:
# Create a crontab entry like “@reboot docker ps” to help nodes join the swarm
- cron:
name: "Initialise docker after reboot"
special_time: reboot
job: “docker ps”


(Fredrik Söderström) #9

Is there anyone that does not have this problem? It would supprise me if this behaviour is different on different machines?

It’s a bit wierd that the nodes does not reconnect on their own, expecially since $ docker ps seem to fix everything why not run that kind of call inside docker when the docker service starts on the host?

I have:

ID                            HOSTNAME                 STATUS              AVAILABILITY        MANAGER STATUS
2nux77pt5w1uvk8ca47n208od *   docker-swarm-manager-1   Ready               Active              Leader
q9xe45perkz08eama2dwr5qeg     docker-swarm-worker-1    Down                Active
v4cwsa9b7i5dpixr3n6nocslb     docker-swarm-worker-1    Down                Active
wcs1v35x5i7izgt8tpuhps3eu     docker-swarm-worker-2    Ready               Active

I did $ docker leave and $ docker join again on docker-swarm-worker-1 for some reason it’s also duplicated. I’m running Docker version 17.05.0-ce, build 89658be on CoreOS (alpha channel to get a new enough docker).

If I run $ docker ps I immediately get:

ID                            HOSTNAME                 STATUS              AVAILABILITY        MANAGER STATUS                                                                                    
2nux77pt5w1uvk8ca47n208od *   docker-swarm-manager-1   Ready               Active              Leader                                                                                            
q9xe45perkz08eama2dwr5qeg     docker-swarm-worker-1    Ready               Active               
v4cwsa9b7i5dpixr3n6nocslb     docker-swarm-worker-1    Down                Active               
wcs1v35x5i7izgt8tpuhps3eu     docker-swarm-worker-2    Ready               Active

(Foxzy) #10

Run each nodes in order, may fix the problem

  1. sudo docker-machine start worker1
  2. sudo docker-machine regenerate-certs worker1
  3. sudo docker-machine env worker1

an individual node change status from Down to Ready


(Omereis) #11

docker-machine ssh myvm1 "docker node update myvm2 --availability active"
did the trick for me


(Fierman) #12

I had the same issue with two node. One node was leader the other node was worker.
Leader node after docker service was restarted was down.

I have fixed this by promoting worker node as manager node and then on the new manager node demote failed leader node.

ubuntu@staging1:~$ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
y0363og32cur9xq9yy0nqg6j9 * staging1 Down Active Reachable
x68yyqtt0rogmabec552634mf staging2 Ready Active

ubuntu@staging1:~$ docker node promote staging2

root@staging1:~# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
plxkuqqnkxotrzy7nhjj27w34 * staging1 Down Active Leader
x68yyqtt0rogmabec552634mf staging2 Ready Active Reachable

root@staging2:~# docker node demote staging1

root@staging2:~# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
plxkuqqnkxotrzy7nhjj27w34 staging1 Down Active
x68yyqtt0rogmabec552634mf * staging2 Ready Active Leader

root@staging2:~# docker node rm staging1

Get join-token from leader node:
root@staging2:~# docker swarm join-token manager

Reconnect failed node to docker swarm cluster:

root@staging1:~# docker swarm leave --force
root@staging1:~# systemctl stop docker
root@staging1:~# rm -rf /var/lib/docker/swarm/
root@staging1:~# systemctl start docker
root@staging1:~# docker swarm join --token XXXXXXXX 192.168.XX.XX:2377

root@staging1:~# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
y0363og32cur9xq9yy0nqg6j9 * staging1 Ready Active Reachable
x68yyqtt0rogmabec552634mf staging2 Ready Active Leader

root@staging1:~# docker node demote staging2

root@staging1:~# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
y0363og32cur9xq9yy0nqg6j9 * staging1 Ready Active Leader
x68yyqtt0rogmabec552634mf staging2 Ready Active


(Ericwmj) #13

maybe firewall or iptables


(Deviirmr) #14

Thanks! work for me.


(Victorbarajas89) #15

restart docker daemon