Docker swarm join fails for removed and rejoined node

andrew007hooper · August 19, 2019, 11:54am

Have a swarm made of three manager nodes.
Node 3 recently got into a bad state and was removed and rejoined, after removing the contents of /var/lib/docker to get the docker daemon starting.
Rejoining seemed to work but a few days later node 3 seemed to drop out of the cluster, docker node ls would show it as reachable but down.

Doing the same rejoin trick no longer works and node 3 goes into a pending state.

Looking in /var/lib/docker/swarm on the other two nodes shows an entry for node 3 in state.json even when the node has been removed from the swarm.

Question,
Does state.json have any actual function, so editing out the deleted node would have an affect on the swarm?
I assume that the state of the swarm is maintained in /var/lib/docker/swarm/raft… which is not human readable, is there a tool or method to consistency check this sate file?

The swarm supports a lot of live production containers and I am reluctant to destroy the swarm and start again.

andrew007hooper · August 23, 2019, 10:03am

Update
After some experimenting we gave up and rebuilt the whole swarm. This time using 3 manager nodes and 4 workers. We altered the compose files to limit deployment to worker nodes so the managers are more or less empty except for portainer.
The application is built to use rabbitMQ so as long as each queue is serviced my a working container on any node or swarm the application keeps working. So switching swarms was painless.
The services are getting quite large and deployment can easily fully load CPU on two nodes. Our thinking is that during deployment the raft database gets out of sync due to delayed or missed writes and it is a downward spiral from there.
Can see why it is encourages to keep workload containers off of the manager nodes, lesson learned.

Topic		Replies	Views
Node does not rejoin swarm after restart Docker Desktop windows	1	3695	December 21, 2017
Swarm 1.12 with boot2docker - hosts never rejoin cluster after reboot Swarm	0	1404	July 15, 2016
No running container after node swarm failover General docker	0	777	March 9, 2018
Swarm not removing failed nodes General	1	1872	January 30, 2015
Docker node statuses Swarm	2	4796	November 2, 2017

Docker swarm join fails for removed and rejoined node

Related topics