Docker Community Forums

Share and learn in the Docker community.

Docker swarm join fails for removed and rejoined node

Have a swarm made of three manager nodes.
Node 3 recently got into a bad state and was removed and rejoined, after removing the contents of /var/lib/docker to get the docker daemon starting.
Rejoining seemed to work but a few days later node 3 seemed to drop out of the cluster, docker node ls would show it as reachable but down.

Doing the same rejoin trick no longer works and node 3 goes into a pending state.

Looking in /var/lib/docker/swarm on the other two nodes shows an entry for node 3 in state.json even when the node has been removed from the swarm.

Does state.json have any actual function, so editing out the deleted node would have an affect on the swarm?
I assume that the state of the swarm is maintained in /var/lib/docker/swarm/raft… which is not human readable, is there a tool or method to consistency check this sate file?

The swarm supports a lot of live production containers and I am reluctant to destroy the swarm and start again.

After some experimenting we gave up and rebuilt the whole swarm. This time using 3 manager nodes and 4 workers. We altered the compose files to limit deployment to worker nodes so the managers are more or less empty except for portainer.
The application is built to use rabbitMQ so as long as each queue is serviced my a working container on any node or swarm the application keeps working. So switching swarms was painless.
The services are getting quite large and deployment can easily fully load CPU on two nodes. Our thinking is that during deployment the raft database gets out of sync due to delayed or missed writes and it is a downward spiral from there.
Can see why it is encourages to keep workload containers off of the manager nodes, lesson learned.