Docker Community Forums

Share and learn in the Docker community.

"Error getting node <id>: node <id> not found" repeated, looking for removed node

OS : CentOS 7, Kernel 3.10.0-862.el7.x86_64
Docker version : 19.03.12 Community, API 1.40

Hello,

I was running a Docker Swarm cluster, with 3 manager and 9 workers.

Yesterday, for some reason, the cluster was broken and I reinitialized (–force-new-cluster) one manager and added other nodes to cluster again.

Now there is only one manager(let’s call it m1), and the m1 node is printing out such log every seconds (/var/log/messages):

Nov 27 14:37:47 PYSTOMGT dockerd: time="2020-11-27T14:37:47.972559924+09:00" level=error msg="Error getting node 2rcgf7rzqvojr4yzyx3ko13ut: node 2rcgf7rzqvojr4yzyx3ko13ut not found"
Nov 27 14:37:47 PYSTOMGT dockerd: time="2020-11-27T14:37:47.973323970+09:00" level=error msg="Error getting node i4i2f2pxh3896nnfnpbv5h143: node i4i2f2pxh3896nnfnpbv5h143 not found"
Nov 27 14:37:47 PYSTOMGT dockerd: time="2020-11-27T14:37:47.973882459+09:00" level=error msg="Error getting node xkig9cw27h9x64lqnv743s5g4: node xkig9cw27h9x64lqnv743s5g4 not found"
Nov 27 14:37:49 PYSTOMGT dockerd: time="2020-11-27T14:37:49.090099529+09:00" level=error msg="Error getting node 2rcgf7rzqvojr4yzyx3ko13ut: node 2rcgf7rzqvojr4yzyx3ko13ut not found"
Nov 27 14:37:49 PYSTOMGT dockerd: time="2020-11-27T14:37:49.090776238+09:00" level=error msg="Error getting node i4i2f2pxh3896nnfnpbv5h143: node i4i2f2pxh3896nnfnpbv5h143 not found"
Nov 27 14:37:49 PYSTOMGT dockerd: time="2020-11-27T14:37:49.091393776+09:00" level=error msg="Error getting node xkig9cw27h9x64lqnv743s5g4: node xkig9cw27h9x64lqnv743s5g4 not found"
...

Those three IDs (2rcg…, i4i2…, xkig…) were the IDs that had been assigned to three worker nodes (say, w1, w2, w3).

When I was trying to recover the cluster yesterday, I did

# on m1
docker swarm init --force-new-cluster

# then, on w1
docker swarm leave --force
docker swarm join --token .... m1:2377
(ID of w1 was changed from 2rcg... to new one)

# same on w2 and w3

For now, there is no node with ID 2rcg… but the manager node is still looking for it.

I can not even remove the node by ID:

# on m1
# docker node rm 2rcgf7rzqvojr4yzyx3ko13ut
Error: No such node: 2rcgf7rzqvojr4yzyx3ko13ut

What can I do now? I want to restart docker daemon on m1 manager, but I’m afraid that the cluster would be broken again. This is a live service environment so I want to be very careful.

You might say, “Firstly, promote more nodes as managers. Then try to restart dockerd on the manager”.
I tried it yesterday. But docker node promote m2 m3 command hung , I lost quorum, and I had to reinitialize cluster again… So I’m planning to reboot every nodes and initialize the whole cluster from scratch on next maintenance day, months later. For now, I just want to make the m1 manager STOP looking for those 3 nodes.

Thanks.