I’m using docker swarm and docker service to manage my app.
Recently I updated a service with 2 tasks, and after about half an hour later, there is a 50% chance that the service is inaccessible through service name and vip.
I executed nslookup tasks.myservice
, and found 4 ip listed, 2 are current running container’s ip, the other 2 are the stopped containers’ ip.
I executed the same command nslookup tasks.myservice
on the two running containers, the result is a little different, only 3 ip found, the stopped container’s ip on the same host was not in the list.
It seems that, while the task stopped, the ip is removed from the host, but it failed to synchronize the information to swarm cluster.
And I found a message like msg="rmServiceBinding d0283c4f3f93e348e91b5239d30d1af0921e0f606ced2f057e85d8997ec7e8c9 possible transient state ok:false entries:0 set:false "
from dockerd log.
My temporary fix is, make the node leave the swarm cluster and join it again.Is there any suggestion to investigate the root cause or how to avoid this issue? Thanks in advance.
My system info
uname -a
Linux dongni-nginx2 5.10.134-15.an8.x86_64 #1 SMP Thu Jul 20 00:35:47 CST 2023 x86_64 x86_64 x86_64 GNU/Linux
docker info
Client: Docker Engine - Community
Version: 24.0.7
Context: default