Context/configuration:
I have two nodes - one is the manager/leader (A) and the other is a worker (B). I have deployed the backend apps on A and the IIS/web server on B. All services are operating on the same persistent overlay network, with only the web server container/service having a published port (80). They are deployed with different compose files/are different stacks. Both host machines/nodes are running Server 2022 as an OS and are on the same docker CE version (version 27.0.3, build 7d4bcd8). The host machines are on the same network, and both are Xen VMs.
Issue:
When I go to recreate/update the service running the site (which has a published port), after it is successfully recreated and running, the service is no longer reachable (ie. http://ip_of_A_or_B:80 from another machine on the network). The specific command is:
“docker service update stack_service --force”
How I workaround (but not feasible for a prod environment):
Currently, the only way to fix this (get the site reachable again) is to restart the docker engine service (which is done on both nodes). If I had to guess, it seems like the routing mesh/the integrated dns resolver/the routing functionality is not properly dropping the old container or updating it with the new one, causing any traffic directed at either nodes IP over the port to be routed to the stopped container instead of the new one.
Logs:
Looking at the daemon panic logs on both nodes, I see some strangeness, but I am unsure how to remediate and can’t seem to track down similar issues. Notable entries after the service update (removed any hashes/IDs and replaced with trunc for readability):
Manager/leader:
level=error msg=“Failed to find policies for service stack_svc (10.0.1.90)”
level=warning msg="rmServiceBinding trunc possible transient state ok:false entries:0 set:false "
level=error msg=“Failed to add ELB policy for service stack_svc (ip:10.0.0.9 target port:80 published port:80) with endpoints [{trunc}] using load balancer IP 10.0.0.4 on network ingress: failed during hnsCallRawResponse: hnsCall failed in Win32: The specified port already exists. (0x803b0013)”
Worker:
level=warning msg="rmServiceBinding trunc possible transient state ok:false entries:0 set:false "
level=warning msg=“Error (Unable to complete atomic operation, key modified) deleting object [endpoint trunc], retrying…”
level=warning msg=“deleteServiceInfoFromCluster NetworkDB DeleteEntry failed for trunc err:cannot delete entry endpoint_table with network id trunc and key trunc does not exist or is already being deleted”
level=warning msg=“rmServiceBinding deleteServiceInfoFromCluster stack_svc trunc aborted c.serviceBindings[skey] !ok”
level=error msg=“Failed to add ELB policy for service stack_svc (ip:10.0.0.9 target port:80 published port:80) with endpoints [{trunc}] using load balancer IP 10.0.0.5 on network ingress: failed during hnsCallRawResponse: hnsCall failed in Win32: The specified port already exists. (0x803b0013)”
Also worth mentioning that even when running okay, I get constant entries that show a failure to resolve the network adapters DNS servers:
level=warning msg=“[resolver] connect failed” error=“dial udp ‘ip_of_DNS’: connect: A socket operation was attempted to an unreachable network.”
But all services/containers are able to resolve external resources, such as data shares and DB servers, strangely enough.
Anyone have any ideas? Any help is greatly appreciated.