Docker Swarm v18.06 Intermittent issues - Please help

I am running a 3 node docker cluster setup using Docker SWARM mode on 3 Virtual machines . The version of docker is 18.03.1-ce, build 9ee9f40. The configurations of VM - 8 CPUs, 64 GB RAM.

I am running micro-service based application services inside docker containers in the cluster.

I am able to start the services fine the first time. But over a period of time , say about 6-8 hours, am not able to access the application services from browser and service-to-service communication also fails intermittently. This gets resolved once I refresh the problematic service alone.

I am using docker service update --force for refreshing the service.

While seeing the syslog messages on the VM, I could see the below. I am unable to interpret the log messages.

The services in the cluster go down automatically and getting restarted. Any help on getting this resolved will be greatly appreciated.

_Sep 3 09:22:46 my-app-vm01 systemd: Starting Network Manager Script Dispatcher Service…
Sep 3 09:22:46 my-app-vm01 dbus[860]: [system] Successfully activated service ‘org.freedesktop.nm_dispatcher’
Sep 3 09:22:46 my-app-vm01 systemd: Started Network Manager Script Dispatcher Service.
Sep 3 09:22:46 my-app-vm01 nm-dispatcher: Dispatching action ‘up’ for vethaeab948
Sep 3 09:22:46 my-app-vm01 kernel: IPVS: Creating netns size=2040 id=8265
Sep 3 09:22:46 my-app-vm01 kernel: docker_gwbridge: port 4(vethaeab948) entered disabled state
Sep 3 09:22:46 my-app-vm01 kernel: br0: port 4(veth7961) entered disabled state
Sep 3 09:22:47 my-app-vm01 NetworkManager[976]: (veth8ecb9dc): failed to disable userspace IPv6LL address handling
Sep 3 09:22:47 my-app-vm01 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth7962: link becomes ready
Sep 3 09:22:47 my-app-vm01 kernel: br0: port 4(veth7962) entered forwarding state
Sep 3 09:22:47 my-app-vm01 kernel: br0: port 4(veth7962) entered forwarding state
Sep 3 09:22:51 my-app-vm01 NetworkManager[976]: (veth1fab53b): failed to disable userspace IPv6LL address handling
Sep 3 09:22:51 my-app-vm01 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethaeab948: link becomes ready
Sep 3 09:22:51 my-app-vm01 kernel: docker_gwbridge: port 4(vethaeab948) entered forwarding state
Sep 3 09:22:51 my-app-vm01 kernel: docker_gwbridge: port 4(vethaeab948) entered forwarding state
Sep 3 09:22:51 my-app-vm01 NetworkManager[976]: (vethaeab948): link connected
Sep 3 09:22:51 my-app-vm01 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth7961: link becomes ready
Sep 3 09:22:51 my-app-vm01 kernel: br0: port 4(veth7961) entered forwarding state
Sep 3 09:22:51 my-app-vm01 kernel: br0: port 4(veth7961) entered forwarding state
Sep 3 09:22:51 my-app-vm01 NetworkManager[976]: (vethd475a86): failed to disable userspace IPv6LL address handling
Sep 3 09:23:02 my-app-vm01 kernel: br0: port 4(veth7962) entered forwarding state
Sep 3 09:23:06 my-app-vm01 kernel: docker_gwbridge: port 4(vethaeab948) entered forwarding state
Sep 3 09:23:06 my-app-vm01 kernel: br0: port 4(veth7961) entered forwarding state
_Sep 3 09:23:14 my-app-vm01 kernel: IPVS: _ip_vs_del_service: enter
Sep 3 09:23:15 my-app-vm01 kernel: br0: port 4(veth7962) entered disabled state
Sep 3 09:23:15 my-app-vm01 NetworkManager[976]: (veth8ecb9dc): failed to find device 49130 ‘veth8ecb9dc’ with udev
Sep 3 09:23:15 my-app-vm01 NetworkManager[976]: (veth8ecb9dc): new Veth device (carrier: OFF, driver: ‘veth’, ifindex: 49130)
Sep 3 09:23:15 my-app-vm01 kernel: br0: port 4(veth7962) entered disabled state
Sep 3 09:23:15 my-app-vm01 kernel: device veth7962 left promiscuous mode
Sep 3 09:23:15 my-app-vm01 kernel: br0: port 4(veth7962) entered disabled state
_Sep 3 09:23:15 my-app-vm01 kernel: IPVS: _ip_vs_del_service: enter
Sep 3 09:23:15 my-app-vm01 NetworkManager[976]: (veth8ecb9dc): failed to disable userspace IPv6LL address handling
_Sep 3 09:23:15 my-app-vm01 kernel: IPVS: _ip_vs_del_service: enter
_Sep 3 09:23:15 my-app-vm01 kernel: IPVS: _ip_vs_del_service: enter
_Sep 3 09:23:15 my-app-vm01 kernel: IPVS: _ip_vs_del_service: enter
_Sep 3 09:23:16 my-app-vm01 kernel: IPVS: _ip_vs_del_service: enter
_

Hi, I have a similar behavior with some buggy services that are trying to get all the memory from the host.

Do you have monitoring for the vm?? can you check you are not running out of resources?? Also can you check the max files open in the docker hosts??

Regards

Thanks for the response ! yeah we do have monitoring for the VM . We initially had this issue with a VM having 8 CPUs . We assumed increasing the number of CPUs would help and so increased it to 16 CPUs. The issue did not appear for sometime and started coming back after a couple of days.

can you check the max files open in the docker hosts?? - I am not aware how to check this. Can you please help?

If you are not seeing some king of “Too many open files (24)” error log message probably not that.

Maybe you can limit the amount of resources taken by each service and try to figure out which one is causing the issue