Microservice connection problems on replicated service

We have multiple docker workers and few managers running in docker swarm.

After deployment of new applications, from time to time I have found that some of our micro services is having connections problems and the request are failing. Usually the problem is gone by scaling the service to 0 and to number of replicas.
In the latest case we had 3 replicas of service running on separate workers. After failing request we scaled to 1 and all the failing request were gone (on the monitoring), the we tried to scale up to 2, and the failing of request started again, so in this case scaling did not solve the issue.

I’m wondering what could be the issue? BUT also wondering how to dig deeper into issue, atm I cannot find anything from logs.

One thing could be updating the docker to latest, but before, I would like to know that this is really issue of docker swarm, and not something else…

Docker version is: Docker version 17.09.1-ce, build 19e2cf6
App: Spring boot java application

From ingress?
Amongst each other or a self managed containerized reverse proxy?
Something else?

Thanks for clarification. We are using traefik as reverse proxy, and it is running in self managed container.

Also, there are other containers running on the same worker without problem, as where I suspect that the failing replicated microservice is running.

Afaik, Traefik does not have problems with DNS result caching. This is a typical problem people experience with self created reverse proxy configurations…

In your szenario, the only situation where I would expect communication to fail is during the time a replica of a service becomes unresponsive, before the container beeing stopped by docker, which Traefik will pick up from dockers event logs to modify the reverse proxy rules accordingly.

In other words: are the containers of your replicas running stable?

Introducing health checks might help here:

  • after container start, traffic is routed to the container if the first health check succeeds
  • three failed health checks in a row will result in a killed and restarted container

Though, this will not prevent that replicas of a services become unavailable, but it will reduce the time it takes to remove a unresponsive replica and replace it with a fresh one.

Thank you, really appreciate. I think the issue is not with proxy, atleast what I have been crawling around… There are health check configured for the service (all replicas), so that makes me wonder what could be the issue here.
When we had this issue I did check the status of service and containers, docker showed nothing suspecios. Is it possible to do dome sort of docker “healt-check” for container?

Of course. The Dockerfile specs provide the HEALTHCHECK instruction for that purpose.

This happened again when I deployed new containers, seems like containers are healty. I really cannot figure this out, this might be an issue with our application…
Now I catched the error on application log(springBoot):
org.springframework.web.client.ResourceAccessException: I/O error on GET request for “http://address:1234/test/v1”: Connection refused (Connection refused); nested exception is java.net.ConnectException: Connection refused (Connection refused)

Is “address:1234” a valid hostname and port combination inside the container network? Does the hostname part match the service name or a network alias of the target service?

I hope you did use a spring boot actuator endpoint to determin the applications health.

Yup, everything is correct. But some of the request are failing, like 1/10, and I get these:
“I/O error on GET request for…nested exception is java.net.UnknownHostException”, “I/O error on GET request for…Connection refused (Connection refused)” and “I/O error on GET request…No route to host (Host unreachable)”. Something I noticed, this is happening when we have lots of requests, so could this be related to caching/routing? Because it seems that not all the connections are failing, only some…

By any chance, you are not routing your traffic thru a self maintained reverse proxy, are you?

The problem I encountered in the past are timeouts with long polling communication patterns, due to deploy.endpoint_mode: vip (which uses IPVS). Setting deploy.endpoint_mode:dnsrr for the target service helps with those type of timeouts. Instead of sending the traffic to IVPS, a multivalue dns lookup is done and used to directly communicate with the target containers. Though, this might introduce the problem of dns name caching issue where the name is resolved once and the resolved ip is used for all follow up calls.

This still reads like your containers die and respawn. Your problem could be a race condiion that occours when a container dies, but is not yet unregistered from the VIP resulting in new requests are still send to dead containers.

Another problem can be a general issue with the overlay network (blocked ports, portfilter other than iptables) - which would result in local container to container traffic to work, but host crossing container to container traffic would not. If your containers are spread around multiple hosts using an overlay network and experience a random outage, than a generall issue is less likely the cause.

Well, race condition does sound good candidate for the issue, like you said:
“This still reads like your containers die and respawn. Your problem could be a race condiion that occours when a container dies, but is not yet unregistered from the VIP resulting in new requests are still send to dead containers.”

I just don’t understand why the issue “stays on”, and wont settle down even if there is several minutes between deployment and the issue. Is there a way to control the registration, or how should this be fixed?

I dont believe this is a issue: "Another problem can be a general issue with the overlay network (blocked ports, portfilter other than iptables) "
As I don’t have any issues when I scale the service up outside the traffic peak. Only when traffic is high, I need to scale one of the services down.

And yes, we have self maintained reverse proxy (traefik) for network.

I missed that you wrote about Traefik earlier and I already responded to it.

You might want to try endpoint mode dnsrr instead of vip, to bypass the IPVS virtual ip. At least I would give it a try. If a VIP unregister race condition takes places, we are talking about fractions of a second up to maybe a couple of seconds.

I am out of ideas. I realy hope you will be able solve the issue and keep us posted about how you solved it.

Thank you so much for giving the hints/ideas! I will continue to investigate the issue. And I will keep you/forum updated if something is found. Perhaps the solution will be to deploy outside of peak hours…

I think the traefik is not the issue, as it works only as a proxy for the service.

Could there be some of the old/ghost containers that are causing this “effect”. And shouldn’t the removing off docker stack (and services) clean up the env. so that there should not be old ones at all? Should there be delay between removing old stack and deploying new one?