[Resolved] Linked container regularly becomes unreachable

Over last two weeks I have a similar issue: sometimes hostname of the one Service become unreachable from the another Service. Both services are described in the one Stack:

test-backend-develop:
autoredeploy: true
command: pm2 start pm2.json --env=test --no-daemon
expose:
- ‘8082’
image: ‘test/test-backend:develop’
test-frontend-develop:
autoredeploy: true
command: sh /etc/nginx/run.sh
environment:
- TEST_BACKEND_HOST=test-backend-develop
- TEST_BACKEND_PORT=8082
- TEST_FRONTEND_PORT=80
image: ‘test/test-frontend:develop’
ports:
- ‘8888:80’
roles:
- global

Log is the following:

[test-frontend-develop-1]2018-03-06T05:55:24.633547378Z 2018/03/06 05:55:24 [error] 9#9: *12 connect() failed (113: Host is unreachable) while connecting to upstream, client: someip, server: , request: “POST /api/himnark HTTP/1.1”, upstream: “http://10.7.0.9:8082/api/himnark”, host: “test-frontend-develop.test-develop.blablabla.svc.dockerapp.io:8888”, referrer: “http ://test-frontend-develop.test-develop.blablabla.svc.dockerapp.io:8888/”

After redeploying, nginx of the “test-frontend-develop” service normally sees “test-backend-develop” service for a while.

Also I have similar Stack with “master” tag:

test-backend-master:
autoredeploy: true
command: pm2 start pm2.json --env=production --no-daemon
environment:
- ‘DEBUG=*’
expose:
- ‘8081’
image: ‘test/test-backend:master’
restart: on-failure
test-frontend-master:
autoredeploy: true
command: sh /etc/nginx/run.sh
environment:
- TEST_BACKEND_HOST=test-backend-master
- TEST_BACKEND_PORT=8081
- TEST_FRONTEND_PORT=80
image: ‘test/test-frontend:master’
ports:
- ‘80:80’
restart: on-failure
roles:
- global

But in this case everything works fine. The main difference between services is in the schedule of redeploying: develop can be redeployed several times a day, but master once a week.

I can’t figure out the reason of above mentioned behavior. I have tried to delete stack and create new one, but it didn’t help. Our nodes on AWS EC2 instances. Swarm mode is disabled (once accidentally I enabled it).

After more investigation, we noticed that the application (test-backend-develop) consuming almost 100% CPU on a system. So we fixed the issue causing 100% CPU usage.