[Resolved] Linked container regularly becomes unreachable

Hi. I’m having an issue with a linked container becoming unreachable from another container. I don’t know exactly at what layer the problem lies, but here is a description of my system:

I’m using Docker Cloud with one instance from AWS. I have a stack with a Postgres container (postgres:latest) and my own container for a simple Play web app that is linked to the Postgres container (links: - database). Everything works fine for a while.

For the last 5 days, I’m getting a notification every day around 07:00 UTC that the Play container stopped with exit code 255. The Play app crashes saying “Cannot connect to database [default]”, “Caused by: java.net.ConnectException: Connection timed out”. If I connect to the Play container and try to ping “database” I get

PING database.4d087578-b7f7-440e-85dc-315e6882ff2a.local.dockerapp.io (10.7.0.1): 56 data bytes ^C --- database.4d087578-b7f7-440e-85dc-315e6882ff2a.local.dockerapp.io ping statistics --- 6 packets transmitted, 0 packets received, 100% packet loss

I redeploy the database container every time this happens, and after that it becomes reachable again. Next time this happens I will take a look at the logs of the Postgres container before I redeploy, but to me it seems to be an issue with the whole container, not with Postgres itself.

Any pointers would be useful, as I’m new to Docker.

Regards,
Christian Z.

Update:

There is nothing in the Postgres log (except for the previous restart). Also, I discovered that the database container is reachable under its IP address, but not under its hostname (from within the web app container). Why is the hostname not usable?

EDIT: This information might be wrong as sometimes the system suddenly starts working again and I might have thought that it’s a DNS issue. From the further posts it looks more like an IP issue.

Update 2: I terminated my cluster (with one node) and started my stack in a new one. This solved the issue for at least one day now.

Still having the same issue with linked containers. Even after it being OK for a few days.

I have the exact same problem. Similar setup with only one node and my database is mongodb instead.
If you do a docker logs weave on your node do you also see loggings like this at around the time of the problem?

[allocator XX:XX:XX:XX:XX:XX] Ignored address 10.7.0.8 claimed by XXX - not in our universe

I have no idea if it’s really related but 10.7.0.8 is the IP of the container having issue connecting to my linked mongodb container…

Hi kimfiedler,

Yeah, I do have exactly the same entry in the log! The IP belongs to the web app container that can’t access the DB. The times where this line is repeated seem to match up exactly with the times where the DB container is unreachable.

To the contrary of one of my earlier messages, the DB container was not reachable even with the IP address, so it might not be a DNS issue.

I hope that we can get this fixed, as this is a showstopper for me using Docker Cloud.

P.S. Found this, which is an old issue that might be related: https://github.com/tutumcloud/weave-daemon/issues/34

1 Like

Seem to be experiencing the same issue here too, some containers seem to demonstrate the issue more than others. I’ve also got the same log entries near the times the instance connectivity drops.

Thanks for sharing!

When the problem came back again, if I issue docker restart weave or docker restart weave-xxxxx.xxxxxxxxx I can temporarily ping the linked machine until it auto restarts. Once it restarts the problem comes back again. If I do a full stack redeploy it usually helps.
Can a Docker dev or an experienced user please give us some pointers or let us know where we should file a bug? Thanks

Did anyone find a solution? I am gonna have to stop using Docker Cloud because of this issue :frowning:

Nope - I’d love to find out whats happening, whether its an issue with our setup or with weave. I think its memory usage related, I’ve noticed that some of our intensive cron jobs that occur over the weekend can (seemingly) make the container unreachable from the parent linked container, they will only come back when we do a redeploy.

Yeah, it could be memory related. Thanks for the hint. I just checked my machine and it is pretty low on memory. So maybe the issue is that there should be an error message if weave (or whatever it is) runs out of memory.

Experiencing this issue w/ our Go services at the moment.

@revett

Have you checked your memory usage? I ended up moving away from Docker Cloud and just running Docker manually on one node (and it works fine now). I’m guessing that with Docker Cloud one should provision a bit more memory than for a bare setup.

@zommerfelds memory is fine on all nodes

Same problem here, since today, all my container links get broken randomly but all at the same time.

I have 4 web containers linked to 4 mysql containers (one for each) and a redis container. When the problem happen, all my apps are logging errors because they can’t reach their databse or their cache store, and i can’t ping any linked container using its name (but i can with its ip).

The first time the issue went away by itself after about an hour of down time.
As i write this post, it’s down again…

I can’t find what triggers the situation… i don’t see anything weird in the server monitoring when the links stop working.

I think this topic is related : [RESOLVED] Dockerapp.io DNS Down?

The latest Docker Cloud release is now available with support for Docker Engine 1.11.2-cs5, which introduces service discovery and DNS improvements, along with more reliable networking between containers.

For more information on this release and how to upgrade nodes to Docker Engine 1.11.2-cs5, check out: Docker Cloud Release Notes (09/27/2016)

still happening on

docker version
Client:
Version: 17.09.0-ce
API version: 1.32
Go version: go1.8.3
Git commit: afdb6d4
Built: Tue Sep 26 22:40:46 2017
OS/Arch: linux/amd64

Server:
Version: 17.09.0-ce
API version: 1.32 (minimum version 1.12)
Go version: go1.8.3
Git commit: afdb6d4
Built: Tue Sep 26 22:39:27 2017
OS/Arch: linux/amd64
Experimental: false

I’ve got exacly the same issue, it seems it happens when I use the network too heavily it temporary loose the link using the service name.

I managed to fix the issue for me bug ping db, copie the ip and add on my /etc/hosts… it seems to be a bug.

2 Likes

okay, but this does not sounds good as the internal ip is auto generated by swarm