Intermittent "host unreachable" between 2 containers (only!)

Hi,

I’ve been using Docker on my small computers for a long time now. In particular, I’ve been running the LinuxServer team’s Swag container as a reverse proxy to many other containers, including Portainer, for several years, without much trouble. However since a few weeks or months, one strange thing has been happening. For context, even though I have some other computers around this also running Docker containers, the following all happens inside one Raspberry Pi 4 computer, running a recently updated Raspberry Pi OS.

I access the Portainer container one of 2 ways. One way is through the LAN IP of the Raspberry, connecting to the port exposed by Docker for that purpose. This works flawlessly and Portainer behaves as usual. The other way is through the reverse proxy, which, even inside the local network, is now behaving erratically (again, this is a fairly recent development). What happens is that I can use Portainer somewhat, but every few page loads, I’m redirected to the login page, as if I had just logged out myself. Looking at the nginx logs, I find errors about “host unreachable” when this happens:

2024/11/22 12:29:31 [error] 933#933: *19703 connect() failed (113: Host is unreachable) while connecting to upstream, client: EDITED, server: home.*, request: "GET /portainer/api/users/me HTTP/2.0", upstream: "http://172.18.0.15:9000/api/users/me", host: "EDITED", referrer: "https://EDITED/portainer/"

To confirm the networking nature of the issue, I ran a shell inside the nginx container:

root@aa1dd0d5edeb:/# ping 172.18.0.15
PING 172.18.0.15 (172.18.0.15): 56 data bytes
64 bytes from 172.18.0.15: seq=1 ttl=64 time=0.190 ms
64 bytes from 172.18.0.15: seq=3 ttl=64 time=0.308 ms
64 bytes from 172.18.0.15: seq=5 ttl=64 time=0.344 ms
64 bytes from 172.18.0.15: seq=9 ttl=64 time=0.183 ms
64 bytes from 172.18.0.15: seq=10 ttl=64 time=0.184 ms
64 bytes from 172.18.0.15: seq=11 ttl=64 time=0.171 ms
64 bytes from 172.18.0.15: seq=13 ttl=64 time=0.300 ms
64 bytes from 172.18.0.15: seq=15 ttl=64 time=0.294 ms
64 bytes from 172.18.0.15: seq=19 ttl=64 time=0.164 ms
64 bytes from 172.18.0.15: seq=21 ttl=64 time=0.172 ms
64 bytes from 172.18.0.15: seq=23 ttl=64 time=0.177 ms

I’m no ping expert, but I would say that the sequence numbers indicate that some ping packets are received as expected, but many are lost.

I also used wget on the portainer URL that last failed to confirm it’s about host unreachable:

root@aa1dd0d5edeb:/# wget http://172.18.0.15:9000/api/users/me
Connecting to 172.18.0.15:9000 (172.18.0.15:9000)
wget: server returned error: HTTP/1.1 401 Unauthorized
root@aa1dd0d5edeb:/# wget http://172.18.0.15:9000/api/users/me
Connecting to 172.18.0.15:9000 (172.18.0.15:9000)
wget: server returned error: HTTP/1.1 401 Unauthorized
root@aa1dd0d5edeb:/# wget http://172.18.0.15:9000/api/users/me
Connecting to 172.18.0.15:9000 (172.18.0.15:9000)
wget: can't connect to remote host (172.18.0.15): Host is unreachable

The first 2 times it worked (of course the unauthorized reply is expected here; interestingly the second call was a bit slower to respond), and the 3rd one failed. This pattern can be reproduced at will, with some randomness to it.

Also interestingly, to the best of my knowledge, NONE of the other containers I run seem to cause such issues, things are working well for every other service I run (unless I’m just not seeing it yet…). And the issue survives reboots, it’s a permanent situation. But then again, I don’t think it’s a Portainer issue since accessing the container without the reverse proxy works fine, as mentioned.

What could be happening here? I’m primarily looking for pointers on how to further investigate this issue, as I feel stuck. My networking and docker knowledge is fairly basic - I guess I know just enough to be frustrated by the thought that this doesn’t make any sense :slight_smile:

Thanks in advance to anyone who could help.
Pierric.

Doesn’t swag use fail2ban (or something similar) under the hood, that decides whether to block ips based on application logs? So if your reverse proxy does not retain the source ip, it would appear as source ip in the application logs, and could lead to fail2ban blocking the reverse proxies ip.

Thanks @meyay !
It does use fail2ban. However wouldn’t fail2ban blocking the request result in a connection refused or connection time-out, rather than host unreachable? Also, I think that pinging the portainer container from the nginx+fail2ban container would in principle not involve fail2ban at all (or nginx for that matter), it should just be ping->kernel via swag docker container->network->kernel via portainer docker container?

It was more like an uneducated guess. I have no idea how fail2ban works under the hood. I only remembered that it applies iptables rules based on application logs.

Probably you are right. If the container had been banned, no ping would succeed. Thought the curl commands that follow the ping could potentially trigger a ban, depending on which logs are used to determine whether to ban should happen.

We have another topic where the user experiences connection problems with docker-ce on OMV. By any chance, do you run the same distro on your pi?

Hi @meyay

No, I run Raspberry Pi OS (ex “raspbian”). The 64-bit “lite” version, and I upgraded everything shortly before posting so it’s current.

So the only common nominator is that both use an os that bases on Debian.

I hope someone else has an idea of what’s going on and causes the problem.

1 Like

Just becuse I noticed the similarity as well and that’s why I came here remembering this topic, I share the other one

It is similar as it is about network connection between containers and a proxy is involved in both cases, but the error messages are different. So since I don’t have a very good idea either, I share the following issue only because you use Portainer

I know it is about MAC addresses, but I wonder how you created the containers. Did you also edit or even created the containers from Portainer?

I created the linked forum post, I think this is exactly what’s happening! I also use portainer and going back to my network inspect when I was having the issue I do see duplicate MAC addresses for some containers. Thank you so much, that at least points me in the right direction. I’m still unclear after reading the linked github issue, is there a clear workaround for this? Do I just need to handle updates outside of portainer? It seems the issue also occurs on reboots so I’m not sure that would solve it. Perhaps explicitly setting the mac address in my compose files would be a more permanent solution?

Even before the MAC address issue, I never recommended creating containers in Portainer. It is great for browsing, but for me, it always makes debugging harder when it hides what is happening behind the scenes. I don’t know why the issue would happen after reboot, but Portainer is restarting too, so I have no idea what it does and for example whether it changes how the restart policy works or not for containers that it created.

Hi all,

Thanks a lot, this is very interesting. I’m a bit confused as to how this triggers specifically the behaviour I’m seeing (issues between just one container and another one, the rest apparently working ok), but I have checked this morning and I do have many duplicate Mac addresses - just on the one server, I haven’t even looked into duplicates across servers.

I wrote a quick bash loop, which I include here in case someone sees something wrong with this approach:

for container in $(docker ps --format '{{.ID}}'); do
        docker inspect $container|grep "MacAddress\": \"[^\"]"|sed -E "s/^.*\"([^\"]*)\",$/\1: $container/g"
done

And the result when running bash tempdocker.sh | sort | uniq is:

02:42:ac:11:00:02: 33da655039fb
02:42:ac:11:00:03: f1c45c1e6dea
02:42:ac:11:00:04: 550880ed7728
02:42:ac:11:00:05: 6ce78ca18d34
02:42:ac:11:00:06: 507b4be1eff2
02:42:ac:12:00:02: 9aa96770d74a
02:42:ac:12:00:03: e07478b514fb
02:42:ac:12:00:04: a3735f568598
02:42:ac:12:00:05: 405b2aa1e891
02:42:ac:12:00:06: 508d2250d686
02:42:ac:12:00:07: 655d0826186d
02:42:ac:12:00:09: 81ec01140a1b
02:42:ac:12:00:09: 83d666ded144
02:42:ac:12:00:0a: 386ea5484b28
02:42:ac:12:00:0a: e329c598d367
02:42:ac:12:00:0b: 543e7aea7133
02:42:ac:12:00:0e: a0d597ba4f22
02:42:ac:12:00:0f: a238190b7f8a
02:42:ac:12:00:0f: bc6af55b2aae
02:42:ac:12:00:10: 15a8e7af76c7
02:42:ac:12:00:11: 6ce78ca18d34
02:42:ac:13:00:02: 5bd2f2a0a1c9
02:42:ac:13:00:03: 6ce78ca18d34
02:42:c0:a8:01:e0: d8704d049f84

There are quite a few sharing MAC addresses!

To @rimelek 's question:

I know it is about MAC addresses, but I wonder how you created the containers. Did you also edit or even created the containers from Portainer?

I normally first create containers with Ansible’s docker module, though I will sometimes use Portainer if I’m not thinking of it being long-term. That said, I do use Portainer frequently to either change the setup on a container (e.g. add/remove a command-line flag to Prometheus, add a label…) or pull the latest image and recreate the container. In either case, it’s the “duplicate/edit” functionality that I then use. So we might be onto something. In particular, portainer itself shares a MAC address with another container, Prometheus, which I have recreated not so long ago from inside Portainer. Timewise this could be a match with the start of the issue.

For context, I also use watchtower to automatically update some containers. I know it’s bad practice, and I don’t do it on all my containers, but keep in mind, I’m talking about a homelab here, not a company’s production system. And time available for maintenance is limited :slight_smile:

RESOLUTION: I have edited the prometheus container again (from Portainer :slight_smile: ) and set a new unique MAC address. Problem gone. 100% ping success from swag to portainer. And Portainer through the reverse proxy seems to behave correctly now.

But still, now, I wonder:

  1. as mentioned, why would I only see issues on the swag<->portainer communication and nothing else?
  2. if the issue is that Docker assigns random MAC addresses that have been “forced” as user-chosen ones by Portainer, shouldn’t Docker simply check that the generated MAC addresses are not already in use? Obviously I must be missing something obvious, forgive my naiveness
  3. what would be a good way for me to fix the issue for all existing containers? am I now doomed to edit the MAC address manually for each one?

Would love to get more insights on those few points, but in the meantime, thanks a lot for pointing me to the issue! I love it when I ask a question and someone smart (or several smart someones!) quickly give me the answer that I would not have found on my own in a million years! :pray:

It matters only when the two containers are communicating. If the portainer container had the same address as another, the response from the proxy could not arrive. Or if the portainer and the proxy container had the same mac, the problem can happen only between these containers. If the other containrs having the same macs are not communicating in any way, and not even with the proxy, it could be fine. And there is also the possibility of luck. You could have a problem later. Without knowing the exact case, we could guess and easily be wrong. If you are interested in this issue, you could reproduce it in a test environment where you intentionally set the same mac for different containers and test which can communicate with what on the same network.

Yes, Docker should check it, and I think this point was also raised in the github issue or somewhere else I read about it. I think they just didn’t expect someone to set the same mac address as there is nothing to warn you when you set the same mac to multiple virtual machines either. Then Portainer came and said “hold my beer” :slight_smile: . Of course since Docker is so popular and easy to use, there could be a mac address check. It would probably slow down the container creation and I’m not sure how much and when it would matter. And somtimes you might even want to test duplicate mac addresses and then it is good that Docker supports it. So maybe a diagnostic tool could be the solution similar to what Docker Desktop have, just for different issues.

Not necessarily, but depending on how many containers you have, it would be possibly the easiest and fastes compared to creating a script that deletes containers and creates new ones. In the future you could avoid using portainer to create new containers and edit existing containers at least until they fix this.

I think it was happening in more cases but with less visible impact. I was annoyed since recently that my Grafana dashboard was getting slower to refresh some basic data from Prometheus. I was worried this had to do with my NFS setup since I had just moved my hard drives to a new server. But behold, today, all my graphs refresh instantly! It was probably that Prometheus was not always answering, but Grafana silently retried until it got the data so I thought it was just slow.

Based on the MAC addresses I’ve seend assigned on my system, it seems that it’s not really random now, but follows some kind of sequential logic - possibly a prefix for the virtual network and then numbers incremented by 1 for each container. So Docker keeps some track of the MAC assignments already. I may be biased by my naive, individual home-lab state of mind (forgive me :slight_smile: ), but I wouldn’t imagine that indexing the assigned MAC addresses somewhere and checking that a new address is not already in use, and picking a new one until it fits, would take any significant amount of resources.
In addition, there may be simpler methods than the current one that would already yield better results even if not perfect. Full randomisation of the MAC address would mean there would still be a small risk of collision, but the address space is big enough that it would be much less likely than today. Similarly, keeping track of just the latest address and always, always incrementing when assigning an address, without ever going back to a lower address in the space, should work fine as well until so many addresses have been created that docker needs to rotate back to zero.
I’m not saying there was a bad decision in the first place: without the Portainer use case, I suppose it made sense to do things as they are now. But I want to hope there are solutions out there. Then again, it’s not like I’d be able to implement them :slight_smile:

Agree, though I think it’s only the auto-assignment that needs to be adjusted. If the user (or Portainer) assigns a MAC address then Docker would respect it, but upon creation of a container without a MAC specified, the MAC would be guaranteed to be unique. So even if Portainer still pins down an initially random address, making it “user-selected” because it doesn’t know better, it should not conflict with other new containers which would be getting a different address. Basically Portainer could pin down the MAC of every container once created, but we’d still never run into a conflict.

Ah yes, but the convenience… :slight_smile: For now it was quick enough to just force new MAC addresses to the few impacted containers, at least on that server (I’ll check the other ones!). Also I wonder if the same may be happening if other software recreates containers, such as watchtower. Ultimately I think my containers are recreated by watchtower more often than by Portainer.
Maybe I’ll just start manually choosing a MAC address for each container when I create them in the future. A bit annoying, but might be safest!

Just rambling about here (and I may be completely wrong about everything!), but just to be clear, let me repeat my thanks for helping me understand what was happening (and fixing my Grafana without realising :slight_smile: ). You made my weekend!

Best thing is to not configure a fixed mac addresses, so the engine can generate one itself.

Though, If a client (like Portainer, the cli, the cli compose plugin) creates a container with a fixed mac, then that’s what the engine will use.

Some management tools allow updating containers, which basically should remove the old container and create a new container based on a new image version using the exact same configuration as the old had. This can be problematic and lead to nonworking updated containers, for instance because some important environment variables require values they would normally inherit from the image, which are not overridden by the old values copied from the old container.

After @rimelek shared the link, I remember that it’s not the first time we had this issue with Portainer users.

I think that’s the point. The mac address is normally assigned by the Docker daemon which will not duplicate it. At least I haven’t seen any case or don’t remember when Portainer was not involved. If Portainer changes the default value, that is something that should be fixed in Portainer. @meyay noted that too, I just wanted to directly reply to your statement.

I guess convinience means different thing for everyone :slight_smile: As long as it works for you, it’s okay :slight_smile:

Exactly. That is the default. I don’t know how and why Portainer does things differently, but if it turns out everything is caused by the docker daemon, we have to rethink who should fix this. If Portainer, I’m not sure if it will ever happen as the issue is pretty old and still open.