Why did 502 resolve after clearing up the docker network?

Hello All,

I’m hoping to find some answers and have an idea on how to optimize this. Initially, I had 5 containers running in one docker network - “Network A”.

Network A:

  • Java web-application (HTTP traffic) (depends_on: redis)
  • 2 x JS applications (HTTP traffic)
  • Redis server
  • PHP web-application.

Java and Redis server communicate within the network. The web-application are sitting behind an NGINX reverse proxy.

After a certain period, the client browsers start to get a 502. The 502s come up randomly and get resolved when refreshed multiple times.

The solution was moving Redis container to another docker network - “Network B”. The Java container was given access to both “Network A” and “Network B”. This fixed the 502s immediately.

Network A:

  • Java web-application (HTTP traffic) (depends_on: redis)
  • 2 x JS applications (HTTP traffic)
  • PHP web-application.

Network B:

  • Java web-application (HTTP traffic) (depends_on: redis)
  • Redis server

Generally, I understand that I’ve reduced the congestion in the “Network A”, by moving Redis out of it. But I don’t exactly understand the nuances.

Could someone please explain why? And also what are the best practices handling HTTP traffic and internal docker networking?

Thank you

Update:

Before switching the networks, the containers were restarted + recreated multiple times.
The logs showed no errors. Also the HTTP requests never reached the web-servers in the containers. Only NGNIX showed the 502s, which meant it was never even able to reach the web-servers.

Here is the simplified version of the docker-compose file:

networks:
  appnetwork:
    external: true
  shared-network:
    external: true

services:
  redis:
    image: 'redis:latest'
    container_name: 'redis'
    expose:
      - 6379
    networks:
      - shared-network

  core:
    image: core
    container_name: core
    depends_on:
      - redis
    restart: always       
    extra_hosts:
      - "host.docker.internal:host-gateway"
    expose:
      - 7000
    ports:
      - "127.0.0.1:7000:7000"
    networks:
      - appnetwork 
      - shared-network
    
  js:
    image: js
    container_name: js
    restart: always
    expose:
      - 80
    ports:
      - "127.0.0.1:5001:80"
    networks:
      - appnetwork

  js2:
    image: js2
    container_name: js2
    restart: always
    expose:
      - 80
    ports:
      - "127.0.0.1:5002:80"
    networks:
      - appnetwork

You state the problem occurs after a certain period. And then it was “immediately” fixed? Maybe because you restarted the containers with the new networks and it just seems fixed?

Even when using a different Docker network, I don’t think that any “congestion” is avoided by that. So personally I don’t think this is really “the solution”.

I fully agree with @bluepuma77. “moving” a container from one network to another also requires recreating it which also means restarting which could “solve” some issues. But I would check the logs in all containers and I would enable verbose or debug logging wherever it is posisble. Gateway error can be returned when a target server is not running or not as expected.

Changing network can help mostly when you have multiple compose services with the same name on the same network so the proxy load balances among all when just one of some of them are actually listening on the required port.

You can share your config if you need more help and someone might be able to catch what is happening.

@bluepuma77 @rimelek
Hey, thanks for the response. I forgot to mention that, so I updated my question. The containers were restarted + recreated multiple times before changing networks. I also added the docker-compose file to the question.

I can’t explain that, but I wouldn’t try to guess without a more verbose error message. When there is an HTTP 502, that is returned by a server and when a server returns that, it has to know why. When I wrote about error messages, I didn’t mean the container that couldn’t be reached but the proxy server that was supposed to reach it. Your nginx reverse proxy.

Your networks in the compose file are external. So there is no way to tell what else is on that network if we assume the reverse proxy was trying to reach another container.

I’m replying only now because I found it hard to connect the pieces together. Like what is network A and B and which is which in your compose file or which network is used by the proxy server and whether the compose file is the original setup or the fixed one. If you explained it, I missed it, but when you simplify things, it is better if you use the right names to refer to. After reading the updated post the multiple times, I understand today that is is the new compose file as you have two networks.

So if you can reproduce the issue again with the original setup and configure your reverese proxy to show you a verbose error message, I can’t promise, but at least there is a chance that we can give you a better answer.

I wouldn’t rule out a weird docker network error completely, but I wouldn’t jump to that conclusion yet either

Can you also tell more about you Docker environment?

We usually need the following information to understand the issue:

  1. What platform are you using? Windows, Linux or macOS? Which version of the operating systems? In case of Linux, which distribution?

  2. How did you install Docker? Sharing the platform almost answers it, but only almost. Direct links to the followed guide can be useful.

  3. On debian based Linux, the following commands can give us some idea and recognize incorrectly installed Docker:

    docker info
    docker version
    

    Review the output before sharing and remove confidential data if any appears (public IP for example)

    dpkg -l 'docker*' | grep '^ii'
    snap list docker
    

    When you share the outputs, always format your posts according to the following guide: How to format your forum posts

I would put my money on: https://forums.docker.com/t/nginx-swarm-redeploy-timeouts/68904/5

nginx is known to cache resolved IPs indefinitely. The post shows how to mitigate the problem.

You might want to consider switching to traefik as a reverse proxy, as it uses the docker event stream to register/unregister the reverse proxy rules for containers.

1 Like

If that’s the case, although I also prefer Traefik nowadays, nginx-proxy is using IP addresses to connect to containers instead of service names

So probably any reverse proxy would work which was configured to work with containers properly.

Hey, thanks for the response. Sorry, for the delayed reply, it was the weekend and was gathering some information.

The platform is Linux: AlmaLinux.8.
Docker 25.04 installed via the package manager (dnf)

Client: Docker Engine - Community
 Version:    25.0.4
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.13.0
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.27.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 26
  Running: 26
  Paused: 0
  Stopped: 0
 Images: 46
 Server Version: 26.1.3
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: local
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 Kernel Version: 4.18.0-477.21.1.el8_8.x86_64
 Operating System: AlmaLinux 8.10 (Cerulean Leopard)
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 31.08GiB
 Name: server
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

The networks were created in the Docker. Here are the networks:

$ docker network ls
NETWORK ID     NAME                           DRIVER    SCOPE
e3df3df02ed    appnetwork                    bridge    local
ad9a91a838b    shared-network                bridge    local
6f9a2047a75    bridge                        bridge    local
246cb50ee64    host                          host      local
$ docker network inspect appnetwork shared-network
[
    {
        "Name": "appnetwork",
        ....
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "171.0.0.0/16",
                    "IPRange": "171.0.0.128/25",
                    "Gateway": "171.0.0.1"
                }
            ]
        }
        ....
    },
    {
        "Name": "shared-network",
        ....
        "IPAM": {
            "Driver": "default",
            "Options": {},
            "Config": [
                {
                    "Subnet": "172.28.0.0/16",
                    "Gateway": "172.28.0.1"
                }
            ]
        }
        ....
    }
]

The issue didn’t reoccur since the networks have been shifted. So I can’t check the logs.

I spoke to a few network people and they suggest it could something with docker network’s port/IP exhaustion? Is that a possibility? If so, is there a way I can reset the networks every night via a cron-task?

Nginx in my server context is not a Docker container. It is managed by a server management software. And is installed on the bare metal server. The container ports are exposed by Docker and the Ngnix connects to these ports.

It seems strange that you use an one year old Docker version and that client and server have different versions. I would try to fix that first.

There are multiple things I have to point out. One was mentioned by @bluepuma77 already

  • You are using AlmaLinux which is not support by Docker officially. Even if a distro is based on an officially supported one it is not that and can be differences. Here are the supported distros: Install | Docker Docs
  • Your Docker client is v25.0.4 and the server is 26.1.3. Although the versions don’t have to be the same, but the best if they are the same. Also I don’t think any of the versions are supported as the current latest version is v28.1.1 and only v28 and v27 is mentioned in the documentation’s “Release notes” summary page: Release notes | Docker Docs
  • Your cgroupfs version is v1, which is a legacy version, but it is probably not related to your issue

I guess anything is possible, but I’m not a network guy myself, so I deal with issues when I met them, but I haven’t met this one. But I don’t see how it would be IP exhaustion if your container could start. Docker would not let you create a container with network when there is no available IP left. The same with ports, unless you mean dynamic ports for TCP communication, but if you have problems with the number of ports, something must be seriously wrong with your app, and I don’t think that would be the case. but again, not a network expert here.

If you are satisfied with how it works now, that’s okay, but if it was indeed the changed network that solved the issue, you will not be able to reproduce it, so if you want to make sure you know what fixed it, you could try to run another test project as it was before the fix. Or just wait and see if it ever ocures again. The topic wil be automatically closed after 30 month, so if you have the issue again after that, you can open a new topic (we can merge that to this one if needed).

So how do you refer to the services from there? Using the loopback IPs like 127.0.0.1:5002. Then why are your networks external? Normally you would have a compose project level network for internal communication between containers in the same project and one additional network for the reverse proxy container so that can access your web server in the compose project.

Be careful with external networks, because then you cannot always use the service names to refer to another container in the project, since it doesn’t matter in which project a service is as long as it is on the same network and requests to the containers with the same service name will be load balanced.