Slow network startup in docker-compose

Hi there,

I have the problem, that it takes up to one minute until container can communicate to the outer world, allthough it’s state is “running”. The site effect of this is, that containers which ned things from external in entrypoint.sh will crash, as the can’t download or communicate with their external resources.

Is there a way to troubleshoot a containers network initialization?

My Infrastructure:
Containerhost: QNAP with Intel i5 8400T and 64GB RAM with latest QTS and Container Station using Docker-Compose
Storage: nvme SSD Raid 1
Management Software: Portainer Business Edition
Container-Sample where I have this issue: traefik latest, reverse Proxy

docker-compose.yml

version: "3.3"

services:
  traefik:
#    dns:
#      - "1.1.1.1"
#      - "8.8.8.8"
    image: traefik:latest
    restart: always
    container_name: traefik
    environment: 
        CF_DNS_API_TOKEN: 'mytoken'

#        TRAEFIK_CERTIFICATESRESOLVERS_MYRESOLVER_ACME_DNSCHALLENGE_DELAYBEFORECHECK: 120
    command:

      - --api.insecure=true # <== Enabling insecure api, NOT RECOMMENDED FOR PRODUCTION
      - --api.dashboard=true # <== Enabling the dashboard to view services, middlewares, routers, etc.
      - --api.debug=true # <== Enabling additional endpoints for debugging and profiling
      - --log.level=TRACE # <== Setting the level of the logs from traefik
      - --providers.docker=true # <== Enabling docker as the provider for traefik
      - --providers.docker.exposedbydefault=false # <== Don't expose every container to traefik
      - --providers.docker.network=web # <== Operate on the docker network named web
      - --entrypoints.web.address=192.168.178.3:80
      - --entrypoints.websecure.address=192.168.178.3:443
      #DNS Challenge
      - --certificatesresolvers.myresolver.acme.dnschallenge=true
      - --certificatesresolvers.myresolver.acme.dnschallenge.provider=cloudflare
      
      # ACME Base
      - --certificatesresolvers.myresolver.acme.email=postmaster@mydomain.com
      - --certificatesresolvers.myresolver.acme.storage=/letsencrypt/acme.json
      - --entrypoints.websecure.http.tls=true
      - --entrypoints.websecure.http.tls.certresolver=myresolver
      - --entrypoints.websecure.http.tls.domains[0].main=mydomain.com
      - --entrypoints.websecure.http.tls.domains[0].sans=*.mydomain.com
      - --serverstransport.insecureskipverify=true


    volumes:
      - /var/run/docker.sock:/var/run/docker.sock # <== Volume for docker admin
      - /share/ContainerStation/persistent/traefik/dynamic.yaml:/dynamic.yaml # <== Volume for dynamic conf file, **ref: line 27
      - /share/ContainerStation/persistent/traefik/config.yml:/config.yml
      - /share/ContainerStation/persistent/traefik/letsencrypt:/letsencrypt
      - /share/ContainerStation/persistent/traefik/certs:/certs:ro
      - /share/ContainerStation/persistent/traefik/certs.yml:/certs.yml
      - /share/ContainerStation/persistent/traefik/entrypoint.sh:/entrypoint.sh
    networks:
       web: # <== Placing traefik on the network named web, to access containers on this network
       qnet-static-eth1-b03c93: # <== Static IP in server dmz
          ipv4_address: 192.168.178.3
    labels:
      - "traefik.enable=true" # <== Enable traefik on itself to view dashboard and assign subdomain to$
      - "traefik.http.routers.api.rule=Host(`monitor.mydomain.com`)" # <== Setting the domain for the d$
      - "traefik.http.routers.api.service=api@internal" # <== Enabling the api to be a service to acce$
networks:
  web:
    external: true
  qnet-static-eth1-b03c93:
    external: true

many thanks in advance

Not sure about the troubleshooting, or the reason the container takes a while to launch, but you can use define a healthcheck, and have the dependent services only launch once it is healthy

good hint. But I think I don’t really understand how to implement it.
Because the examples I found, always use a second container, which checks the 1st one if its up. But I must get sure, that the entrypoint.sh waits for execution until the network of the started container is fully functional.

Additionally there is a the strange fact, that I must ping out of my problem container, to get a proper network connection within around 30 seconds. If the ping isn’t executed it could take several minutes until the container is reachable… why ever …very strange…

I’m not entirely sure I understand what you mean about needing a secondary container

Basically the way the healthcheck works is you set a script that runs every interval, if that script exits with an error, the container is considered unhealthy.
You can define that healthcheck on the container which takes a while to launch, you can have it ping itself (localhost), or whatever is necessary to actually check the service has started.

Then, you can have your other services depend on that one being marked as healthy, so that they only start once it has already finished launching

As for the specifics, that’d depend on your application and project structure, but that’s the gist of it

Ok, I think I have an imagination problem :smiley:
I know, if I have a stack of several services, there is the possibility to set dependencies between these services. But in my case it’s a stack of only ONE service/container. So I must get sure, that the entrypoint.sh isn’t starting before the network inside this container is working properly.

I found examples like this here:

version: "2.1"
services:
    api:
        build: .
        container_name: api
        ports:
            - "8080:8080"
        depends_on:
            db:
                condition: service_healthy
    db:
        container_name: db
        image: mysql
        ports:
            - "3306"
        environment:
            MYSQL_ALLOW_EMPTY_PASSWORD: "yes"
            MYSQL_USER: "user"
            MYSQL_PASSWORD: "password"
            MYSQL_DATABASE: "database"
        healthcheck:
            test: ["CMD", "mysqladmin" ,"ping", "-h", "localhost"]
            timeout: 20s
            retries: 10

But here the “api” container waits until “db” is up. That’s clear to me. But doesn’t cover my issue.

Something like this here isn’t working, because the dependency would be the container itself what results in a “hen egg problem”.

version: "3.3"

services:
  traefik:
#    dns:
#      - "1.1.1.1"
#      - "8.8.8.8"
    image: traefik:latest
    restart: always
    container_name: traefik
    environment: 
        CF_DNS_API_TOKEN: 'mytoken'

#        TRAEFIK_CERTIFICATESRESOLVERS_MYRESOLVER_ACME_DNSCHALLENGE_DELAYBEFORECHECK: 120
    healthcheck:
        test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://192.168.178.10:60080/cgi-bin"]
        timeout: 10s
         retries: 100
    depends_on:
         traefik:
             condition: service_healthy
    command:
...

What I need is a mechanism, which starts the container, but postpones the entrypoint.sh until the network is up.

I would say I hope it is Docker Compose v2, but based on your shared code snippets, I don’t think so. So make sure you are using Docker Compsoe v2, the only supported compose. Sorry for not linking due to my attempt to quickly respond, but a google search should give you the answer quickly.

How do you know that? Are you sure that the container is the problem and not “the outer world”?
Once the container is started it should immediately have everything in place, including the network.

Then change the entrypoint. You cannot delay the entrypoint itself as it is part of the command that creates the process to run in the container. There is no running container without an already running process as the container is basically the isolation of that process. What you can do is change the entrypoint script and add a loop in which you add a second delay for example or more if needed, and try the command until it succeeds.

Ui Sorry for my really delayed answer. It seems my notification didn’t work.

Basically I can see the network init delay as follows:
I have many containers which try to download some stuff while startup. In my example above traefik tries to solve the ACME DNS-Challenge, which fails, because the internet isn’t reachable yet.

A second example is my plex media server, which tries to download drivers for the hardware transcoder. What’s also not possible due the same issue.

Dec 19, 2024 06:29:13.088 [140172413377336] WARN - [HttpClient/HCl#1] HTTP error requesting GET https://plex.tv/api/codecs/mpeg2video_decoder?build=linux-x86_64-standard&deviceId=13d8a508-70bb-44fc-8806-0371e1648dd9&oldestPreviousVersion=1%2E24%2E2%2E4973-2b1b51db9&version=e613bce-3d5ad59c62e771ae9cb5738e (6, Couldn't resolve host name) (Could not resolve host: plex.tv)
Dec 19, 2024 06:29:13.089 [140172360760120] ERROR - Codecs: Failed to download XML for codec 'mpeg2video_decoder'
Dec 19, 2024 06:29:13.089 [140172360760120] WARN - Codecs: Failed to download mpeg2video decoder; bailing out
Dec 19, 2024 06:29:13.089 [140172360760120] DEBUG - [GPU] Got device: Intel CoffeeLake-S GT2 [UHD Graphics 630], intel@builtin, default true, best true, ID 8086:3e92:8086:2212@0000:00:02.0, DevID [8086:3e92:8086:2212], flags 0x38e7
Dec 19, 2024 06:29:13.089 [140172360760120] INFO - Preemptively preparing driver imd for GPU Intel CoffeeLake-S GT2 [UHD Graphics 630]
Dec 19, 2024 06:29:13.089 [140172360760120] INFO - [DriverDL/imd] Obtaining driver
Dec 19, 2024 06:29:13.089 [140172360760120] INFO - [DriverDL/imd] Fetching zipped component
Dec 19, 2024 06:29:13.089 [140172360760120] DEBUG - [DriverDL/imd/GetFile/HCl#a] HTTP requesting GET https://downloads.plex.tv/intel-media-driver/imd-1c7ab8176722a768cec9f3e5-linux-x86_64.zip
Dec 19, 2024 06:29:13.089 [140172360760120] INFO - Preemptively preparing driver icr for GPU Intel CoffeeLake-S GT2 [UHD Graphics 630]
Dec 19, 2024 06:29:13.089 [140172360760120] INFO - [DriverDL/icr] Obtaining driver
Dec 19, 2024 06:29:13.089 [140172360760120] INFO - [DriverDL/icr] Fetching zipped component
Dec 19, 2024 06:29:13.089 [140172360760120] DEBUG - [DriverDL/icr/GetFile/HCl#b] HTTP requesting GET https://downloads.plex.tv/intel-compute-runtime/icr-2cc111011bf540977e2b145d-linux-x86_64.zip
Dec 19, 2024 06:29:13.104 [140172413377336] WARN - [HttpClient/HCl#2] HTTP error requesting GET https://plex.tv/media/providers?X-Plex-Token=xxxxxxxxxxxxxxxxxxxx (6, Couldn't resolve host name) (Could not resolve host: plex.tv)
Dec 19, 2024 06:29:13.104 [140172407999288] ERROR - [MediaProviderManager] Error parsing content.
Dec 19, 2024 06:29:13.105 [140172407999288] ERROR - [MediaProviderManager] Error parsing XML: Error parsing file.
Dec 19, 2024 06:29:13.105 [140172407999288] DEBUG - [MediaProviderManager/HCl#c] HTTP requesting GET https://plex.tv/media/providers?X-Plex-Token=xxxxxxxxxxxxxxxxxxxx
Dec 19, 2024 06:29:13.109 [140172413377336] WARN - [HttpClient/HCl#3] HTTP error requesting GET https://plex.tv/api/v2/server/access_tokens?auth_token=xxxxxxxxxxxxxxxxxxxx (6, Couldn't resolve host name) (Could not resolve host: plex.tv)
Dec 19, 2024 06:29:13.110 [140172360760120] DEBUG - MyPlex: using cached data for request for https://plex.tv/api/v2/server/access_tokens?auth_token=xxxxxxxxxxxxxxxxxxxx

My “quickfix” for that was, to put a ping to my gateway in front of containers startup script for at least one minute. There I can see that it takes up to 40 seconds until my ping gets a response. Which is a Problem for a lot of Containers. Any ideas how to fix the root cause?

Since you run Docker on QNAP, the behavior will most likely be caused by how their (!) Docker Engine is integrated into the system. From what I remember they also provide a network plugin that does not exist anywhere else.

I am confident that users of a QNAP forum will more likely know what causes this behavior and tell whether it can be changed or not.

Of course, I also created a thread at QNAP-Community. But I think the most users are not sooo deep inside docker. Therefore I got no response at all.
This issue seems to be very “special”.

Hi N300,

So… I was struggling with the same issue, and after troubleshooting I concluded that it was spanning tree on the virtual switch that was causing the delay in network connectivity. once you disable spanning tree on your virtual switch you should be good to go. The option is in small print at the bottom of the config window.

Good luck

3 Likes

Hi jdaumeri,

you are my hero :blush:
Since I’ve deactivated ST on the affected physical interfaces I get not one timeout when restarting a container. Formally it took up to one minute to come back after restart.

Really nice!!