Debug slow docker container startup

Running a swarm of 8 nodes.
Docker version is 24.0.2

Have had intermittent problems, and think the issue may relate to docker taking a very long time to do things.

docker container run --rm --net none --pid host --uts host --privileged --log-driver none   --name speedtest --rm busybox true | grep real

I have already pulled this image so it should be local.

If I run this command locally on my dev machine it can execute in < 1s
On a few of the swarm hosts, it can take 10/20/30s at times. Usually the best I can manage might be 2-3s.

Looking for any ideas in how to debug this issue.
Network seems unused (but why would it even hit the network anyway?),
disk IO is very low,
disk space is high,
and CPU is very low.

Not sure where else to possibly investigate.

Any ideas are welcomed!

It could be related to your environment or the backing filesystem. If you share the output of docker info that can show some valuable information. Remove parts of the output that you wouldn’t share like registry IP or username before posting it here.

When I saw Docker working really slowly last time, I suspected there was a high disk usage on the host and the VMs didn’t have dedicated disks. Couldn’t confirm 100%, but some hours later everything wer fast again.

If you run locally on a 16 core CPU with much RAM and NVMe storage, and the (“cloud”?) server only has 1 core CPU available, other processes running, little RAM and slow storage via network, then this might be expected.

It’s interesting that you mention Swarm, but the container you run seems plain Docker.

here you go - maybe you can see something in there. It might just be a HW issue… ? At this point I wouldn’t put gremlins out of the question. Again, even running time docker run hello-world can take 20s or more. This is not in a virtual server, this is running on bare metal.

Client: Docker Engine - Community
 Version:    24.0.2
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.10.5
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.18.1
    Path:     /usr/libexec/docker/cli-plugins/docker-compose
  scan: Docker Scan (Docker Inc.)
    Version:  v0.23.0
    Path:     /usr/libexec/docker/cli-plugins/docker-scan

Server:
 Containers: 8
  Running: 5
  Paused: 0
  Stopped: 3
 Images: 143
 Server Version: 24.0.2
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: active
  NodeID: 7uboxlda8xmxsu5ci44i9exla
  Is Manager: true
  ClusterID: jeoeqv4cyqg0n41btl480o4e4
  Managers: 5
  Nodes: 8
  Default Address Pool: 10.0.0.0/8
  SubnetSize: 24
  Data Path Port: 4789
  Orchestration:
   Task History Retention Limit: 5
  Raft:
   Snapshot Interval: 10000
   Number of Old Snapshots to Retain: 0
   Heartbeat Tick: 1
   Election Tick: 10
  Dispatcher:
   Heartbeat Period: 5 seconds
  CA Configuration:
   Expiry Duration: 3 months
   Force Rotate: 0
  Autolock Managers: false
  Root Rotation In Progress: false
  Node Address:XXXX
  Manager Addresses:
XXXXX
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 3dce8eb055cbb6872793272b4f20ed16117344f8
 runc version: v1.1.7-0-g860f061
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
 Kernel Version: 5.4.0-150-generic
 Operating System: Ubuntu 18.04.6 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 32
 Total Memory: 125.9GiB
 Name: XXXX
 ID: XXXXXX
 Docker Root Dir: /var/lib/docker
 Debug Mode: true
  File Descriptors: 106
  Goroutines: 220
  System Time: 2024-06-08T16:12:06.335112825Z
  EventsListeners: 2
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No swap limit support

Is this related to swarm service deployments, or also to plain container deployments with docker run or docker compose?.

Swarm deployments indeed are not instantaneous, as you just only deploy changes of the desired state, then the scheduler reconciles the desired state with the current state , then lets the scheduler determine a node that satisfies the placement, as well as the resource constraints. Once the node is detemined it will schedule the service task on the node, which usually also pulls the image from the registry. In my experience the most time is lost while pulling the image, especially if the registry in not reachable via high bandwidth, low lately network, You could try if setting the pull_policy is supported and swarm and can prevent unnecessary pulls.

With plain container deployments, it should indeed be instantaneous, especially when overlay2 is used as storage driver.

You should find the required information to access the syslog of the docker daemon, and enable debug mode on this page: https://docs.docker.com/config/daemon/logs/

Can you try with ext4 filesystem instead of xfs? We had a similar problem (very very long container start and stop) and it solved the issue.
Does anybody have any insight on why it helped?

I haven’t used Docker with XFS yet, but we discussed a similar issue here before, assuming it is slow because of the disk IO

https://forums.docker.com/t/why-is-my-xfs-block-io-insane-compared-to-my-ext4-block-io/116788:

There was no solution in that topic. When did it happen to you? Do you remember which Docker version did you have?