I have already pulled this image so it should be local.
If I run this command locally on my dev machine it can execute in < 1s
On a few of the swarm hosts, it can take 10/20/30s at times. Usually the best I can manage might be 2-3s.
Looking for any ideas in how to debug this issue.
Network seems unused (but why would it even hit the network anyway?),
disk IO is very low,
disk space is high,
and CPU is very low.
It could be related to your environment or the backing filesystem. If you share the output of docker info that can show some valuable information. Remove parts of the output that you wouldn’t share like registry IP or username before posting it here.
When I saw Docker working really slowly last time, I suspected there was a high disk usage on the host and the VMs didn’t have dedicated disks. Couldn’t confirm 100%, but some hours later everything wer fast again.
If you run locally on a 16 core CPU with much RAM and NVMe storage, and the (“cloud”?) server only has 1 core CPU available, other processes running, little RAM and slow storage via network, then this might be expected.
It’s interesting that you mention Swarm, but the container you run seems plain Docker.
here you go - maybe you can see something in there. It might just be a HW issue… ? At this point I wouldn’t put gremlins out of the question. Again, even running time docker run hello-world can take 20s or more. This is not in a virtual server, this is running on bare metal.
Is this related to swarm service deployments, or also to plain container deployments with docker run or docker compose?.
Swarm deployments indeed are not instantaneous, as you just only deploy changes of the desired state, then the scheduler reconciles the desired state with the current state , then lets the scheduler determine a node that satisfies the placement, as well as the resource constraints. Once the node is detemined it will schedule the service task on the node, which usually also pulls the image from the registry. In my experience the most time is lost while pulling the image, especially if the registry in not reachable via high bandwidth, low lately network, You could try if setting the pull_policy is supported and swarm and can prevent unnecessary pulls.
With plain container deployments, it should indeed be instantaneous, especially when overlay2 is used as storage driver.
Can you try with ext4 filesystem instead of xfs? We had a similar problem (very very long container start and stop) and it solved the issue.
Does anybody have any insight on why it helped?