Slow Network Startup in a Swarm

I am trouble shooting load times for a system of containers. In particular ShinyProxy in swarm mode loads one container per user. I can start the container in about three seconds, with a docker client. I can also time various steps in the app and they are all taking negligible time including connecting out to a database and importing data. From the app view there appears to be some delay to when the network is ready to listen(receive and handles requests).

I have tried to manually start the container as a service, in a similar way to how ShinyProxy launches the containers, by creating a service in the swarm. This process seems to take a long time. EG polling the service takes a while.

I am triangulating, but there seems like the launch time is either overhead in the swarm service model or some networking issue in the container base image (using a debian based image currently).

I thought I would ask the community to see if there is anyway to trouble shoot this kind of issue, slow network startup in a swarm.

Swarm service deployments are slower than plain container deployments. The scheduler needs to find a node that meets the deployment constraints and has free capacity for the requested resources, schedule a task on the node, which then pulls the image from a registry, and then creates the container.

A typical pinpoint is how fast the image can be pulled from the registry (worst case the task will remain in “prepare” state for minutes, or even hours if your DockerHub pull-rate is exceeded). Usually running a local registry, and using only images from the local registry speeds up the prepare step by magnitudes.

The default image pull policy for swarm services appears to be always. You could try if setting the pull_policy to missing mitigates the problem. Though, I am not sure if swarm stacks even supports that configuration item.

You can use docker service ps {service name} to check if a service is in prepare state and see how long it is in the state.

I thought about the pull policy, which I made sure to disable. I monitored docker events and container logs in real time. It appears to be a networking issue, maybe in the container or in the host.

EG the container is up and running but not available, during the service launch.

Are there any other insights into how I can improve/understand this process. I feel like there are some things that are just pending/blocking like container healthcheck etc. When I monitor the process, I don’t see much CPU or network traffic. I am doing this on a single node for example.

I have no further idea than what I already wrote in my last response.

For me it’s usually the image pull that takes ages, which can be improved by running a local container image registry in the network that acts as a pull through cache.