Docker Community Forums

Share and learn in the Docker community.

Stack deploy services never start

I am working on a new swarm just set up. We have 3 manager nodes on centos 7. Services run fine, dockersamples/visualizer is running and can see services on any of the 3 nodes.
The nodes look healthy:

:~/stacktest$ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
m2gwj719n04n4mlqopoi37x2f * swarm-1 Ready Active Reachable 20.10.6
mj5wgc9yml9nhmetdc6q3s5co swarm-2 Ready Active Leader 20.10.6
fp2ewnf7ywjn7kuan9dif7pdk swarm-3 Ready Active Reachable 20.10.6

I followed the example in Deploy a stack to a swarm | Docker Documentation and everything worked fine up to the step ‘Deploy the stack to the swarm’. When I run the command 'docker stack deploy -c docker-compose,yml stackdemo, I get:

:~/stacktest$ docker stack deploy -c docker-compose.yml stackdemo
Creating network stackdemo_default
Creating service stackdemo_web
Creating service stackdemo_redis

However, the services never get to the running state even though no error is given. The current state remains ‘New’ indefinitely:

:~/stacktest$ docker service ls
ID NAME MODE REPLICAS IMAGE PORTS
mhspwv44uljg registry replicated 1/1 registry:2 *:5000->5000/tcp
3suwj8vmoi33 stackdemo_redis replicated 0/1 redis:alpine
tiyihji32p0o stackdemo_web replicated 0/1 127.0.0.1:5000/stackdemo:latest
6wslf6nds61p viz replicated 1/1 dockersamples/visualizer:latest *:8080->8080/tcp
:~/stacktest$ docker service ps stackdemo_redis
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
xcp6v1vq70nz stackdemo_redis.1 redis:alpine Running New 13 minutes ago
:~/stacktest$ docker service ps stackdemo_web
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
ilfhstti03cz stackdemo_web.1 127.0.0.1:5000/stackdemo:latest Running New 14 minutes ago

If I run dockerd -D and try the stack deploy, there is an error where the stackdemo_default network is not found:

DEBU[2021-05-05T14:19:52.704978621-04:00] swarm-1: Initiating bulk sync for networks [bgp2p8e98hcml6dn1oj9ohia2] with node 97d5a9f34ab2
DEBU[2021-05-05T14:19:52.904793793-04:00] Calling HEAD /_ping
DEBU[2021-05-05T14:19:52.912946386-04:00] Calling HEAD /_ping
DEBU[2021-05-05T14:19:52.977910738-04:00] Calling GET /v1.41/info
DEBU[2021-05-05T14:19:53.044689994-04:00] Calling GET /v1.41/networks?filters=%7B%22label%22%3A%7B%22com.docker.stack.namespace%3Dstackdemo%22%3Atrue%7D%7D
DEBU[2021-05-05T14:19:53.105499434-04:00] Calling POST /v1.41/networks/create
DEBU[2021-05-05T14:19:53.105694071-04:00] form data: {“Attachable”:false,“CheckDuplicate”:false,“ConfigFrom”:null,“ConfigOnly”:false,“Driver”:“overlay”,“EnableIPv6”:false,“IPAM”:null,“Ingress”:false,“Internal”:false,“Labels”:{“com.docker.stack.namespace”:“stackdemo”},“Name”:“stackdemo_default”,“Options”:null,“Scope”:""}
DEBU[2021-05-05T14:19:53.182405347-04:00] Calling GET /v1.41/services?filters=%7B%22label%22%3A%7B%22com.docker.stack.namespace%3Dstackdemo%22%3Atrue%7D%7D
DEBU[2021-05-05T14:19:53.240936740-04:00] Calling GET /v1.41/distribution/127.0.0.1:5000/stackdemo:latest/json
DEBU[2021-05-05T14:19:53.305659117-04:00] Calling POST /v1.41/services/create
DEBU[2021-05-05T14:19:53.305841802-04:00] form data: {“EndpointSpec”:{“Ports”:[{“Protocol”:“tcp”,“PublishMode”:“ingress”,“PublishedPort”:8000,“TargetPort”:8000}]},“Labels”:{“com.docker.stack.image”:“127.0.0.1:5000/stackdemo”,“com.docker.stack.namespace”:“stackdemo”},“Mode”:{“Replicated”:{}},“Name”:“stackdemo_web”,“TaskTemplate”:{“ContainerSpec”:{“Image”:“127.0.0.1:5000/stackdemo:latest”,“Labels”:{“com.docker.stack.namespace”:“stackdemo”},“Privileges”:{“CredentialSpec”:null,“SELinuxContext”:null}},“ForceUpdate”:0,“Networks”:[{“Aliases”:[“web”],“Target”:“stackdemo_default”}],“Placement”:{},“Resources”:{}}}
DEBU[2021-05-05T14:19:53.306914737-04:00] error handling rpc error=“rpc error: code = NotFound desc = network stackdemo_default not found” rpc=/docker.swarmkit.v1.Control/GetNetwork

But this is what I get when I list networks:

:~/stacktest$ docker network ls
NETWORK ID NAME DRIVER SCOPE
a4ff3b97d648 bridge bridge local
aa463181d7a1 docker_gwbridge bridge local
bacf509c8c32 host host local
bgp2p8e98hcm ingress overlay swarm
9bcd7049fa90 none null local
wrf071cu1k0i stackdemo_default swarm

The same thing happens when using another custom stack. I’m stumped - any ideas to try troubleshooting this?

I figured out the problem. I had initialized the swarm with a /16 default address pool and a default addr pool mask length of 16. This means the ingress network created when the swarm was initialized took up all the space available. I was able to troubleshoot using concepts I got from the tutorial at Networking with overlay networks | Docker Documentation

This situation arose because I had misconfigured a new swarm if I was expecting to be able to deploy new stacks with defualt networks. The same issue could happen if you had an old swarm and ran out of network space to create new default networks. It’s tough to troubleshoot if you don’t know where to look because of the lack of error messages when deploying the stack.

The original (incorrect) swarm init command that only has room for one network was:

docker swarm init --advertise-addr 128.253.109.225 --default-addr-pool 10.0.0.0/16 --default-addr-pool-mask-length 16

The new (correct) swarm init command that has room for 4096 networks was:

docker swarm init --advertise-addr 128.253.109.225 --default-addr-pool 10.0.0.0/12 --default-addr-pool-mask-length 24