Issue: Stack runs service but some containers show as “new” but not running. And there isn’t a node set either. Check out the output of the docker stack ps command:
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
goo9f6e5evmt bdnet_bdnet-bls.1 evolvecomputing/evolve:binLogService-v1 Running New about an hour ago
There isn’t a node set.
OS Version: CentOS 7.8
Docker Version:
Client: Docker Engine - Community
Version: 19.03.13
API version: 1.40
Go version: go1.13.15
Git commit: 4484c46d9d
Built: Wed Sep 16 17:03:45 2020
OS/Arch: linux/amd64
Experimental: false
The output of the command is awful to read. Can you repost it, but this time use docker stack ps ${your stack name} --format '{{json .}}' to get json output and post it instead?
Hmm, indeed the Node information is missing, but at the same time Error is empty. That’s odd.
Can you try docker stack ps nzb --format '{{json .}}' --no-trunc to get a more detailed output?
Looks like the scheduler did not even create a task. Though, in case of schedulung problems, I would except Error not beeing empty…
Hmm, according https://docs.docker.com/engine/swarm/how-swarm-mode-works/swarm-task-states/ your service is stuck in the inital state. I have never seen a deployment beeing stuck in that state without actualy providing ANY sort of details that would help with root cause analysis. You can check todays systemd logs for errors: `sudo journalctl -t dockerd -t docker -S “$(date +%F)” and see if something suspicous pops up.
Do you use any additional docker volume or network plugins or a non standard runc implementation (e.g. the one from nvidia)?
I pretty much use the basic swarm setup. Nothing special at all. No special network plugins or anything.
Log events that are suspicious:
dockerd[11919]: time=“2020-10-08T08:28:01.652266185-04:00” level=info msg=“ignoring event” module=libcontainerd namespace=moby topic=/tasks/delete type=“*events.TaskDelete”
dockerd[11919]: time=“2020-10-08T08:28:15.896577911-04:00” level=warning msg=“Entry was not in db: nid:5ln37mtgvzvfc4978muc20bjn eid:48de25bd7ab4fbf59dc05567bab645c852eb989e59ba1262154b44691b066385 peerIP:10.0.39.26 peerMac:02:42:0a:00:27:1a isLocal:false vtep:10.10.10.119”
dockerd[11919]: time=“2020-10-08T08:28:15.896673613-04:00” level=warning msg=“Peer operation failed:could not delete fdb entry for nid:5ln37mtgvzvfc4978muc20bjn eid:48de25bd7ab4fbf59dc05567bab645c852eb989e59ba1262154b44691b066385 into the sandbox:Search neighbor failed for IP 10.10.10.119, mac 02:42:0a:00:27:1a, present in db:false op:&{2 5ln37mtgvzvfc4978muc20bjn 48de25bd7ab4fbf59dc05567bab645c852eb989e59ba1262154b44691b066385 [0 0 0 0 0 0 0 0 0 0 255 255 10 0 39 26] [255 255 255 0] [2 66 10 0 39 26] [0 0 0 0 0 0 0 0 0 0 255 255 10 10 10 119] false false false EventNotify}”
This looks inconclusive to me. Do both nodes have enough free space in /var/lib/docker? I had situations where the swarm raft log was broken on single nodes, but could be recovery by the other master nodes - though this only works with a 3(+) Manager node setup.
Other then that: I have no idea what might cause this situation. I hope you will eventualy find a solution and share it with us.
level=error msg=“task allocation failure” error=“failed to allocate network IP for task qyn5fr80yjyz0okvwvyl7ijge network hgjy9rs9qpqf6pztq8gjuanlt: could not find an available IP” module=node node.id=ja75y7yaeqrui4cnyj6dktubx
Seems it can’t find an available IP address for the node. Perhaps it’s hanging on that. Thoughts?
It does make sense that this is the problem or at least part of the problem.
Did you try to redeploy the services with a different overlay network? Can you eliminate an ip-range collision between the overlay network and your real network and routing rules?
The challenge is that I need to have a bigger network than a /24 for these services. I want them all on the network talking to each other. And this network is going to get BIG. I don’t see anyway to update the existing swarm overlay network. It appears I have to delete and recreate it.
I could use something like this:
docker network create \
--driver=bridge \
--subnet=172.28.0.0/16 \
--ip-range=172.28.0.0/16 \
--gateway=172.28.31.254 \
br0
That would give me over 1 million ip addresses.
Do you concur?