Container in Swarm missing Node

Hey guys,

Issue: Stack runs service but some containers show as “new” but not running. And there isn’t a node set either. Check out the output of the docker stack ps command:

ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
goo9f6e5evmt bdnet_bdnet-bls.1 evolvecomputing/evolve:binLogService-v1 Running New about an hour ago

There isn’t a node set.

OS Version: CentOS 7.8

Docker Version:
Client: Docker Engine - Community
Version: 19.03.13
API version: 1.40
Go version: go1.13.15
Git commit: 4484c46d9d
Built: Wed Sep 16 17:03:45 2020
OS/Arch: linux/amd64
Experimental: false

Server: Docker Engine - Community
Engine:
Version: 19.03.13
API version: 1.40 (minimum version 1.12)
Go version: go1.13.15
Git commit: 4484c46d9d
Built: Wed Sep 16 17:02:21 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.3.7
GitCommit: 8fba4e9a7d01810a393d5d25a3621dc101981175
runc:
Version: 1.0.0-rc10
GitCommit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
docker-init:
Version: 0.18.0
GitCommit: fec3683

Reproduce: I simply run docker stack against the docker-compose.yml and service runs.

This problem seemed to start about a week or so ago. I don’t remember any updates that would have caused it.

Thanks,
Chuck

The output of the command is awful to read. Can you repost it, but this time use docker stack ps ${your stack name} --format '{{json .}}' to get json output and post it instead?

Sorry about that. Here is the json:

{“CurrentState”:“New 2 hours ago”,“DesiredState”:“Running”,“Error”:"",“ID”:“goo9f6e5evmt”,“Image”:“evolvecomputing/evolve:binLogService-v1”,“Name”:“bdnet_bdnet-bls.1”,“Node”:"",“Ports”:""}

{“CurrentState”:“New 2 hours ago”,“DesiredState”:“Running”,“Error”:"",“ID”:“1m2pk9nwietj”,“Image”:“evolvecomputing/spaw:activeCompServiceProd-latest”,“Name”:“bdnet_bdnet-acs.1”,“Node”:"",“Ports”:""}

{“CurrentState”:“Running 2 hours ago”,“DesiredState”:“Running”,“Error”:"",“ID”:“83zwass96nnu”,“Image”:“evolvecomputing/spaw:webProd-latest”,“Name”:“bdnet_bdnet-web.1”,“Node”:“dnode1”,“Ports”:""}

{“CurrentState”:“Running 2 hours ago”,“DesiredState”:“Running”,“Error”:"",“ID”:“nlybrfqgb9k8”,“Image”:“evolvecomputing/maria-db:latest”,“Name”:“bdnet_bdnet-db.1”,“Node”:“dnode1”,“Ports”:""}

Hmm, indeed the Node information is missing, but at the same time Error is empty. That’s odd.
Can you try docker stack ps nzb --format '{{json .}}' --no-trunc to get a more detailed output?

Looks like the scheduler did not even create a task. Though, in case of schedulung problems, I would except Error not beeing empty…

{“CurrentState”:“New 2 hours ago”,“DesiredState”:“Running”,“Error”:“”,“ID”:“goo9f6e5evmtqdd7bcwbd2m3p”,“Image”:“evolvecomputing/evolve:binLogService-v1@sha256:a680b530d99d6d7aab81123600c5400740293c96c16536edca1cfd3c071c244d”,“Name”:“bdne_bdne-bls.1”,“Node”:“”,“Ports”:“”}

{“CurrentState”:“New 2 hours ago”,“DesiredState”:“Running”,“Error”:“”,“ID”:“1m2pk9nwietj8o4y2pq7gi9yh”,“Image”:“evolvecomputing/spaw:activeCompServiceProd-latest@sha256:e2d34f7f1b351502a6dc8251eb52c912750e9677622830fecec66c3db84c19c3”,“Name”:“bdne_bdne-acs.1”,“Node”:“”,“Ports”:“”}

{“CurrentState”:“Running 2 hours ago”,“DesiredState”:“Running”,“Error”:“”,“ID”:“83zwass96nnuqni5fqfpbzgyy”,“Image”:“evolvecomputing/spaw:webProd-latest@sha256:5852d7fa06aae2cceb799d670c2297b21b79ae86f41a5c4895bafaf7cb4b52a0”,“Name”:“bdne_bdne-web.1”,“Node”:“dnode1”,“Ports”:“”}

{“CurrentState”:“Running 2 hours ago”,“DesiredState”:“Running”,“Error”:“”,“ID”:“nlybrfqgb9k84sern6jng1jeb”,“Image”:“evolvecomputing/maria-db:latest@sha256:ae093e0bedf3c20f12e40b840bca665cde1faef5b535cf907afac65368db1c29”,“Name”:“bdne_bdne-db.1”,“Node”:“dnode1”,“Ports”:“”}

Hmm, according https://docs.docker.com/engine/swarm/how-swarm-mode-works/swarm-task-states/ your service is stuck in the inital state. I have never seen a deployment beeing stuck in that state without actualy providing ANY sort of details that would help with root cause analysis. You can check todays systemd logs for errors: `sudo journalctl -t dockerd -t docker -S “$(date +%F)” and see if something suspicous pops up.

Do you use any additional docker volume or network plugins or a non standard runc implementation (e.g. the one from nvidia)?

I pretty much use the basic swarm setup. Nothing special at all. No special network plugins or anything.

Log events that are suspicious:
dockerd[11919]: time=“2020-10-08T08:28:01.652266185-04:00” level=info msg=“ignoring event” module=libcontainerd namespace=moby topic=/tasks/delete type=“*events.TaskDelete”

dockerd[11919]: time=“2020-10-08T08:28:15.896577911-04:00” level=warning msg=“Entry was not in db: nid:5ln37mtgvzvfc4978muc20bjn eid:48de25bd7ab4fbf59dc05567bab645c852eb989e59ba1262154b44691b066385 peerIP:10.0.39.26 peerMac:02:42:0a:00:27:1a isLocal:false vtep:10.10.10.119”

dockerd[11919]: time=“2020-10-08T08:28:15.896673613-04:00” level=warning msg=“Peer operation failed:could not delete fdb entry for nid:5ln37mtgvzvfc4978muc20bjn eid:48de25bd7ab4fbf59dc05567bab645c852eb989e59ba1262154b44691b066385 into the sandbox:Search neighbor failed for IP 10.10.10.119, mac 02:42:0a:00:27:1a, present in db:false op:&{2 5ln37mtgvzvfc4978muc20bjn 48de25bd7ab4fbf59dc05567bab645c852eb989e59ba1262154b44691b066385 [0 0 0 0 0 0 0 0 0 0 255 255 10 0 39 26] [255 255 255 0] [2 66 10 0 39 26] [0 0 0 0 0 0 0 0 0 0 255 255 10 10 10 119] false false false EventNotify}”

This looks inconclusive to me. Do both nodes have enough free space in /var/lib/docker? I had situations where the swarm raft log was broken on single nodes, but could be recovery by the other master nodes - though this only works with a 3(+) Manager node setup.

Other then that: I have no idea what might cause this situation. I hope you will eventualy find a solution and share it with us.

This is an interesting error:

level=error msg=“task allocation failure” error=“failed to allocate network IP for task qyn5fr80yjyz0okvwvyl7ijge network hgjy9rs9qpqf6pztq8gjuanlt: could not find an available IP” module=node node.id=ja75y7yaeqrui4cnyj6dktubx

Seems it can’t find an available IP address for the node. Perhaps it’s hanging on that. Thoughts?

And I have over 500GB on for /var/lib/docker.

It does make sense that this is the problem or at least part of the problem.
Did you try to redeploy the services with a different overlay network? Can you eliminate an ip-range collision between the overlay network and your real network and routing rules?

The challenge is that I need to have a bigger network than a /24 for these services. I want them all on the network talking to each other. And this network is going to get BIG. I don’t see anyway to update the existing swarm overlay network. It appears I have to delete and recreate it.

I could use something like this:

docker network create \
  --driver=bridge \
  --subnet=172.28.0.0/16 \
  --ip-range=172.28.0.0/16 \
  --gateway=172.28.31.254 \
  br0

That would give me over 1 million ip addresses.

Do you concur?

Actually…mistake there. Here is the correct one:

docker network create
–driver=overlay
–scope=swarm
–subnet=172.16.0.0/12
–ip-range=172.16.0.0/12
–gateway=172.28.31.254
br0

I assume --driver=bridge is a typo.
Appart from that it looks good to mel

Update; your update looks fine, Though, I am unclear if you need to provide an ipam strategie to get dhcp going.

I didn’t need an ipam strategy before, so probably fine.

Thanks so much for your help. I truly appreciate you taking the time.

Have a great day/evening.

Many Thanks,
Chuck

Welcome!

It seems that /24 subnets are used for a reason: https://docs.docker.com/engine/reference/commandline/network_create/#overlay-network-limitations.

hmmm…okay. I’ll take a look at that.

Thanks.