Swarm manager loses state in one-node swarm

We have an application deployed via Swarm in a single-node swarm, i.e. manager and worker are the same machine. I’ve noticed “recently” that pulling logs for failed tasks fails – it looks like the Manager functionality is losing contact/state with the Worker functionality:

$ docker service logs krpc47lhliv8
error from daemon in stream: Error grabbing logs: rpc error: code = Unknown desc = warning: incomplete log stream. some logs could not be retrieved for the following reasons: node p5ccjzptq7ulm2p6vs8j6x00n is not available

$ docker node ls
ID                            HOSTNAME                   STATUS    AVAILABILITY   MANAGER STATUS   ENGINE VERSION
p5ccjzptq7ulm2p6vs8j6x00n *   somevm.myhost.com   Ready     Active         Leader           20.10.23

$ docker --version
Docker version 20.10.23, build 7155243

$ head -5 /etc/os-release
NAME="Oracle Linux Server"
VERSION="8.7"
ID="ol"
ID_LIKE="fedora"
VARIANT="Server"

$ uname -a
Linux somevm.myhost.com 5.15.0-6.80.3.1.el8uek.x86_64 #2 SMP Tue Jan 10 11:28:16 PST 2023 x86_64 x86_64 x86_64 GNU/Linux

I say “recently” above because this used to work, with at least one upgrade of the deployed application, and possibly a Docker upgrade, in the interim.

I can still find the dead container with docker ps -a and pull the logs with docker logs directly, but that’s a workaround that ends once we need more than one node in the swarm.

Web searches haven’t yielded much of obvious help.

Does anyone have suggestions for what to try next?

So if I understand you correctly, you can still list nodes and create new services, but you can’t get the logs? Or is docker service create failing as well?

Sorry for the delay in responding. And thank you for your time and attention.

Yes, you’ve summarized correctly. I was able to successfully create a new service:

$ docker service create --replicas 1 --name helloworld alpine ping docker.com
piukqnxzn73836h11k5wsyhgz
overall progress: 1 out of 1 tasks 
1/1: running   [==================================================>] 
verify: Service converged 

$ docker service ls
ID             NAME                MODE         REPLICAS   IMAGE                                                 PORTS
[...]
piukqnxzn738   helloworld          replicated   1/1        alpine:latest                                         

But I still can’t gather logs for any of the services/tasks, including the newly-created service:

$ docker service logs piukqnxzn738
error from daemon in stream: Error grabbing logs: rpc error: code = Unknown desc = warning: incomplete log stream. some logs could not be retrieved for the following reasons: node p5ccjzptq7ulm2p6vs8j6x00n is not available

$ docker service logs helloworld
error from daemon in stream: Error grabbing logs: rpc error: code = Unknown desc = warning: incomplete log stream. some logs could not be retrieved for the following reasons: node p5ccjzptq7ulm2p6vs8j6x00n is not available

I can even define and deploy a new stack:

$ cat docker-compose.yml
version: '3.6'

services:
  dpinger:
    image: 'alpine'
    command: 'ping docker.com'
  gpinger:
    image: 'alpine'
    command: 'ping google.com'

$ docker stack deploy --compose-file docker-compose.yml pingers
Creating network pingers_default
Creating service pingers_gpinger
Creating service pingers_dpinger

$ docker stack ps pingers
ID             NAME                IMAGE           NODE                       DESIRED STATE   CURRENT STATE            ERR
OR     PORTS
8do1h05synif   pingers_dpinger.1   alpine:latest   oncs-blackduck.ciena.com   Running         Running 15 seconds ago

kpfiv5ao4vkz   pingers_gpinger.1   alpine:latest   oncs-blackduck.ciena.com   Running         Running 15 seconds ago

But still no dice on getting logs:

$ docker service logs pingers_dpinger
error from daemon in stream: Error grabbing logs: rpc error: code = Unknown desc = warning: incomplete log stream. some lo
gs could not be retrieved for the following reasons: node p5ccjzptq7ulm2p6vs8j6x00n is not available

Weird, eh?

I found this issue

which mentions

warning: incomplete log stream

someone wrote log rotation helped, but it didn’t work for everyone. Maybe you can find a comment that helps.

Thank you for digging that up.

I’ve read the entire thread, and unfortunately it doesn’t look like there’s any real fix.

The really odd thing is that docker service logs <taskid> fails but docker service logs --follow <taskid> works fine.

The thread started in September, 20217 and it’s now March 2023, so I don’t have much hope for a fix soon. :frowning:

Have you checked if corrupt logs are the reason?

for log in /var/lib/docker/containers/*/*-json.log; do jq < $log 2 >&1 /dev/null || echo $log corrupted;done

The command uses jq to check whether a container log is valid json. If it’s not the filename will be printed to the console. It needs to be run on each node,

Interesting idea. Thanks for the suggestion.

Your suggested pipeline produces no indication that any of the logs are corrupt. I expected this outcome, as the error reported indicates that the Docker Swarm Manager node can’t talk to the Worker node (even though they are the same VM), which would be a precondition to reading the logs files. As I stated in the original post, I can still successfully retrieve logs by bypassing the Swarm functionality via docker logs, so the log files themselves must be intact.

So the setup change between the first post where you had a single node swarm and now?
A node is either a manager node or a worker node. A manager node can not be a manager node and a worker node at the same time.

So being able to use swarm related commands (docker node, docker service, docker stack,…) indicates that the swarm mode itself works. To me, it sounded like only the logs can’t be fetched by docker service logs. The log output warning: incomplete log stream made think about issues I had years ago with corrupt logs.

Did you try to use docker service logs {service name} instead of the taskid? I didn’t knew a taskid can be used instead of the service name. Though, it been 4 years that I’ve used swarm in my job. I still run a swarm pet cluster in my homelab.

As there is no official docker release for Oracle Linux, you might want to identify the maintainer of the package you installed, and raise an issue - of course that is if the package is supported on that os.It could very well be an incompatibility/bug. It doesn’t really make sense that only the logs are affected.

No, the setup has not changed. When I referred to “Manager” and “Worker” in my previous post, I meant to refer to the Docker software entities, both running on the same VM in my case. My use of the word “node” in that context was misleading because, as you rightly point out, a given (virtual) machine can only be one type of node or the other. Sorry for the confusion.

Yes, I’ve tried specifying the service by name, and the results are the same. My primary use case for docker service logs is retrieving logs for failed instances of services, so specifying the task ID is the preferred option for me.

As demonstrated by the link that @rimelek provided above, this issue is not new, and not unique to Oracle Linux. It’s been around since at least Docker 17.06, and folks in that thread report seeing it on Ubuntu, Debian, Arch, and possibly others.