Swarm manager loses state in one-node swarm

We have an application deployed via Swarm in a single-node swarm, i.e. manager and worker are the same machine. I’ve noticed “recently” that pulling logs for failed tasks fails – it looks like the Manager functionality is losing contact/state with the Worker functionality:

$ docker service logs krpc47lhliv8
error from daemon in stream: Error grabbing logs: rpc error: code = Unknown desc = warning: incomplete log stream. some logs could not be retrieved for the following reasons: node p5ccjzptq7ulm2p6vs8j6x00n is not available

$ docker node ls
ID                            HOSTNAME                   STATUS    AVAILABILITY   MANAGER STATUS   ENGINE VERSION
p5ccjzptq7ulm2p6vs8j6x00n *   somevm.myhost.com   Ready     Active         Leader           20.10.23

$ docker --version
Docker version 20.10.23, build 7155243

$ head -5 /etc/os-release
NAME="Oracle Linux Server"
VERSION="8.7"
ID="ol"
ID_LIKE="fedora"
VARIANT="Server"

$ uname -a
Linux somevm.myhost.com 5.15.0-6.80.3.1.el8uek.x86_64 #2 SMP Tue Jan 10 11:28:16 PST 2023 x86_64 x86_64 x86_64 GNU/Linux

I say “recently” above because this used to work, with at least one upgrade of the deployed application, and possibly a Docker upgrade, in the interim.

I can still find the dead container with docker ps -a and pull the logs with docker logs directly, but that’s a workaround that ends once we need more than one node in the swarm.

Web searches haven’t yielded much of obvious help.

Does anyone have suggestions for what to try next?

So if I understand you correctly, you can still list nodes and create new services, but you can’t get the logs? Or is docker service create failing as well?

Sorry for the delay in responding. And thank you for your time and attention.

Yes, you’ve summarized correctly. I was able to successfully create a new service:

$ docker service create --replicas 1 --name helloworld alpine ping docker.com
piukqnxzn73836h11k5wsyhgz
overall progress: 1 out of 1 tasks 
1/1: running   [==================================================>] 
verify: Service converged 

$ docker service ls
ID             NAME                MODE         REPLICAS   IMAGE                                                 PORTS
[...]
piukqnxzn738   helloworld          replicated   1/1        alpine:latest                                         

But I still can’t gather logs for any of the services/tasks, including the newly-created service:

$ docker service logs piukqnxzn738
error from daemon in stream: Error grabbing logs: rpc error: code = Unknown desc = warning: incomplete log stream. some logs could not be retrieved for the following reasons: node p5ccjzptq7ulm2p6vs8j6x00n is not available

$ docker service logs helloworld
error from daemon in stream: Error grabbing logs: rpc error: code = Unknown desc = warning: incomplete log stream. some logs could not be retrieved for the following reasons: node p5ccjzptq7ulm2p6vs8j6x00n is not available

I can even define and deploy a new stack:

$ cat docker-compose.yml
version: '3.6'

services:
  dpinger:
    image: 'alpine'
    command: 'ping docker.com'
  gpinger:
    image: 'alpine'
    command: 'ping google.com'

$ docker stack deploy --compose-file docker-compose.yml pingers
Creating network pingers_default
Creating service pingers_gpinger
Creating service pingers_dpinger

$ docker stack ps pingers
ID             NAME                IMAGE           NODE                       DESIRED STATE   CURRENT STATE            ERR
OR     PORTS
8do1h05synif   pingers_dpinger.1   alpine:latest   oncs-blackduck.ciena.com   Running         Running 15 seconds ago

kpfiv5ao4vkz   pingers_gpinger.1   alpine:latest   oncs-blackduck.ciena.com   Running         Running 15 seconds ago

But still no dice on getting logs:

$ docker service logs pingers_dpinger
error from daemon in stream: Error grabbing logs: rpc error: code = Unknown desc = warning: incomplete log stream. some lo
gs could not be retrieved for the following reasons: node p5ccjzptq7ulm2p6vs8j6x00n is not available

Weird, eh?

I found this issue

which mentions

warning: incomplete log stream

someone wrote log rotation helped, but it didn’t work for everyone. Maybe you can find a comment that helps.

Thank you for digging that up.

I’ve read the entire thread, and unfortunately it doesn’t look like there’s any real fix.

The really odd thing is that docker service logs <taskid> fails but docker service logs --follow <taskid> works fine.

The thread started in September, 20217 and it’s now March 2023, so I don’t have much hope for a fix soon. :frowning:

Have you checked if corrupt logs are the reason?

for log in /var/lib/docker/containers/*/*-json.log; do jq < $log 2 >&1 /dev/null || echo $log corrupted;done

The command uses jq to check whether a container log is valid json. If it’s not the filename will be printed to the console. It needs to be run on each node,

1 Like

Interesting idea. Thanks for the suggestion.

Your suggested pipeline produces no indication that any of the logs are corrupt. I expected this outcome, as the error reported indicates that the Docker Swarm Manager node can’t talk to the Worker node (even though they are the same VM), which would be a precondition to reading the logs files. As I stated in the original post, I can still successfully retrieve logs by bypassing the Swarm functionality via docker logs, so the log files themselves must be intact.

So the setup change between the first post where you had a single node swarm and now?
A node is either a manager node or a worker node. A manager node can not be a manager node and a worker node at the same time.

So being able to use swarm related commands (docker node, docker service, docker stack,…) indicates that the swarm mode itself works. To me, it sounded like only the logs can’t be fetched by docker service logs. The log output warning: incomplete log stream made think about issues I had years ago with corrupt logs.

Did you try to use docker service logs {service name} instead of the taskid? I didn’t knew a taskid can be used instead of the service name. Though, it been 4 years that I’ve used swarm in my job. I still run a swarm pet cluster in my homelab.

As there is no official docker release for Oracle Linux, you might want to identify the maintainer of the package you installed, and raise an issue - of course that is if the package is supported on that os.It could very well be an incompatibility/bug. It doesn’t really make sense that only the logs are affected.

No, the setup has not changed. When I referred to “Manager” and “Worker” in my previous post, I meant to refer to the Docker software entities, both running on the same VM in my case. My use of the word “node” in that context was misleading because, as you rightly point out, a given (virtual) machine can only be one type of node or the other. Sorry for the confusion.

Yes, I’ve tried specifying the service by name, and the results are the same. My primary use case for docker service logs is retrieving logs for failed instances of services, so specifying the task ID is the preferred option for me.

As demonstrated by the link that @rimelek provided above, this issue is not new, and not unique to Oracle Linux. It’s been around since at least Docker 17.06, and folks in that thread report seeing it on Ubuntu, Debian, Arch, and possibly others.

I doubt that it’s a general problem, as It works like a charm on my homelab 3 node master only cluster using latest docker-ce and Ubuntu 20.04.

I hope you find a solution. Since you are running your swarm in a vm, why not test the behavior with one of the supported distributions and os versions?

It’s obviously not a problem that occurs very frequently, or else Docker would have fixed it by now. I suspect that there is something particular about my stack that is tickling an otherwise hard-to-reach bug.

In fact, one poster in the thread linked above suggested that he can reproduce the issue at will, and that key factors in doing so are (a) one or more services that produce high volumes of output to the console, and (b) network failures between nodes. Obviously (b) isn’t an issue for me with the swarm being contained on a single node. But I do know that at least a couple of the 16 services in this stack are very verbose. We often push this application hard, so I’m sure it’s spouting a tonne of data to the console at high times. If Docker is susceptible to losing sync trying to keep up, I wouldn’t be surprised that this scenario is triggering it.

Our testing instance for the same application is literally a copy of the production system on which we see this issue – same VM specs, same OS, same Docker configuration, same stack configuration, and it uses a copy of the production database. The test instance has never manifested this problem. So again, there is something specific to the production instance that gives rise to the behaviour.

Thanks, I hope we find a solution too. :slight_smile: As for OS concerns, Oracle Linux is binary-compatible with RHEL, so it’s very unlikely that the OS is to blame… especially given the list of OSes that users have reported this issue against. (See my previous post and/or the linked thread above.) If we have to, I can try a different distribution, but given that this is a production system, I’m not going to bother until I have a much stronger indication that the OS is the culprit. Right now, there is no evidence to support that theory, and there is counter evidence demonstrating that the problem is OS-agnostic.

I’ve encountered the same problem, where using CA rotation doesn’t make any difference. This issue occurs every month in my test cluster when there’s only one administrative node. In my production cluster with three administrative nodes, it happens every three months. Currently, my only solution is to restart Docker on these administrative nodes. Is there any other solution?

Running Docker Swarm pretty stable on Debian, 3 managers, 6 workers, even different Docker versions now and then.

Just because of an underlaying VLAN issue on provider side I just killed my 600 days uptime.