gray380
(Gray380)
January 25, 2024, 11:17am
1
Hello,
When I try to remove node from the cluster the following error message has appeared:
docker node rm --force p5npyj7vms82jsiwmsywpietm
Error response from daemon: rpc error: code = Unknown desc = raft: raft message is too large and can't be sent
I’ve tried to remove it by the hostname as well, the result is the same.
journal:
Jan 25 12:12:16 sbtv-dock044 dockerd[1508]: time="2024-01-25T12:12:16.160354363+02:00" level=error msg="Handler for DELETE /v1.44/nodes/p5npyj7vms82jsiwmsywpietm returned error: rpc error: code = Unknown desc = raft: raft message is too large and can't be sent"
Jan 25 12:14:16 sbtv-dock044 dockerd[1508]: time="2024-01-25T12:14:16.284127977+02:00" level=error msg="Handler for DELETE /v1.44/nodes/sbtv-dock004 returned error: rpc error: code = Unknown desc = raft: raft message is too large and can't be sent"
The node list:
docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
nieeiv4ja9dxp00zlef952pba sbtv-dock003 Ready Active Leader 25.0.1
p5npyj7vms82jsiwmsywpietm sbtv-dock004 Down Active 25.0.1
n0x6vx2bz9wwb9hwss1g7932z sbtv-dock005 Ready Active 25.0.1
ds3dspc1kc5nvwawyhyjn7yy7 sbtv-dock006 Ready Active 25.0.1
p69gynx3p0pz52d982ju6sltb sbtv-dock007 Ready Active Reachable 25.0.1
pjotlc68rgmsjre3jh3t62quj * sbtv-dock044 Ready Active Reachable 25.0.1
We faced this issue while updating cluster nodes from 20.x to 25.x, the node for some reason stop to be manager and we tried to rejoin it in a result there was two nodes with same hostname and different ids in the cluster configuration. We tried to remove “failed” node and faced the issue.
So we can demote, promote, pause, drain this node but remove.
Could you help to get rid of this node from the cluster configuration?
Docker info output:
Client: Docker Engine - Community
Version: 25.0.1
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.12.1
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.24.2
Path: /usr/libexec/docker/cli-plugins/docker-compose
scan: Docker Scan (Docker Inc.)
Version: v0.23.0
Path: /usr/libexec/docker/cli-plugins/docker-scan
Best regards,
Serhiy.
rimelek
(Ákos Takács)
January 27, 2024, 10:00am
2
You mean you had Docker 20.x and you updated to Docker 25.x in one step? That could even make the nodes incompetible with eachother, I guess.
Have you searched for the error messages and read issues like this?
opened 12:42PM - 05 Apr 18 UTC
closed 03:10PM - 14 Apr 18 UTC
area/swarm
I have a swarm with ~1300 nodes, and some enter and leave all the time (about 10… /minute).
Since about a week ago, I'm experiencing an error when trying to remove dead nodes with `docker node rm xxxx` from the swarm:
```
Error response from daemon: rpc error: code = Unknown desc = raft: raft message is too large and can't be sent
```
All I see in the logs is the same:
```
Apr 5 12:35:31 ip-10-0-0-10 dockerd[1239]: time="2018-04-05T12:35:31.686809606Z" level=error msg="Error removing node x2nhsvmnzoaq5hp3xqfl2a7dp: rpc error: code = Unknown desc = raft: raft message is too large and can't be sent"
Apr 5 12:35:31 ip-10-0-0-10 dockerd[1239]: time="2018-04-05T12:35:31.687236652Z" level=error msg="Handler for DELETE /v1.37/nodes/x2nhsvmnzoaq5hp3xqfl2a7dp returned error: rpc error: code = Unknown desc = raft: raft message is too large and can't be sent"
Apr 5 12:35:36 ip-10-0-0-10 dockerd[1239]: time="2018-04-05T12:35:36.289574281Z" level=error msg="Error removing node kdmprxylwjmvutsfb9y1f2o17: rpc error: code = Unknown desc = raft: raft message is too large and can't be sent"
Apr 5 12:35:36 ip-10-0-0-10 dockerd[1239]: time="2018-04-05T12:35:36.289644704Z" level=error msg="Handler for DELETE /v1.37/nodes/kdmprxylwjmvutsfb9y1f2o17 returned error: rpc error: code = Unknown desc = raft: raft message is too large and can't be sent"
```
gray380
(Gray380)
January 28, 2024, 10:21am
3
Yes, thanks.
Certificates rotation does not help.
coryaent
(Stephen)
July 23, 2024, 6:55pm
4
I have the same issue. I checked the GitHub issue related to too many node removals, and ran the command docker node ls | wc -l. It returns 34, far less than described in the GitHub issue.
Update: Someone else is having the same issue, and apparently the solution could be to wait a few days and try again.
opened 06:39PM - 22 Feb 24 UTC
status/0-triage
kind/bug
area/swarm
version/25.0
### Description
I have a cluster of 3 nodes (See them below). I am unable to re… move the node that shows as down and it's breaking my cluster.
```
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
pcefq43rapf8mw8887inkztvi * <CENSORED>llbprmmid01.<CENSORED>.net Ready Active Reachable 25.0.3
xl8kfa7y56uu3vz1xsw0lxb61 <CENSORED>llbprmmid02.<CENSORED>.net Ready Active Reachable 25.0.3
d31t8u7reuky3kjyym84smsl4 <CENSORED>llbprmmid03.<CENSORED>.net Ready Active Leader 25.0.3
hdresbrj592dpjlh80gwsuy9h <CENSORED>llbprmmid03.<CENSORED>.net Down Drain 25.0.2
```
When I do a `docker node ls | wc -l` it returns
```
5
```
I found out that there was a similar issue reported before, https://forums.docker.com/t/removing-node-from-the-swarm-issue-raft-message-is-too-large-and-cant-be-sent/139518, I went through it, tried what they advise, but still no luck.
Any idea how I can fix this? This actually broke a production environment!
### Reproduce
1. docker node rm -f hdresbrj592dpjlh80gwsuy9h
### Expected behavior
docker node rm should remove the node that I want removed from the cluster without issues.
### docker version
```bash
Client: Docker Engine - Community
Version: 25.0.3
API version: 1.44
Go version: go1.21.6
Git commit: 4debf41
Built: Tue Feb 6 21:15:16 2024
OS/Arch: linux/amd64
Context: default
Server: Docker Engine - Community
Engine:
Version: 25.0.3
API version: 1.44 (minimum version 1.24)
Go version: go1.21.6
Git commit: f417435
Built: Tue Feb 6 21:14:12 2024
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.28
GitCommit: ae07eda36dd25f8a1b98dfbf587313b99c0190bb
runc:
Version: 1.1.12
GitCommit: v1.1.12-0-g51d5e94
docker-init:
Version: 0.19.0
GitCommit: de40ad0
```
### docker info
```bash
Client: Docker Engine - Community
Version: 25.0.3
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.12.1
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.24.5
Path: /usr/libexec/docker/cli-plugins/docker-compose
scan: Docker Scan (Docker Inc.)
Version: v0.23.0
Path: /usr/libexec/docker/cli-plugins/docker-scan
Server:
Containers: 4
Running: 4
Paused: 0
Stopped: 0
Images: 5
Server Version: 25.0.3
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
Swarm: active
NodeID: pcefq43rapf8mw8887inkztvi
Is Manager: true
ClusterID: nixy9iaupmii3yn6uidkw2l10
Managers: 3
Nodes: 4
Default Address Pool: 10.0.0.0/8
SubnetSize: 24
Data Path Port: 4789
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 10
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 3
Autolock Managers: false
Root Rotation In Progress: true
Node Address: 192.168.32.110
Manager Addresses:
192.168.32.110:2377
192.168.32.111:2377
192.168.32.112:2377
Runtimes: runc io.containerd.runc.v2
Default Runtime: runc
Init Binary: docker-init
containerd version: ae07eda36dd25f8a1b98dfbf587313b99c0190bb
runc version: v1.1.12-0-g51d5e94
init version: de40ad0
Security Options:
seccomp
Profile: builtin
Kernel Version: 4.18.0-477.21.1.el8_8.x86_64
Operating System: Red Hat Enterprise Linux 8.8 (Ootpa)
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 15.39GiB
Name: <CENSORED>llbprmmid01.<CENSORED>.net
ID: a7918f44-224e-45dd-abf4-3c95d61e0f6f
Docker Root Dir: /u02/docker
Debug Mode: false
HTTP Proxy: <CENSORED>
HTTPS Proxy: <CENSORED>
No Proxy: <CENSORED>
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
```
### Additional Info
Below is a snippet of the the logs, from `journalctl -u docker.service -f`, after enabling debug and trying to remove node3 from node1
```
Feb 22 18:10:53 <CENSORED>llbprmmid01.<CENSORED>.net dockerd[674413]: time="2024-02-22T18:10:53.346690739Z" level=debug msg="sending heartbeat to manager { } with timeout 5s" method="(*session).heartbeat" module=node/agent node.id=pcefq43rapf8mw8887inkztvi session.id=72z12be65dn88tkuxtckrxxmj sessionID=72z12be65dn88tkuxtckrxxmj
Feb 22 18:10:53 <CENSORED>llbprmmid01.<CENSORED>.net dockerd[674413]: time="2024-02-22T18:10:53.347704874Z" level=debug msg="heartbeat successful to manager { }, next heartbeat period: 5.305433283s" method="(*session).heartbeat" module=node/agent node.id=pcefq43rapf8mw8887inkztvi session.id=72z12be65dn88tkuxtckrxxmj sessionID=72z12be65dn88tkuxtckrxxmj
Feb 22 18:10:56 <CENSORED>llbprmmid01.<CENSORED>.net dockerd[674413]: time="2024-02-22T18:10:56.465861670Z" level=debug msg="Calling HEAD /_ping"
Feb 22 18:10:56 <CENSORED>llbprmmid01.<CENSORED>.net dockerd[674413]: time="2024-02-22T18:10:56.466267559Z" level=debug msg="Calling DELETE /v1.44/nodes/hdresbrj592dpjlh80gwsuy9h?force=1"
Feb 22 18:10:56 <CENSORED>llbprmmid01.<CENSORED>.net dockerd[674413]: time="2024-02-22T18:10:56.806858949Z" level=debug msg="error handling rpc" error="rpc error: code = Unknown desc = raft: raft message is too large and can't be sent" rpc=/docker.swarmkit.v1.Control/RemoveNode
Feb 22 18:10:56 <CENSORED>llbprmmid01.<CENSORED>.net dockerd[674413]: time="2024-02-22T18:10:56.807027288Z" level=debug msg="Error removing node" error="rpc error: code = Unknown desc = raft: raft message is too large and can't be sent" node-id=hdresbrj592dpjlh80gwsuy9h
Feb 22 18:10:56 <CENSORED>llbprmmid01.<CENSORED>.net dockerd[674413]: time="2024-02-22T18:10:56.807091867Z" level=debug msg="FIXME: Got an API for which error does not match any expected type!!!" error="rpc error: code = Unknown desc = raft: raft message is too large and can't be sent" error_type="*status.Error" module=api
Feb 22 18:10:56 <CENSORED>llbprmmid01.<CENSORED>.net dockerd[674413]: time="2024-02-22T18:10:56.807104059Z" level=error msg="Handler for DELETE /v1.44/nodes/hdresbrj592dpjlh80gwsuy9h returned error: rpc error: code = Unknown desc = raft: raft message is too large and can't be sent"
Feb 22 18:10:56 <CENSORED>llbprmmid01.<CENSORED>.net dockerd[674413]: time="2024-02-22T18:10:56.807116950Z" level=debug msg="FIXME: Got an API for which error does not match any expected type!!!" error="rpc error: code = Unknown desc = raft: raft message is too large and can't be sent" error_type="*status.Error" module=api
Feb 22 18:10:58 <CENSORED>llbprmmid01.<CENSORED>.net dockerd[674413]: time="2024-02-22T18:10:58.653777697Z" level=debug msg="sending heartbeat to manager { } with timeout 5s" method="(*session).heartbeat" module=node/agent node.id=pcefq43rapf8mw8887inkztvi session.id=72z12be65dn88tkuxtckrxxmj sessionID=72z12be65dn88tkuxtckrxxmj
Feb 22 18:10:58 <CENSORED>llbprmmid01.<CENSORED>.net dockerd[674413]: time="2024-02-22T18:10:58.654635872Z" level=debug msg="heartbeat successful to manager { }, next heartbeat period: 5.339470303s" method="(*session).heartbeat" module=node/agent node.id=pcefq43rapf8mw8887inkztvi session.id=72z12be65dn88tkuxtckrxxmj sessionID=72z12be65dn88tkuxtckrxxmj
Feb 22 18:11:03 <CENSORED>llbprmmid01.<CENSORED>.net dockerd[674413]: time="2024-02-22T18:11:03.994666787Z" level=debug msg="sending heartbeat to manager { } with timeout 5s" method="(*session).heartbeat" module=node/agent node.id=pcefq43rapf8mw8887inkztvi session.id=72z12be65dn88tkuxtckrxxmj sessionID=72z12be65dn88tkuxtckrxxmj
Feb 22 18:11:03 <CENSORED>llbprmmid01.<CENSORED>.net dockerd[674413]: time="2024-02-22T18:11:03.995801446Z" level=debug msg="heartbeat successful to manager { }, next heartbeat period: 4.774354673s" method="(*session).heartbeat" module=node/agent node.id=pcefq43rapf8mw8887inkztvi session.id=72z12be65dn88tkuxtckrxxmj sessionID=72z12be65dn88tkuxtckrxxmj
Feb 22 18:11:05 <CENSORED>llbprmmid01.<CENSORED>.net dockerd[674413]: time="2024-02-22T18:11:05.980417978Z" level=debug msg="memberlist: Stream connection from=192.168.32.112:57678"
Feb 22 18:11:05 <CENSORED>llbprmmid01.<CENSORED>.net dockerd[674413]: time="2024-02-22T18:11:05.980569597Z" level=debug msg="<CENSORED>llbprmmid01.<CENSORED>.net(54699a50f58e): Initiating bulk sync for networks [lp9jegtxlrr1ojaijz43kr233] with node 2d3fae1bf26e"
Feb 22 18:11:07 <CENSORED>llbprmmid01.<CENSORED>.net dockerd[674413]: time="2024-02-22T18:11:07.057352324Z" level=debug msg="memberlist: Stream connection from=192.168.32.111:51210"
Feb 22 18:11:07 <CENSORED>llbprmmid01.<CENSORED>.net dockerd[674413]: time="2024-02-22T18:11:07.057502628Z" level=debug msg="<CENSORED>llbprmmid01.<CENSORED>.net(54699a50f58e): Initiating bulk sync for networks [lp9jegtxlrr1ojaijz43kr233] with node 67d13d52bcbe"
Feb 22 18:11:08 <CENSORED>llbprmmid01.<CENSORED>.net dockerd[674413]: time="2024-02-22T18:11:08.770571281Z" level=debug msg="sending heartbeat to manager { } with timeout 5s" method="(*session).heartbeat" module=node/agent node.id=pcefq43rapf8mw8887inkztvi session.id=72z12be65dn88tkuxtckrxxmj sessionID=72z12be65dn88tkuxtckrxxmj
Feb 22 18:11:08 <CENSORED>llbprmmid01.<CENSORED>.net dockerd[674413]: time="2024-02-22T18:11:08.771434267Z" level=debug msg="heartbeat successful to manager { }, next heartbeat period: 5.469749527s" method="(*session).heartbeat" module=node/agent node.id=pcefq43rapf8mw8887inkztvi session.id=72z12be65dn88tkuxtckrxxmj sessionID=72z12be65dn88tkuxtckrxxmj
Feb 22 18:11:09 <CENSORED>llbprmmid01.<CENSORED>.net dockerd[674413]: time="2024-02-22T18:11:09.545754661Z" level=debug msg="memberlist: Stream connection from=192.168.32.112:42024"
```
coryaent
(Stephen)
July 25, 2024, 5:03pm
5
I waited a couple of days, and now it works.