Docker swarm Raft WAL corrupted because of disk space utilized 100%

jsarvabhowma · February 6, 2024, 12:59pm

Hi Team,

I am running Docker Swarm with one manager node, which does the worker tasks (I don’t have any workers.). Currently, the swarm is running a couple of services, like Tomcat and PostgresDB. Now, because disk space has reached 100%, the Docker services have gone down, and it also seems like Docker Swarm RAFT WAL got corrupted. Because of this, after cleaning the disk space, when I run the Docker service commands, it’s throwing an “irreparable WAL error.”. So I just wanted to check if there’s any way we can restore the corrupted RAFT WAL in Docker Swarm. Below are logs.

Error from Docker Deamon Logs:
level=error msg=“Cluster exited with an error: manager stopped: can’t initialize raft node: irreparable WAL error: wal: max entry size limit exceeded, recBytes: 13269, fileSize(64000000) - offset(63989856) - padBytes(3) = entryLimit(10141)”

~# docker service ls
Error response from the daemon: This node is not a swarm manager. Use “docker swarm init” or “docker swarm join” to connect this node to the swarm and try again.
~#

root@sarvavm:/var/lib/docker/swarm/raft/wal-v3-encrypted# ls -larth
total 611M
drwx------ 4 root root 4.0K Dec 17 2021 …
-rw-r–r-- 1 root root 62M Jul 6 2022 0000000000000002-000000000000fafa.wal.broken
-rw-r–r-- 1 root root 62M Nov 14 2022 0000000000000003-00000000000166fe.wal.broken
-rw-r–r-- 1 root root 62M Mar 15 2023 000000000000001e-00000000000f1aa6.wal.broken
-rw-r–r-- 1 root root 62M Mar 16 2023 000000000000001f-00000000000fa43e.wal.broken
-rw-r–r-- 1 root root 62M Jun 26 2023 0000000000000020-0000000000102b9a.wal.broken
-rw-r–r-- 1 root root 62M Jun 26 2023 0000000000000021-000000000010ab28.wal.broken
-rw-r–r-- 1 root root 62M Jun 29 2023 0000000000000027-000000000013ad28.wal.broken
-rw-r–r-- 1 root root 62M Jul 4 2023 000000000000002d-0000000000169c02.wal.broken
-rw------- 1 root root 62M Feb 2 16:00 0000000000000391-0000000001d19a2f.wal
-rw------- 1 root root 62M Feb 2 16:57 0000000000000392-0000000001d212e5.wal
drwx------ 2 root root 4.0K Feb 6 07:58 .
root@sarvavm:/var/lib/docker/swarm/raft/wal-v3-encrypted#

rimelek · February 6, 2024, 1:35pm

If you don’t have a backup, there is nothing to restore from.

jsarvabhowma · February 6, 2024, 2:03pm

Thanks for the input @rimelek
Is there any way we can prevent this? like maybe something like gracefully stopping the docker daemon itself when the disk space reaches 99% or something like RAFT WAL itself will take a backup regularly and then restore it as soon the disk space issue is resolved(by the docker daemon itself).

rimelek · February 6, 2024, 3:26pm

I would make a small partition dedicated to the swarm state, but that would need to be created before installing or at least before initializing swarm. I don’t know what size it can be maximum so I would try to find out that and create a partition based on that information. It might not work if Docker can’t handle aditional file sin that fodler like lost+found. When I deployed etcd that way, I had to move etcd to a subpath, but I don’t think you could do it with that folder under docker data root.

The best if you can monitor your system and send an alert when you have not enough free space.

If you have only one node, do you need swarm? Are you using a feature that works with swarm only?