Task History kills my Swarm

Hi

We have a Swarm-Cluster with one Master and 3 Worker nodes.
We use docker services with “–mode replicated-job” to run regular jobs on intervals.
These get created every time they should run, and removed after they finished.

Now i have the Problem, that they stay in the task-history. Every single run that was done since the last reboot of the hostcomputers.

Output from “docker node ps”

j9b7zj6oe0iy   zsg0utokootnpin10u8yks6m6.q7apfswjq1cs2lv5bgbxale3f                    registry.*:5000/*:latest                vsrv-swarmmanager01   Complete        Complete 5 hours ago
wqf2en8fm0s4   zsx2r6gdv3evvhoz6hetbqyop.q7apfswjq1cs2lv5bgbxale3f                    registry.*:5000/*:latest                vsrv-swarmmanager01   Complete        Complete 25 hours ago
rrpyxm7f11wj   zt1g6m68u2wfwokbzcvvaqlh6.q7apfswjq1cs2lv5bgbxale3f                    registry.*:5000/*:latest             vsrv-swarmmanager01   Complete        Complete 12 hours ago
sqdq0m9z0tzi   zt1q36319anelooa6v693brny.q7apfswjq1cs2lv5bgbxale3f                    registry.*:5000/*:latest            vsrv-swarmmanager01   Complete        Complete 27 hours ago
jd5elxnjgca0   zul9qa08zf5wwaret9wpeqgh8.q7apfswjq1cs2lv5bgbxale3f                    registry.*:5000/*:latest         vsrv-swarmmanager01   Complete        Complete 9 hours ago

I have 4000+ of those entries currently. At around 40’000 entries the cluster is no longer able to start new tasks and hast to be rebooted to resume operation.

Why are those not get cleaned up? Why do you let stuff that’s done and dusted kill a full cluster?

Regards
Sintbert

The task history has a default limit of 5 tasks. As the default setting doesn’t seem to your use-case, you will need to adjust the value to whatever supports your use-case.

Note: Depending on whether you want to be able to perform rolling updates and want swarm to revert to the last running state, you can set the --task-history-limit you see fit. If you don’t care for recovery of failed rolling updates, just set it to 0 and have no task history at all.

1 Like

Hei @meyay
The task history doesn’t apply at all. It limits how many tasks stay in the history per service. At least one is always in that history for every running service.
The problem is, those tasks usually get removed when the service gets removed. But not those who where startet by a “–mode replicated-job” service, those stay when the service gets removed and are now orphaned.
I have not found any way except for a complete cluster reboot to remove those orphaned tasks, since tasks can only be removed when the corresponding service is removed.

Created issue: `replicated-job` doesn't respect `task-history-limit` · Issue #45443 · moby/moby · GitHub