Potential dockerd memory leak introduced between 27.4.0 and 26.0.0

We’ve been investigating an issue where our containers would sometimes randomly get killed after weeks of running fine.

We figured out that the OS was OOM killing our containers, so we started into what could be the cause of our high memory consumption. We added additional monitoring to try to figure it out. We also reduced the memory consumption of our containers, but it didn’t solve the issue.

Digging further into it, it appears that it’s dockerd itself that’s increasing in memory consumption over time.

$ ps aux --sort -rss
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        1264  1.5 47.3 5503944 1840040 ?     Ssl   2024 640:45 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
root        3344 11.6 15.9 2541548 620748 ?      Ssl   2024 4857:12 python3 src/crisp/hw_api/app.py
root     1533064 12.8  9.1 2203244 353724 ?      Sl   Jan03 1990:18 python3 src/crisp/hw_api/app.py
root        3509  0.3  2.7 663868 107116 ?       Ssl   2024 154:56 python3 src/crisp/jobs/service.py
$ docker system info
Client: Docker Engine - Community
 Version:    27.4.0
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.19.2
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.31.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 4
  Running: 4
  Paused: 0
  Stopped: 0
 Images: 7
 Server Version: 27.4.0
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 88bf19b2105c8b17560993bee28a01ddc2f97182
 runc version: v1.2.2-0-g7cb3632
 init version: de40ad0
 Security Options:
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.1.0-rpi7-rpi-v8
 Operating System: Debian GNU/Linux 12 (bookworm)
 OSType: linux
 Architecture: aarch64
 CPUs: 4
 Total Memory: 3.704GiB
 Name: RIS0141
 ID: fc9725eb-8064-4150-a4f3-735358fb3f0c
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
$ uptime
 11:25:10 up 28 days, 22:17,  1 user,  load average: 0.16, 0.23, 0.22

Memory consumption over time:

One a different machine, running the same containers for a longer amount of time the same command produces this output:

$ ps aux --sort -rss
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        2201  7.7 41.1 3592392 1597648 ?     Ssl   2024 12573:58 python3 crisp/hw_api/app.py
root     2789509  2.3 34.5 3306148 1340800 ?     Sl    2024 3754:02 python3 crisp/hw_api/app.py
root        4062  3.7  3.2 895936 124940 ?       Ssl   2024 6058:53 python3 crisp/jobs/service.py
root      452656  0.4  2.0 1317616 79744 ?       Ssl   2024 208:45 /usr/sbin/tailscaled --state=/var/lib/tailscale/tailscaled.state --socket=/run/tailscale/tailscaled.sock --port=41641
root        1325 45.1  1.6 2853760 62908 ?       Ssl   2024 73292:49 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock

dockerd consuming 1.6% memory (note that this machine is currently under load. I will try to generate information when it’s no longer under load)

$ docker system info
Client: Docker Engine - Community
 Version:    26.0.0
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.13.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.25.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 2
  Running: 2
  Paused: 0
  Stopped: 0
 Images: 1
 Server Version: 26.0.0
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: e377cd56a71523140ca6ae87e30244719194a521
 runc version: v1.1.12-0-g51d5e94
 init version: de40ad0
 Security Options:
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.6.20+rpt-rpi-v8
 Operating System: Debian GNU/Linux 12 (bookworm)
 OSType: linux
 Architecture: aarch64
 CPUs: 4
 Total Memory: 3.703GiB
 Name: RIS0161
 ID: a8293512-26d2-47d2-b105-96254e4ac756
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No memory limit support
WARNING: No swap limit support
$ uptime
 11:24:24 up 112 days, 20:03,  2 users,  load average: 0.24, 0.28, 0.22

Memory consumption over time:
(Removed due to new user upload restrictions) This is a graph that’s largely flat over the same time.

From my point of view, it appears that the dockerd is leaking memory and that this was introduced between 27.4 and 26.0.0? That’s at least the only explanation I can think of that explains my observations.

Note: I’m aware that the number of containers are running are different (2 vs 4) but the extra containers running should be insignificant. And in any case should not make the dockerd process itself consume more memory.

It appears this has been fixed in 27.4.1. I’ll update my versions :slight_smile: