Potential dockerd memory leak introduced between 27.4.0 and 26.0.0

frederikjuul · January 14, 2025, 10:36am

We’ve been investigating an issue where our containers would sometimes randomly get killed after weeks of running fine.

We figured out that the OS was OOM killing our containers, so we started into what could be the cause of our high memory consumption. We added additional monitoring to try to figure it out. We also reduced the memory consumption of our containers, but it didn’t solve the issue.

Digging further into it, it appears that it’s dockerd itself that’s increasing in memory consumption over time.

$ ps aux --sort -rss
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        1264  1.5 47.3 5503944 1840040 ?     Ssl   2024 640:45 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
root        3344 11.6 15.9 2541548 620748 ?      Ssl   2024 4857:12 python3 src/crisp/hw_api/app.py
root     1533064 12.8  9.1 2203244 353724 ?      Sl   Jan03 1990:18 python3 src/crisp/hw_api/app.py
root        3509  0.3  2.7 663868 107116 ?       Ssl   2024 154:56 python3 src/crisp/jobs/service.py

$ docker system info
Client: Docker Engine - Community
 Version:    27.4.0
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.19.2
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.31.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 4
  Running: 4
  Paused: 0
  Stopped: 0
 Images: 7
 Server Version: 27.4.0
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 88bf19b2105c8b17560993bee28a01ddc2f97182
 runc version: v1.2.2-0-g7cb3632
 init version: de40ad0
 Security Options:
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.1.0-rpi7-rpi-v8
 Operating System: Debian GNU/Linux 12 (bookworm)
 OSType: linux
 Architecture: aarch64
 CPUs: 4
 Total Memory: 3.704GiB
 Name: RIS0141
 ID: fc9725eb-8064-4150-a4f3-735358fb3f0c
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled

$ uptime
 11:25:10 up 28 days, 22:17,  1 user,  load average: 0.16, 0.23, 0.22

Memory consumption over time:

One a different machine, running the same containers for a longer amount of time the same command produces this output:

$ ps aux --sort -rss
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        2201  7.7 41.1 3592392 1597648 ?     Ssl   2024 12573:58 python3 crisp/hw_api/app.py
root     2789509  2.3 34.5 3306148 1340800 ?     Sl    2024 3754:02 python3 crisp/hw_api/app.py
root        4062  3.7  3.2 895936 124940 ?       Ssl   2024 6058:53 python3 crisp/jobs/service.py
root      452656  0.4  2.0 1317616 79744 ?       Ssl   2024 208:45 /usr/sbin/tailscaled --state=/var/lib/tailscale/tailscaled.state --socket=/run/tailscale/tailscaled.sock --port=41641
root        1325 45.1  1.6 2853760 62908 ?       Ssl   2024 73292:49 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock

dockerd consuming 1.6% memory (note that this machine is currently under load. I will try to generate information when it’s no longer under load)

$ docker system info
Client: Docker Engine - Community
 Version:    26.0.0
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.13.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.25.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 2
  Running: 2
  Paused: 0
  Stopped: 0
 Images: 1
 Server Version: 26.0.0
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: e377cd56a71523140ca6ae87e30244719194a521
 runc version: v1.1.12-0-g51d5e94
 init version: de40ad0
 Security Options:
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.6.20+rpt-rpi-v8
 Operating System: Debian GNU/Linux 12 (bookworm)
 OSType: linux
 Architecture: aarch64
 CPUs: 4
 Total Memory: 3.703GiB
 Name: RIS0161
 ID: a8293512-26d2-47d2-b105-96254e4ac756
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No memory limit support
WARNING: No swap limit support

$ uptime
 11:24:24 up 112 days, 20:03,  2 users,  load average: 0.24, 0.28, 0.22

Memory consumption over time:
(Removed due to new user upload restrictions) This is a graph that’s largely flat over the same time.

From my point of view, it appears that the dockerd is leaking memory and that this was introduced between 27.4 and 26.0.0? That’s at least the only explanation I can think of that explains my observations.

Note: I’m aware that the number of containers are running are different (2 vs 4) but the extra containers running should be insignificant. And in any case should not make the dockerd process itself consume more memory.

frederikjuul · January 14, 2025, 1:28pm

It appears this has been fixed in 27.4.1. I’ll update my versions

github.com/moby/moby

Memory leak

opened 10:07AM - 12 Dec 24 UTC

closed 07:36PM - 12 Dec 24 UTC

hivenet-mathieu-lacage

kind/bug area/performance area/metrics/otel version/27.4

### Description Hi, About 2 days ago, I upgraded our production servers to the… latest docker version (27.4.0). Since then, the docker daemon of all servers is eating memory like candy and getting killed by the oom killer once every couple hours. The result is not pretty. I collected pprof output from the running docker daemon. They all look the across all servers. Here is a typical example: ``` mathieu@Host-001:~/$ go tool pprof -top ./heap-dump-de1-2-2024-12-12-10-56 File: dockerd Build ID: 64214f994238df60afd9aa8c14b0322a8fc3412a Type: inuse_space Time: Dec 12, 2024 at 10:56am (CET) Showing nodes accounting for 3326.65MB, 99.71% of 3336.46MB total Dropped 142 nodes (cum <= 16.68MB) flat flat% sum% cum cum% 2440.59MB 73.15% 73.15% 2440.59MB 73.15% github.com/moby/buildkit/util/tracing/detect.(*TraceRecorder).ExportSpans 502.54MB 15.06% 88.21% 502.54MB 15.06% go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).interfaceArrayToEventArray (inline) 383.52MB 11.49% 99.71% 383.52MB 11.49% go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).SetAttributes 0 0% 99.71% 2942.62MB 88.20% github.com/containerd/containerd/tracing.(*Span).End (inline) 0 0% 99.71% 383.52MB 11.49% github.com/containerd/containerd/tracing.(*Span).SetAttributes (inline) 0 0% 99.71% 3326.15MB 99.69% github.com/docker/docker/daemon/logger/loggerutils.(*LogFile).readLogsLocked 0 0% 99.71% 3326.15MB 99.69% github.com/docker/docker/daemon/logger/loggerutils.(*follow).Do 0 0% 99.71% 3326.15MB 99.69% github.com/docker/docker/daemon/logger/loggerutils.(*follow).forward 0 0% 99.71% 3326.15MB 99.69% github.com/docker/docker/daemon/logger/loggerutils.(*forwarder).Do 0 0% 99.71% 3326.15MB 99.69% github.com/docker/docker/daemon/logger/loggerutils.(*forwarder).Do.func1 0 0% 99.71% 2943.12MB 88.21% go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End 0 0% 99.71% 502.54MB 15.06% go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).snapshot 0 0% 99.71% 2440.59MB 73.15% go.opentelemetry.io/otel/sdk/trace.(*simpleSpanProcessor).OnEnd ``` The above points to otel as the likely source of a memory leak. So far, I have been unable to find a way to disable this tracing to revert to a sane(r) setup in production. ### Reproduce Unfortunatly, I am not able share the list of containers and images needed to reproduce this. I would be happy to collect more data if needed. ### Expected behavior Dockerd uses much less memory. ### docker version ```bash Client: Docker Engine - Community Version: 27.4.0 Context: default Debug Mode: false Plugins: buildx: Docker Buildx (Docker Inc.) Version: v0.19.2 Path: /usr/libexec/docker/cli-plugins/docker-buildx compose: Docker Compose (Docker Inc.) Version: v2.31.0 Path: /usr/libexec/docker/cli-plugins/docker-compose Server: Containers: 24 Running: 6 Paused: 0 Stopped: 18 Images: 8 Server Version: 27.4.0 Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Using metacopy: false Native Overlay Diff: true userxattr: false Logging Driver: json-file Cgroup Driver: systemd Cgroup Version: 2 Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog Swarm: inactive Runtimes: io.containerd.runc.v2 runc Default Runtime: runc Init Binary: docker-init containerd version: 88bf19b2105c8b17560993bee28a01ddc2f97182 runc version: v1.2.2-0-g7cb3632 init version: de40ad0 Security Options: apparmor seccomp Profile: builtin cgroupns Kernel Version: 6.1.0-18-cloud-amd64 Operating System: Debian GNU/Linux 12 (bookworm) OSType: linux Architecture: x86_64 CPUs: 2 Total Memory: 6.625GiB Name: hive-prod-node-de1-2 ID: c7645856-eac3-4fdb-b1de-f1bf7ff5b355 Docker Root Dir: /var/lib/docker Debug Mode: false Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false ``` ### docker info ```bash Client: Docker Engine - Community Version: 27.4.0 Context: default Debug Mode: false Plugins: buildx: Docker Buildx (Docker Inc.) Version: v0.19.2 Path: /usr/libexec/docker/cli-plugins/docker-buildx compose: Docker Compose (Docker Inc.) Version: v2.31.0 Path: /usr/libexec/docker/cli-plugins/docker-compose Server: Containers: 24 Running: 6 Paused: 0 Stopped: 18 Images: 8 Server Version: 27.4.0 Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Using metacopy: false Native Overlay Diff: true userxattr: false Logging Driver: json-file Cgroup Driver: systemd Cgroup Version: 2 Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog Swarm: inactive Runtimes: io.containerd.runc.v2 runc Default Runtime: runc Init Binary: docker-init containerd version: 88bf19b2105c8b17560993bee28a01ddc2f97182 runc version: v1.2.2-0-g7cb3632 init version: de40ad0 Security Options: apparmor seccomp Profile: builtin cgroupns Kernel Version: 6.1.0-18-cloud-amd64 Operating System: Debian GNU/Linux 12 (bookworm) OSType: linux Architecture: x86_64 CPUs: 2 Total Memory: 6.625GiB Name: hive-prod-node-de1-2 ID: c7645856-eac3-4fdb-b1de-f1bf7ff5b355 Docker Root Dir: /var/lib/docker Debug Mode: false Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false ``` ### Additional Info root@prod-node-de2-1:/home/debian# cat /etc/docker/daemon.json { "log-driver": "json-file", "log-opts": { "max-file": "5", "max-size": "10m" } }

Topic		Replies	Views
Memory "lost" in Linux but systemctl restart docker fixes problem (temporarily) General docker	7	786	September 24, 2024
Docker is taking to much disk memory General	2	6689	February 16, 2018
Container fails with error 137, but no OOM flag set (and there's plenty of RAM) General	11	24822	August 10, 2021
Docker Daemon without containers uses 100% memory General docker	0	1440	March 31, 2017
Percent memory usage of containers after reboot General	8	584	April 14, 2024

Potential dockerd memory leak introduced between 27.4.0 and 26.0.0

Related topics