We’ve been investigating an issue where our containers would sometimes randomly get killed after weeks of running fine.
We figured out that the OS was OOM killing our containers, so we started into what could be the cause of our high memory consumption. We added additional monitoring to try to figure it out. We also reduced the memory consumption of our containers, but it didn’t solve the issue.
Digging further into it, it appears that it’s dockerd
itself that’s increasing in memory consumption over time.
$ ps aux --sort -rss
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1264 1.5 47.3 5503944 1840040 ? Ssl 2024 640:45 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
root 3344 11.6 15.9 2541548 620748 ? Ssl 2024 4857:12 python3 src/crisp/hw_api/app.py
root 1533064 12.8 9.1 2203244 353724 ? Sl Jan03 1990:18 python3 src/crisp/hw_api/app.py
root 3509 0.3 2.7 663868 107116 ? Ssl 2024 154:56 python3 src/crisp/jobs/service.py
$ docker system info
Client: Docker Engine - Community
Version: 27.4.0
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.19.2
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.31.0
Path: /usr/libexec/docker/cli-plugins/docker-compose
Server:
Containers: 4
Running: 4
Paused: 0
Stopped: 0
Images: 7
Server Version: 27.4.0
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 88bf19b2105c8b17560993bee28a01ddc2f97182
runc version: v1.2.2-0-g7cb3632
init version: de40ad0
Security Options:
seccomp
Profile: builtin
cgroupns
Kernel Version: 6.1.0-rpi7-rpi-v8
Operating System: Debian GNU/Linux 12 (bookworm)
OSType: linux
Architecture: aarch64
CPUs: 4
Total Memory: 3.704GiB
Name: RIS0141
ID: fc9725eb-8064-4150-a4f3-735358fb3f0c
Docker Root Dir: /var/lib/docker
Debug Mode: false
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
$ uptime
11:25:10 up 28 days, 22:17, 1 user, load average: 0.16, 0.23, 0.22
Memory consumption over time:
One a different machine, running the same containers for a longer amount of time the same command produces this output:
$ ps aux --sort -rss
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 2201 7.7 41.1 3592392 1597648 ? Ssl 2024 12573:58 python3 crisp/hw_api/app.py
root 2789509 2.3 34.5 3306148 1340800 ? Sl 2024 3754:02 python3 crisp/hw_api/app.py
root 4062 3.7 3.2 895936 124940 ? Ssl 2024 6058:53 python3 crisp/jobs/service.py
root 452656 0.4 2.0 1317616 79744 ? Ssl 2024 208:45 /usr/sbin/tailscaled --state=/var/lib/tailscale/tailscaled.state --socket=/run/tailscale/tailscaled.sock --port=41641
root 1325 45.1 1.6 2853760 62908 ? Ssl 2024 73292:49 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
dockerd consuming 1.6% memory (note that this machine is currently under load. I will try to generate information when it’s no longer under load)
$ docker system info
Client: Docker Engine - Community
Version: 26.0.0
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.13.1
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.25.0
Path: /usr/libexec/docker/cli-plugins/docker-compose
Server:
Containers: 2
Running: 2
Paused: 0
Stopped: 0
Images: 1
Server Version: 26.0.0
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: e377cd56a71523140ca6ae87e30244719194a521
runc version: v1.1.12-0-g51d5e94
init version: de40ad0
Security Options:
seccomp
Profile: builtin
cgroupns
Kernel Version: 6.6.20+rpt-rpi-v8
Operating System: Debian GNU/Linux 12 (bookworm)
OSType: linux
Architecture: aarch64
CPUs: 4
Total Memory: 3.703GiB
Name: RIS0161
ID: a8293512-26d2-47d2-b105-96254e4ac756
Docker Root Dir: /var/lib/docker
Debug Mode: false
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No memory limit support
WARNING: No swap limit support
$ uptime
11:24:24 up 112 days, 20:03, 2 users, load average: 0.24, 0.28, 0.22
Memory consumption over time:
(Removed due to new user upload restrictions) This is a graph that’s largely flat over the same time.
From my point of view, it appears that the dockerd is leaking memory and that this was introduced between 27.4 and 26.0.0? That’s at least the only explanation I can think of that explains my observations.
Note: I’m aware that the number of containers are running are different (2 vs 4) but the extra containers running should be insignificant. And in any case should not make the dockerd
process itself consume more memory.