Dockerd using 100% CPU

Hey folks - I’m trying to debug why dockerd on one of our bare metal servers is pegging our cpu at 100% for a few days now. None of the containers seem to utilizing a lot of CPU (from running docker stats). The dockerd logs don’t show anything helpful from what I can see. What else should I look for? Any help is appreciated. Here’s some docker info from our host, incase it’s useful:

# docker info
Containers: 4
 Running: 4
 Paused: 0
 Stopped: 0
Images: 4
Server Version: 18.06.1-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
runc version: 69663f0bd4b60df09991c08812a60108003fa340
init version: fec3683
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-157-generic
Operating System: Ubuntu 16.04.6 LTS
OSType: linux
Architecture: x86_64
CPUs: 20
Total Memory: 31.3GiB
Name: basil
ID: PHHV:SHFX:DT6S:4BW4:UEWF:MXJM:32YT:NSDI:2J3F:EBF6:V3IB:IRIQ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Username: bowerybot
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support
# docker version
Client:
 Version:           18.06.1-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        e68fc7a
 Built:             Tue Aug 21 17:24:56 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.06.1-ce
  API version:      1.38 (minimum version 1.12)
  Go version:       go1.10.3
  Git commit:       e68fc7a
  Built:            Tue Aug 21 17:23:21 2018
  OS/Arch:          linux/amd64
  Experimental:     false
1 Like

One reason can be corrupt json-log files.

Use find /var/lib/docker/containers/ -name *-json.log -exec bash -c 'jq '.' {} > /dev/null 2>&1 || echo "file corrupt: {}"' \; to identify if all fines are still json compliant. Delete all corrupt files.

7 Likes

Wow good call that seems to have fixed it! Thank you so much for the help. Can you explain how that command finds corrupted log files? And why they may have been corrupted in the first place? I need to learn more about this. Thanks

I try my best :slight_smile:

The find command searches for all files that end with -json.log in /var/lib/docker/containers and sends them to jq to parse them, though the parsed ouput is redirected to /dev/null to not pollute the screen. The || is a boolen OR that makes use of the return code of jq: if it’s 0, parsing succeeded. if its >0 parsing failed and the output "file corupt: ${filename} gets printed. The {} characters are special characters in find -exec that get replaced by the actual filename.

Does that make sense?

We had it in the past when the filesytem was full and the logs got stuck mid-writing of log entries.

Yes that makes perfect sense. Thank you for the explanation. Cheers

This post saved my bacon, ty <3

My log files were not corrupt and yet, moving them out helped my CPU come back to normal. Not sure why O.o

I used find /var/lib/docker/containers/ -name *-json.log -exec mv {} ~/logs-backup \;

1 Like

Hello,

is there any Docker bug report tracking this issue? It happened to me several times on different hosts in last few weeks. It’s Docker CE version 24 running on Debian 11.

This issue is really unpleasant. It can exhaust production hosts an containers become unresponsive. Any clues how to mitigate it?

The only situation where I encountered this behavior in the past, was when the filesystem was filled and docker could not write complete log lines. We got rid of it by using bigger partitions and reducing the log level and the chattiness of what actually was logged.

Thanks for the reply. Yes, one of the containers is a bit chatty (java :roll_eyes:), but the free disk space is enough for now.

I tried to reduce overall log sizes with log rotation config in daemon.json:

{
  "log-driver": "json-file",
  "log-opts": {
    "max-file": "5",
    "max-size": "100m"
  }
}

could it be somehow related?

I am having 100% CPU usage from docker, on Mac. The above command does not work (I presume the paths are for Linux). Any suggestions?

Usually when the category is not any of the “Docker Desktop” categories, it is indeed about docker-ce on Linux. I have no idea how to run the command in docker desktop.

I remember that @rimelek shared a nsenter command long time ago that uses a privileged container to access a shell in host namespace of the utility vm that runs the docker backend. But even then, the paths would high likely not different for Docker Desktop.

100% of your entire host (100% of all CPUs), 100% of one CPU or 100% of the CPUs of Docker Desktop’s virtual machine?

Since Docker Desktop runs almost evrything in the virtual machine which has limiter resource, logs should not have a serious affect on the host.

You cans till check the resource limits in the GUI including the limits of the virtual machine and also the log rotation in the daemon config.

100% of one host CPU.

Then that could be Docker Desktop as you assumed. This is the command that @meyay referred to:

docker run --rm -it --privileged --pid host ubuntu:20.04 \
    nsenter --all -t 1 \
      -- ctr -n services.linuxkit task exec -t --exec-id test docker \
           sh

It will give you a shell so you can see the files including the docker data root and the config file, but don’t change anything there until the Graphical interface works. You can change the daemon config from the GUI. docker stats can also show you how much resources containers use and you can try docker system prune to remove

  - all stopped containers
  - all networks not used by at least one container
  - all dangling images
  - all dangling build cache

which will give you more space in the virtual machine. Check unused volumes too.

1 Like

We had a similar incident with Docker itself maxing out 20 CPU cores out of 64 on Ubuntu 20.04.5 LTS.

The high CPU persisted even after we stopped all the Docker containers. When inspected with iotop, /run/contained/containerd.sock was unnormal IO throughtput.

The JSON log file issue mentioned in this thread was not the root cause. We inspected the JSON log files by the instructions here. However we think the issue was still something log related.

We saw errors in syslog related to Docker:

Oct 25 10:03:41 poly dockerd[988]: time="2023-10-25T10:03:41.747578758Z" level=error msg="Handler for GET /v1.41/images/json returned error: write unix /run/docker.sock->@: write: broken pipe"

Somebody had seen a similar issue in the past.

As a fix

  • We updated our Docker from Docker version 20.10.18, build b40c2f6 to Docker version 24.0.6, build ed223bc
  • With the update, we restarted Docker

The issue is now gone.