Hey folks - I’m trying to debug why dockerd on one of our bare metal servers is pegging our cpu at 100% for a few days now. None of the containers seem to utilizing a lot of CPU (from running docker stats). The dockerd logs don’t show anything helpful from what I can see. What else should I look for? Any help is appreciated. Here’s some docker info from our host, incase it’s useful:
Use find /var/lib/docker/containers/ -name *-json.log -exec bash -c 'jq '.' {} > /dev/null 2>&1 || echo "file corrupt: {}"' \; to identify if all fines are still json compliant. Delete all corrupt files.
Wow good call that seems to have fixed it! Thank you so much for the help. Can you explain how that command finds corrupted log files? And why they may have been corrupted in the first place? I need to learn more about this. Thanks
The find command searches for all files that end with -json.log in /var/lib/docker/containers and sends them to jq to parse them, though the parsed ouput is redirected to /dev/null to not pollute the screen. The || is a boolen OR that makes use of the return code of jq: if it’s 0, parsing succeeded. if its >0 parsing failed and the output "file corupt: ${filename} gets printed. The {} characters are special characters in find -exec that get replaced by the actual filename.
Does that make sense?
We had it in the past when the filesytem was full and the logs got stuck mid-writing of log entries.
is there any Docker bug report tracking this issue? It happened to me several times on different hosts in last few weeks. It’s Docker CE version 24 running on Debian 11.
This issue is really unpleasant. It can exhaust production hosts an containers become unresponsive. Any clues how to mitigate it?
The only situation where I encountered this behavior in the past, was when the filesystem was filled and docker could not write complete log lines. We got rid of it by using bigger partitions and reducing the log level and the chattiness of what actually was logged.
Usually when the category is not any of the “Docker Desktop” categories, it is indeed about docker-ce on Linux. I have no idea how to run the command in docker desktop.
I remember that @rimelek shared a nsenter command long time ago that uses a privileged container to access a shell in host namespace of the utility vm that runs the docker backend. But even then, the paths would high likely not different for Docker Desktop.
Then that could be Docker Desktop as you assumed. This is the command that @meyay referred to:
docker run --rm -it --privileged --pid host ubuntu:20.04 \
nsenter --all -t 1 \
-- ctr -n services.linuxkit task exec -t --exec-id test docker \
sh
It will give you a shell so you can see the files including the docker data root and the config file, but don’t change anything there until the Graphical interface works. You can change the daemon config from the GUI. docker stats can also show you how much resources containers use and you can try docker system prune to remove
- all stopped containers
- all networks not used by at least one container
- all dangling images
- all dangling build cache
which will give you more space in the virtual machine. Check unused volumes too.
We had a similar incident with Docker itself maxing out 20 CPU cores out of 64 on Ubuntu 20.04.5 LTS.
The high CPU persisted even after we stopped all the Docker containers. When inspected with iotop, /run/contained/containerd.sock was unnormal IO throughtput.
The JSON log file issue mentioned in this thread was not the root cause. We inspected the JSON log files by the instructions here. However we think the issue was still something log related.
We saw errors in syslog related to Docker:
Oct 25 10:03:41 poly dockerd[988]: time="2023-10-25T10:03:41.747578758Z" level=error msg="Handler for GET /v1.41/images/json returned error: write unix /run/docker.sock->@: write: broken pipe"