Random, high response times on containers across identical Linux hosts

I’m completely baffled with this.

I’m running two identical Ubuntu (24.04 minimal) VMs (6 core, 4GB RAM), running only the docker runtime and a couple other utilities (git, samba, nano). Completely fresh install. I’ll call these two (identical) docker hosts Docker 1 and Docker 2. Each host runs about 17/18 containers (mix of wordpress, phpmyadmin, ulogger, gitea, bookstack, mariadb, etc) and has plenty of available RAM and seemingly no issues with CPU contention. Also, each VM guest, while on the same host machine, is on its own SSD.

The issue is, I’m seeing these random spikes/plateaus from SOME containers running on either host. Response times are pretty much identical, where it would spike to 11 seconds for a period and then return to normal. The spikes would also be consistently the same time - 11 seconds in this case, across BOTH hosts. Some other apps don’t seem to ever get the issue, such as Portainer and Gitea. It SEEMs to affect Wordpress sites mostly, but not exclusive to WP sites.

Here’s the weirder thing… sometimes after reboots or restarting containers, the issue will “jump” and start affecting another container, where it will emit the same 11-second spikes/plateaus. The behavior will show even for container apps that are extremely lightweight, such as uLogger or phpMyAdmin (i.e., unauth’d login screens). WTF?

In the screenshots below, I’m using Uptime Kuma and am making HTTP requests every minute, with a delay threshold of 48 seconds (default). The red lines below are from 502 errors reported by the reverse proxy.

So far I have tried:

  • Recreating the hosts - This started when running Ubuntu 20.04, and recreating the hosts with 24.04 made no change. The only difference is the delay spikes changed from 15 seconds to 11 seconds with the new host.

  • Swapping reverse proxies - I originally noticed the issue when using Nginx on Ubuntu 20.04, but noticed the same behavior when moving to Traefik (on both Ubuntu 20.04 and 24.04).

  • Pinging containers directly by port - When I was using Nginx originally, the ports were randomly assigned on the host, so they were available over HTTP. This made no difference, though I did not do too much testing. I abandoned this after noticing Traefik showed the same issues.

The only things common between the two hosts is the fact that Uptime Kuma is pinging these, though I have noticed random spikes when I wrote my own HTTP ping utility in C#, though I’m not 100% certain. UK is pinging every minute, but my utility is pinging every 5 seconds. I see random spikes upward of 20-40 seconds, but not consistently as indicated by UK.

It SEEMS like some sort of networking issue, but I do not know what, as everything is default and traefik is routing on internal container IPs.

Screenshots

(Since I’m new I’m allowed only 1 screenshot…)

Here’s an example of Roundcube running on Docker 1, however, there are similar spikes/plateaus for some other containers on the same host as well as Docker 2 (both are around 11 seconds for containers on both hosts)

Roundcube running on Docker 1:
image

Any insight is appreciated, as I’m completely baffled by what is going on here. I was hoping a re-build of the VM would help but apparently not. Thank you.

Do you run the VMs on dedicated hardware or shared hosts?

Have you tried to run something like netdata to see if any underlying Linux functions (disk access, etc) show the same delay? (recently they are going very enterprise-y, but you can still run a free local instance with Docker)

The VMs are on dedicated hardware - a PC I build years ago.

I’m trying out netdata. Wow, there is a ton here. Any particular thing to look for when running my tests? Just glancing at it, nothing seems to stand out. I’m getting anomalies being detected, but I’m also seeing them on a completely different docker host that I’m testing netdata on as well, so not sure if that’s alarming. Thanks.

I would look for graphs with a similar pattern to your Docker service response times. If it’s an old PC, maybe disk has issues.

I’m seeing TCP sockets sitting in TIMEWAIT quite a bit. Trying to figure out how to see which container is doing this.

This level of network debugging isn’t my strength, so if anyone has some tips, LMK. Thanks!

I’m not sure if this is the definite fix, but for the last 1.5 days I’ve moved all the containers into a user-defined bridge network (as opposed to the default system bridge running previously). So far, there have been no weird response-time spikes/plateaus. Things seem to be running as expected.

Is there any explanation as to why this may be the reason? I know there are some optimizations with user-defined versus system-defined, but I thought it was minimal. Any thoughts? Thanks.

I spoke too soon. One of the random containers started spiking. No idea why.

The long timeouts are always 11.5 seconds. Across BOTH hosts.

How in the hell do two identical Ubuntu hosts, freshly installed, with just the docker runtime on there, cause random containers to have 11.5 second response times (sometimes more)? The containers on each are different, except for portainer and traefik, which are on both. Neither of these are the cause, however.

It’s not anything related to a load balancer, as I can shell into the container and curl the local 172.18.X.X IP address and get similar delays.

I’m going to try podman, when I work through some minor nuances. I’ll try running the VMs on another host of mine. I’ll try an older version of docker, as I swear this wasn’t a problem before version 25 or 26.

I did not measure the time, but I had an issue in a VM where Docker ran and even the bash shell was slow periodically. The host machine was used for other purposes as well and had a large home directory on software raid where the VM images had to be. The VMs had very little resources, although they were not using a lot of CPU and memory. When I checked the output of htop, I noticed that the “incus_agent” (formerly lxd_agent) in the LXD virtual machine periodically used 100% vCPU. I could not confirm it, but this was one of my guesses to somehow affect the performance. On the other hand, even the “ls /home/user/” command was slow sometimes in a folder with very few files or none, while other folders not on the software raid gave much quicker response all the time. Since the server has eventually has to read sometthing from the host, if you have similar storage solutions on both machines, that could also be a cause. Since you wrote about SSD, if it is dirrectly attached to the host, it shouldn’t be a problem, but maybe you will have a new idea based on my observation.

Thanks for the replies. I did figure out how to fix the issue, though I’m not 100% sure WHY this is the fix.

Keep in mind, these are fresh OS installations of Ubuntu 20 and 24. I merely installed git and docker onto them afterward, and ran my scripts to deploy my containers. The ubuntu servers were using static IPs and DNS is using my local (LAN) DNS servers.

The issue apparently was with DNS (of-fucking-course!) as the spikes seemed to be timing out for consistent periods of time, but then resumed and gave a response to the request. From my understanding, requests from the container (such as calls to a database using a host name) were using the local Docker DNS (127.0.0.11), which was forwarded to systemd-resolved (127.0.0.53), and then, maybe that forwards to the internal DNS servers on my LAN.

I initially set my two DNS servers in /etc/docker/daemon.json, but this still showed consistent 5-second spikes/plateaus. This was effectively using 127.0.0.11, which forwarded to my two LAN DNS servers.

Finally, I set --dns 192.168.1.X for my local DNS servers and it worked perfectly - no more spikes or plateaus. It’s been over a week with this configuration and so far so good.

So, I don’t know why the resolution using Docker’s DNS and systemd-resolved was causing random spikes on brand-new installations, but explicitly setting the DNS servers as the main nameservers fixed the spikes.

Would anyone have an explanation as to why DNS was causing these issues?

Hopefully this helps anyone seeing this same issue.

1 Like