Random, high response times on containers across identical Linux hosts

binarydad · August 3, 2024, 2:56pm

I’m completely baffled with this.

I’m running two identical Ubuntu (24.04 minimal) VMs (6 core, 4GB RAM), running only the docker runtime and a couple other utilities (git, samba, nano). Completely fresh install. I’ll call these two (identical) docker hosts Docker 1 and Docker 2. Each host runs about 17/18 containers (mix of wordpress, phpmyadmin, ulogger, gitea, bookstack, mariadb, etc) and has plenty of available RAM and seemingly no issues with CPU contention. Also, each VM guest, while on the same host machine, is on its own SSD.

The issue is, I’m seeing these random spikes/plateaus from SOME containers running on either host. Response times are pretty much identical, where it would spike to 11 seconds for a period and then return to normal. The spikes would also be consistently the same time - 11 seconds in this case, across BOTH hosts. Some other apps don’t seem to ever get the issue, such as Portainer and Gitea. It SEEMs to affect Wordpress sites mostly, but not exclusive to WP sites.

Here’s the weirder thing… sometimes after reboots or restarting containers, the issue will “jump” and start affecting another container, where it will emit the same 11-second spikes/plateaus. The behavior will show even for container apps that are extremely lightweight, such as uLogger or phpMyAdmin (i.e., unauth’d login screens). WTF?

In the screenshots below, I’m using Uptime Kuma and am making HTTP requests every minute, with a delay threshold of 48 seconds (default). The red lines below are from 502 errors reported by the reverse proxy.

So far I have tried:

Recreating the hosts - This started when running Ubuntu 20.04, and recreating the hosts with 24.04 made no change. The only difference is the delay spikes changed from 15 seconds to 11 seconds with the new host.
Swapping reverse proxies - I originally noticed the issue when using Nginx on Ubuntu 20.04, but noticed the same behavior when moving to Traefik (on both Ubuntu 20.04 and 24.04).
Pinging containers directly by port - When I was using Nginx originally, the ports were randomly assigned on the host, so they were available over HTTP. This made no difference, though I did not do too much testing. I abandoned this after noticing Traefik showed the same issues.

The only things common between the two hosts is the fact that Uptime Kuma is pinging these, though I have noticed random spikes when I wrote my own HTTP ping utility in C#, though I’m not 100% certain. UK is pinging every minute, but my utility is pinging every 5 seconds. I see random spikes upward of 20-40 seconds, but not consistently as indicated by UK.

It SEEMS like some sort of networking issue, but I do not know what, as everything is default and traefik is routing on internal container IPs.

Screenshots

(Since I’m new I’m allowed only 1 screenshot…)

Here’s an example of Roundcube running on Docker 1, however, there are similar spikes/plateaus for some other containers on the same host as well as Docker 2 (both are around 11 seconds for containers on both hosts)

Roundcube running on Docker 1:

Any insight is appreciated, as I’m completely baffled by what is going on here. I was hoping a re-build of the VM would help but apparently not. Thank you.

bluepuma77 · August 3, 2024, 6:34pm

Do you run the VMs on dedicated hardware or shared hosts?

Have you tried to run something like netdata to see if any underlying Linux functions (disk access, etc) show the same delay? (recently they are going very enterprise-y, but you can still run a free local instance with Docker)

binarydad · August 3, 2024, 9:41pm

The VMs are on dedicated hardware - a PC I build years ago.

I’m trying out netdata. Wow, there is a ton here. Any particular thing to look for when running my tests? Just glancing at it, nothing seems to stand out. I’m getting anomalies being detected, but I’m also seeing them on a completely different docker host that I’m testing netdata on as well, so not sure if that’s alarming. Thanks.

bluepuma77 · August 4, 2024, 9:17am

I would look for graphs with a similar pattern to your Docker service response times. If it’s an old PC, maybe disk has issues.

binarydad · August 5, 2024, 1:25pm

I’m seeing TCP sockets sitting in TIMEWAIT quite a bit. Trying to figure out how to see which container is doing this.

This level of network debugging isn’t my strength, so if anyone has some tips, LMK. Thanks!

binarydad · August 8, 2024, 5:46pm

I’m not sure if this is the definite fix, but for the last 1.5 days I’ve moved all the containers into a user-defined bridge network (as opposed to the default system bridge running previously). So far, there have been no weird response-time spikes/plateaus. Things seem to be running as expected.

Is there any explanation as to why this may be the reason? I know there are some optimizations with user-defined versus system-defined, but I thought it was minimal. Any thoughts? Thanks.

binarydad · August 9, 2024, 1:32am

I spoke too soon. One of the random containers started spiking. No idea why.

The long timeouts are always 11.5 seconds. Across BOTH hosts.

How in the hell do two identical Ubuntu hosts, freshly installed, with just the docker runtime on there, cause random containers to have 11.5 second response times (sometimes more)? The containers on each are different, except for portainer and traefik, which are on both. Neither of these are the cause, however.

It’s not anything related to a load balancer, as I can shell into the container and curl the local 172.18.X.X IP address and get similar delays.

I’m going to try podman, when I work through some minor nuances. I’ll try running the VMs on another host of mine. I’ll try an older version of docker, as I swear this wasn’t a problem before version 25 or 26.

rimelek · August 10, 2024, 6:28am

I did not measure the time, but I had an issue in a VM where Docker ran and even the bash shell was slow periodically. The host machine was used for other purposes as well and had a large home directory on software raid where the VM images had to be. The VMs had very little resources, although they were not using a lot of CPU and memory. When I checked the output of htop, I noticed that the “incus_agent” (formerly lxd_agent) in the LXD virtual machine periodically used 100% vCPU. I could not confirm it, but this was one of my guesses to somehow affect the performance. On the other hand, even the “ls /home/user/” command was slow sometimes in a folder with very few files or none, while other folders not on the software raid gave much quicker response all the time. Since the server has eventually has to read sometthing from the host, if you have similar storage solutions on both machines, that could also be a cause. Since you wrote about SSD, if it is dirrectly attached to the host, it shouldn’t be a problem, but maybe you will have a new idea based on my observation.

binarydad · August 23, 2024, 3:00pm

Thanks for the replies. I did figure out how to fix the issue, though I’m not 100% sure WHY this is the fix.

Keep in mind, these are fresh OS installations of Ubuntu 20 and 24. I merely installed git and docker onto them afterward, and ran my scripts to deploy my containers. The ubuntu servers were using static IPs and DNS is using my local (LAN) DNS servers.

The issue apparently was with DNS (of-fucking-course!) as the spikes seemed to be timing out for consistent periods of time, but then resumed and gave a response to the request. From my understanding, requests from the container (such as calls to a database using a host name) were using the local Docker DNS (127.0.0.11), which was forwarded to systemd-resolved (127.0.0.53), and then, maybe that forwards to the internal DNS servers on my LAN.

I initially set my two DNS servers in /etc/docker/daemon.json, but this still showed consistent 5-second spikes/plateaus. This was effectively using 127.0.0.11, which forwarded to my two LAN DNS servers.

Finally, I set --dns 192.168.1.X for my local DNS servers and it worked perfectly - no more spikes or plateaus. It’s been over a week with this configuration and so far so good.

So, I don’t know why the resolution using Docker’s DNS and systemd-resolved was causing random spikes on brand-new installations, but explicitly setting the DNS servers as the main nameservers fixed the spikes.

Would anyone have an explanation as to why DNS was causing these issues?

Hopefully this helps anyone seeing this same issue.

binarydad · December 5, 2024, 12:27am

seem this is still sometimes an issue… oddly, this didn’t come after any kind of apt updates, but after restarting 2 containers, BOTH starting having issues. granted, the restart was part of a “Recreate” action under Portainer, instead of me running my own setup scripts.

I’m baffled. What in the world could cause very random DNS spikes for a period of time, go away briefly, and then come back? some sort of port exhaustion?

rimelek · December 5, 2024, 8:06pm

I still have no answers, but you mentioned Portainer in the issue, so I wonder, did you use Portainer to create the containers that are having problems with the DNS requests?

binarydad · December 11, 2024, 7:55pm

I am using portainer, yes. I typically run the “recreate” option to recreate the container upon a new image, but started recently using the CLI to manually pull, stop, remove, and run the container. not sure if this helps, it’s among my set of (possibly baseless) attempts to remedy this.

trying not to try a third server, as this one was already a fresh ubuntu install.

really feels like some quirk that i’m hitting. i’m doing nothing special with my setup, and only running about 18 containers. again, the reverse proxy made no difference (used nginx and now traefik - both had the same issue).

I discovered the other day I can’t even ping the internal virtual IP when containers get into this “state”, until i kill it and recreate it. makes absolutely no sense.

rimelek · December 11, 2024, 9:44pm

I asked about Portainer only because recently we had multiple reports about Portainer somehow breaking the network of Docker containers by duplicating mac addresses. Your issue seems different, but if you created anything from Portainer in your current environment, you could still check it.

After those reports came we both tried to write a script with @meyay that helps detecting duplicate mac addresses and this is what I made which I wanted to publish but haven’t done it yet.

#!/usr/bin/env bash

set -eu -o pipefail

network_name="${1:-}"

if [[ -z "$network_name" ]]; then
   mapfile -t network_names < <(docker network ls --format '{{ .Name }}')
else
  network_names=("$network_name")
fi

for inet in "${network_names[@]}"; do
  docker network inspect --format json "$inet" \
    | jq '.[0] | .Containers[] + {"Network": (.Name)} | select(.MacAddress != "") | {(.MacAddress): .}' \
    | jq -s 'reduce .[] as $item ({}; .[$item | keys_unsorted[0]] += [$item[$item | keys_unsorted[0]]])' \
    | jq '[. | to_entries[] | select(.value | length > 1)] | from_entries | select(. != {})'
done

You could save it as docker-duplicate-mac-detector.sh, make it executable and run it without arguments to find duplicate IPs in all networks or run it with a network name ass the only argument to detect duplicate mac only in that network.

The output is a json containing the problematic containers’ name, mac address, IP and network name.

But the previous issues were about containers communicating with eachother on the same network not another container on another host.

binarydad · December 12, 2024, 12:09am

Interesting! Does it count if I’ve recreated containers using portainer? My issues never happen after I run them via my own scripts but one time did notice that after recreating a container via portainer, I had intermittent ping issues.

Ill try this when I get back. Thanks!

rimelek · December 12, 2024, 12:33am

Yes, recreating counts

binarydad · December 12, 2024, 12:35am

Sounds promising. I was reading about a similar issue with Moby and the ping output/behavior shared is identical to my issue. So far none of mine show it, but i haven’t restarted anything. If it happens again I’ll check the mac addresses.

Or just stop using portainer altogether, as I only use it as a web UI for quick monitoring.

binarydad · December 18, 2024, 8:08pm

Just an update, but so far it’s working fine. I’ve since ditched Portainer in lieu of Yacht, so even if portainer was the cause, I would not know unfortunately. After seeing the logs and behaviors of those with the issue, and having it almost 100% mirror my tests, I’m almost certain that was the cause.

I also only deploy/redeploy my containers using my own scripts (stop, remove, run). I’ll report back if it comes back again. Thank you @rimelek for the insight! This has been driving me nuts for many months.

rimelek · December 19, 2024, 1:40pm

Thank you for the updates. I thought it was a different issue, but it is good to know that it was likely affected by Portainer

chatre7 · February 7, 2025, 2:01pm

I use portainer to deploy container 2 server but i found issue random slow response and random container
I use 2.21.4
Please share me your portainer version.

Topic		Replies	Views
Curious behaviour General	0	916	July 22, 2016
Response time problem for rails app in docker General	1	882	May 23, 2017
Docker scalability issues -- major problems when running hundreds/thousands of interactive sessions General	10	5322	April 12, 2017
Difference in sending speed between 2 containers at one host and different host? General	0	722	April 22, 2019
Strange performance behavior? Docker Desktop	0	886	January 17, 2020

Random, high response times on containers across identical Linux hosts

So far I have tried:

Screenshots

Related topics