Background: We have developed an application which hosts a REST API (flask app with gunicorn), and then makes connections to MySQL and MongoDB servers (separate servers, not containers).
Issue: When the application is configured to bind to port 5000 through Docker’s bridge network, we see that all 20 containers (replicas) in the Docker Service are processing data with 100% CPU each. Then, after some arbitrary time (15-30 seconds usually), the containers CPU drops to 1% and the bandwidth observed on the bridge interface is a steady 10-11Mbps. Finally, after perhaps 45-75 seconds, the containers receive more data at a fast rate and are processing at 100% again for another 15-30 seconds. This cycle repeats over and over forever (45-75 seconds of fast, unrestricted bandwidth, then 15-30 seconds of 10-11Mbps bandwidth)
Case: When the application is configured to bind to port 5000 through the host network, we see that the 1 container in the Docker Service is processing data with 100% CPU each. Throughout the entire lifecycle, the container continues receiving and processing data at 100%
Summary: The application receives data at an unrestricted rate when attached to the host network, but when attached to a bridge network, the data appears to be unrestricted for 15-30 seconds, then reduces to 10-11Mbps for 45-75 seconds, then the cycle repeats
Note: The script being run to hit the application is running locally on the Docker server, pointing to http://localhost:5000
Docker Version 19.03.8
Note: This application is running in an internet-restricted environment. Upgrades to any software version is a difficult process and cannot be considered unless absolutely necessary to fix the issue. Upgrades to software for the reason of “Just upgrade to the latest and see what happens” is not possible.
What could be limited high performance traffic on Docker networks in a way where it would be fast, then slow, then fast, etc, etc?