Slow ss/scp download speeds with published ports in mesh network

jonakarl · August 11, 2025, 9:41am

Hi,

The scp download (but not upload) speed is very slow from a container to a external pc over the mesh network, is there a way to configure the mesh network to be faster?

Background:

We have a docker swarm network with multiple nodes where we expose a unique port for each user of our system so they can ssh into their own container.

However, when downloading files with scp or rsync (over ssh) the download speed are very low (100 KB/s from a 10 GBps link).

Uploading over scp/rsync works fine (max out the network connection) so it is only in downlink it is a issue.

I have ruled out:

Disk issues - the file in the container is hosted on a fast SSD m2 disk
CPU/MEM issues - no high cpu/mem usage on the node hosting the container
Network issues - I used other tools (eg GitHub - magic-wormhole/magic-wormhole: get things from one computer to another, safely ) that do not use the docker mesh network and then the network connection is maxed out in both up and downlink for the file transfer.

The setup is from a docker swarm perspective pretty standard but we have increased the ingress network from /24 to /16 to overcome the max number of published ports problem (as in our case it restricted the number of concurrent users to much). Currently we have 27 active users.

I have tried searching both docker issues and the forum but only found very old issues (2014) that had fixes merged already (Incredibly unreliable/slow networking using port forwarding · Issue #4827 · moby/moby · GitHub and related).

Best Regards
Jonas

rimelek · August 11, 2025, 5:15pm

I cannot claim to be a network expert, so I would expect the speed to be the same in both direction which is clearly not the case.

Have you also checked the disk on the client side to which you download the files? I don’t think that is the problem, but I have to ask to really rule out that. I also assume you tried to download files to multiple machines and the download speed was always slow, right?

Can you try to run a simple Docker container without Swarm? Or is it what magic-wormhole did? I would test the scp with different Docker networks like the default Docker bridge, a custom docker bridge like the one that docker compsoe creates, amnd maybe also with host network if that is possible in your environment for testing purposes.

You can also check your networks on the host and make sure there is no big differentce between the MTU settings.

And my last idea for now is using tcpdump or tshark to see if there is any rejected or retried package in the network traffic that can slow it down.

jonakarl · August 13, 2025, 6:08am

Hi,

Yes also the disk on the client side is “fast” (tested on different clients as well) and since I tested the workaround (wormhole) on the same client and got good speed I am pretty sure it is not a client limitation.

Wormhole is a software that (somewhat simplified) creates a “p2p tunnel” (much like webrtc) between two computers to eg send files. wormhole is not aware of that it runs in docker or use any docker specific “tricks”, I only used it to see if I could avoid the download limitation by scp.

We will do some test in the coming days and see if what tcpdump can show. However, unless there is some other way to share a port in docker swarm network (with no direct access to the swarm workers for clients) than to route it over the ingress network, we need to use a mesh network.

bluepuma77 · August 13, 2025, 7:31am

If it’s WebRTC, it might use TURN (“Traversal Using Relays around NAT”) servers if direct communication fails. That would explain slowness.

jonakarl · August 13, 2025, 7:59am

Hi,

To avoid misunderstandings.

wormhole is fast in both uplink and downlink
scp/rsync is fast in uplink but slow in downlink

The wormhole was only used as a workaround to avoid the mesh network but the final goal is the get the speed of scp to work as expected.

rimelek · August 13, 2025, 3:45pm

I wouldn’t recommend alternative networks just because one is slower than expected, so the network tests are only for figuring out where the real issue is.Then we can point you to the right direction to report the issue if we can’t help.

jonakarl · August 21, 2025, 7:20am

Hi All,

I have now done some more investigation and I thought I should share the results but first some background.

We have two docker swarms, lets call them prod and dev running on the same HW (ie same links and switches but different vlans and on similar but different servers).
All network links are 10 Gbit (sfp+ mikrotik CRS326-24S+2Q+), the network cards vary between Intel x710 to some cheap asus AQC100 card.
There is however three main differences between these environments; in the prod environment we 12 workers and in the dev we have 2, dev is running docker engine 28.1.1 (also tested with 28.3.3) on kernel 5.15.0-152-generic (Ubuntu jammy) while the prod is running 28.2.2 on kernel 6.8.0-60-generic (Ubuntu noble). 28.2.2 Some Swarm services are not discoverable over DNS · Issue #50129 · moby/moby · GitHub is effecting 28.2.2 but it is not exactly the issue we have.

On the dev environment all work more or less as expected (same results with both 28.1.1 and 28.3.3) with iperf3 I get around 6 Gbit/s node to node (ie without mesh) and around 4 Gbit/s over the mesh (ie container to container). There seams to be a cap on the mesh network around 5-6 Gbits (tested with both docker nodes on the same vm host) and raw capacity around 18 Gbits between the docker nodes. The speed is the same in both directions.

However on the prod environment I still get around 6 Gbit/s node to node (in both directions) but container to container I get 4Gbit/s in one direction and around 8 Mbit/s in the other.

Our next test will be to upgrade the dev environment to the same kernel version and os as the prod and see if we can replicate the issue.

Can the mesh network route the traffic in different paths for up and downlink traffic ? and how to debug this ?

Regards

jonakarl · August 21, 2025, 12:04pm

Short update;

Updated dev and prod to latest version so they are now on same docker kernel and os version.
This did not help, dev works good (as before) and prod have good performance( 4 Gbit) in one direction but not in the other (10 Mbit) over the overlay network. Node to node communication is still around 6-7 Gbit/s

I forgot to mention that 3 nodes in the prod swarm and 1 node in dev have 1 Gbit links but I have not tested from any of them but they are part of the swam in case that matters.

Regards

rimelek · August 22, 2025, 9:18am

I can only recommend the same as before

MTU would be shown in the output of

ip link

I rarely have to solve network issues by myself and I don’t use Swarm, so I don’t have a lot of ideas, but I would definitely try tshark to see what happens with the packets. It is like wireshark just from terminal.

jonakarl · August 22, 2025, 10:42am

Hi,

All nodes have the same MTU setting (1500) so that should be ok.

I have confirmed that it is only between containers running on the manager and a worker node there is issues (on containers running on two worker nodes there is no issue.

Furthermore, if I run more than one stream , ie iperf3 -P2 the throughput of both streams are 2-3 Gbit/s per stream while if I run only one stream the performance is 10 Mbit/s.
I have also tested to run two containers and 1 stream from each and both get around 10 Mbit/s.

I have captured dumps on the dev and prod system and something is odd:

On the dev system I see a lot of UDP (vxlan) packets while on the prod system it is mostly TCP and alot of dupacks.

I start to suspect something is fishy with our manager node but what speak against that is that the manager on both dev and prod are virtual machines running on the same physical host.

I am confused.

Regards

rimelek · August 23, 2025, 8:48pm

Then it is above my networking skills. I don’t know if it is a bug or any kind of configuration issue, so I hope someone finds this topic with more Swarm overlay and networking experience.

bluepuma77 · August 24, 2025, 7:24am

Have you checked and compared native and also Docker network settings?

We see some issues in the forum where people use VLANs but the MTU is not set to 1400, so they have issues with larger packets getting corrupted.

A docker network create not always correctly recognizes the need to set a smaller MTU, so it needs to be set manually.

Also note a default Docker Swarm network only enables 254 service containers because of the default /24.

Also native network drivers might have some jumbo frame settings. Maybe they are not the same on all machines.

jonakarl · August 24, 2025, 8:00am

MTU on all hosts 1500
MTU inside containers 1450
Switches have jumbo frame enabled so link layer PDU >=9000.

Swarm network created as a /16 network.

Could try to increase host MTU on hosts but unfortunately I will not be able to test modifications in the prod system for some time as students are coming back to classes this week. I will try to replicate the issue in the dev system.

Has anyone had issues with network card offloading or similar proxmox network issues (we recently migrated from VMware to proxmox)

meyay · August 24, 2025, 10:24am

I migrated my homelab from VMware to Proxmox 6 years ago and ran a Swarm cluster there. I use 10G nics in my homelab, but never bothered to switch to jumbo frames. I do remember that I started with lift and shift using the VMware vmxnet3 vnics, but eventually replaced the vms with Ubuntu cloud images ([distro]-server-cloudimg-amd64.img) to leverage CloudInit. I also switched the vnics to virtio. Never had any performance Issue since.

jonakarl · August 28, 2025, 11:11am

Hi, for those interested I think we have reached the end of the road on this one and will (when time allows) reinstall the docker environment (at least the manager).

Anyway I will share some final results of the analysis.
The network are configured in the same way:
(On the left the network that works fine in our dev environment on the right the prod environment)

The speed between the docker hosts are similar (no screenshot of that unfortunately) looking at the tcptraces we can in the prod environment see a long delay after a build up large packet drops.

This does not happen in the dev environment or in the other direction.

meyay · August 28, 2025, 11:16am

Could be caused by faulty checksum offloading.

Try disabling it in the vm: ethtool -K eth0 tx-checksum-ip-generic off (of course you need to replace the interface with the affected one)

Topic		Replies	Views
Speed up image downloading? Swarm	4	7712	June 7, 2016
Swarm mode routing mesh not working instead is using host mode by default General swarm	1	1493	March 13, 2020
Swam 1.12 Multi-host networking help Swarm	30	10720	March 29, 2017
Slow responses from a web app in swarm cluster Swarm	2	4479	June 19, 2019
Swarm container communication between nodes example Swarm docker	0	1759	November 1, 2016

Slow ss/scp download speeds with published ports in mesh network

Related topics