Docker Swarm failing to route TFTP traffic between servers in latest version of Docker (v24)

joshuacrunden · September 28, 2023, 11:05am

I’m trying to create two services in Docker swarm that can TFTP files between them.

The docker-compose.yml file I provide to swarm is as follows:

version: "3"

services:
  tftp-server:
    image: tftp-server
    ports:
      - "69/udp"
      - "40000:40000/udp"
      - "40001:40001/udp"
  tftp-client:
    image: tftp-client

And I deploy it with:

docker stack deploy -c docker-compose.yml tftp-example

My tftp-server image is built from the following Dockerfile:

FROM alpine:latest

RUN apk update; \
        apk add tftp-hpa tcpdump

RUN mkdir /tftpd; \
        touch /tftpd/test.txt; \
        chmod -R 777 /tftpd

EXPOSE 69/udp
EXPOSE 40000/udp
EXPOSE 40001/udp

CMD in.tftpd -4 -Lvvv -R 40000:40001 --address 0.0.0.0:69 /tftpd

and my client is built from the following Dockerfile:

FROM alpine:latest

RUN apk update; \
        apk add tftp-hpa

RUN echo "hello world" >> test.txt

CMD tail -f /dev/null

Once deployed, I exec into the client with the following command:

docker container exec -it $(docker container ls -aqf name=client) sh

And attempt a TFTP transfer with:

tftp -vvv tftp-example_tftp-server 69 -R 40000:40001 -c put /test.txt /tftpd/test.txt

With Docker version 17.03.1-ce, build c6d412e running on CentOS 7, this works as expected. The tftp command runs successfully with the following output:

Connected to tftp-example_tftp-server (10.0.0.4), port 69
putting /test.txt to tftp-example_tftp-server:/tftpd/test.txt [netascii]
Sent 13 bytes in 0.0 seconds [7217 bit/s]

running tcpdump -i any within the tftp-server container shows the following traffic which confirms it’s running successfully. From the output, it appears the two containers are talking directly.

10:35:48.453304 eth2  In  IP tftp-example_tftp-client.1.nz949j5gydq45moyekqz0g2fj.tftp-example_default.40000 > c1a95db058b1.69: TFTP, length 27, WRQ "/tftpd/test.txt" netascii
10:35:48.453834 eth2  Out IP c1a95db058b1.40000 > tftp-example_tftp-client.1.nz949j5gydq45moyekqz0g2fj.tftp-example_default.40000: UDP, length 4
10:35:48.453907 eth2  In  IP tftp-example_tftp-client.1.nz949j5gydq45moyekqz0g2fj.tftp-example_default.40000 > c1a95db058b1.40000: UDP, length 17
10:35:48.454006 eth2  Out IP c1a95db058b1.40000 > tftp-example_tftp-client.1.nz949j5gydq45moyekqz0g2fj.tftp-example_default.40000: UDP, length 4

However, on Rocky 9, running Docker version 24.0.6, build ed223bc, with the exact same docker-compose.yml and Dockerfiles, the tftp command hangs, times out and seg faults () with the following output:

Connected to tftp-example_tftp-server (10.0.3.7), port 69
putting /test.txt to tftp-example_tftp-server:/tftpd/test.txt [netascii]
Transfer timed out.
Segmentation fault (core dumped)

The tcpdump -i any output when run within the tftp-server container is as follows:

10:42:35.951161 eth1  B   ARP, Request who-has 10.0.3.7 tell tftp-example_tftp-client.1.jh6hjhn9qwu3oiz0ac8ni0e6v.tftp-example_default, length 28
10:42:35.951291 eth1  B   ARP, Request who-has 1cfd400d96d7 tell 10.0.3.4, length 28
10:42:35.951303 eth1  Out ARP, Reply 1cfd400d96d7 is-at 02:42:0a:00:03:08 (oui Unknown), length 28
10:42:35.951313 eth1  In  IP 10.0.3.4.40000 > 1cfd400d96d7.69: TFTP, length 27, WRQ "/tftpd/test.txt" netascii
10:42:35.951894 eth1  Out IP 1cfd400d96d7.40000 > 10.0.3.4.40000: UDP, length 4
10:42:35.951922 eth1  In  IP 10.0.3.4 > 1cfd400d96d7: ICMP 10.0.3.4 udp port 40000 unreachable, length 40

In this case, 1cfd400d96d7 is the containerID for the tftp-server, and interestingly, 10.0.3.4 is the load-balancer that docker swarm creates for the network that both service belong in (in this example, it is given the name lb-tftp-example_default)

According to the TFTP RFC:

The initial request happens over (conventionally) port 69. Then high ephemeral ports are used for the actual file transfer.

So from the tcpdump, it seems the tftp-client service talks to the tftp-server service on port 69, as expected. The server tries to begin file transfer over a high ephemeral port, sending the request to the swarm service load balancer. But rather than the LB forwarding the request onto the tftp-client container, it just returns with “port 4000 (in this case) on LB is unreachable”.

Does anyone know why this is happening? Is this a bug in the newer version of Docker Engine (specifically Swarm?) or is there some configuration that I’m missing to enable this to work?

Any help would be very greatly appreciated!

bluepuma77 · September 28, 2023, 6:49pm

You use Docker Swarm and stack deploy, but the compose has no deploy section?

When you just use compose (no stack deploy), it would place the two services on a default Docker network.

meyay · September 28, 2023, 7:11pm

Swarm does not create a load balancer for a network, it only provides a gateway.

Each service on the other hand uses a IPVS vip, which indeed balances the traffic to the service tasks.

If endpoint_mode: dnsrr is used, the service name would be resolved to a multi-value dns record with the ips of the service task instead. Traffic between nodes will use the host’s gw_bridge interface for communication.

Can you share the output of docker version and docker info?

joshuacrunden · September 28, 2023, 10:08pm

Are you referring to this deploy section?

Yes, you are correct; it does auto generate a network and places both services in that network.
Are you suggested setting the docker-compose.yml configuration up different? And if so, how would it address the problem in question?

joshuacrunden · September 28, 2023, 10:53pm

Swarm does not create a load balancer for a network, it only provides a gateway.

When you run a docker network inspect on the the auto generated network that the two services are placed inside, the output includes this for the containers section:

...
        "Containers": {
            "6b08ef3df1589076e15ef101d44e92a3ba7c2ee75b9df7a1cd80138d775cae55": {
                "Name": "tftp-example_tftp-server.1.a594gwhwg1wsv9vm2e5wu1fpa",
                "EndpointID": "58caf61714e75d64bdc0014e92528f2fcb5f42fec4e0c2831cbc15bf248b80ed",
                "MacAddress": "02:42:0a:00:05:06",
                "IPv4Address": "10.0.5.6/24",
                "IPv6Address": ""
            },
            "e513df3b5be5a585cfd01fe1231083d68c656dae31f330eb61678be58035a9d7": {
                "Name": "tftp-example_tftp-client.1.rj3bnfnis8h3s5jb5nzxxfcsx",
                "EndpointID": "02a5786006187a19b13643beb61f46f0c9838837bfd443efccdf773b07717437",
                "MacAddress": "02:42:0a:00:05:08",
                "IPv4Address": "10.0.5.8/24",
                "IPv6Address": ""
            },
            "lb-tftp-example_default": {
                "Name": "tftp-example_default-endpoint",
                "EndpointID": "3853565e50973ca3dd563902461ac093d2a1d524ff2d81088cbdf4e02e0b300c",
                "MacAddress": "02:42:0a:00:05:04",
                "IPv4Address": "10.0.5.4/24",
                "IPv6Address": ""
            }
        },
...

I’m assuming lb-tftp-example_default is a load balancer that Docker swarm provisions? I’m assuming that what lb means in this case. Do correct me if I’m wrong!

TFTP traffic between the server and client is routed to this lb-tftp-example_default container, which seems to be performing some NAT on the requests, but when the TFTP server responds to the client, this lb-tftp-example_default fails to redirect traffic back to the client, and just responds with “port X is unreachable”.

If endpoint_mode: dnsrr is used, the service name would be resolved to a multi-value dns record with the ips of the service task instead. Traffic between nodes will use the host’s gw_bridge interface for communication.

I’ll give endpoing_mode: dnsrr a go to see if this changes behaviour.

Sure, my docker version is:

Client: Docker Engine - Community
 Version:           24.0.6
 API version:       1.43
 Go version:        go1.20.7
 Git commit:        ed223bc
 Built:             Mon Sep  4 12:33:18 2023
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          24.0.6
  API version:      1.43 (minimum version 1.12)
  Go version:       go1.20.7
  Git commit:       1a79695
  Built:            Mon Sep  4 12:31:49 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.24
  GitCommit:        61f9fd88f79f081d64d6fa3bb1a0dc71ec870523
 runc:
  Version:          1.1.9
  GitCommit:        v1.1.9-0-gccaecfc
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

and my docker info is:

Client: Docker Engine - Community
 Version:    24.0.6
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.11.2
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.21.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 3
  Running: 3
  Paused: 0
  Stopped: 0
 Images: 4
 Server Version: 24.0.6
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: active
  NodeID: v5lazx6ad3ua411dgfaeaigk9
  Is Manager: true
  ClusterID: wvbmsebxryr3dj5hyc3julxdb
  Managers: 1
  Nodes: 1
  Default Address Pool: 10.0.0.0/8
  SubnetSize: 24
  Data Path Port: 4789
  Orchestration:
   Task History Retention Limit: 5
  Raft:
   Snapshot Interval: 10000
   Number of Old Snapshots to Retain: 0
   Heartbeat Tick: 1
   Election Tick: 10
  Dispatcher:
   Heartbeat Period: 5 seconds
  CA Configuration:
   Expiry Duration: 3 months
   Force Rotate: 0
  Autolock Managers: false
  Root Rotation In Progress: false
  Node Address: 10.0.1.23
  Manager Addresses:
   10.0.1.23:2377
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 61f9fd88f79f081d64d6fa3bb1a0dc71ec870523
 runc version: v1.1.9-0-gccaecfc
 init version: de40ad0
 Security Options:
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 5.14.0-284.30.1.el9_2.x86_64
 Operating System: Rocky Linux 9.2 (Blue Onyx)
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 3.506GiB
 Name: mozart
 ID: ed690a0a-1599-4b53-8096-bbbb7aa4ac0d
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

meyay · September 29, 2023, 6:20am

Seems you are correct Though, I am unclear why a lb container (!) exists in each overlay network and where its actually used. It is not the vip of a service, actually none of them use the lb ip.

bluepuma77 · September 29, 2023, 8:03am

What do you want to achieve? You want to run the 2 services on the same node? Or do you have multiple Swarm nodes? You want to use the externally exposed ports or internal ports within a Docker (overlay?) network for the communication? Should outside clients be able to connect to the server later on?

joshuacrunden · September 29, 2023, 12:15pm

Ideally it should run on one node. The tftp client and server only need to communicate with each other, so it would be great if their ports are only open on the docker network, and not accessible from the host OS.

I understand this seems an odd setup. The config and Dockerfiles I’ve provided are a very simplified part of a much larger microservices system I’ve inherited (it’s a single node reference/test environment which imitates a multi node production instance). When I uplifted Docker in this inherited system, I ran into the problem I outlined above with the TFTP containers. I reduced it down to this very simplified example I provided above to help understand the problem.

bluepuma77 · September 29, 2023, 12:36pm

We use Docker Swarm and we simply use a Docker overlay network, which is a Docker network going accross all Swarm nodes, all needed services attached to it. See simple Traefik Swarm example.

meyay · September 29, 2023, 6:04pm

I did some more research, and it appears the network’s lb container is indeed responsible to implement the service vip.

Get the service vip:

me@swarm1:~$ docker service inspect portainer_agent --format '{{json .Endpoint}}'
{"Spec":{"Mode":"vip"},"VirtualIPs":[{"NetworkID":"bz0dwtk3prov6c3jwahjywxl5","Addr":"10.0.4.5/24"}]}

Confirm that name resolution for the service name returns the service vip:

me@swarm1:~$ docker run --rm -it --net container:$(docker ps -q --filter name=portainer_agent) nicolaka/netshoot nslookup agent
Server:         127.0.0.11
Address:        127.0.0.11#53

Non-authoritative answer:
Name:   agent
Address: 10.0.4.5

Check name resolution for the service tasks (multi-value dns response with all service task ips):

me@swarm1:~$ docker run --rm -it --net container:$(docker ps -q --filter name=portainer_agent) nicolaka/netshoot nslookup tasks.agent
Server:         127.0.0.11
Address:        127.0.0.11#53

Non-authoritative answer:
Name:   tasks.agent
Address: 10.0.4.8
Name:   tasks.agent
Address: 10.0.4.6
Name:   tasks.agent
Address: 10.0.4.7

Use nsenter to enter the network namespace of the network lb container and query ipvsadm:

me@swarm1:~$ short_id=$(docker network ls  --format '{{ slice .ID 0 9 }}'  --filter name=portainer)
me@swarm1:~$ sudo nsenter --net=/var/run/docker/netns/lb_${short_id} ipvsadm -l -n
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
FWM  267 rr
  -> 10.0.4.6:0                   Masq    1      0          0
  -> 10.0.4.7:0                   Masq    1      0          0
  -> 10.0.4.8:0                   Masq    1      0          0
FWM  273 rr
  -> 10.0.4.3:0                   Masq    1      0          0

The three service task ips are listed underneath FWM 267. The other entry is Portainer.

There is still a missing piece: I don’t see how the service vip (.5) is connected to the network lb container (.4).

I must admit I never noticed the lb before and never looked deeper. I always assumed docker would create the network namespaces and use ipvs on the host without creating a container for it.

joshuacrunden · October 14, 2023, 8:29pm

Many thanks for doing more research into this for me. Apologies for the late reply.
I was It seems something seems to be going wrong with SNAT’ing as traffic goes between the load balance and the services, but I haven’t had chance to look at it further. I’ve got some more things I want to try, but many thanks for your help.
In some way I’m glad to know it isn’t something obvious and that I wasn’t making a silly mistake.

Cheers

meyay · October 15, 2023, 9:40am

It is indeed not obvious

We recently had a topic where someone had problems with overlay traffic communication in swarm. It was caused by the NSX service on ESXi hosts the swarm vm’s were running on. By any chance, is your Rocky vm running on such a host/cluster?

joshuacrunden · October 19, 2023, 10:15am

My Rocky VM is running on AWS EC2. Not sure what hypervisor AWS use and how it relates to ESXi. The microservices system where the problem original occurred runs on completely different kit, so I imagine hard to draw a good comparison.
A colleague recently managed to deploy our microservice system without the issue occurring, which is great, but we’re not sure what he’s done differently to get it to work. We might do some investigation when we get chance.
I’ll update here if we ever get to the bottom of it, but otherwise, many thanks for your help!
Cheers

meyay · October 19, 2023, 3:10pm

You shouldn’t have the problem that ESXi has: that a service prevents port 4789 communication on all vm’s. AWS uses Neutrino based on KVM. I never experienced any problems with it.

My experience is that on AWS EC2 everything just works, if the security groups are configured properly according docker docs.

Though, it has been 5 years now, that I used or have seen Swarm in production in any of my projects. All of them use k8s, preferably the hyperscaler’s own flavor or OpenShift.