Docker swarm ovelay network not working on AWS

Hey guys. So I’m trying to setup swarm on two AWS instances. Everything works except the overlay network. It seems docker isn’t connecting to it at all. I’ve already enabled all the necessary ports in the security group for both instances on the AWS console. Doing a tcp dump on the port 4789 displays no traffic at all. But it displays traffic for port 2377 and port 7946.
I’ve tried the configuration on docker 19.03.4-ce and also 18.06.3-ce but I still get the same result.

Here’s the output for the manager node for docker info:

    Containers: 1
     Running: 0
     Paused: 0
     Stopped: 1
    Images: 6
    Server Version: 18.06.3-ce
    Storage Driver: overlay2
     Backing Filesystem: extfs
     Supports d_type: true
     Native Overlay Diff: true
    Logging Driver: json-file
    Cgroup Driver: cgroupfs
    Plugins:
     Volume: local
     Network: bridge host macvlan null overlay
     Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
    Swarm: active
     NodeID: yiyvvk7rcz033eorn9l0rydxd
     Is Manager: true
     ClusterID: y01nnnejg938bz1mtpjj6n1uo
     Managers: 1
     Nodes: 3
     Orchestration:
      Task History Retention Limit: 5
     Raft:
      Snapshot Interval: 10000
      Number of Old Snapshots to Retain: 0
      Heartbeat Tick: 1
      Election Tick: 10
     Dispatcher:
      Heartbeat Period: 5 seconds
     CA Configuration:
      Expiry Duration: 3 months
      Force Rotate: 0
     Autolock Managers: false
     Root Rotation In Progress: false
     Node Address: 18.184.183.97
     Manager Addresses:
      18.184.183.97:2377
    Runtimes: runc
    Default Runtime: runc
    Init Binary: docker-init
    containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
    runc version: a592beb5bc4c4092b1b1bac971afed27687340c5
    init version: fec3683
    Security Options:
     apparmor
     seccomp
      Profile: default
    Kernel Version: 4.15.0-1052-aws
    Operating System: Ubuntu 18.04.3 LTS
    OSType: linux
    Architecture: x86_64
    CPUs: 1
    Total Memory: 983.9MiB
    Name: ip-172-31-12-68
    ID: 5OQ4:XQZ5:M3ML:WEC5:2WTA:JTXX:ABK6:DYUK:3Z27:R4IH:ECGH:KEYU
    Docker Root Dir: /var/lib/docker
    Debug Mode (client): false
    Debug Mode (server): false
    Registry: https://index.docker.io/v1/
    Labels:
    Experimental: false
    Insecure Registries:
     127.0.0.0/8
    Live Restore Enabled: false

Here’s the output for the worker node:

    Containers: 2
     Running: 1
     Paused: 0
     Stopped: 1
    Images: 3
    Server Version: 18.06.3-ce
    Storage Driver: overlay2
     Backing Filesystem: extfs
     Supports d_type: true
     Native Overlay Diff: true
    Logging Driver: json-file
    Cgroup Driver: cgroupfs
    Plugins:
     Volume: local
     Network: bridge host macvlan null overlay
     Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
    Swarm: active
     NodeID: t0a1ttjoctmlhqobl1ur5qcwy
     Is Manager: false
     Node Address: 3.120.139.109
     Manager Addresses:
      18.184.183.97:2377
    Runtimes: runc
    Default Runtime: runc
    Init Binary: docker-init
    containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
    runc version: a592beb5bc4c4092b1b1bac971afed27687340c5
    init version: fec3683
    Security Options:
     apparmor
     seccomp
      Profile: default
    Kernel Version: 4.15.0-1052-aws
    Operating System: Ubuntu 18.04.3 LTS
    OSType: linux
    Architecture: x86_64
    CPUs: 1
    Total Memory: 983.9MiB
    Name: ip-172-31-4-225
    ID: DPB6:5FSV:DTSC:XVI3:Z762:4KJA:QZGE:FPGJ:KFAR:UHBC:L6AH:D2I6
    Docker Root Dir: /var/lib/docker
    Debug Mode (client): false
    Debug Mode (server): false
    Registry: https://index.docker.io/v1/
    Labels:
    Experimental: false
    Insecure Registries:
     127.0.0.0/8
    Live Restore Enabled: false

The interesting thing is the same configuration works on digital ocean with any issues.

I’m suspecting there’s an AWS specific setup required. But I’m not entirely sure.

As I used to run swarm without issues on AWS for roughly two years, I am quite sure this can not be a general problem.

Usualy those ports are everything you need. Though, why not loosen the rules to accept traffic on all ports, test and if you succeed lock down the rules again.

I’ve actually tried that already. Port 4789 still keeps misbehaving.

@meyay I’ve just run telnet on the manager mode, port 7946 gives the following result:

telnet ip 7946
Trying ip...
Connected to ip.

While port 4789 gives

telnet ip 4789
Trying ip...
telnet: Unable to connect to remote host: Connection timed out

Could this be an issue on aws and not docker swarm itself?

Your result is not realy surprising, as telnet can not connect to udp ports.
You migth want to follow this SO response to see examples on how to test udp connections.

Are you sure you created rules for udp traffic as well?
Are you using encryption on the swarm network?

I ran socat - UDP:ip:4789 and socat - UDP:ip:7946 and there was no output. It seems UDP traffic is not being transmitted :thinking:.
These are current rules on AWS (temporary):

The firewall isn’t up also on both servers.
I haven’t enabled encryption either.

For anybody who faces this issue. I got it working by using the private IPs instead of public ones.

For anybody who faces this issue. I got it working by using the private IPs instead of public ones.

Thanks @michaelbukachi !

How did you do that? did you need to init swarm manager again? I do not want to stop my swarm to apply this change :frowning:

No, I didn’t have to. When joining a swarm just use the private IP address of the server instead of the public one.

2 Likes

So I’m going through the same thing. Do you have an example of how you went about correcting this via a command you used? I’m doing a docker swarm deployment through docker-compose. Would that same command translate to that as well?

I strongly recommend to not expose the docker cluster backplane to the internet.

Make sure to use a private network and then use an alb with a target group that points to all the nodes in the private network that should serve content to the internet.

Put one or more jumphost in one of your public subnets and use it to either connect to your manager nodes in the private subnets or even better to a dedicated “control” instance that only has the docker-cli client and uses docker context to communicate with the manager nodes.

Though, If you are already on AWS, why not just use EKS (=manged kubernetes) or ECS (=managed containers)?

Update: I never used it that way, but it is possible to perform remote docker-compose deployments with ECS, see: Deploying Docker containers on ECS | Docker Documentation

1 Like

Hey @meyay ,

Thanks again for the feedback I appreciate it greatly!
So I have a service that is deployed to docker swarm and also includes a tunneling service that routes my traffic to an ngrok url.
Would that still be a bad practice for a docker swarm deployment even if the service gets tunneled to a specific ngrok url?
I haven’t tried AWS ECS service but I may in the future should these set of services actually get deployed successfully. I’ll definitely try our their service later if I can.

Thanks again,

-Brian

I asume ngrok provides some sort of reverse portforwarding, having the public entrypoint somewhere at
ngrok, tunneling the trafic through to your nodes? I haven’t used it, as I either use Public ALBs, Public NLBs or API-GW+Private-Link+private NLB, depending on the requirement.

This says nothing about wheter your nodes are on a public or private network, since you initiate the connection from one of your nodes to ngrok, the traffic could leave the network using a nat-gw or a public-ip. A swarm cluster is not the sort of payload you should run in a public subnet.

ECS/EKS has the charme that you can run autoscalling groups for the nodes (or even use fully managed FARGATE nodes), which are managed by AWS - you simply just deploy your stuff without having to administrate or patch the nodes in any way. I would neither create a docker swarm cluster or a kubernetes cluster manualy on AWS. The drawbacks outweight the advantages…

1 Like