All processes are on the same machine

I have created a 20 machine swarm on EC2. I am having a problem, in that all processes I run are on the same machine, regardless of the --hostname I specify. This seems broken in the first place, since the default strategy is ‘spread.’

The script I use to create the cluster is here. I am able to create the cluster, but it is not distributing work correctly.

#!/bin/bash

#
# Setup a docker swarm with 20 workers
#

# Spawn an instance to generate a Swarm token
docker-machine create \
    --driver amazonec2 \
    --amazonec2-instance-type m3.medium \
    --amazonec2-subnet-id subnet-40502c36 \
    --amazonec2-zone=c \
    --amazonec2-vpc-id=vpc-66f0e002 \
    'aws.swarm-token-machine'

# Setup environment to run a command on this node
eval "$(docker-machine env 'aws.swarm-token-machine')"

# Create a token for our swarm cluster and setup our enviroment
docker run swarm create # copy token to SWARM_CLUSTER_TOKEN
export SWARM_CLUSTER_TOKEN=''

# Create 20 swarm workers at once
for i in {1..$WORKER_COUNT}
do
    docker-machine create \
    --driver amazonec2 \
    --amazonec2-instance-type m3.medium \
    --amazonec2-subnet-id subnet-40502c36 \
    --amazonec2-zone=c \
    --amazonec2-vpc-id=vpc-66f0e002 \
    --swarm \
    --swarm-discovery token://$SWARM_CLUSTER_TOKEN \
    aws.agent$i &
done

# Set our environment to this 20 machine swarm
eval "$(docker-machine env --swarm 'aws.swarm-master')"

#
# Selenium stuff
#

# Setup a selenium hub
docker run -d \
    --name selenium-hub \
    -p 4444:4444 \
    --hostname aws.agent1 \
    selenium/hub:2.53.0

# Setup 20 Chrome nodes linked to the hub
for i in {1..20}
do
  docker run -d --name=chrome-node-$i --link selenium-hub:hub selenium/node-chrome:2.53.0
done

What is going on that all processes stick to one machine? How can I fix this? I have tried using --hostname in my docker run commands, but it has no effect.

This is the output of docker ps:

CONTAINER ID        IMAGE ... STATUS              PORTS                         NAMES
5a81c6a3e381        selenium/node-chrome:2.53.0 ... Up 5 seconds                                      aws.agent13/chrome-node-2
74f46a0b5809        selenium/node-chrome:2.53.0 ... Up 2 minutes                                      aws.agent13/chrome-node-1
244e4d3092cd        selenium/hub:2.53.0 ... Up 4 minutes        54.86.124.43:4444->4444/tcp   aws.agent13/selenium-hub,aws.agent13/chrome-node-1/hub

It looks like you are using the default network, and using --link. I would expect those nodes to be scheduled to the same node you are linking to.

If you were to set up an overlay network, then those containers could communicate with eachother on any node.

Check out the docker network create stuff.

I will try your suggestion. You mean that linked containers have to run on the same agent? That is very unexpected. I will check out the docs you mention. I do not understand.

In the meanwhile, maybe this isn’t a valid issue after all but I already created an issue here: https://github.com/docker/docker/issues/21968

When you are using the default bridged network, --link will only work between containers on the same host connected to that same network.

The overlay network driver was specifically developed to achieve multi-host aware container networking. If you use an overlay network, you can --link between hosts. You could also skip using the --link feature, and use the new service discovery feature. Basically all containers connected to the same docker network style network will be able to connect to any other containers by referencing their --name.

Thanks! Ok, so if I create an overlay network and run the containers that are linked via the --net option, they will then be spread across the cluster?

That is correct, although if you do use the docker network create stuff, you don’t need to use the --link feature.

Without --link, how will a selenium node know how to connect to the selenium hub?

You can use the new network discovery system.

Basically, you can reach another container by resolving the name of that container.

For example:

$ docker network create foo
$ docker run -d --name web --net=foo nginx
$ docker run --rm -it --net=foo alpine ping -c 1 web
PING web (10.0.10.2): 56 data bytes
64 bytes from 10.0.10.2: seq=0 ttl=64 time=0.119 ms

--- web ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss

Docker runs an internal embedded dns service that allows this to happen: https://docs.docker.com/engine/userguide/networking/configure-dns/

Thanks, to use overlay networks I have to run a distributed store. Now I’m having trouble getting a swarm master to finish booting, when I use consul for its keystore. The keystore boots up ok, but the swarm master simply never finishes initializing.

# Setup consul for our overlay network to divide linked containers across the network
docker-machine create \
    --driver amazonec2 \
    --amazonec2-instance-type m3.medium \
    --amazonec2-subnet-id subnet-40502c36 \
    --amazonec2-zone=c \
    --amazonec2-vpc-id=vpc-66f0e002 \
    'aws.mh-keystore'

eval "$(docker-machine env aws.mh-keystore)"

docker run -d \
    -p "8500:8500" \
    -h "consul" \
    progrium/consul -server -bootstrap

# Create spawn master
docker-machine create \
    --driver amazonec2 \
    --amazonec2-instance-type m3.medium \
    --amazonec2-subnet-id subnet-40502c36 \
    --amazonec2-zone=c \
    --amazonec2-vpc-id=vpc-66f0e002 \
    --swarm \
    --swarm-master \
    --swarm-discovery="consul://$(docker-machine ip aws.mh-keystore):8500" \
    --engine-opt="cluster-store=consul://$(docker-machine ip aws.mh-keystore):8500" \
    --engine-opt="cluster-advertise=eth1:2376" \
    aws.swarm-master

The container can’t be reached:

Running pre-create checks...
Creating machine...
(aws.swarm-master) Launching instance...
Waiting for machine to be running, this may take a few minutes...
Detecting operating system of created instance...
Waiting for SSH to be available...
Detecting the provisioner...
Provisioning with ubuntu(systemd)...
Installing Docker...
Copying certs to the local machine directory...
Copying certs to the remote machine...
Setting Docker configuration on the remote daemon...
Error creating machine: Error running provisioning: Unable to verify the Docker daemon is listening: Maximum number of retries (10) exceeded

Any idea what I’m doing wrong? I checked the security group, and ports 22 and 2376 are open.

To answer my own question: once I changed eth1 to eth0, the machine comes up :slight_smile:

Ah, now I can’t get nodes to join the swarm. Using this script:

for i in {1..20}
do
    # docker-machine create \
    # --driver amazonec2 \
    # --amazonec2-instance-type m3.medium \
    # --amazonec2-subnet-id subnet-40502c36 \
    # --amazonec2-zone=c \
    # --amazonec2-vpc-id=vpc-66f0e002 \
    # --swarm \
    # --swarm-discovery token://$SWARM_CLUSTER_TOKEN \
    # aws.agent$i &
    docker-machine create \
        --driver amazonec2 \
        --amazonec2-instance-type m3.medium \
        --amazonec2-subnet-id subnet-40502c36 \
        --amazonec2-zone=c \
        --amazonec2-vpc-id=vpc-66f0e002 \
        --swarm \
        --swarm-discovery="consul://$(docker-machine ip aws.mh-keystore):8500" \
        --engine-opt="cluster-store=consul://$(docker-machine ip aws.mh-keystore):8500" \
        --engine-opt="cluster-advertise=eth1:2376" \
        aws.agent.$i &
done

docker info after I run eval "$(docker-machine env --swarm 'aws.swarm-master')" shows:

Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 0
Server Version: swarm/1.2.0
Role: primary
Strategy: spread
Filters: health, port, dependency, affinity, constraint
Nodes: 0
Plugins: 
 Volume: 
 Network: 
Kernel Version: 4.2.0-18-generic
Operating System: linux
Architecture: amd64
CPUs: 0
Total Memory: 0 B
Name: aws.swarm-master

despite docker-machine ls showing the nodes are up:

Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 0
Server Version: swarm/1.2.0
Role: primary
Strategy: spread
Filters: health, port, dependency, affinity, constraint
Nodes: 0
Plugins: 
 Volume: 
 Network: 
Kernel Version: 4.2.0-18-generic
Operating System: linux
Architecture: amd64
CPUs: 0
Total Memory: 0 B
Name: aws.swarm-master
Russells-MacBook-Pro-OLD-495:marketing rjurney$ docker-machine ls
NAME               ACTIVE      DRIVER      STATE     URL                         SWARM                       DOCKER    ERRORS
aws.agent.1        -           amazonec2   Running   tcp://54.175.96.235:2376    aws.swarm-master            v1.11.0   
aws.agent.2        -           amazonec2   Running   tcp://54.175.95.149:2376    aws.swarm-master            v1.11.0   
aws.agent.3        -           amazonec2   Running   tcp://54.175.95.129:2376    aws.swarm-master            v1.11.0   
aws.agent.4        -           amazonec2   Running   tcp://54.87.141.182:2376    aws.swarm-master            v1.11.0   
aws.agent.5        -           amazonec2   Running   tcp://54.152.199.55:2376    aws.swarm-master            v1.11.0   
aws.agent.6        -           amazonec2   Running   tcp://54.165.106.166:2376   aws.swarm-master            v1.11.0   
aws.agent.7        -           amazonec2   Running   tcp://54.174.220.235:2376   aws.swarm-master            v1.11.0   
aws.agent.8        -           amazonec2   Running   tcp://52.90.145.100:2376    aws.swarm-master            v1.11.0   
aws.agent.9        -           amazonec2   Running   tcp://54.174.219.184:2376   aws.swarm-master            v1.11.0   
aws.agent.10       -           amazonec2   Running   tcp://54.175.93.9:2376      aws.swarm-master            v1.11.0   
aws.agent.11       -           amazonec2   Running   tcp://54.175.95.229:2376    aws.swarm-master            v1.11.0   
aws.agent.12       -           amazonec2   Running   tcp://54.89.146.252:2376    aws.swarm-master            v1.11.0   
aws.agent.13       -           amazonec2   Running   tcp://54.175.99.194:2376    aws.swarm-master            v1.11.0   
aws.agent.14       -           amazonec2   Running   tcp://52.90.236.249:2376    aws.swarm-master            v1.11.0   
aws.agent.15       -           amazonec2   Running   tcp://54.85.210.108:2376    aws.swarm-master            v1.11.0   
aws.agent.16       -           amazonec2   Running   tcp://54.175.96.108:2376    aws.swarm-master            v1.11.0   
aws.agent.17       -           amazonec2   Running   tcp://54.175.103.39:2376    aws.swarm-master            v1.11.0   
aws.mh-keystore    -           amazonec2   Running   tcp://52.91.92.218:2376                                 v1.11.0   
aws.swarm-master   * (swarm)   amazonec2   Running   tcp://54.174.197.186:2376   aws.swarm-master (master)   v1.11.0   

Any idea what I’m doing wrong?

Changing the interface to eth0 has no effect either.

Any help? I’m stuck :frowning:

Tested creating Docker Swarm cluster from Docker image on EC2, gets installed and separate nodes get added.

  1. Install Docker Swarm with the Docker image “swarm”.
    sudo docker run --rm swarm create
  2. Using the token returned start Docker Swarm Manager.
    docker run -t -p <swarm_port>:2375 -t swarm manage token://<cluster_id>
  3. Start Docker Swarm agents.
    docker run -d swarm join --addr=<node_ip:2375> token://<cluster_id>