All processes are on the same machine

rjurney · April 12, 2016, 7:10am

I have created a 20 machine swarm on EC2. I am having a problem, in that all processes I run are on the same machine, regardless of the --hostname I specify. This seems broken in the first place, since the default strategy is ‘spread.’

The script I use to create the cluster is here. I am able to create the cluster, but it is not distributing work correctly.

#!/bin/bash

#
# Setup a docker swarm with 20 workers
#

# Spawn an instance to generate a Swarm token
docker-machine create \
    --driver amazonec2 \
    --amazonec2-instance-type m3.medium \
    --amazonec2-subnet-id subnet-40502c36 \
    --amazonec2-zone=c \
    --amazonec2-vpc-id=vpc-66f0e002 \
    'aws.swarm-token-machine'

# Setup environment to run a command on this node
eval "$(docker-machine env 'aws.swarm-token-machine')"

# Create a token for our swarm cluster and setup our enviroment
docker run swarm create # copy token to SWARM_CLUSTER_TOKEN
export SWARM_CLUSTER_TOKEN=''

# Create 20 swarm workers at once
for i in {1..$WORKER_COUNT}
do
    docker-machine create \
    --driver amazonec2 \
    --amazonec2-instance-type m3.medium \
    --amazonec2-subnet-id subnet-40502c36 \
    --amazonec2-zone=c \
    --amazonec2-vpc-id=vpc-66f0e002 \
    --swarm \
    --swarm-discovery token://$SWARM_CLUSTER_TOKEN \
    aws.agent$i &
done

# Set our environment to this 20 machine swarm
eval "$(docker-machine env --swarm 'aws.swarm-master')"

#
# Selenium stuff
#

# Setup a selenium hub
docker run -d \
    --name selenium-hub \
    -p 4444:4444 \
    --hostname aws.agent1 \
    selenium/hub:2.53.0

# Setup 20 Chrome nodes linked to the hub
for i in {1..20}
do
  docker run -d --name=chrome-node-$i --link selenium-hub:hub selenium/node-chrome:2.53.0
done

What is going on that all processes stick to one machine? How can I fix this? I have tried using --hostname in my docker run commands, but it has no effect.

This is the output of docker ps:

CONTAINER ID        IMAGE ... STATUS              PORTS                         NAMES
5a81c6a3e381        selenium/node-chrome:2.53.0 ... Up 5 seconds                                      aws.agent13/chrome-node-2
74f46a0b5809        selenium/node-chrome:2.53.0 ... Up 2 minutes                                      aws.agent13/chrome-node-1
244e4d3092cd        selenium/hub:2.53.0 ... Up 4 minutes        54.86.124.43:4444->4444/tcp   aws.agent13/selenium-hub,aws.agent13/chrome-node-1/hub

programmerq · April 12, 2016, 4:14pm

It looks like you are using the default network, and using --link. I would expect those nodes to be scheduled to the same node you are linking to.

If you were to set up an overlay network, then those containers could communicate with eachother on any node.

Check out the docker network create stuff.

rjurney · April 12, 2016, 9:39pm

I will try your suggestion. You mean that linked containers have to run on the same agent? That is very unexpected. I will check out the docs you mention. I do not understand.

In the meanwhile, maybe this isn’t a valid issue after all but I already created an issue here: https://github.com/docker/docker/issues/21968

programmerq · April 12, 2016, 9:42pm

When you are using the default bridged network, --link will only work between containers on the same host connected to that same network.

The overlay network driver was specifically developed to achieve multi-host aware container networking. If you use an overlay network, you can --link between hosts. You could also skip using the --link feature, and use the new service discovery feature. Basically all containers connected to the same docker network style network will be able to connect to any other containers by referencing their --name.

rjurney · April 12, 2016, 9:58pm

Thanks! Ok, so if I create an overlay network and run the containers that are linked via the --net option, they will then be spread across the cluster?

programmerq · April 12, 2016, 10:12pm

That is correct, although if you do use the docker network create stuff, you don’t need to use the --link feature.

rjurney · April 14, 2016, 12:08am

Without --link, how will a selenium node know how to connect to the selenium hub?

programmerq · April 14, 2016, 1:32am

You can use the new network discovery system.

Basically, you can reach another container by resolving the name of that container.

For example:

$ docker network create foo
$ docker run -d --name web --net=foo nginx
$ docker run --rm -it --net=foo alpine ping -c 1 web
PING web (10.0.10.2): 56 data bytes
64 bytes from 10.0.10.2: seq=0 ttl=64 time=0.119 ms

--- web ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss

Docker runs an internal embedded dns service that allows this to happen: https://docs.docker.com/engine/userguide/networking/configure-dns/

rjurney · April 14, 2016, 3:17am

Thanks, to use overlay networks I have to run a distributed store. Now I’m having trouble getting a swarm master to finish booting, when I use consul for its keystore. The keystore boots up ok, but the swarm master simply never finishes initializing.

# Setup consul for our overlay network to divide linked containers across the network
docker-machine create \
    --driver amazonec2 \
    --amazonec2-instance-type m3.medium \
    --amazonec2-subnet-id subnet-40502c36 \
    --amazonec2-zone=c \
    --amazonec2-vpc-id=vpc-66f0e002 \
    'aws.mh-keystore'

eval "$(docker-machine env aws.mh-keystore)"

docker run -d \
    -p "8500:8500" \
    -h "consul" \
    progrium/consul -server -bootstrap

# Create spawn master
docker-machine create \
    --driver amazonec2 \
    --amazonec2-instance-type m3.medium \
    --amazonec2-subnet-id subnet-40502c36 \
    --amazonec2-zone=c \
    --amazonec2-vpc-id=vpc-66f0e002 \
    --swarm \
    --swarm-master \
    --swarm-discovery="consul://$(docker-machine ip aws.mh-keystore):8500" \
    --engine-opt="cluster-store=consul://$(docker-machine ip aws.mh-keystore):8500" \
    --engine-opt="cluster-advertise=eth1:2376" \
    aws.swarm-master

The container can’t be reached:

Running pre-create checks...
Creating machine...
(aws.swarm-master) Launching instance...
Waiting for machine to be running, this may take a few minutes...
Detecting operating system of created instance...
Waiting for SSH to be available...
Detecting the provisioner...
Provisioning with ubuntu(systemd)...
Installing Docker...
Copying certs to the local machine directory...
Copying certs to the remote machine...
Setting Docker configuration on the remote daemon...
Error creating machine: Error running provisioning: Unable to verify the Docker daemon is listening: Maximum number of retries (10) exceeded

Any idea what I’m doing wrong? I checked the security group, and ports 22 and 2376 are open.

rjurney · April 14, 2016, 4:25am

To answer my own question: once I changed eth1 to eth0, the machine comes up

rjurney · April 14, 2016, 4:52am

Ah, now I can’t get nodes to join the swarm. Using this script:

for i in {1..20}
do
    # docker-machine create \
    # --driver amazonec2 \
    # --amazonec2-instance-type m3.medium \
    # --amazonec2-subnet-id subnet-40502c36 \
    # --amazonec2-zone=c \
    # --amazonec2-vpc-id=vpc-66f0e002 \
    # --swarm \
    # --swarm-discovery token://$SWARM_CLUSTER_TOKEN \
    # aws.agent$i &
    docker-machine create \
        --driver amazonec2 \
        --amazonec2-instance-type m3.medium \
        --amazonec2-subnet-id subnet-40502c36 \
        --amazonec2-zone=c \
        --amazonec2-vpc-id=vpc-66f0e002 \
        --swarm \
        --swarm-discovery="consul://$(docker-machine ip aws.mh-keystore):8500" \
        --engine-opt="cluster-store=consul://$(docker-machine ip aws.mh-keystore):8500" \
        --engine-opt="cluster-advertise=eth1:2376" \
        aws.agent.$i &
done

docker info after I run eval "$(docker-machine env --swarm 'aws.swarm-master')" shows:

Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 0
Server Version: swarm/1.2.0
Role: primary
Strategy: spread
Filters: health, port, dependency, affinity, constraint
Nodes: 0
Plugins: 
 Volume: 
 Network: 
Kernel Version: 4.2.0-18-generic
Operating System: linux
Architecture: amd64
CPUs: 0
Total Memory: 0 B
Name: aws.swarm-master

despite docker-machine ls showing the nodes are up:

Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 0
Server Version: swarm/1.2.0
Role: primary
Strategy: spread
Filters: health, port, dependency, affinity, constraint
Nodes: 0
Plugins: 
 Volume: 
 Network: 
Kernel Version: 4.2.0-18-generic
Operating System: linux
Architecture: amd64
CPUs: 0
Total Memory: 0 B
Name: aws.swarm-master
Russells-MacBook-Pro-OLD-495:marketing rjurney$ docker-machine ls
NAME               ACTIVE      DRIVER      STATE     URL                         SWARM                       DOCKER    ERRORS
aws.agent.1        -           amazonec2   Running   tcp://54.175.96.235:2376    aws.swarm-master            v1.11.0   
aws.agent.2        -           amazonec2   Running   tcp://54.175.95.149:2376    aws.swarm-master            v1.11.0   
aws.agent.3        -           amazonec2   Running   tcp://54.175.95.129:2376    aws.swarm-master            v1.11.0   
aws.agent.4        -           amazonec2   Running   tcp://54.87.141.182:2376    aws.swarm-master            v1.11.0   
aws.agent.5        -           amazonec2   Running   tcp://54.152.199.55:2376    aws.swarm-master            v1.11.0   
aws.agent.6        -           amazonec2   Running   tcp://54.165.106.166:2376   aws.swarm-master            v1.11.0   
aws.agent.7        -           amazonec2   Running   tcp://54.174.220.235:2376   aws.swarm-master            v1.11.0   
aws.agent.8        -           amazonec2   Running   tcp://52.90.145.100:2376    aws.swarm-master            v1.11.0   
aws.agent.9        -           amazonec2   Running   tcp://54.174.219.184:2376   aws.swarm-master            v1.11.0   
aws.agent.10       -           amazonec2   Running   tcp://54.175.93.9:2376      aws.swarm-master            v1.11.0   
aws.agent.11       -           amazonec2   Running   tcp://54.175.95.229:2376    aws.swarm-master            v1.11.0   
aws.agent.12       -           amazonec2   Running   tcp://54.89.146.252:2376    aws.swarm-master            v1.11.0   
aws.agent.13       -           amazonec2   Running   tcp://54.175.99.194:2376    aws.swarm-master            v1.11.0   
aws.agent.14       -           amazonec2   Running   tcp://52.90.236.249:2376    aws.swarm-master            v1.11.0   
aws.agent.15       -           amazonec2   Running   tcp://54.85.210.108:2376    aws.swarm-master            v1.11.0   
aws.agent.16       -           amazonec2   Running   tcp://54.175.96.108:2376    aws.swarm-master            v1.11.0   
aws.agent.17       -           amazonec2   Running   tcp://54.175.103.39:2376    aws.swarm-master            v1.11.0   
aws.mh-keystore    -           amazonec2   Running   tcp://52.91.92.218:2376                                 v1.11.0   
aws.swarm-master   * (swarm)   amazonec2   Running   tcp://54.174.197.186:2376   aws.swarm-master (master)   v1.11.0

Any idea what I’m doing wrong?

rjurney · April 14, 2016, 6:32pm

Changing the interface to eth0 has no effect either.

rjurney · April 15, 2016, 9:09pm

Any help? I’m stuck

dvohra · April 24, 2016, 1:44am

Tested creating Docker Swarm cluster from Docker image on EC2, gets installed and separate nodes get added.

Install Docker Swarm with the Docker image “swarm”.
sudo docker run --rm swarm create
Using the token returned start Docker Swarm Manager.
docker run -t -p <swarm_port>:2375 -t swarm manage token://<cluster_id>
Start Docker Swarm agents.
docker run -d swarm join --addr=<node_ip:2375> token://<cluster_id>

Topic		Replies	Views
Docker overlay network without docker-machine General	5	2996	April 15, 2016
Docker Swarm: Master and Worker not communicating properly Swarm docker , swarm	10	3726	July 13, 2023
No healthy node available in the cluster - Swarm - Docker-machine Swarm	5	6342	August 8, 2016
Swam 1.12 Multi-host networking help Swarm	30	10691	March 29, 2017
DOCKER Linux ( Ubuntu 18.04 ) Multi-Host General	0	958	March 25, 2019

All processes are on the same machine

Related topics