Boot Swarm agents, but master shows none registered

rjurney · April 16, 2016, 12:46am

This started in another thread, but I figure it is best for the next guy if I make a new thread.

I am having trouble getting a docker swarm to boot, using consul as the keystore. The script I am using to boot my docker swarm keystore and swarm master is:

#
# Setup a docker swarm with 20 workers
#

# Setup consul for our overlay network to divide linked containers across the network
docker-machine create \
    --driver amazonec2 \
    --amazonec2-instance-type m3.medium \
    --amazonec2-subnet-id subnet-40502c36 \
    --amazonec2-zone=c \
    --amazonec2-vpc-id=vpc-66f0e002 \
    'aws.mh-keystore'

eval "$(docker-machine env aws.mh-keystore)"

docker run -d \
    --name consul \
    -p "8500:8500" \
    -h "consul" \
    progrium/consul -server -bootstrap

# Create spawn master
docker-machine create \
    --driver amazonec2 \
    --amazonec2-instance-type m3.medium \
    --amazonec2-subnet-id subnet-40502c36 \
    --amazonec2-zone=c \
    --amazonec2-vpc-id=vpc-66f0e002 \
    --swarm \
    --swarm-master \
    --swarm-discovery="consul://$(docker-machine ip aws.mh-keystore):8500" \
    --engine-opt="cluster-store=consul://$(docker-machine ip aws.mh-keystore):8500" \
    --engine-opt="cluster-advertise=eth0:2376" \
    aws.swarm-master

And then I boot agents:

# Create 20 swarm workers at once
for i in {1..20}
do
    docker-machine create \
        --driver amazonec2 \
        --amazonec2-instance-type m3.medium \
        --amazonec2-subnet-id subnet-40502c36 \
        --amazonec2-zone=c \
        --amazonec2-vpc-id=vpc-66f0e002 \
        --swarm \
        --swarm-discovery="consul://$(docker-machine ip aws.mh-keystore):8500" \
        --engine-opt="cluster-store=consul://$(docker-machine ip aws.mh-keystore):8500" \
        --engine-opt="cluster-advertise=eth0:2376" \
        aws.agent.$i &
done

# Set our environment to this 20 machine swarm
eval "$(docker-machine env --swarm 'aws.swarm-master')"

docker-machine ls shows this node is part of a swarm:

NAME               ACTIVE      DRIVER      STATE     URL                         SWARM                       DOCKER    ERRORS
aws.agent.1        -           amazonec2   Running   tcp://54.89.49.121:2376     aws.swarm-master            v1.11.0   
aws.mh-keystore    -           amazonec2   Running   tcp://52.23.162.150:2376                                v1.11.0   
aws.swarm-master   * (swarm)   amazonec2   Running   tcp://54.164.148.229:2376   aws.swarm-master (master)   v1.11.0

However, docker info shows no connected nodes and I can’t run anything:

Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 0
Server Version: swarm/1.2.0
Role: primary
Strategy: spread
Filters: health, port, dependency, affinity, constraint
Nodes: 0
Plugins: 
 Volume: 
 Network: 
Kernel Version: 4.2.0-18-generic
Operating System: linux
Architecture: amd64
CPUs: 0
Total Memory: 0 B
Name: aws.swarm-master

Just to be sure it isn’t a firewall issue, I edited the docker-machine security group to let everything in and out, so that isn’t it.

I got this recipe from the docker website. What am I doing wrong?

rjurney · April 17, 2016, 2:32am

Been working on this some more…

When I boot a node and try to join a swarm manually like so:

docker run swarm join --advertise=$(docker-machine ip aws.mh-keystore):2375 consul://$(docker-machine ip aws.mh-keystore)

I get a frequently seen error:

time="2016-04-17T02:15:13Z" level=info msg="Initializing discovery without TLS" 
time="2016-04-17T02:15:13Z" level=info msg="Registering on the discovery service every 1m0s..." addr="52.23.162.150:2375" discovery="consul://52.23.162.150" 
time="2016-04-17T02:15:13Z" level=error msg="cannot set or renew session for ttl, unable to operate on sessions" 
time="2016-04-17T02:16:13Z" level=info msg="Registering on the discovery service every 1m0s..." addr="52.23.162.150:2375" discovery="consul://52.23.162.150" 
time="2016-04-17T02:16:13Z" level=error msg="cannot set or renew session for ttl, unable to operate on sessions" 
time="2016-04-17T02:17:13Z" level=info msg="Registering on the discovery service every 1m0s..." addr="52.23.162.150:2375" discovery="consul://52.23.162.150" 
time="2016-04-17T02:17:13Z" level=error msg="cannot set or renew session for ttl, unable to operate on sessions" 
time="2016-04-17T02:18:13Z" level=info msg="Registering on the discovery service every 1m0s..." addr="52.23.162.150:2375" discovery="consul://52.23.162.150" 
time="2016-04-17T02:18:13Z" level=error msg="cannot set or renew session for ttl, unable to operate on sessions" 
time="2016-04-17T02:19:13Z" level=info msg="Registering on the discovery service every 1m0s..." addr="52.23.162.150:2375" discovery="consul://52.23.162.150" 
time="2016-04-17T02:19:13Z" level=error msg="cannot set or renew session for ttl, unable to operate on sessions"

Looking at the code at consul.go shows:

var (
        ...
        // ErrSessionRenew is thrown when the session can't be
	// renewed because the Consul version does not support sessions
	ErrSessionRenew = errors.New("cannot set or renew session for ttl, unable to operate on sessions")
)

...

// Put a value at "key"
func (s *Consul) Put(key string, value []byte, opts *store.WriteOptions) error {
	key = s.normalize(key)

	p := &api.KVPair{
		Key:   key,
		Value: value,
		Flags: api.LockFlagValue,
	}

	if opts != nil && opts.TTL > 0 {
		// Create or renew a session holding a TTL. Operations on sessions
		// are not deterministic: creating or renewing a session can fail
		for retry := 1; retry <= RenewSessionRetryMax; retry++ {
			err := s.renewSession(p, opts.TTL)
			if err == nil {
				break
			}
			if retry == RenewSessionRetryMax {
				return ErrSessionRenew
			}
		}
	}

	_, err := s.client.KV().Put(p, nil)
	return err
}

The error message is supposed to be about an old consul version that doesn’t support sessions, but we are using the latest version of consul in progrium/consul. The code indicates this is about not being able to connect after so many retries.

So, why can’t I connect to consul? My security group permissions are totally permissive to all traffic. What gives? Others have had this problem too.

dvohra · April 24, 2016, 2:00am

Is the Swarm manager running?