Unstable 1.12 cluster

I have a cluster of 8 host in docker 1.12 Rc4 in swarm mode.

I have a service for which the image is on a private repo in docker hub.

The service is published using docker service create with mode global so 1 instance run on each server.

The service provides internal dns mappings for other containers to be installed in that cluster.

First, when i tried deploying my service, i noticed that the docker login and docker pull needs to be done on each server manually. The login is not scopped at the swarm level, which would make much more sense otherwise on larger farm this can be hard to manage.

Second, even once i was logged in, the swarm is not able to pull the image itself, it needs me to pull the image first manually on each server. I experienced this both at the creation of the service and now at the updates. for example, i updated using docker service update --image. Update from 0.4 to 0.5 (version of my service) and the service completely came down, not only did the swarm not be able to pull the image, even though the docker login was ok but the swarm just completely killed the 0.4 instances one after the other without checking first if the new version was working.

Also, when this kind of issue happens, on the docker service task it just says accepted on each nodes. The swarm does not return any error but the service is completely down. To see the error, i had to run sudo journalctl -fu docker.service.

So we need to make sure proper errors are reaching the service details otherwise this is not really usefull to have service management.

As a third issue, when i rebooted the servers, all my containers deployed on the cluster using docker service create were unstable. On some machine the service would work, on others it would not.

On some machines i got an error to repull the image, but when i did, docker just responded with image already up to date. Had to restart the docker daemon to clear this error. Even after the daemon restart, the containers would come up but now the network is all broken. i get errors like these in the logs

Jul 21 21:52:46 SWCANCENSM01 dockerd[1210]: time=“2016-07-21T21:52:46.378431971Z” level=error msg=“fatal task error” error=“Unable to complete atomic operation, key modified” module=taskmanager task.id=5xric420qib7jcnvsr3877s2v
Jul 21 21:52:46 SWCANCENSM01 dockerd[1210]: time=“2016-07-21T21:52:46.816276710Z” level=error msg=“network ingress remove failed: network ingress not found” module=taskmanager task.id=2cxtd98qoirw4hekcgmuio5oz
Jul 21 21:52:46 SWCANCENSM01 dockerd[1210]: time=“2016-07-21T21:52:46.816928610Z” level=error msg=“remove task failed” error=“network ingress not found” module=taskmanager task.id=2cxtd98qoirw4hekcgmuio5oz

and on the nodes that get this error, the containers come up but the mesh network included in swarm does not work.

i also noticed that on reboots, the swarm re-creates new containers for each service but the old containers are not recycled. So not only is this wastefull because there is still a good version of the container on the server that we could just start but there is not garbage collection on the old container created by swarm services. This could fill up the container list and the space pretty quickly.

as a final point, whenever i would repair errors on one member of the swarm, another member which would have been working fine would have it’s networking die. Not sure why but getting the network to work on one machine or getting the container to work on one machine would somehow break the others.

The service i’m publishing here is fully stateless, based on bind 9 and doesn’t have volumes nor any point of attach to the local machine. Its config is pushed in the image itself so this should be the ideal service for a swarm. This worries me.

anyone had the same experience with 1.12 ?

Note that i’m running this cluster in Azure, if it makes a difference. Using the latest RC from git. Run most host on 1 cpu 3.5 GB ram, some hosts on 2 CPU 14 GB ram.

Seems like the default network is ingress and it is not over-ridable. Refer similar issue

i created a new overlay network, the behavior is really strange.

used docker service create --mode global --name InternalDNS --network Swarm1 -p 53:53/udp -p 53:53/tcp toutougabi/docker_binddns:V0.6

only 1 of the 8 nodes responds on the dns requests even though all nodes have an active and working dns server.

i decided to recreate the service in replica mode --replicas 8 the result are even weirder…
i now have 2 of the 8 nodes that respond to dns request even though all nodes have an active version again.

here is the log from one of the machine that does not work
Jul 22 14:33:36 SWCANCENSM06 dockerd[47789]: peerdbupdate in sandbox failed for ip 10.255.0.3 and mac 02:42:0a:ff:00:03: couldn’t find the subnet “10.255.0.3/16” in network "53umz1fsp2a0jc2rm17ajttxi"
Jul 22 14:33:36 SWCANCENSM06 dockerd[47789]: peerdbupdate in sandbox failed for ip 10.255.0.9 and mac 02:42:0a:ff:00:09: couldn’t find the subnet “10.255.0.9/16” in network "53umz1fsp2a0jc2rm17ajttxi"
Jul 22 14:33:36 SWCANCENSM06 dockerd[47789]: peerdbupdate in sandbox failed for ip 10.255.0.5 and mac 02:42:0a:ff:00:05: couldn’t find the subnet “10.255.0.5/16” in network "53umz1fsp2a0jc2rm17ajttxi"
Jul 22 14:33:36 SWCANCENSM06 dockerd[47789]: peerdbupdate in sandbox failed for ip 10.255.0.15 and mac 02:42:0a:ff:00:0f: couldn’t find the subnet “10.255.0.15/16” in network "53umz1fsp2a0jc2rm17ajttxi"
Jul 22 14:33:36 SWCANCENSM06 dockerd[47789]: peerdbupdate in sandbox failed for ip 10.255.0.14 and mac 02:42:0a:ff:00:0e: couldn’t find the subnet “10.255.0.14/16” in network "53umz1fsp2a0jc2rm17ajttxi"
Jul 22 14:33:37 SWCANCENSM06 dockerd[47789]: time=“2016-07-22T14:33:37Z” level=info msg="Firewalld running: false"
Jul 22 14:33:37 SWCANCENSM06 dockerd[47789]: time=“2016-07-22T14:33:37Z” level=info msg="Firewalld running: false"
Jul 22 14:33:37 SWCANCENSM06 dockerd[47789]: time=“2016-07-22T14:33:37Z” level=info msg="Firewalld running: false"
Jul 22 14:33:37 SWCANCENSM06 dockerd[47789]: time=“2016-07-22T14:33:37Z” level=info msg="Firewalld running: false"
Jul 22 14:33:37 SWCANCENSM06 dockerd[47789]: time=“2016-07-22T14:33:37Z” level=info msg=“Firewalld running: false”

Are the subnets and gateways specified in thedocker network create -d overlay command?

no the in the case of 1.12 in swarm mode it is unclear to me if this should be filled.

Specify the subnets. Refer the example
docker network create -d overlay
–subnet=192.168.0.0/16
–subnet=192.170.0.0/16
–gateway=192.168.0.100 \
–gateway=192.170.0.100
–ip-range=192.168.1.0/24 \