Nice to hear @rusher81572!
Summary of my findings of the GA release:
Constraints are not honored and containers are running anywhere they want.
Mesh networking is flakey
- takes a long time to access a container on any host
- Most of the time you can not access each container at all
- containers still have problems resolving other containers on the same network on other machines
Sorry to hear @rusher81572. You might want to file issues for the stated problems at https://github.com/docker/docker/issues/new with minimally reproducible examples if possible. As with anything new ironing out the details will take a while in swarm mode.
Tested out the GA release and it still doesn’t work
Two linux real nodes, running 1.12.0 build 8eab29e
network defined to the swarm as docker network create --driver overlay --subnet 192.168.99.0/24 SIGMAnet
all service builds including the line --network SIGMAnet
any service can resol any other service started on the same node, but cant resolve any service on the opposite node
from the notes above, sounds like a fairly endemic problem.
I haven’t tested it myself yet, but check the release notes on 1.12.1-rc1.
It lists numerous network bugs like the ones described here as fixed.
thanks colin - bit scary 1.12.0 would go GA with such fundamental issues (if such the case)
Thanks for the clarification. I am trying 1.12.1 on my Rpi cluster now to see if there are any improvements
Nope, did not work on 1.12.1.
Relief - not just me being stupid then (hopefully )
@rusher81572 Sorry to hear about your issues. You seem to be running a Rpi cluster. Is that correct? Not sure if you are hitting any issue specific to being in Raspberry pi cluster. Would you mind opening an issue with detailed information about your setup and a sequence of steps to reproduce the problem?
Some of identified that the Raspbian Kernel was missing the
vxlan module. Running
rpi-update adds the module (plus a reboot).
I put these test scenarios together to try and document what was going wrong:
After the update I was able to get through them all. I’ve now got an 8-node cluster which can run my redis hit-counter in Swarmmode.
@mrjana @alexellis2 Thanks for the suggestions. Please note that my basic Swarm functionality testing done per the creation of this forum topic was with docker-machine on x86 hardware with VirtualBox using 1.12.0.
I ran rpi-update and updated the kernel. However, I did not see a vxlan module with lsmod so I used modprobe and now it is visible.
Used three Pi’s with Docker 1.12.1-rc2, build 236317f, experimental, vxlan module loaded. Kernel 4.4.19-v7+.
docker swarm init docker swarm join.....(for each Pi as needed)
1mmb8fgd9s8m9peodyr3x6qoc * rpi-2 Ready Active Leader c26mil621fs4f0iwa1e8884ar rpi-3 Ready Active em26jwtm7u3f800pkrl4fw2py rpi-4 Ready Active
docker network create chat -d overlay docker service create --name mysql --network chat registry:5000/mysql docker service create --name test --network chat -p 444:444 registry:5000/test
docker service ls ID NAME REPLICAS IMAGE COMMAND 4n6r4fbiu0cl mysql 1/1 registry:5000/mysql 8h36vyo2luyy test 1/1 registry:5000/test
We can see here that the database and Node.js application is running on separate machines.
# docker node ps rpi-3 ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR 1zfxjmkbqgz7c0q9ji26h7qkg mysql.1 registry:5000/mysql rpi-3 Running Running about a minute ago # docker node ps rpi-2 ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR 5dcwkxxdy4aunbvg6sk9q3wtw test.1 registry:5000/test rpi-2 Running Running about a minute ago
The service named test is running my Node.js app fine with database connectivity and is accessible from each host. More testing is needed with other apps but is looking good so far.
My only problem that remains now is with labels and constraints. That does not seem to be working right.
This command works and runs MySQL on rpi-3 as expected:
docker service create --name sql2 -l "node=rpi-3" --network chat registry:5000/mysql
Then I tried to run MySQL on a node and label that does not exist and it just went to a random node:
docker service create --name sql2 -l "node=rpi-1" --network chat registry:5000/mysql
For my final test, I created three MySQL services using the same label and they all went to random nodes which is not behavior that I would expect.
docker service create --name sql3 -l "node=rpi-4" --network chat registry:5000/mysql docker service create --name sql2 -l "node=rpi-4" --network chat registry:5000/mysql docker service create --name sql1 -l "node=rpi-4" --network chat registry:5000/mysql
@rusher81572 thanks for confirming the network connectivity behaviour with the rpi-update and 1.12.1-rc (though I would recommend using 1.12.1 released version).
Regarding the scheduling constraints, I don’t think you are using the correct options in
docker service create command. “-l” just adds a label to any object. if you are looking to constraint the scheduling, you should be using the
--constraint option with appropriate supported constraints as mentioned here : https://docs.docker.com/engine/reference/commandline/service_create/#/specify-service-constraints .
I used constraints in classical Swarm (where you need an external kv store) - as I remember it involved editing the daemon script to add a label. I haven’t tried scheduling by hostname, but this looked relevant: https://github.com/docker/docker/pull/24397#issuecomment-231227571
Btw. Did the RC come via
experimental.docker.com? If you are finding that re-running the
get.docker.com command is not refreshing the Docker version to the general release, then you might want to run
apt-get remove docker-engine prior to the script.
Let us know how that goes - I will also try scheduling via label on my swarm and report back.
I’ve had a quick go on my 7 node cluster.
Set up via node hostname
$ docker service create --constraint node.hostname==pi2swarm7 --name hello1 --publish 3000:3000 --replicas=1 alexellis2/arm-alpinehello 6u0lvc8cm1d8unfasek2n6x2r $ docker service ps hello1 ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR 97p4vkcgs42309srjvn6gb5d5 hello1.1 alexellis2/arm-alpinehello pi2swarm7 Running Running 3 minutes ago
This is an example with a custom node label:
$ docker node update pi2swarm2 --label-add db=1 $ docker service create --constraint 'node.labels.db == 1' --name hello2 --publish 3001:3000 --replicas=1 alexellis2/arm-alpinehello $ docker service ps hello2 ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR e5zb4g4myzsf35ipydt21fsvi hello2.1 alexellis2/arm-alpinehello pi2swarm2 Running Running 36 seconds ago
Hope these examples help with your set-up.
Constraints are working, thanks @alexellis2. I moved Swarm 1.12.1 into production on my Pi’s. I did notice a few issues.
First, I am running phpfpm in a container and tinyrss will constantly reload over and over again after clicking on “All Articles”. I scaled phpfpm to 1 instance and that looked like it resolved the problem. I would like to scale it out like I did before using nginx. Then started getting 504 Gateway Time-outs when accessing TinyRSS. The Wordpress blog and TinyRSS both connect to the same phpfpm container and wordpress works fine. There must be something wrong with the internal Swarm plumbing. I removed the service and created it again and it seems to work now. The only problem is that 2/3 of the phpfpm instances are now running on the same host.
There is also performance issues. I connect to the first Pi and that should load balance all my requests to other Pi’s. This was just like my setup before. But with Swarm taking over the load balancing with the mesh networking, it is very slow. This behavior is breaking all of my Node.js apps now.
Like previous docker versions, It is annoying that containers on a virtual network can not resolve each other if one of them is restarted.
Lastly, I noticed that scaling containers will sometimes schedule multiple instances on the same Pi. Is this expected behavior? I rather have 1 instance on each Pi.
@rusher81572 Can you please confirm if the issues that you raised earlier in this thread are addressed.
- Using 1.12.1
- multi-host networking issue that you raised in r-pi is resolved by rpi-update (and modprobe ?)
- Constraints issue that you brought up is a human error (invalid flag usage).
Regarding the other issues that you have raised in the recent comments, I think we should take this to docker/docker issue tracker so that we can gather more information (and correct any human errors) and also better visibility / support from maintainers.
Yes, the earlier issues in this thread are addressed which are:
multi-host networking issue that you raised in r-pi is resolved by rpi-update (and modprobe ?)
Constraints issue that you brought up is a human error (invalid flag usage).
Thanks for the help.
@rusher81572 Can you please characterize what you mean by “load balancing with mesh networking is very slow”? Slow as in it is taking a long time or are you seeing throughput issues? Also is this when you expose a port and access it from outside the cluster or are you doing intra cluster?
I meet the same problem today with the newest docker engine version (CE)。 Are the swarm mode routing mesh and internal routing Production Ready?