Swarm failure - voting does not work

wolfgangpfnuer · August 16, 2016, 3:02pm

Expected behavior

expect swarm to be resilient with multiple manager nodes

Actual behavior

The swarm dies a lot (every couple of hours / days it seems). Apparently If a heartbeat message to leader is lost, a reelection is triggered, but votes get lost due to some kind of deadlock or something similar. That means the swarm is down - it is impossible to get any information about whatever happens inside the swarm, as all commands simply complain that there’s no master.

As I had 3 Manager nodes, I tried to start 2 at a time as per [1.12rc4] Occasionally Swarm cluster does not work with 2 of 3 manager nodes online · Issue #25055 · moby/moby · GitHub, but to no avail. I tried all 3 possible combinations (stopping all 3, starting 2 of them)

I then started a new swarm as described in Administer and maintain a swarm of Docker Engines | Docker Docs. I can’t join the other 2 manager nodes to that swarm, because I can’t get them to leave their swarm:

docker swarm leave --force
Error response from daemon: context deadline exceeded

I can’t even force-init a new cluster on those nodes, with the same error message.

Also my new 1-manager cluster is absolutely useless - while I can see my services and tasks, and even the nodes, all nodes are failing their heartbeat check. I assume that’s happening due to the token being changed when re-initializing the swarm…

Questions

A) how can I recover from this state? Is it possible at all? My only lead would be to change the tokens in dynamodb to the values produced by the new init, stop the two other managers and start 2 new ones (they should join the swarm using the tokens from dynamodb). Then start the ssh daemon container on the worker nodes manually with -h worker_node_host and join the swarm with their respective docker-services (that way the maybe still running services should be integrated - alternatively just scale up a few new worker nodes and then destroy the old ones)
Is there an easier way? Actually creating a new stack and tearing down the old seems easier than my solution…

B) As per github link above, this is going to be fixed in docker 1.12.1. When is it going to be released / integrated in the AMI for the cloud formation template (beta5?)?

C) Is a leaderless swarm still restarting containers? Or does that mean everything is down, not just the CLI?

Additional Information

This is a huge blocker for our usage of docker-swarm mode on AWS… I hoped that the rough edges would be more in the cloud-formation template - but I did expect docker 1.12 to be actually stable with its swarm implementation…
What exactly is causing this deadlock / endless reelection cycle? Could anything I did change have something to do with it? As per my other posts in here, I added a VPC Peering to default VPC and the according routing table entries.
so 172.31.0.0/16 is forwarded to my default VPC.
Except for that, I was deploying like 100+ containers total, about half of which still had errors and were restarting (fighting with the change from docker-compose to service format). I was using micro instances for manager nodes, but load was low (single digit cpu/memory).
Before that 3 manager node attempt I was having a 5 Manager (c4.xlarge) with 3 worker nodes (c4.xlarge as well). This worked with one service (50 tasks) over the weekend. On monday I tried to add additional services (again, with errors =>restarting). I was using the manager nodes for containers as well, so load was a bit higher, but shouldn’t have been too high (lower double-digit if at all). After a few hours the swarm died the same way as it did with my 3-manager swarm. I thought maybe the problem was that the other containers took too much resources and the manager(s) got disturbed, so I destroyed the swarm and created a new one (the 3 manager-swarm from above)

friism · August 16, 2016, 8:03pm

@wolfgangpfnuer thanks for helping us test this and sorry about the breakage.

B) We’re working on shipping beta5 (today or tmrw) and that will include 1.12.1-RC2. Even though it’s an RC, the consensus seems to be that it’ll be better than 1.12.0. We’re also working on a less complicated token mechanism for Dcoker for AWS to avoid the deadend you got into.

I hope you’ll stick with us and try the next release too. I’m pinging people at Docker to see if we can get you answers to your other questions. Let me know if you have other questions or if there’s anything else we can help with.

wolfgangpfnuer · August 22, 2016, 11:38am

tried it. now I’m getting similar but different problems.

If I want to update a service I usually get

docker service update --image private/image project/frontend
Error response from daemon: rpc error: code = 4 desc = context deadline exceeded

all services are shown as 0/x (probably because the manager can’t see the worker nodes anymore)

Sometimes I also get stacktraces like the following on docker service ps <service_name> (and other commands I guess):

runtime/cgo: pthread_create failed: Resource temporarily unavailable
SIGABRT: abort
PC=0xa7dc87 m=0

goroutine 0 [idle]:

goroutine 5 [syscall]:
runtime.notetsleepg(0x13dfb20, 0xffffffffffffffff, 0x1)
       	/usr/local/go/src/runtime/lock_futex.go:205 +0x4e fp=0xc82001df40 sp=0xc82001df18
os/signal.signal_recv(0x0)
       	/usr/local/go/src/runtime/sigqueue.go:116 +0x132 fp=0xc82001df78 sp=0xc82001df40
os/signal.loop()
       	/usr/local/go/src/os/signal/signal_unix.go:22 +0x18 fp=0xc82001dfc0 sp=0xc82001df78
runtime.goexit()
       	/usr/local/go/src/runtime/asm_amd64.s:1998 +0x1 fp=0xc82001dfc8 sp=0xc82001dfc0
created by os/signal.init.1
       	/usr/local/go/src/os/signal/signal_unix.go:28 +0x37

goroutine 1 [select]:
github.com/docker/engine-api/client/transport/cancellable.Do(0x7f0550ff7788, 0xc820010eb8, 0x7f0550ffc860, 0xc820309a00, 0xc8200b8540, 0x0, 0x0, 0x0)
       	/go/src/github.com/docker/docker/vendor/src/github.com/docker/engine-api/client/transport/cancellable/cancellable.go:56 +0x49d
github.com/docker/engine-api/client.(*Client).sendClientRequest(0xc82032a240, 0x7f0550ff7788, 0xc820010eb8, 0xd585d0, 0x3, 0xc820309ca0, 0x1b, 0x0, 0x0, 0x0, ...)
       	/go/src/github.com/docker/docker/vendor/src/github.com/docker/engine-api/client/request.go:106 +0x509
github.com/docker/engine-api/client.(*Client).sendRequest(0xc82032a240, 0x7f0550ff7788, 0xc820010eb8, 0xd585d0, 0x3, 0xc820309ca0, 0x1b, 0x0, 0x0, 0x0, ...)
       	/go/src/github.com/docker/docker/vendor/src/github.com/docker/engine-api/client/request.go:75 +0x2dc
github.com/docker/engine-api/client.(*Client).get(0xc82032a240, 0x7f0550ff7788, 0xc820010eb8, 0xc820309ca0, 0x1b, 0x0, 0x0, 0xc8201c6fb0, 0x0, 0x0)
       	/go/src/github.com/docker/docker/vendor/src/github.com/docker/engine-api/client/request.go:33 +0xa6
github.com/docker/engine-api/client.(*Client).ServiceInspectWithRaw(0xc82032a240, 0x7f0550ff7788, 0xc820010eb8, 0x7ffd29908ecd, 0x11, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
       	/go/src/github.com/docker/docker/vendor/src/github.com/docker/engine-api/client/service_inspect.go:15 +0xef
github.com/docker/docker/api/client/service.runPS(0xc820072510, 0x7ffd29908ecd, 0x11, 0xc8202e7300, 0xc8202d7050, 0x0, 0x0)
       	/go/src/github.com/docker/docker/api/client/service/ps.go:45 +0x10d
github.com/docker/docker/api/client/service.newPSCommand.func1(0xc8202fb440, 0xc8203190a0, 0x1, 0x1, 0x0, 0x0)
       	/go/src/github.com/docker/docker/api/client/service/ps.go:31 +0x79
github.com/spf13/cobra.(*Command).execute(0xc8202fb440, 0xc820319040, 0x1, 0x1, 0x0, 0x0)
       	/go/src/github.com/docker/docker/vendor/src/github.com/spf13/cobra/command.go:593 +0x6fe
github.com/spf13/cobra.(*Command).ExecuteC(0xc8200758c0, 0xc8202fb440, 0x0, 0x0)
       	/go/src/github.com/docker/docker/vendor/src/github.com/spf13/cobra/command.go:683 +0x55c
github.com/spf13/cobra.(*Command).Execute(0xc8200758c0, 0x0, 0x0)
       	/go/src/github.com/docker/docker/vendor/src/github.com/spf13/cobra/command.go:642 +0x2d
github.com/docker/docker/cli/cobraadaptor.CobraAdaptor.run(0xc8200758c0, 0xc820072510, 0x7ffd29908ec2, 0x7, 0xc82000a1e0, 0x2, 0x2, 0x0, 0x0)
       	/go/src/github.com/docker/docker/cli/cobraadaptor/adaptor.go:118 +0x25f
github.com/docker/docker/cli/cobraadaptor.CobraAdaptor.Command.func1(0xc82000a1e0, 0x2, 0x2, 0x0, 0x0)
       	/go/src/github.com/docker/docker/cli/cobraadaptor/adaptor.go:126 +0x8c
github.com/docker/docker/cli.(*Cli).Run(0xc8203206c0, 0xc82000a1d0, 0x3, 0x3, 0x0, 0x0)
       	/go/src/github.com/docker/docker/cli/cli.go:81 +0x34b
main.main()
       	/go/src/github.com/docker/docker/cmd/docker/docker.go:72 +0x599

goroutine 17 [syscall, locked to thread]:
runtime.goexit()
       	/usr/local/go/src/runtime/asm_amd64.s:1998 +0x1

goroutine 6 [select]:
net/http.(*persistConn).roundTrip(0xc82031f930, 0xc820319110, 0x0, 0x0, 0x0)
       	/usr/local/go/src/net/http/transport.go:1476 +0xf1f
net/http.(*Transport).RoundTrip(0xc82000c840, 0xc8200b8540, 0xc82000c840, 0x0, 0x0)
       	/usr/local/go/src/net/http/transport.go:327 +0x9bb
net/http.send(0xc8200b8540, 0x7f0550ff3578, 0xc82000c840, 0x0, 0x0, 0x0, 0xc82026f310, 0x0, 0x0)
       	/usr/local/go/src/net/http/client.go:260 +0x6b7
net/http.(*Client).send(0xc820320ae0, 0xc8200b8540, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
       	/usr/local/go/src/net/http/client.go:155 +0x185
net/http.(*Client).doFollowingRedirects(0xc820320ae0, 0xc8200b8540, 0xedb930, 0x0, 0x0, 0x0)
       	/usr/local/go/src/net/http/client.go:475 +0x8a4
net/http.(*Client).Do(0xc820320ae0, 0xc8200b8540, 0x0, 0x0, 0x0)
       	/usr/local/go/src/net/http/client.go:188 +0xff
github.com/docker/engine-api/client/transport/cancellable.Do.func1(0x7f0550ffc860, 0xc820309a00, 0xc8200b8540, 0xc82032a480)
       	/go/src/github.com/docker/docker/vendor/src/github.com/docker/engine-api/client/transport/cancellable/cancellable.go:49 +0x35
created by github.com/docker/engine-api/client/transport/cancellable.Do
       	/go/src/github.com/docker/docker/vendor/src/github.com/docker/engine-api/client/transport/cancellable/cancellable.go:52 +0xff

goroutine 8 [runnable]:
net/http.(*persistConn).readLoop(0xc82031f930)
       	/usr/local/go/src/net/http/transport.go:1051
created by net/http.(*Transport).dialConn
       	/usr/local/go/src/net/http/transport.go:860 +0x10a6

goroutine 9 [runnable]:
net/http.(*persistConn).writeLoop(0xc82031f930)
       	/usr/local/go/src/net/http/transport.go:1278
created by net/http.(*Transport).dialConn
       	/usr/local/go/src/net/http/transport.go:861 +0x10cb

rax    0x0
rbx    0x139a4a8
rcx    0xa7dc87
rdx    0x6
rdi    0xb8b
rsi    0xb8b
rbp    0xf0c1be
rsp    0x7ffd29906828
r8     0xa
r9     0x1e4e880
r10    0x8
r11    0x206
r12    0x1e50d40
r13    0xed8c08
r14    0x0
r15    0x8
rip    0xa7dc87
rflags 0x206
cs     0x33
fs     0x0
gs     0x0

friism · August 28, 2016, 4:32am

Hm, that and this is with the 1.12.1-based beta5? (template here: https://docker-for-aws.s3.amazonaws.com/aws/beta/aws-v1.12.1-beta5.json)

Do you have more details on the instance size you choose and manager and worker count and what app you’re deploying. I’d like to try to reproduce the problem.

Michael

wolfgangpfnuer · August 29, 2016, 11:17am

yeah, exactly. I didn’t have the email yet back then, but I saw somebody here posting about beta5, so i just did a replace on the URL to get the beta5 template

I used 3 micro instances for manager nodes, and I think 4-5 c4.xlarge for worker nodes.

I was deploying a python-container based app that’s running only on worker nodes divided into frontend/backend (labeled 2 nodes as backend). frontend is using port 80, backend is just doing it’s thing accessing redis, SQL and SQS that are accessed through the VPC-Peering

Topic		Replies	Views
Swarm in Broken State after ASG replaced 2 out of 3 Managers General aws , docker , swarm	1	1812	August 1, 2017
All docker service in docker swarm suddenly restarted General docker , swarm	1	3535	September 26, 2022
Cluster Broken and Docker unresponsive General aws	1	2879	July 12, 2017
How to kill a broken swarm? Swarm	13	25784	October 1, 2019
Will docker swarm 1.12 support multiple managers? Swarm docker , beta	9	13124	December 23, 2018

Swarm failure - voting does not work

Expected behavior

Actual behavior

Questions

Additional Information

Related topics