Docker Community Forums

Share and learn in the Docker community.

Hybrid swarm crashing Windows host


(James06) #1

Hello,

We are currently building a hybrid swarm but running into a few problems. The immediate problem is that deploying a stack will cause the Windows host go into a black screen and reboot. (We are running two monitors and one monitor will go black, the other will freeze and then the workstation will reboot). The Linux host (Ubuntu 16.04) will not crash. When the system comes back up we often have to reinitialize the swarm and redeploy the stack. Sometimes the server will crash right after we deploy the stack while other times it will run for a few hours before the Windows host reboots. This happens on the environment that I am using and a similarly configured environment that my colleague is using.

What we are running
Windows 10 Version 1709 (OS Build 16299.192)
Ubuntu 16.04.4 LTS (GNU/Linux 4.4.0-116-generic x86_64)

We are using mode host and mode global in our stack.
Docker Compose version 3.2 in the compose file that we deploy.

Windows Docker version
Client:
Version: 17.12.0-ce
API version: 1.35
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:05:22 2017
OS/Arch: windows/amd64

Server:
Engine:
Version: 17.12.0-ce
API version: 1.35 (minimum version 1.24)
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:15:52 2017
OS/Arch: windows/amd64
Experimental: false

Windows docker info
Containers: 6
Running: 6
Paused: 0
Stopped: 0
Images: 7
Server Version: 17.12.0-ce
Storage Driver: windowsfilter
Windows:
Logging Driver: json-file
Plugins:
Volume: local
Network: ics l2bridge l2tunnel nat null overlay transparent
Log: awslogs etwlogs fluentd gelf json-file logentries splunk syslog
Swarm: active
NodeID: tz1aw9ug20559diuyoktwhlso
Is Manager: true
ClusterID: 6kpiamwhgidpl57l1di28byhj
Managers: 1
Nodes: 2
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 3
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 0
Autolock Managers: false
Root Rotation In Progress: false
Node Address: 192.168.216.123
Manager Addresses:
192.168.216.123:2377
Default Isolation: hyperv
Kernel Version: 10.0 16299 (16299.15.amd64fre.rs3_release.170928-1534)
Operating System: Windows 10 Enterprise
OSType: windows
Architecture: x86_64
CPUs: 24
Total Memory: 15.95GiB
Name: 204
ID: XVQK:PTOC:EMLN:X7ES:7J5R:F7LW:V2OC:YVIJ:JF57:GYM2:2R2J:XXXX
Docker Root Dir: C:\ProgramData\Docker
Debug Mode (client): false
Debug Mode (server): true
File Descriptors: -1
Goroutines: 236
System Time: 2018-03-06T13:49:17.4278821+01:00
EventsListeners: 7
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false

Linux Docker Version
Client:
Version: 17.12.1-ce
API version: 1.35
Go version: go1.9.4
Git commit: 7390fc6
Built: Tue Feb 27 22:17:40 2018
OS/Arch: linux/amd64

Server:
Engine:
Version: 17.12.1-ce
API version: 1.35 (minimum version 1.12)
Go version: go1.9.4
Git commit: 7390fc6
Built: Tue Feb 27 22:16:13 2018
OS/Arch: linux/amd64
Experimental: false

Linux docker info
Containers: 5
Running: 5
Paused: 0
Stopped: 0
Images: 80
Server Version: 17.12.1-ce
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 298
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
NodeID: n1ui1cj868w5six25s8n014pv
Is Manager: false
Node Address: 192.168.205.156
Manager Addresses:
192.168.216.123:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9b55aab90508bd389d7654c4baf173a981477d55
runc version: 9f9c96235cc97674e935002fc3d78361b696a69e
init version: 949e6fa
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.4.0-116-generic
Operating System: Ubuntu 16.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.859GiB
Name: docker
ID: IL2E:QE67:7REN:UYQE:JBLS:UQPA:OT73:4UGE:LSLR:ZSFO:DE3K:XXXX
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

As far as reproducing the problem, we followed the normal procedures for installing Docker on Windows and Linux (stable version for both). We initialize the swarm on the Windows side and keep it as the manager. As mentioned before, the Windows host will sometimes run for hours before crashing while other times it will crash almost immediately after deploying the stack.


(James06) #2

Ok, I think we sorted this out. It all comes down to the network. In both of our cases, our network (on Windows) had become corrupt but to different levels.

Using Get-VMNetworkAdapter -ManagementOS we found a lot more than what we expected to see. Mine looked fine but my colleague’s workstation showed way too many.

We used netcfg -d to clean up the network and from there things started looking up.

Two other things we made sure to do for the Windows/Linux hybrid environment:

  1. made sure we were running compose file format version 3.3 for our stack
  2. used endpoint_mode: dnsrr (this requires compose file format version 3.3)