Hello there - I’m using Docker Swarm to make a multi-OS cluster - one Ubuntu 16.04 Linux node and one Windows Server 2016 node to run a Windows application as a web function via OpenFaaS. The OpenFaaS gateway and helper containers/services must run on Linux, hence the mixed OS cluster.
Just yesterday I had everything working. Between yesterday and today, all the Windows function containers had stopped running. Looking at the event log, I can see that the Windows node was restarted in the evening.
I got a bit of help from the OpenFaaS Slack that pointed me at some troubleshooting steps, and I think what I’ve come across means that the Windows node changed IDs after rebooting - which they said is not expected behavior. This would explain why the service containers weren’t restarted.
root@faas-gateway:~# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
e7kluijdhdtqqxudgo8o939mt * faas-gateway Ready Active Leader 18.06.1-ce
j3r3kqrv0xbcbdelqr3t03jmm faas-runner Ready Active 17.06.2-ee-16
root@faas-gateway:~# docker service ls
ID NAME MODE REPLICAS IMAGE PORTS
arepuq0z4nvj base64 replicated 1/1 functions/alpine:latest
89ej897kke4o echoit replicated 1/1 functions/alpine:latest
muvz7nkk0x7r func_alertmanager replicated 1/1 prom/alertmanager:v0.15.0
otyzha5ah8a1 func_faas-swarm replicated 1/1 openfaas/faas-swarm:0.4.2
y6cn9k605p7u func_gateway replicated 1/1 openfaas/gateway:0.9.2 *:8080->8080/tcp
rhe2joei0pob func_nats replicated 1/1 nats-streaming:0.6.0
vn3hn43z4bgq func_prometheus replicated 1/1 prom/prometheus:v2.3.1 *:9090->9090/tcp
4px6g4lfmcc9 func_queue-worker replicated 1/1 openfaas/queue-worker:0.5.2
vwwpzjymzek9 hubstats replicated 1/1 functions/hubstats:latest
tvr9evio1h1t ipconfig replicated 0/1 alexellis/windows-ipconfig:nano
ik5eifgfdo74 nodeinfo replicated 1/1 functions/nodeinfo:latest
kqb6sytxme0t faas-fn replicated 0/1 runner7:7.6-faas
o919pigchkiz faas-fn-optimize replicated 0/1 runner7:7.6-faas-optimize
8tprw1ohy1en wordcount replicated 1/1 functions/alpine:latest
root@faas-gateway:~# docker service logs --tail 100 faas-fn
error from daemon in stream: Error grabbing logs: rpc error: code = Unknown desc = warning: incomplete log stream. some logs could not be retrieved for the following reasons: node lz3qinihv1d2to67dbn8a6gq2 is not available
ipconfig
faas-fn
and faas-fn-optimize
are services for the Windows functions. There was another Windows node in the swarm at one point, but it was replaced in the swarm one week ago by the current Windows node. I don’t have its node ID. That the service attempts to get logs from a node that is not in the swarm tells me that it’s either trying to schedule those services on that old node for some reason or that the current server changed its node ID.
For reference, this is a representative function declaration file for the Windows OpenFaaS functions:
provider:
name: faas
gateway: http://127.0.0.1:8080
functions:
faas-fn:
image: runner7:7.6-faas
skip_build: true
environment:
suppress_lock: true
read_timeout: "28800s"
write_timeout: "28800s"
ack_timeout: "28800s"
constraints:
- "node.platform.os == windows"
All nodes in the swarm, past and present, were created in an on-prem instance of Microsoft’s Virtual Machine Manager Self-Service Portal.
One suggestion was to try scaling the problematic services to 0 then back to 1, but it hangs when scaling back up:
ubuntu@faas-gateway:~$ sudo docker service scale faas-fn=0
[sudo] password for ubuntu:
faas-fn scaled to 0
overall progress: 0 out of 0 tasks
verify: Service converged
ubuntu@faas-gateway:~$ sudo docker service scale faas-fn=1
faas-fn scaled to 1
overall progress: 0 out of 1 tasks
1/1:
I installed Docker on Windows by following instructions on Microsoft’s website, which installed 17.06. It was pointed out that that won’t install the latest version of Docker, and I might try upgrading to 18.03.
Any ideas for getting these services back up and running or additional diagnostics to try?
Edit: I’ve found what looks like an HNS failure, but I don’t know what I can do about it:
root@faas-gateway:~# docker service ps --no-trunc=true faas-fn
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
kru9sv5nuoszbn0ur28n8omr2 faas-fn.1 runner7:7.6-faas faas-runner Running Running 39 seconds ago
onrhsi8un210av42qskoo8lek \_ faas-fn.1 runner7:7.6-faas faas-runner Shutdown Shutdown 35 seconds ago
dm4s2uv6je3lkhl85nenhjcic \_ faas-fn.1 runner7:7.6-faas faas-runner Shutdown Failed about a minute ago "HNS failed with error : Element not found. "
jlxfkkvqwazlamot5w3fa20lq \_ faas-fn.1 runner7:7.6-faas faas-runner Shutdown Shutdown about a minute ago
f3levys95x559gzg92b4r8xts \_ faas-fn.1 runner7:7.6-faas faas-runner Shutdown Shutdown about a minute ago
f8egdqvwje9dosuzodpln5n7w \_ faas-fn.1 runner7:7.6-faas faas-runner Shutdown Shutdown 18 hours ago