Swarm node came up with a different node ID?

Hello there - I’m using Docker Swarm to make a multi-OS cluster - one Ubuntu 16.04 Linux node and one Windows Server 2016 node to run a Windows application as a web function via OpenFaaS. The OpenFaaS gateway and helper containers/services must run on Linux, hence the mixed OS cluster.

Just yesterday I had everything working. Between yesterday and today, all the Windows function containers had stopped running. Looking at the event log, I can see that the Windows node was restarted in the evening.

I got a bit of help from the OpenFaaS Slack that pointed me at some troubleshooting steps, and I think what I’ve come across means that the Windows node changed IDs after rebooting - which they said is not expected behavior. This would explain why the service containers weren’t restarted.

root@faas-gateway:~# docker node ls
ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS      ENGINE VERSION
e7kluijdhdtqqxudgo8o939mt *   faas-gateway        Ready               Active              Leader              18.06.1-ce
j3r3kqrv0xbcbdelqr3t03jmm     faas-runner         Ready               Active                                  17.06.2-ee-16
root@faas-gateway:~# docker service ls
ID                  NAME                MODE                REPLICAS            IMAGE                             PORTS
arepuq0z4nvj        base64              replicated          1/1                 functions/alpine:latest
89ej897kke4o        echoit              replicated          1/1                 functions/alpine:latest
muvz7nkk0x7r        func_alertmanager   replicated          1/1                 prom/alertmanager:v0.15.0
otyzha5ah8a1        func_faas-swarm     replicated          1/1                 openfaas/faas-swarm:0.4.2
y6cn9k605p7u        func_gateway        replicated          1/1                 openfaas/gateway:0.9.2            *:8080->8080/tcp
rhe2joei0pob        func_nats           replicated          1/1                 nats-streaming:0.6.0
vn3hn43z4bgq        func_prometheus     replicated          1/1                 prom/prometheus:v2.3.1            *:9090->9090/tcp
4px6g4lfmcc9        func_queue-worker   replicated          1/1                 openfaas/queue-worker:0.5.2
vwwpzjymzek9        hubstats            replicated          1/1                 functions/hubstats:latest
tvr9evio1h1t        ipconfig            replicated          0/1                 alexellis/windows-ipconfig:nano
ik5eifgfdo74        nodeinfo            replicated          1/1                 functions/nodeinfo:latest
kqb6sytxme0t        faas-fn             replicated          0/1                 runner7:7.6-faas
o919pigchkiz        faas-fn-optimize    replicated          0/1                 runner7:7.6-faas-optimize
8tprw1ohy1en        wordcount           replicated          1/1                 functions/alpine:latest
root@faas-gateway:~# docker service logs --tail 100 faas-fn
error from daemon in stream: Error grabbing logs: rpc error: code = Unknown desc = warning: incomplete log stream. some logs could not be retrieved for the following reasons: node lz3qinihv1d2to67dbn8a6gq2 is not available

ipconfig faas-fn and faas-fn-optimize are services for the Windows functions. There was another Windows node in the swarm at one point, but it was replaced in the swarm one week ago by the current Windows node. I don’t have its node ID. That the service attempts to get logs from a node that is not in the swarm tells me that it’s either trying to schedule those services on that old node for some reason or that the current server changed its node ID.

For reference, this is a representative function declaration file for the Windows OpenFaaS functions:

provider:
  name: faas
  gateway: http://127.0.0.1:8080

functions:
  faas-fn:
    image: runner7:7.6-faas
    skip_build: true
    environment:
      suppress_lock: true
      read_timeout: "28800s"
      write_timeout: "28800s"
      ack_timeout: "28800s"
    constraints:
      - "node.platform.os == windows"

All nodes in the swarm, past and present, were created in an on-prem instance of Microsoft’s Virtual Machine Manager Self-Service Portal.

One suggestion was to try scaling the problematic services to 0 then back to 1, but it hangs when scaling back up:

ubuntu@faas-gateway:~$ sudo docker service scale faas-fn=0
[sudo] password for ubuntu:
faas-fn scaled to 0
overall progress: 0 out of 0 tasks
verify: Service converged
ubuntu@faas-gateway:~$ sudo docker service scale faas-fn=1
faas-fn scaled to 1
overall progress: 0 out of 1 tasks
1/1:

I installed Docker on Windows by following instructions on Microsoft’s website, which installed 17.06. It was pointed out that that won’t install the latest version of Docker, and I might try upgrading to 18.03.

Any ideas for getting these services back up and running or additional diagnostics to try?

Edit: I’ve found what looks like an HNS failure, but I don’t know what I can do about it:

root@faas-gateway:~# docker service ps --no-trunc=true faas-fn
ID                          NAME                IMAGE                  NODE                   DESIRED STATE       CURRENT STATE                 ERROR                                           PORTS
kru9sv5nuoszbn0ur28n8omr2   faas-fn.1           runner7:7.6-faas       faas-runner            Running             Running 39 seconds ago
onrhsi8un210av42qskoo8lek    \_ faas-fn.1       runner7:7.6-faas       faas-runner            Shutdown            Shutdown 35 seconds ago
dm4s2uv6je3lkhl85nenhjcic    \_ faas-fn.1       runner7:7.6-faas       faas-runner            Shutdown            Failed about a minute ago     "HNS failed with error : Element not found. "
jlxfkkvqwazlamot5w3fa20lq    \_ faas-fn.1       runner7:7.6-faas       faas-runner            Shutdown            Shutdown about a minute ago
f3levys95x559gzg92b4r8xts    \_ faas-fn.1       runner7:7.6-faas       faas-runner            Shutdown            Shutdown about a minute ago
f8egdqvwje9dosuzodpln5n7w    \_ faas-fn.1       runner7:7.6-faas       faas-runner            Shutdown            Shutdown 18 hours ago

I’ve found a fix, but I still don’t know the root cause. Redeploying the functions via faas-cli deploy caused the services to successfully start their replicas. The replicas are all there in docker service ls and I can get at their logs with docker service logs -f FUNCTION. It’s also listed as Running in docker service ps FUNCTION.