Cannot re-deploy service to node after unexpected restart (Simulated node death)

aldenar · July 17, 2023, 12:20pm

I am facing an issue where a service cannot be ran on a swarm node that previously failed and lost all data (Simulating critical HW failure).

If I manually remove the old node, and rejoin it, I cannot scale the service back either. The docker scale command just hangs indefinitely.

In docker node ls I can clearly see the node (The sh-apisix-2 one) as being ready and active:

ID                            HOSTNAME      STATUS    AVAILABILITY   MANAGER STATUS   ENGINE VERSION
49skokrsbhvvky6bx1pvptema *   sh-apisix-1   Ready     Active         Leader           24.0.2
o7905ox7gf4htt8m2hko7frcr     sh-apisix-2   Ready     Active         Reachable        24.0.4
v04f8pet6sgtanavvr000vzag     sh-apisix-3   Ready     Active         Reachable        24.0.2

Yet next running docker service scale apisix_etcd=3 only outputs:

apisix_etcd scaled to 3
overall progress: 2 out of 3 tasks
1/3: running   [==================================================>]
2/3: running   [==================================================>]
3/3:

And… Never anything more.

The nodes’ dockerd logs (/var/log/syslog |grep dockerd) are not indicating anything being wrong. Yet I cannot even see the new node in docker service ps apisix_etcd:

ID             NAME            IMAGE                NODE                        DESIRED STATE   CURRENT STATE          ERROR                              PORTS
1tgb40ux1g5i   apisix_etcd.1   bitnami/etcd:3.4.9   sh-apisix-1                 Running         Running 5 days ago
ms5dym9xo2zf   apisix_etcd.2   bitnami/etcd:3.4.9   24lkfyzhmroswozgxqx16mdqw   Shutdown        Rejected 2 hours ago   "cannot create a swarm scoped …"
xyhib7aqrqgy   apisix_etcd.3   bitnami/etcd:3.4.9   sh-apisix-3                 Running         Running 5 days ago

The second node is the one I already had to remove, and the one that rejoined doesn’t appear to have joined the service / ran the task.

Yet the other service from the stack did deploy without issues:

docker service ps apisix_apisix:

ID             NAME                  IMAGE                        NODE                        DESIRED STATE   CURRENT STATE                ERROR                              PORTS
tdy4nfertnbn   apisix_apisix.1       apache/apisix:3.4.0-debian   sh-apisix-3                 Running         Running 6 days ago
zfx773zp2jbf   apisix_apisix.2       apache/apisix:3.4.0-debian   sh-apisix-1                 Running         Running 6 days ago
wu7aq6l4tq3m   apisix_apisix.3       apache/apisix:3.4.0-debian   sh-apisix-2                 Running         Running 13 minutes ago

The service definition file:

version: "3.8"

services:
  apisix:
    image: "apache/apisix:3.4.0-debian"
    volumes:
      - /var/apisix/config.yaml:/usr/local/apisix/conf/config.yaml:ro
    depends_on:
      - etcd
    ports:
      - "9180:9180/tcp"
      - "9080:9080/tcp"
      - "9091:9091/tcp"
      - "9443:9443/tcp"
    networks:
      - apisix
    deploy:
      mode: replicated
      replicas: 3
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 5
      update_config:
        parallelism: 1
        delay: 10s
      placement:
        max_replicas_per_node: 1

  etcd:
    image: bitnami/etcd:3.4.9
    user: root
    extra_hosts:
      - "sh-apisix-1.internal:172.16.0.1"
      - "sh-apisix-2.internal:172.16.0.2"
      - "sh-apisix-3.internal:172.16.0.3"
    volumes:
      - /var/lib/etcd:/etcd_data:rw
    environment:
      ETCD_DATA_DIR: /etcd_data
      ETCD_ENABLE_V2: "true"
      ALLOW_NONE_AUTHENTICATION: "yes"
      ETCD_NAME: "{{.Node.Hostname}}"
      ETCD_ADVERTISE_CLIENT_URLS: "http://{{.Node.Hostname}}.internal:2379"
      ETCD_LISTEN_CLIENT_URLS: "http://0.0.0.0:2379"
      ETCD_LISTEN_PEER_URLS: "http://0.0.0.0:2380"
      ETCD_INITIAL_CLUSTER: "sh-apisix-1=http://sh-apisix-1.internal:2380,sh-apisix-2=http://sh-apisix-2.internal:2380,sh-apisix-3=http://sh-apisix-3.internal:2380"
      ETCD_INITIAL_CLUSTER_STATE: "new"
      ETCD_INITIAL_CLUSTER_TOKEN: "token-00"
      ETCD_INITIAL_ADVERTISE_PEER_URLS: "http://{{ .Node.Hostname }}.internal:2380"
    ports:
      - "2379:2379/tcp"
      - "2380:2380/tcp"
    networks:
      - apisix
    deploy:
      mode: replicated
      replicas: 3
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 5
      update_config:
        parallelism: 1
        delay: 10s
      placement:
        max_replicas_per_node: 1

networks:
  apisix:
    driver: overlay
    attachable: true

Anyone would be able to point me in the correct direction please? I’m… Kinda desperate…

meyay · July 17, 2023, 1:44pm

There is a truncated error message that probably would shed some light on the issue.

Try docker service ps apisix_etcd --no-trunc

aldenar · July 17, 2023, 1:49pm

Yes and no, it just states that the task failed as it “cannot create a swarm scoped network when swarm is not active” – It’s this error that the issue started with. To resolve it, I tried… Scaling the service down then back up. Restarting the second node. Making sure it was part of the swarm. And finally completely reinstalling it while wiping the whole /var/lib/docker/* – But to no avail.

Now, after re-joining the cluster, the second node isn’t even present in the task’s process list:

:~# docker service ps --no-trunc apisix_etcd
ID                          NAME            IMAGE                                                                                        NODE                        DESIRED STATE   CURRENT STATE          ERROR                                                             PORTS
1tgb40ux1g5idq0c7ahlp7wsb   apisix_etcd.1   bitnami/etcd:3.4.9@sha256:ec70db1eef17ef58d1d05e10a2797a6ba378a31a5a9eb1ea9bd6d911b155e8fe   sh-apisix-1                 Running         Running 5 days ago
ms5dym9xo2zfx5lo0u3os83pd   apisix_etcd.2   bitnami/etcd:3.4.9@sha256:ec70db1eef17ef58d1d05e10a2797a6ba378a31a5a9eb1ea9bd6d911b155e8fe   24lkfyzhmroswozgxqx16mdqw   Shutdown        Rejected 3 hours ago   "cannot create a swarm scoped network when swarm is not active"
xyhib7aqrqgy1sswgc06i9t0r   apisix_etcd.3   bitnami/etcd:3.4.9@sha256:ec70db1eef17ef58d1d05e10a2797a6ba378a31a5a9eb1ea9bd6d911b155e8fe   sh-apisix-3                 Running         Running 5 days ago

Note that however, the second node remains; just under a random hexadecimal name; leading me to think that it’s somehow “stuck” in the swarm’s configuration, and perhaps this is the source of my problems?

Yet I don’t seem to be able to deploy the service to the new node that replaced it.

meyay · July 17, 2023, 1:52pm

What does docker node ls say?

aldenar · July 17, 2023, 1:53pm

It’s in the initial post:

:~# docker node ls
ID                            HOSTNAME      STATUS    AVAILABILITY   MANAGER STATUS   ENGINE VERSION
49skokrsbhvvky6bx1pvptema *   sh-apisix-1   Ready     Active         Leader           24.0.2
o7905ox7gf4htt8m2hko7frcr     sh-apisix-2   Ready     Active         Reachable        24.0.4
v04f8pet6sgtanavvr000vzag     sh-apisix-3   Ready     Active         Reachable        24.0.2

meyay · July 17, 2023, 2:03pm

Looks like a bug to me. You might want to raise an issue in the buildkit github project.

Furthermore, you might want to check on the node itself, if it reports back that the swarm mode is enabled: docker info --format '{{.Swarm.LocalNodeState}}'. There might be a misalignment regarding the membership state between the swarm state, and the nodes local state.

aldenar · July 17, 2023, 2:11pm

Alright, I’ll try raising a github issue. Thanks for the pointer.

And just for the record, yes, docker info --format '{{.Swarm.LocalNodeState}}' reports “active”.

meyay · July 17, 2023, 3:51pm

The scheduler should periodically try to reconcile the current state with the desired state, and should find the new node as suitable node for the deployment. For whatever reason, it doesn’t seem to do so.

Topic		Replies	Views
Docker swarm scale and new swarm nodes General swarm	0	802	September 15, 2021
Docker service scale is buggy Swarm docker	1	3396	March 24, 2018
What happens when a swarm worker node with running stacks times out of the swarm and rejoins? Swarm	2	672	April 25, 2019
Node does not rejoin swarm after restart Docker Desktop windows	1	4066	December 21, 2017
Docker Swarm - Nodes/containers unreachable on swarm after deployments Swarm swarm	0	1980	March 3, 2018

Cannot re-deploy service to node after unexpected restart (Simulated node death)

Related topics