Docker Community Forums

Share and learn in the Docker community.

Docker Swarm: bulk sync to node failed,

Hi all,
I’m using Docker Swarm to manage 3 manager nodes. The docker version of these 3 nodes are 19.03.2, 19.03.1 and 18.06.1-ce
Recently I’ve encountered a bug that made one node failed. I viewed the docker daemon logs of these 3 nodes, here’s what I found

1. On node A:
level=warning msg=“memberlist: Push/Pull with 195b81e7dbab failed: dial tcp IP-node-B:7946: i/o timeout”
level=warning msg=“bulk sync to node 195b81e7dbab failed: failed to send a TCP message during bulk sync: dial tcp IP-node-B:7946: i/o timeout”
level=info msg=“memberlist: Suspect 195b81e7dbab has failed, no acks received”
level=info msg=“memberlist: Marking 195b81e7dbab as failed, suspect timeout reached (0 peer confirmations)”
level=info msg=“Node 195b81e7dbab/IP-node-B, left gossip cluster”
level=info msg=“Node 195b81e7dbab/IP-node-B, added to failed nodes list”
level=info msg=“Node 195b81e7dbab/IP-node-B, joined gossip cluster”

2. On node B
level=error msg=“Bulk sync to node 9b2291a0f417 timed out”
level=warning msg=“bulk sync to node 9b2291a0f417 failed: failed to send a TCP message during bulk sync: dial tcp IP-node-A:7946: i/o timeout”
level=warning msg=“memberlist: Was able to connect to 9b2291a0f417 but other probes failed, network may be misconfigured”
level=warning msg=“memberlist: failed to receive: read tcp IP-node-B:7946->IP-node-A:46810: i/o timeout from=IP-của-node-A:46810”
level=warning msg=“bulk sync to node 9b2291a0f417 failed: failed to send a TCP message during bulk sync: dial tcp IP-node-A:7946: i/o timeout”
level=warning msg=“memberlist: Refuting a suspect message (from: 9b2291a0f417)”
level=error msg=“Bulk sync to node 9b2291a0f417 timed out”
level=warning msg=“memberlist: Push/Pull with 9b2291a0f417 failed: dial tcp IP-node-A:7946: i/o timeout”
level=warning msg=“bulk sync to node 9b2291a0f417 failed: failed to send a TCP message during bulk sync: dial tcp IP-node-A:7946: i/o timeout”

3. On node C:
level=warning msg=“memberlist: Refuting a suspect message (from: 195b81e7dbab)”
level=info msg=“memberlist: Suspect 195b81e7dbab has failed, no acks received”
level=info msg=“memberlist: Marking 195b81e7dbab as failed, suspect timeout reached (0 peer confirmations)”

According to the log it seems like node A and node B cannot communicate with each other over TCP on port 7946, which made the Swarm think that node A is not available, and it mark node A as failed.

However, I use the netstat -tuplen command to check on 3 nodes and found that port 7946 is still listening

Anyone has any idea on why this happened ?