Docker Community Forums

Share and learn in the Docker community.

Node dont Update Stats to UCP

Cluster 1MNG 4WRK 1DTR
SO: RH7.7
Docker Version: 19.03.5

What happened was that from one day to the next, 3 Workers and DTR got the status “Node-local UCP component status was last updated XXXXXX seconds ago”, the Dynatrace log pointed to a slowness in the Cluster services. I did a one-time reboot on 1 of the problem Nodes and nothing worked.

             Finally, we remove one of the problem Nodes and add it back to the Cluster, after that the UCP-Agent container cannot be HEALTHY and restart every 10-12 seconds, but other deploys made by the teams are distributed normally in all Workes.

LOG DOCKERD MANAGER

– Logs begin at Tue 2020-04-28 17:08:50 -03, end at Thu 2020-04-30 13:57:54 -03. –

Apr 30 13:57:48 MANAGER dockerd[20799]: time=“2020-04-30T13:57:48.329302748-03:00” level=warning msg=“grpc: Server.Serve failed to complete security handshake from “10.11.17.147:50532”: remote error: tls: bad certificate” module=grpc

Apr 30 13:57:48 MANAGER dockerd[20799]: time=“2020-04-30T13:57:48.613578850-03:00” level=error msg=“failed to sign CSR” error=“unable to perform certificate signing request: Post https://10.11.17.141:12381/api/v1/cfssl/sign: x509: certificate signed by unknown authority (possibly because of “x509: ECDSA verification failure” while trying to verify candidate authority certificate “swarm-ca”)” method="(*Server).signNodeCert" module=ca node.id=ct7db0bk53ucmqn602h2ckma4

Apr 30 13:57:49 MANAGER dockerd[20799]: time=“2020-04-30T13:57:49.266491155-03:00” level=warning msg=“grpc: Server.Serve failed to complete security handshake from “10.11.17.147:50534”: remote error: tls: bad certificate” module=grpc

LOG DOCKERD WORKER BROKEN

– Logs begin at Thu 2020-04-30 13:19:46 -03, end at Thu 2020-04-30 15:28:37 -03. –

Apr 30 15:26:55 WORKER4 dockerd[2442]: time=“2020-04-30T15:26:55.808051780-03:00” level=warning msg=“7aeae2455217fe0aee791e5955be43f21cc4820faa9ef91c4e9a156b95d3ec01 cleanup: failed to unmount IPC: umount /var/lib/docker/containers/7aeae2455217fe0aee791e5955be43f21cc4820faa9ef91c4e9a156b95d3ec01/mounts/shm, flags: 0x2: no such file or directory”

Apr 30 15:26:55 WORKER4 dockerd[2442]: time=“2020-04-30T15:26:55.824540165-03:00” level=error msg=“fatal task error” error=“task: non-zero exit (1)” module=node/agent/taskmanager node.id=ct7db0bk53ucmqn602h2ckma4 service.id=ciuxeokovt1aevgnlka268g6z task.id=a81sv93scrl7zc56pix1he948

Apr 30 15:27:11 WORKER4 dockerd[2442]: time=“2020-04-30T15:27:11.857760984-03:00” level=info msg=“ignoring event” module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"

LOG CONTAINER UCP-AGENT TRY HEALTHY

{“level”:“info”,“msg”:“Loading local node Docker Info”,“time”:“2020-04-30T18:31:02Z”}

{“level”:“info”,“msg”:“Loading local node TLS configuration”,“time”:“2020-04-30T18:31:02Z”}

{“level”:“info”,“msg”:“Loading Node TLS Config”,“time”:“2020-04-30T18:31:02Z”}

{“level”:“info”,“msg”:“UCP Node Certs do not exist - falling back to Swarm-mode node certs”,“time”:“2020-04-30T18:31:02Z”}

{“level”:“info”,“msg”:“Connecting to etcd cluster at addresses [10.11.17.141]”,“time”:“2020-04-30T18:31:02Z”}

{“level”:“info”,“msg”:“Attempting to connect to the etcd cluster at the following addresses: [10.11.17.141:12379]”,“time”:“2020-04-30T18:31:02Z”}

seems like certificate rotation was messed up. Been there, tried to fix it and failed…

Even though there is well described solution in the success center, I never managed to correct this error myself. Instead of trying it by yourself, I would strongly advice to raise a support ticket.