Docker Community Forums

Share and learn in the Docker community.

Node dont Update Stats to UCP

Cluster 1MNG 4WRK 1DTR
SO: RH7.7
Docker Version: 19.03.5

What happened was that from one day to the next, 3 Workers and DTR got the status “Node-local UCP component status was last updated XXXXXX seconds ago”, the Dynatrace log pointed to a slowness in the Cluster services. I did a one-time reboot on 1 of the problem Nodes and nothing worked.

             Finally, we remove one of the problem Nodes and add it back to the Cluster, after that the UCP-Agent container cannot be HEALTHY and restart every 10-12 seconds, but other deploys made by the teams are distributed normally in all Workes.

LOG DOCKERD MANAGER

– Logs begin at Tue 2020-04-28 17:08:50 -03, end at Thu 2020-04-30 13:57:54 -03. –

Apr 30 13:57:48 MANAGER dockerd[20799]: time=“2020-04-30T13:57:48.329302748-03:00” level=warning msg=“grpc: Server.Serve failed to complete security handshake from “10.11.17.147:50532”: remote error: tls: bad certificate” module=grpc

Apr 30 13:57:48 MANAGER dockerd[20799]: time=“2020-04-30T13:57:48.613578850-03:00” level=error msg=“failed to sign CSR” error=“unable to perform certificate signing request: Post https://10.11.17.141:12381/api/v1/cfssl/sign: x509: certificate signed by unknown authority (possibly because of “x509: ECDSA verification failure” while trying to verify candidate authority certificate “swarm-ca”)” method="(*Server).signNodeCert" module=ca node.id=ct7db0bk53ucmqn602h2ckma4

Apr 30 13:57:49 MANAGER dockerd[20799]: time=“2020-04-30T13:57:49.266491155-03:00” level=warning msg=“grpc: Server.Serve failed to complete security handshake from “10.11.17.147:50534”: remote error: tls: bad certificate” module=grpc

LOG DOCKERD WORKER BROKEN

– Logs begin at Thu 2020-04-30 13:19:46 -03, end at Thu 2020-04-30 15:28:37 -03. –

Apr 30 15:26:55 WORKER4 dockerd[2442]: time=“2020-04-30T15:26:55.808051780-03:00” level=warning msg=“7aeae2455217fe0aee791e5955be43f21cc4820faa9ef91c4e9a156b95d3ec01 cleanup: failed to unmount IPC: umount /var/lib/docker/containers/7aeae2455217fe0aee791e5955be43f21cc4820faa9ef91c4e9a156b95d3ec01/mounts/shm, flags: 0x2: no such file or directory”

Apr 30 15:26:55 WORKER4 dockerd[2442]: time=“2020-04-30T15:26:55.824540165-03:00” level=error msg=“fatal task error” error=“task: non-zero exit (1)” module=node/agent/taskmanager node.id=ct7db0bk53ucmqn602h2ckma4 service.id=ciuxeokovt1aevgnlka268g6z task.id=a81sv93scrl7zc56pix1he948

Apr 30 15:27:11 WORKER4 dockerd[2442]: time=“2020-04-30T15:27:11.857760984-03:00” level=info msg=“ignoring event” module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"

LOG CONTAINER UCP-AGENT TRY HEALTHY

{“level”:“info”,“msg”:“Loading local node Docker Info”,“time”:“2020-04-30T18:31:02Z”}

{“level”:“info”,“msg”:“Loading local node TLS configuration”,“time”:“2020-04-30T18:31:02Z”}

{“level”:“info”,“msg”:“Loading Node TLS Config”,“time”:“2020-04-30T18:31:02Z”}

{“level”:“info”,“msg”:“UCP Node Certs do not exist - falling back to Swarm-mode node certs”,“time”:“2020-04-30T18:31:02Z”}

{“level”:“info”,“msg”:“Connecting to etcd cluster at addresses [10.11.17.141]”,“time”:“2020-04-30T18:31:02Z”}

{“level”:“info”,“msg”:“Attempting to connect to the etcd cluster at the following addresses: [10.11.17.141:12379]”,“time”:“2020-04-30T18:31:02Z”}

seems like certificate rotation was messed up. Been there, tried to fix it and failed…

Even though there is well described solution in the success center, I never managed to correct this error myself. Instead of trying it by yourself, I would strongly advice to raise a support ticket.

The cause is related to ucp-metrics pods that are not able to rotate certificates.

Upgrade to latest UCP version (it has been fixed in the 3.1.15, 3.2.8 and 3.3.2).

Otherwise you can use kubectl and delete all pods of ucp-metrics, the daemon-set will restart them automatically and you will see all metrics back on your dashboard:

kubectl delete pod/ucp-metrics-xxxxx pod/ucp-metrics-yyyyy pod/ucp-metrics-zzzzz -n kube-system