Docker Community Forums

Share and learn in the Docker community.

UCP 2.1 continuously reconfiguring

(Rlaveycal) #1

I’ve just upgraded my 2 node DDC cluster from cs engine 1.12 / UCP 2.0 / DDR 2.1 to the latest versions. Since then the UCP agents on both nodes have been in a state of continuous restarting.


Centos 7.3 on AWS t2.medium instances. 1 UCP manager; 1 UCP worker (with DDR)

docker info on the worker:

Containers: 26
 Running: 12
 Paused: 0
 Stopped: 14
Images: 21
Server Version: 1.13.1-cs1
Storage Driver: devicemapper
 Pool Name: docker-thinpool
 Pool Blocksize: 524.3 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file:
 Metadata file:
 Data Space Used: 3.45 GB
 Data Space Total: 10.2 GB
 Data Space Available: 6.746 GB
 Metadata Space Used: 852 kB
 Metadata Space Total: 104.9 MB
 Metadata Space Available: 104 MB
 Thin Pool Minimum Free Space: 1.019 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: true
 Deferred Deleted Device Count: 0
 Library Version: 1.02.135-RHEL7 (2016-11-16)
Logging Driver: json-file
Cgroup Driver: cgroupfs
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: active
 NodeID: eebh1phvkjvbqmvqo2u5ugmcx
 Is Manager: false
 Node Address:
 Manager Addresses:
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: aa8187dbd3b7ad67d8e5e3a15115d3eef43a7ed1
runc version: 9df8b306d01f59d3a8029be411de015b7304dd8f
init version: 949e6fa
Security Options:
  Profile: default
Kernel Version: 3.10.0-514.6.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.451 GiB
Name: calaawsdock02.calastone.local
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Experimental: false
Insecure Registries:
Live Restore Enabled: false

Manager is configured with a CA signed wildcard certificate

The Nodes page in UCP is usually showing “Unable to determine node state” or “he ucp-agent task is finished” for the Manager and “The ucp-agent task is shutdown” for the worker.

Running docker ps on either node just hangs.

DTR seems to be running OK.

journalctl -u docker -b -r on the manager node contains entries like

Feb 14 09:05:09 calaawsdock01.calastone.local dockerd[27399]: time="2017-02-14T09:05:09.368994838Z" level=warning msg="Health checkfor container 4ceb9a7ee7d5505abb7fccb129407677efc4e8a7ab8be72335213662cc125f39 error: unable to find user nobody: no matching entries in passwd file"
Feb 14 09:27:04 calaawsdock01.calastone.local dockerd[27399]: 2017/02/14 09:27:04 grpc: Server.processUnaryRPC failed to write status stream error: code = 4 desc = "context deadline exceeded"
Feb 14 09:50:30 calaawsdock01.calastone.local dockerd[27399]: time="2017-02-14T09:50:30.670938918Z" level=warning msg="Health check for container 4d061e77bff9db544ab276b873de231a1b77a1a24471ad30ad1811d66726f751 error: rpc error: code = 4 desc = context deadline exceeded"

On the worker…

Feb 14 09:50:39 calaawsdock02.calastone.local dockerd[804]: time="2017-02-14T09:50:39.193377911Z" level=error msg="Error unmounting container 4e36f0502f12f17a0beac709cde44a71a5e4e1f867bd1b34b1a008ffc58aa859: invalid argument"
Feb 14 09:50:39 calaawsdock02.calastone.local dockerd[804]: time="2017-02-14T09:50:39.193335168Z" level=error msg="devmapper: Error unmounting device a46ed8f5a1991fd4c7ab43cbf429d1c8b8bd13d7690d882d4c5eb677e7c30823: invalid argument"

I’ve also seen similar agent log entries to those mentioned in Issues Installing UCP 2.1.0 on Docker 1.13 in one of the rare occasions when the agent was running.

(Jsoler) #2

What do you get wit docker ps and docker service ps ucp-agent ?
I prefer at least a set up of 3 ucp manager nodes and 3 worker to host dtr. Every time I get this type of error " devmapper: Error unmounting device …" I restarted docker or rebooted the server because I don’t know a proper way to release device resources.

(Wwiii) #3

Seeing the same issues. Thought I was the only one.

Restarting the docker service and rebooting didn’t help.

docker ps shows the ucp-agent with the updated 2.1.0 container.

various boxes exhibit the issue over time.

extremely disconcerting. eager to see an update on this issue.

(EManalo) #4

I am seeing the same issue when we upgraded to CS engine 1.13.1-cs1 and UCP 2.1.0. We have a 3 UCP node cluster, 3 DTRs, and 3 Container nodes in our test environment using the standard cert, all on RHEL 7.3. Previous version setup was UCP 2.0.2, DTR 2.1.4 and CS Engine 1.12.6-cs7.

(Wwiii) #5


What versions are you running?

Docker version 1.13.0-cs1-rc1, build 9989e84
RHEL 7.3

3 Node UPC
3 Node DTR
3 Node Worker

(Michael) #6

I’m see similar issues. I was seeing it on a cluster with 3 DTRs, 3 Controllers, and 2 Workers. So I went back and retried with a 3 Controller, and 2 worker setup. It looked okay for a bit but then went back to seemingly random nodes reporting timeouts. And then those nodes being okay, and other nodes reporting timeouts. Essentially it reports back that a node is unhealthy and has timed out on a _ping . I’ve tried adjusting some of the heartbeat and other settings with no luck.

This is at:
Red Hat Enterprise Linux Server release 7.3 (Maipo)

(Wwiii) #7

I have a paid support account. Going to open a ticket about this to help expedite.

(Trapier) #8

Thanks for reaching out wwiii!

This symptom is consistent with a UCP 2.1 problem we’re tracking under internal issue ID orca/#5917. The issue is expected to be corrected by the following:

docker service update --mount-add type=bind,source=/etc/docker,target=/etc/docker ucp-agent

The fix for this issue is targeted for release with the next patch release of UCP 2.1. Am working on a KB for this issue and will post a link here when it’s up.

(Michael) #9

I tried
docker service update --mount-add type=bind, source=/etc/docker,target=/etc/docker ucp-agent
for my issue. It seemed okay for a couple hours yesterday but overnight I am seeing the same thing as before. Bouncing between the node with Unhealthy UCP Node reports and timout exceeded while awaiting headers.

(EManalo) #10

Tried the recommended docker service update command as well, and it did not fix the issue. The unhealthy UCP node and timeout are still showing repeatedly.

(Trapier) #11

Update on engineering investigation into potential lingering stability issues after adding the ucp-agent /etc/docker bind mount mentioned earlier.

The metrics UI in UCP 2.1.0 periodically polls for container file system utilization. The processing of these endpoints in the engine takes out locks on container objects and is relatively slow on devicemapper, resulting in contention. An upstream issue has been opened to explore options in the engine, including a proposal to remove the need for locks in determining container file system utilization

Secondary signatures of this issue may include:

  • kernel logs journalctl -fk show an uninterrupted stream of umount events against dm devices
  • dockerd frequenting the top of top, with ~100% cpu utilization
  • docker ps run against the local engine (no client bundle loaded) takes 30+ seconds to complete.

As an interim work-around the following can be applied on all managers to disable the UCP metrics backend:

docker exec -it ucp-metrics sed -i s:^:#: /etc/prometheus/prometheus.yml
sudo kill -SIGHUP $(pidof prometheus)

This will result in stats showing as grayed out 0% figures on the Resources/Nodes view.

To subsequently restore metrics functionality if desired, run the following on each manager:

docker exec -it ucp-metrics sed -i s:^#:: /etc/prometheus/prometheus.yml
sudo kill -SIGHUP $(pidof prometheus)

Will draft a KB to represent this issue as well and reply here with a link when published.

(Trapier) #12

Engineering expects have a CS Engine RC testing build for the secondary issue with file system metrics available shortly. If anyone sees relief from disabling the metrics backend and is willing to test an engine build expected to achieve the same outcome without disabling metrics, please open a support case ( and message me with the case number. Any issues opening a support case, message me as well, and we’ll get it worked out.

(Rlaveycal) #13

I restarted the docker service and applied both changes to my manager node. So far so good…

(Michael) #14

Hey trapier,

I’m seeing the same issues, I’ll send you off my support ticket number. I tried the work around and it didn’t offer any improvement for me.

(Skiermatt2) #15

I’m seeing this same issue. Tried disabling the metrics backend, but the problem persists. Any update?

(Nrknettjenester) #16

I am seeing the same issue here, any news on this?

(Michael) #17

There is a new CS release out, I’d check on upgrading to that and then disabling prometheus in the ucp-metrics container. Thats gotten me to a working environment.

(EManalo) #18

Just want to check if there’s been update on this issue.

I upgraded to the latest CS Docker Engine version 1.13.1-cs2 and tried to disable prometheus in the ucp-metrics container, and it did not work for me. I am still seeing the same issue.

(Michael) #19

Was this an upgrade or a fresh install?

There was another issue on upgrades, if this was an upgrade did you try the update to the ucp-agent service mentioned earlier in this thread?

(EManalo) #20

Yes, I did a complete fresh install of all nine nodes - first with 1 UCP, 3 DTRs, 3 Container nodes. I started with all of them using the fresh install of the latest CS Docker Engine 1.13.1-cs2, UCP 2.1.0, DTR 2.2.2. The 1 UCP setup with 3 DTRs and 3 Container hosts works okay and I was able to do a test deploy, and configured the DTRs successfully with 2 additional replicas.

When I added two more UCP nodes (these two new nodes having the fresh install of the latest CS Docker Engine), the moment those 2 UCP nodes gets added, then I started seeing the following “unhealthy” message again from the UCP Web UI. All three UCP nodes shows this unhealthy message.

Unhealthy UCP Controller: unable to reach controller: Get https://<UCP_Node>/_ping: net/http: request canceled (Client.Timeout exceeded while awaiting headers)