Docker stability issues

We’re seeing persistent, but hard-to-replicate, stability issues possibly related to nvidia-docker/docker. Cross-posting this with nvidia-docker issue 494 (no link b/c new user limitation)–and we have also been coordinating with Azure support–because it has been extremely difficult to track down the underlying source.

Any feedback is greatly appreciated.

We previously have been working with nvidia-docker/docker with no meaningful problems for many months.

Repeated error message:

“Unable to connect to plugin: /var/lib/nvidia-docker/nvidia-docker.sock/Plugin.Activate: Post http://%2Fvar%2Flib%2Fnvidia-docker%2Fnvidia-docker.sock/Plugin.Activate: dial unix /var/lib/nvidia-docker/nvidia-docker.sock: connect: no such file or directory, retrying in 1s”

Manifestation:

Docker freezes/becomes non-responsive. Sometimes “docker ps” will work; other times it will just hang. New containers typically cannot be brought up (freezes/hangs), after this occurs.

Replication:

We have been unable to figure out a way to replicate the above, but it seems to be happening multiple times per day per VM.

Remediation:

Fully cleaning out docker containers/data and a reboot seems to bring things back to stability, but not permanently.

We have also re-installed nvidia-docker and Docker on any given suffering VM, and repeatedly started up new VMs and re-ran the entire installation of Docker, nvidia-docker, etc.

System info:

There is a good chance that this is related to other issues (see below), but wanted to file this 1) in the off-chance that this is a known issue, 2) we are missing something, and/or 3) you can suggest diagnostics.

Platform information:

Azure NC6s (GPUs)
We are on LinuxAzure kernel, for which there was a known Docker issue which was supposedly just fixed (https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1719045). This seemed to improve stability, but not entirely solve things; it is possible that our outstanding issues are unrelated to the LinuxAzure kernel issues.
Our OS is encrypted via dm-crypt.
On Linux master0 4.4.0-96-generic and unencrypted disk we have not yet observed the above issues (and have been using similar configuration for many months). Downgrading a box to non-Linux*Azure kernel and encrypting is next on our list of attempted remediations, but we’ve been cycling on this problem for an extended period of time and greatly appreciate any external feedback available.

Docker info:

cluster@fathom-dp-proc9:~/diseaseTools$ docker info
Containers: 26
Running: 7
Paused: 0
Stopped: 19
Images: 6
Server Version: 17.09.0-ce
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local nvidia-docker
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 3f2f8b84a77f73d38244dd690525642a72156c64
init version: 949e6fa
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.11.0-1013-azure
Operating System: Ubuntu 16.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 6
Total Memory: 55.03GiB
Name: fathom-dp-proc9
ID: UXVA:2MZL:7TQF:5GIY:AY6J:6VJG:NHS6:ULCC:VOUE:EHBA:UG36:OTMC
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Username: bonneaud
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

Nvidia-smi:

cluster@fathom-dp-proc9:~/diseaseTools$ nvidia-smi
Wed Oct 11 02:46:56 2017
±----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81 Driver Version: 384.81 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000317:00:00.0 Off | 0 |
| N/A 35C P8 26W / 149W | 11MiB / 11439MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

Some logs:

– Logs begin at Wed 2017-10-11 01:55:50 UTC, end at Wed 2017-10-11 02:03:07 UTC. –
Oct 11 01:56:12 fathom-dp-proc9 systemd[1]: Starting Docker Application Container Engine…
Oct 11 01:56:12 fathom-dp-proc9 dockerd[1497]: time=“2017-10-11T01:56:12.403977100Z” level=info msg=“libcontainerd: new containerd process, pid: 1605"
Oct 11 01:56:15 fathom-dp-proc9 dockerd[1497]: time=“2017-10-11T01:56:15.755630500Z” level=info msg=”[graphdriver] using prior storage driver: overlay2"
Oct 11 01:56:16 fathom-dp-proc9 dockerd[1497]: time=“2017-10-11T01:56:16.644749800Z” level=warning msg="Unable to connect to plugin: /var/lib/nvidia-docker/nvidia-docker.sock/Plugin.Activate: Post http://%2Fvar%2Flib%2Fnvidia-docker%2Fnvidia-docker.sock/Plugin.Activate: dial unix /var/lib/nvidia-docker/nvidia-docker.sock: connect: no such file or directory, retrying in 1s"
Oct 11 01:56:17 fathom-dp-proc9 dockerd[1497]: time=“2017-10-11T01:56:17.645215000Z” level=warning msg="Unable to connect to plugin: /var/lib/nvidia-docker/nvidia-docker.sock/Plugin.Activate: Post http://%2Fvar%2Flib%2Fnvidia-docker%2Fnvidia-docker.sock/Plugin.Activate: dial unix /var/lib/nvidia-docker/nvidia-docker.sock: connect: no such file or directory, retrying in 2s"
Oct 11 01:56:19 fathom-dp-proc9 dockerd[1497]: time=“2017-10-11T01:56:19.674717800Z” level=info msg="Graph migration to content-addressability took 0.00 seconds"
Oct 11 01:56:19 fathom-dp-proc9 dockerd[1497]: time=“2017-10-11T01:56:19.675056600Z” level=warning msg="Your kernel does not support swap memory limit"
Oct 11 01:56:19 fathom-dp-proc9 dockerd[1497]: time=“2017-10-11T01:56:19.675260700Z” level=warning msg="Your kernel does not support cgroup rt period"
Oct 11 01:56:19 fathom-dp-proc9 dockerd[1497]: time=“2017-10-11T01:56:19.675399300Z” level=warning msg="Your kernel does not support cgroup rt runtime"
Oct 11 01:56:19 fathom-dp-proc9 dockerd[1497]: time=“2017-10-11T01:56:19.675531000Z” level=warning msg="Your kernel does not support cgroup blkio weight"
Oct 11 01:56:19 fathom-dp-proc9 dockerd[1497]: time=“2017-10-11T01:56:19.675659000Z” level=warning msg="Your kernel does not support cgroup blkio weight_device"
Oct 11 01:56:19 fathom-dp-proc9 dockerd[1497]: time=“2017-10-11T01:56:19.676160800Z” level=info msg="Loading containers: start."
Oct 11 01:56:19 fathom-dp-proc9 dockerd[1497]: time=“2017-10-11T01:56:19.971500700Z” level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
Oct 11 01:56:20 fathom-dp-proc9 dockerd[1497]: time=“2017-10-11T01:56:20.086152500Z” level=info msg="Loading containers: done."
Oct 11 01:56:20 fathom-dp-proc9 dockerd[1497]: time=“2017-10-11T01:56:20.353561600Z” level=info msg=“Docker daemon” commit=afdb6d4 graphdriver(s)=overlay2 version=17.09.0-ce
Oct 11 01:56:20 fathom-dp-proc9 dockerd[1497]: time=“2017-10-11T01:56:20.354119400Z” level=info msg="Daemon has completed initialization"
Oct 11 01:56:20 fathom-dp-proc9 dockerd[1497]: time=“2017-10-11T01:56:20.361190600Z” level=info msg="API listen on /var/run/docker.sock"
Oct 11 01:56:20 fathom-dp-proc9 systemd[1]: Started Docker Application Container Engine.
Oct 11 01:58:16 fathom-dp-proc9 dockerd[1497]: time=“2017-10-11T01:58:16.222797512Z” level=error msg="Handler for GET /v1.22/containers/449928a5a8c943f0eeeb73e3d65c59c2d33c09f1b9bbf836427439f8b3945cc3/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"
Oct 11 01:58:16 fathom-dp-proc9 dockerd[1497]: time=“2017-10-11T01:58:16.222910308Z” level=error msg="Handler for GET /v1.22/containers/449928a5a8c943f0eeeb73e3d65c59c2d33c09f1b9bbf836427439f8b3945cc3/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"
Oct 11 01:58:16 fathom-dp-proc9 dockerd[1497]: http: multiple response.WriteHeader calls
Oct 11 01:58:16 fathom-dp-proc9 dockerd[1497]: http: multiple response.WriteHeader calls
Oct 11 01:58:34 fathom-dp-proc9 dockerd[1497]: time=“2017-10-11T01:58:34.721960785Z” level=error msg="attach: stderr: write unix /var/run/docker.sock->@: write: broken pipe"
Oct 11 01:58:35 fathom-dp-proc9 dockerd[1497]: time=“2017-10-11T01:58:35.494824014Z” level=error msg="attach: stderr: write unix /var/run/docker.sock->@: write: broken pipe"
Oct 11 01:58:35 fathom-dp-proc9 dockerd[1497]: time=“2017-10-11T01:58:35.686069316Z” level=error msg="attach: stdout: write unix /var/run/docker.sock->@: write: broken pipe"
Oct 11 01:58:35 fathom-dp-proc9 dockerd[1497]: time=“2017-10-11T01:58:35.686129416Z” level=error msg="attach failed with error: write unix /var/run/docker.sock->@: write: broken pipe"
Oct 11 01:58:35 fathom-dp-proc9 dockerd[1497]: time=“2017-10-11T01:58:35.882506662Z” level=error msg=“attach: stdout: write unix /var/run/docker.sock->@: write: broken pipe”