This is working great until we did some failover test on network equipment between the docker hosts and NFS server. This failover test break the NFS containers and was not able to recover until we rebooted all failing nodes.
Symptom was High load (something like 300) and high latency. Looking at /var/log/messages I could see :
Aug 18 21:51:11 docker-worker01 kernel: [422211.885112] nfs: server not responding, timed out
Aug 18 21:51:11 docker-worker01 kernel: [422212.461092] nfs: server not responding, timed out
Aug 18 21:51:11 docker-worker01 kernel: [422212.521107] nfs: server not responding, timed out
last time I researched it in general (as in not specific to docker), I could only find the usual recommendation: unmount/remount the remote share. This is not going to work with a volume.
It could very well be a missing feature in docker itself, as it could try to access the remote share, and remount it, if it doesn’t respond within a given timeframe. From what I understand this still would require all containers that access it to be restarted as well, as the inode of the mounted remote share changes. The namespace isolation hands over the mounts to the container by their inode, and not by their path (which would be ${docker data-root}/volumes/${volume name}/_data ).
If the stale volume could be detected by your healthcheck, in theory you should be able to use it to flag the task as unhealthy, which might result in a recovery when the task is terminated and replaced with a new one. I doubt it will work, if multiple task replicas run on the same node, as the remote share will remain mounted, until the last container using it is stopped or deleted.
You could raise an issue in docker’s upstream project Moby and create either a feature or bug, whatever you feel is more suited.
From what I understand during the outage : container detected issue so it was restarting again and again, but Thanks to your input I understand the volume didn’t try to umount/mount the NFS side which could explain the issue.
But I’m very suprise that I would be the “only one” facing this issue.
You are not. Afaik, you are the first to share he runs nfs in a ha setup, and expect the failover to be handled transparently.
There might be an ugly way to make failover work: mount the nfs share on the host, bind the parent folder of the mountpoint into the container with a bind propagation (see: https://docs.docker.com/engine/storage/bind-mounts/#configure-bind-propagation>) that propagates sub-mounts (=nfs mountpoints) of the original mount (=parent folder you bind mount into the container) to the container (=replica mount of the bind mounted parent folder).