Docker Host with High load on NFS failover

romgo · September 1, 2025, 9:08am

Dear cmmunity,

I’m running a swarm cluster 9 nodes running on debian 12 hosts.

Some of the service I deployed use a NFS share for persistence.

I use the mount of NFS directly through docker file to avoid any mismatch between OS and container state.

So in my stack file I got volume definition like :


volumes:
  grafana-data:
    driver: local
    driver_opts:
      type: nfs
      o: addr=nfs.domain.com,rw,nolock,soft,vers=4.0
      device: ":/nfs/docker/grafana-dev"

This is working great until we did some failover test on network equipment between the docker hosts and NFS server. This failover test break the NFS containers and was not able to recover until we rebooted all failing nodes.

Symptom was High load (something like 300) and high latency. Looking at /var/log/messages I could see :

Aug 18 21:51:11 docker-worker01 kernel: [422211.885112] nfs: server  not responding, timed out
Aug 18 21:51:11 docker-worker01 kernel: [422212.461092] nfs: server  not responding, timed out
Aug 18 21:51:11 docker-worker01 kernel: [422212.521107] nfs: server  not responding, timed out

Anyone encounter this kind of Issue ?

Thank you for your help.

bluepuma77 · September 1, 2025, 6:28pm

What’s the failover? Different server? Different IP?

romgo · September 5, 2025, 12:50pm

On NFS side, we use keepalived to failover a VIP between two servers. Data are synced with DRBD

meyay · September 5, 2025, 5:15pm

The issue are stale NFS mounts.

last time I researched it in general (as in not specific to docker), I could only find the usual recommendation: unmount/remount the remote share. This is not going to work with a volume.

It could very well be a missing feature in docker itself, as it could try to access the remote share, and remount it, if it doesn’t respond within a given timeframe. From what I understand this still would require all containers that access it to be restarted as well, as the inode of the mounted remote share changes. The namespace isolation hands over the mounts to the container by their inode, and not by their path (which would be ${docker data-root}/volumes/${volume name}/_data ).

If the stale volume could be detected by your healthcheck, in theory you should be able to use it to flag the task as unhealthy, which might result in a recovery when the task is terminated and replaced with a new one. I doubt it will work, if multiple task replicas run on the same node, as the remote share will remain mounted, until the last container using it is stopped or deleted.

You could raise an issue in docker’s upstream project Moby and create either a feature or bug, whatever you feel is more suited.

romgo · September 9, 2025, 8:54am

From what I understand during the outage : container detected issue so it was restarting again and again, but Thanks to your input I understand the volume didn’t try to umount/mount the NFS side which could explain the issue.

But I’m very suprise that I would be the “only one” facing this issue.

meyay · September 9, 2025, 9:01pm

You are not. Afaik, you are the first to share he runs nfs in a ha setup, and expect the failover to be handled transparently.

There might be an ugly way to make failover work: mount the nfs share on the host, bind the parent folder of the mountpoint into the container with a bind propagation (see: https://docs.docker.com/engine/storage/bind-mounts/#configure-bind-propagation>) that propagates sub-mounts (=nfs mountpoints) of the original mount (=parent folder you bind mount into the container) to the container (=replica mount of the bind mounted parent folder).

Topic		Replies	Views
NFS volume: same service but inconsistent storage between replicas in different nodes Swarm docker , swarm	1	3329	May 29, 2018
Docker Swarm NFS mount Swarm docker , swarm	6	30976	April 3, 2018
Best practice for persistent data using swarm General	8	31308	April 25, 2018
Docker on scalable storage (NFS, ceph, gluster) Swarm	4	7775	September 25, 2017
Docker (Swarm) Volume problem with NFS General swarm	6	9649	December 9, 2020

Docker Host with High load on NFS failover

Related topics