Docker Community Forums

Share and learn in the Docker community.

NFS stale handle in containers, but ok in OS

Hi all,

I’m running CoreOS in several VMs. Within the CoreOS instances I’m mounting NFS shares from a local NAS, which are then passed on to the containers (NFS share sub-dirs as volumes).

After some traffic on the NAS the NFS handles turn stale within the containers, but in the underlaying CoreOS the mounts are still intact and the data is still accessible from the OS. As a consequence I have to re-start the containers, since they cannot access any data on the volumes anymore. After the re-start the volumes are back to normal until the next occurrence.

Setup & Configuration:

    Server
        VirtualBox on Windows 10 PRO
        several CoreOS 2247.7.0 (Rhyolite) on VMs (VB)
            kernel:    4.19.84
            rkt:       1.30.0
            docker:    18.06.3

    NAS
        UNRAID OS, providing NFS V3 (no V4, unfortunately)
        fuse_remember set to 600 (setting it to -1 or 0 didn't help)

    Mount in CoreOS (configured in json)
        mount -t nfs <NAS IP>:<path in NAS> /mnt/nas

    Container YAML
        volumes:
          - /mnt/nas/<path>:<path in container>

The stale handles appear after some traffic on the NAS. Traffic would be in the magnitude of a few GB, sometimes even less than a GB. Sometimes after just a few minutes, sometimes after a day or two. This behaviour is very inconvenient and I haven’t found any solution to this.

I know that the NFS client has to refresh the handle every now and again. By the looks of it, the CoreOS NFS client does this correctly, but somehow the volumes in the containers don’t get updated properly.

What can I do to ensure that the containers don’t lose thir access to the volumes? Any idea anybody?

Regards

Actually having the same problem. Found this with a google search. Have you found a solution?

I’m mounting a filesystem on a linux box that is backed with mergerfs. I suspect our issues are related.

So I think I found the option for my issue. There is a parameter called “noforget” that you add to mergerfs to prevent stale mounts. Since adding this I have not had the stale mounts in my docker containers.

A quick google shows that unraid has this option and I’ve found some forums with some discussion about using this option in unraid. It might be worth a look.

A lot of people also say they just switch to smb to fix it :frowning:

Hi jdyyc,

I think I found the solution to it. I’m still testing/observing it, before I roll it out to every instance / container.

What you are referring to is the fuse_remember setting in UNRAID. This is what I’ve been playing with, but without success.

The thing is, that NFS handlers become stale after a while. This is being managed by the server. The client regularly checks for those stale handlers and updates the situation itself. This is, what happens on OS level.

Containers don’t do that, since they don’t know that there is something to do. They don’t know, there is an NFS client behind the volume affected, so it doesn’t update and, consequently, fails.

I didn’t know, that there was a way to directly refer the containers’ volumes individually to the share, without manually mounting the share within each container. Means, the container will then know that the volume is an NFS share and acts accordingly. I’ve changed it for two of my notoriously failing containers and it seems to work. As I said, I’m still observing the behaviour, before I change it for all containers and CoreOS VMs.

Instead of taking a ‘normal’ directory (the mounted NFS share)

volumes:
  - /mnt/nas/<path>:<path in container>

I now refer to the NFS share directly from the container

services:
  cont1:
  image: ....
  container_name: ....
ports:
  ....
volumes:
  - nas1_dock1_cont1:<path in container>

volumes:
  nas1_dock1_cont1:
    driver: local
    driver_opts:
      type: nfs
      o: addr=<NAS IP>,rw
      device: ":<path in NAS>"

The above shows the relevant parts in the YAML files, which I use to configure the containers.

Regarding the recommendations for SMB, and I also found tons of them:

  1. CoreOS does not natively support SMB. I prefer to take CoreOS and the stuff provided within it as is, rather than adding/modifying stuff, which might not work after the next update and then the tinkering starts again.

  2. Actually, I installed SMB once and it worked as such. Unfortunately, I could’t manage to get the NextCloud container to install, since the install failed when it wanted to create symbolic links. With NFS shares it worked.

After all that pain with SMB being no option and NFS causing failing containers I almost had enough and I was close to ditching my ‘self hosting’ infrastructure. I will give it another few days with my tests and, provided the modified containers are still intact, while the ‘old’ ones are failing, will then roll it out to the entire infrastructure.

I hope the above helped. Good luck!

Regards

Cool. I’m going to give that direct NFS mount option a try today too. Thanks for the tip.

It’s interesting that this whole thing started soon as I started using mergerfs. Prior to that I was using NFS volumes for all the things in containers with no problems. According to the mergerfs docs it’s something specifc to fuse and nfs, and I suppose Unraid is using fuse under the hood too.

So far after implementing “noforget” on the server, I haven’t had the problem occur again.

Heh, I too only have two containers that seem to exhibit this problem all the time. I wonder if they’re the same two containers… :thinking:

And ya, I also am not keen on using smb.

Hi jdyyc,

well testing looks ok, but still not 100%.

The reason why not all containers are affected is that some of them only use the data on the volume at start-up time and will then keep it in their memory. Only the containers, which are reading more data over time will fail, when the volume disappears. The others won’t notice it.

In my case I have a letsencrypt reverse proxy running, which also provides a webpage. When the handles go stale the proxy would still run perfectly with the data in memory, but the webpage won’t load. This is because the html and php files won’t be found anymore. After the re-configuration the letsencrypt container works fine, but others still won’t work properly.

I’ve got a few ghost blogs running, which I also re-configured. They are updating the handles and the containers will carry on working, but only after 5-10 secs after I called them up. Means, when the handles went stale and I open the blog, it fails. After two or three attempts the blog is coming up and is working ok then. [Edit] Actually, I just noticed that the ghost containers are restarting themselves.

Quite annoying that is. I think the re-config made quite a big step in the right direction, but the problem is still not 100% solved.

How is your testing going?

Regards