I have a reasonably sized Kubernetes cluster with at least 20 pods per node and over 10 nodes.
The underlying docker agents are using overlay2 as their storage engine.
We frequently see that after a period of hours to days, processes within the container are no longer able to write to certain directories. The error seen is:
root@node:/project/input# echo " " >> test.txt bash: test.txt: No such file or directory
After significant testing, it appears that files cannot be written to directories which are infrequently written to. After looking through the overlay2 directories, I can see that the directories I am able to write to are in the ‘upper’ directory. The directories I cannot write to are within the ‘lower’ linked directories.
The nodes are below 10% inode usage, we’re not hitting inotify watch limits and memory has at least 10% remaining (>1GB) and 60% CPU usage on average.
What could be causing this? To say I’m pulling my hair out is an understatement.